Presentation: Cassandra in Production @ Digg - Arin Sarkissian
It looks like the Digg guys are the most public about their usage of Cassandra. Arin’s presentation below is a bit less technical than the ☞ article published a while back, but also has some nice additions.
My notes:
- how it is to use an alpha-stage project that you don’t have any idea how others are using
- the problem with sharding is that there’s no standard way to doing it
- if you start giving away features in your RDBMS why not also looking at alternatives?
-
why Cassandra:
- easy administration (nb at least the promise of)
- no SPF
- more flexible than key-value stores
loading data: MySQL -> Hadoop -> Cassandra
This sounds like a complex process. Arin is mentioning the use of Scribe at Digg and I was wondering if using Scribe to directly get data into Cassandra wouldn’t have been more easier. Anyway it’s difficult to say without knowing the details
- 12 servers initially, backed down to 8, 3TB of data
- Performance: < 1ms writes, ~4-5ms reads (nb: these are the numbers from the slides, but I find them odd)
I would have preferred to implement this service layer in Java as managing resources/pooling would have been better
No, we don’t hate SQL
- open sourced Python library for Cassandra lazyboy ☞