Presentation: Cassandra in Production @ Digg - Arin Sarkissian

It looks like the Digg guys are the most public about their usage of Cassandra. Arin’s presentation below is a bit less technical than the ☞ article published a while back, but also has some nice additions.

My notes:

  • how it is to use an alpha-stage project that you don’t have any idea how others are using
  • the problem with sharding is that there’s no standard way to doing it
  • if you start giving away features in your RDBMS why not also looking at alternatives?
  • why Cassandra:

  • easy administration (nb at least the promise of)
  • no SPF
  • more flexible than key-value stores
  • loading data: MySQL -> Hadoop -> Cassandra

    This sounds like a complex process. Arin is mentioning the use of Scribe at Digg and I was wondering if using Scribe to directly get data into Cassandra wouldn’t have been more easier. Anyway it’s difficult to say without knowing the details

  • 12 servers initially, backed down to 8, 3TB of data
  • Performance: < 1ms writes, ~4-5ms reads (nb: these are the numbers from the slides, but I find them odd)
  • I would have preferred to implement this service layer in Java as managing resources/pooling would have been better
  • No, we don’t hate SQL
  • open sourced Python library for Cassandra lazyboy ☞
Digg new architecture