NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Digg Going The Cassandra Way

I’ve just read about another high profile web site, Digg, going the Cassandra way. While this is not absolutely new as we’ve already heard about Cassandra in production @ Digg, the important bit is in this quote:

At the time of writing, we’ve reimplemented most of Digg’s functionality using Cassandra as our primary datastore.

I also have found interesting what motivated Digg to reach this decision and the reasons why a NoSQL solution would fit their specific scenario:

[…] the increasing difficulty of building a high performance, write intensive, application on a data set that is growing quickly, with no end in sight.


Our domain area, news, doesn’t exact strict consistency requirements, so (according to Brewer’s theorem) relaxing this allows gains in availability and partition tolerance (i.e. operations completing, even in degraded system states). […]

As our system grows, it’s important for us to span multiple data centers for redundancy and network performance and to add capacity or replace failed nodes with no downtime. We plan to continue using commodity hardware, and to continue assuming that it will fail regularly. All of this is increasingly difficult with MySQL.

The same article mentions a couple of improvements Digg have added to Cassandra to make it more Digg-usable (all of these been promised to be open sourced):

  • full text, relational and graph indexing systems
  • increased comparitor speed
  • better compaction threading
  • reduced logging overhead and Scribe support for logging
  • support for row-level caching
  • support for multi-get
  • slow uery logging
  • improved bulk import functionality

I’d definitely be interested to hear more about the details of this process, so if you have any contacts at Digg it would be great if you could make the introductions! I bet their story will be as exciting as Twitter’s one.