NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Digg: All content tagged as Digg in NoSQL databases and polyglot persistence

How Digg is Built? Using a Bunch of NoSQL technologies

The picture should speak for Digg’s polyglot persistency approach:

Digg Data Storage Architecture

But here is also a description of the data stores in use:

Digg stores data in multiple types system depending on the type of data and the access patterns, and also for historical reasons in some cases :)

  • Cassandra: The primary store for “Object-like” access patterns for such things as Items (stories), Users, Diggs and the indexes that surround them. Since the Cassandra 0.6 version we use does not support secondary indexes, these are computed by application logic and stored here. […]

  • HDFS: Logs from site and API events, user activity. Data source and destination for batch jobs run with Map-Reduce and Hive in Hadoop. Big Data and Big Compute!

  • MySQL: This is mainly the current store for the story promotion algorithm and calculations, because it requires lots of JOIN heavy operations which is not a natural fit for the other data stores at this time. However… HBase looks interesting.

  • Redis: The primary store for the personalized news data because it needs to be different for every user and quick to access and update. We use Redis to provide the Digg Streaming API and also for the real time view and click counts since it provides super low latency as a memory-based data storage system.

  • Scribe: the log collecting service. Although this is a primary store, the logs are rotated out of this system regularly and summaries written to HDFS.

I know this will sound strange, but isn’t it too much in there?


Original title and link: How Digg is Built? Using a Bunch of NoSQL technologies (NoSQL databases © myNoSQL)


Redis at Digg: Story View Counts

Digg just rolled out a new feature, cummulative page event counters (page views plus clicks), that is using Redis as its underlying solution.

Clickstream information is extracted real time from logs and then Redis’s support for incrementing values comes into play. And in case you are wondering how these counters deal with concurrent updates, keep in mind that Redis is a single threaded engine, so all operations are executed sequentially.

In Digg’s own words: “Redis rocks!”

Original title and link: Redis at Digg: Story View Counts (NoSQL databases © myNoSQL)


Cassandra Status Inside Facebook, Twitter, Digg, and More

As everyone probably knows by now, Cassandra was originated at Facebook as a solution for inbox search and then open sourced under the ASF umbrella and an Apache license. Since then, Twitter, Digg, Reddit and quite a few others started using it, but not much have been heard from Facebook.

So, in case you are wondering ☞ what’s up with Cassandra here’s a very concise update:

  1. Twitter and Digg are not planning to fork the project. In fact there are clear plans to contribute back their work on Cassandra (see this for more details)
  2. Facebook is still using Cassandra internally for the inbox search, but they are using their own version
  3. even if except the initial code share Facebook has stopped contributing to the Cassandra project, the community on ASF is doing well (read growing)
  4. Riptano, the company founded by Cassandra project lead Jonathan Ellis and Matt Pfeil, is offering technical support, professional services, and training for Cassandra

Update: interesting ☞ note (dated July 7th) from Twitter’s engineer, Nick Kallen:

Twitter no longer intends to use Cassandra for any critical data-stores in the near term future.

MemcacheDB History at Reddit

Steve Huffman (co-founder and programmer of Reddit) speaking at ☞ FOWA Miami 2010 (around min.18:30)[1]:

And then there is another software that is really handy MemcacheDB, which is like memcached but is persistent. […] It’s very very fast, super-handy, we store far more data in MemcacheDB than we do in Postgres

Then bam! MemcacheDB bursting blocking writes leading Reddit to switch to Cassandra as friends from Digg or Twitter did.

Lesson learned: take such pieces of advise with a grain of salt and always test your scenario.

  1. It looks like Steve was not working at Reddit anymore at the time the presentation was made and so he might not have been aware of the problems related to MemcacheDB.  ()

Presentation: NoSQL: Dealing with the Data Deluge

A presentation by John Quinn (@doofdoofsf) on NoSQL, relational databases and massive amounts of data. Somehow a nicer and extended form of NoSQL is here to stay:

Digg Going The Cassandra Way

I’ve just read about another high profile web site, Digg, going the Cassandra way. While this is not absolutely new as we’ve already heard about Cassandra in production @ Digg, the important bit is in this quote:

At the time of writing, we’ve reimplemented most of Digg’s functionality using Cassandra as our primary datastore.

I also have found interesting what motivated Digg to reach this decision and the reasons why a NoSQL solution would fit their specific scenario:

[…] the increasing difficulty of building a high performance, write intensive, application on a data set that is growing quickly, with no end in sight.


Our domain area, news, doesn’t exact strict consistency requirements, so (according to Brewer’s theorem) relaxing this allows gains in availability and partition tolerance (i.e. operations completing, even in degraded system states). […]

As our system grows, it’s important for us to span multiple data centers for redundancy and network performance and to add capacity or replace failed nodes with no downtime. We plan to continue using commodity hardware, and to continue assuming that it will fail regularly. All of this is increasingly difficult with MySQL.

The same article mentions a couple of improvements Digg have added to Cassandra to make it more Digg-usable (all of these been promised to be open sourced):

  • full text, relational and graph indexing systems
  • increased comparitor speed
  • better compaction threading
  • reduced logging overhead and Scribe support for logging
  • support for row-level caching
  • support for multi-get
  • slow uery logging
  • improved bulk import functionality

I’d definitely be interested to hear more about the details of this process, so if you have any contacts at Digg it would be great if you could make the introductions! I bet their story will be as exciting as Twitter’s one.