The picture should speak for Digg’s polyglot persistency approach:
But here is also a description of the data stores in use:
Digg stores data in multiple types system depending on the type of data and the access patterns, and also for historical reasons in some cases :)
Cassandra: The primary store for “Object-like” access patterns for such things as Items (stories), Users, Diggs and the indexes that surround them. Since the Cassandra 0.6 version we use does not support secondary indexes, these are computed by application logic and stored here. […]
HDFS: Logs from site and API events, user activity. Data source and destination for batch jobs run with Map-Reduce and Hive in Hadoop. Big Data and Big Compute!
MySQL: This is mainly the current store for the story promotion algorithm and calculations, because it requires lots of JOIN heavy operations which is not a natural fit for the other data stores at this time. However… HBase looks interesting.
Redis: The primary store for the personalized news data because it needs to be different for every user and quick to access and update. We use Redis to provide the Digg Streaming API and also for the real time view and click counts since it provides super low latency as a memory-based data storage system.
Scribe: the log collecting service. Although this is a primary store, the logs are rotated out of this system regularly and summaries written to HDFS.
I know this will sound strange, but isn’t it too much in there?
Original title and link: How Digg is Built? Using a Bunch of NoSQL technologies (NoSQL databases © myNoSQL)