NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Redis-powered Facebook-like newsfeeds

As we’ve learned over time there are only two ways to keep your service usable: either make it fast for every access or do the work upfront. Each of these comes with their limitations and costs and for the proposed solution using the precomputing approach these are well explained in the linked article:

In any real-life architecture, there are trade-offs made between speed, storage type and memory usage. In this case, memory and disk space is being compromised for the sake of speed of access. Retrieving any type of newsfeed is virtually free, which means that page loading speed will not be affected. Writes are still very fast, as there is no disk or SQL access involved. Although memory usage is relatively high, the minimal amount of data which is stored, and the use of MD5 keys to avoid unneccesary data replication, help to keep it within reason. Additionally, appropriate use of Redis’ automatic key expiry settings, and a regular cronjob pruning of old lists, will help even further.

While nothing written in the article is incorrect per se, there are some additional aspects of complexity which are not covered:

  • writes explosion: when using the precomputing approach with denormalized data there are many situations in which a single core write will trigger many more additional write operations. A very simple example is a user having tons of followers. At that moment, continuing to perform writes synchronously will soon become user noticeable (nb however fast writes are, at some moment the operation will become noticeable) and that’s something you’ll want to avoid
  • data explosion or data not fitting a single machine. At that point for each write you’ll also need to determine the different locations where data will need to be pushed to. Basically you’ll have to route your data to various locations in your cluster and this will come with . Also, without a very good data partitioning strategy — I’m not referring here only to core data (i.e. newsfeed), but also additional info that is usually presented along with it — data retrieval might require multiple roundtrips to various machines and that will contribute to the complexity of your system.

Bottom line this is a common problem for services like Facebook, Twitter and many others. And as far as I know, Twitter is using the precompute strategy, while Facebook is using both precompute and make it fast for every access strategies by using heavily parallelization and various optimizations[1].


  • [1] Disclaimer: I haven’t worked for any of these and my comment is based on the architecture talks I’ve been watching about the two services. ()