NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Cassandra, HBase, Riak: Choosing the Right Solution

Mozilla shows us the right way of choosing a storage solution (as opposed to this completely incorrect way):

  1. list as many requirements and details as you have
  2. identify critical features
  3. install, experiment and compare with your checklist
  4. analyze and document missing features, nice-to-haves, etc.

Not only that but the post goes on and explains how Cassandra, HBase, and Riak each answers to the following requirements:

  • Scalability — Deliver a solution that can handle the expected starting load and that can easily scale out as that load goes up.
  • Elasticity — Because the peak traffic periods are relatively short and the non-peak hours are almost idle, it is important to consider ways to ensure the allocated hardware is not sitting idle, and that you aren’t starved for resources during the peak traffic periods.
  • Reliability — Stability and high availability is important. It isn’t as critical as it might be in certain other projects, but if we were down for several hours during the peak traffic period, the client layer needs to be able to retain the data and resubmit at a later date.
  • Storage — Need enough room to store active experiments and also recent experiments that are being analyzed. It is expected that data will become stale over time and can be archived off of the active cluster.
  • Analysis — What do we have to put together to provide a friendly system to the analysts?
  • Cost — Actual cost of the additional hardware needed to deploy the initial solution and to scale through at least the end of the year.
  • Manpower — How much time and effort will it take us to deliver the first critical stage of the project and the subsequent stages? Also consider ongoing maintenance and ownership of the code.
  • Security — Because we will be accepting data from an outside, untrusted source, we need to consider what steps are necessary to ensure the health of the system and the privacy of users.
  • Extensibility — delivering a platform that can readily evolve to meet the future needs of the project and hopefully other projects as well.
  • Disaster Recovery / Migration — If the original system fails to meet the requirements after going live, what options do we have to recover from that situation? If we decide to switch to another technology, how do we move the data?

While they are not the only ones doing such extensive investigative work — see also Cassandra at Twitter and HBase at Adobe — there are many things to be learned from their experience. Thanks Mozilla for sharing it with us!

Also available a comparison of Cassandra, HBase and PNUTS and Cassandra and HBase compared.