NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Adku: All content tagged as Adku in NoSQL databases and polyglot persistence

Adku's Choice: Cassandra or HBase

The 8 6 reasons[1] Adku prefers Cassandra to HBase:

  1. Reliability
  2. Performance
  3. Consistency
  4. Single point of failure
  5. Hot spot problem
  6. MapReduce
  7. Simpler, Hackable
  8. Community support

Before jumping to any conclusions make sure you read the disclaimer:

While these decisions apply to Adku, they might not apply to your situation. Always do your own investigation and experimentation before choosing any large part of your system.

Update: JD Cryans2 commented on the points listed above (thanks JD):

This comparison reminds me of the pain we went through in the late 2009 when lots of similar comparisons came out from all sides — the “NoSQL war”. Unfortunately as we all found out, no one wins.

But let’s look at the points mentioned in this post.

  • Reliability: As far as I can tell that’s not a reliability test. The first thing that raises questions is the large number of crashes of the region servers. Considering the data set used (1 million rows of the full “Alice in Wonderland” text) is small compared to the ones other HBase users (StumbleUpon, Mozilla) are handling, that would point out to a configuration problem that wasn’t taken care of.

    One could say it’s because HBase is hard to configure or that the default configurations aren’t good, and to some extent I agree, but you don’t quantify reliability based on these.

  • Hot Spot Problem: This point is an interesting one, and more likely falls into the disclaimer.

    Distribution based on timestamp row keys will be better with Cassandra. But usually when using timestamps you also want range scans which is impossible with hashing. For example OpenTSDB provides a very efficient way to store time series by using a clever row key design. A design that you’ll probably also use if you need scans in Cassandra.

    Not to mention that using MapReduce will require sorted row keys anyways.

  • Community Support: Comparing communities only based on the number of IRC users is too much of a simplification. Someone looking to use an open source project should spend some time getting to know and interact with the users before stating that “one community is more helpful” than the other — a message that could also be perceived as disrespectful.

There are also a couple of points that are mentioned in the post even if HBase is the “winner” (MapReduce) or the feature is not a hard requirement (consistency).

I left performance last as the post mentions similar write performance results. But there is too little information about the benchmark to be able to comment on it. At first glance those results look weird considering they weren’t using a Hadoop version that supports append, which as shown by the original YCSB paper would make quite a difference.

After the Adku blog came out, Edward Capriolo wrote this response (rant?) to all who try to do the same as them and I think it’s worth the read:

  1. From the original list I have crossed MapReduce as the author considers HBase as the “winner”. Also commenters to the original post have clarified the confusion about HBase single point of failure.  

  2. Jean-Daniel Cryans: Apache HBase committer and DB Engineer at StumbleUpon, @jdcryans

Original title and link: Adku’s Choice: Cassandra or HBase (NoSQL databases © myNoSQL)