NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



HBase, MapReduce and Data Analysis

Joydeep Sen Sarma takes a look at the possibility of separating the I/O paths for processing real-time requests and analytic queries while continuing the use the same HBase storage:

  • Applications requiring up-to-date versions can go through the RegionServer (Tablets in BigTable parlance) API
  • However Applications that do not care about the very latest updates can directly access compacted files from HDFS

As regards the advantages of such an approach, he writes:

Many analytical and reporting applications are perfectly fine with data that’s about a day old or so. Issuing large IO against HDFS will always be more performant than streaming data out of region servers. And it allows one to size the HBase clusters based on the (more steady) real-time traffic - not based on large bursty batch jobs (alternately put - partially isolates the online traffic from resource contention by batch jobs).

As pointed by a commenter, this article doesn’t look at a possible 3rd scenario: stream analysis and I’m not sure there is any Hadoop/HBase solution for that. But for the cases covered the solution sounds interesting as long as you can avoid data collisions and make sure that this approach doesn’t lead to inconsistencies in analyzed data.