NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Lessons Learned from Using Hadoop and HBase in Production

Some notes and questions about ☞ ReadPath' usage of HBase and Hadoop.

The dictionary processing went from a system that was having trouble keeping up with the incoming stream of content ( ReadPath adds ~1,500 new items / minute) to one that could completely rebuild a dictionary from 250 Million content items in under 3 hours (this equates to ~1,400,000 items / minute).

This sounds extremely cool, but I’d appreciated more details about the hardware involved. There is only one mention that they are currently using an 8 node HBase/Hadoop cluster. Better stats would be: initial items/node vs current items/node; initial algo LOC vs HBase/Hadoop algo LOC; etc.

One of the main items that was keeping me from pulling the trigger on porting to HBase was concerns about data loss. In my first day of playing with HBase, I had a bad server take out the .META. table and result in complete loss of HBase tables. I pulled that server and haven’t had any data loss since […]

If you thought using NoSQL solutions would automatically address and solve backup and restore policies, you were wrong.

Next steps include […] looking at using HBase for the link graph system that ReadPath needs to sort items. The link graph is a much more difficult system, the read/write pattern is completely random which blows away any caching. In preliminary tests, the system ends up being disk bound.

Wondering if for this scenario a graph database wouldn’t be a better fit. But then, I am not sure how well supported is MapReduce/Hadoop in graph databases.