NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Expert Tips for Optimizing Hadoop and MapReduce

You are running or at least planning to run a Hadoop cluster. Even if it is for a one time job, as NY Times or Digg are using it, or for long term, you will always want to take advantage of the optimizations and improvements that other did before you.

You can start by using Amdhal’s law for Hadoop provisioning and only then look into the expert tips shared with us by Todd Lipcon of Cloudera and Nathan Marz of BackType.

The “7 Tips for Improving MapReduce Performance”, included below for reference, are described in much detail and accompanied by diagnostics and benchmarks in the ☞ original article.

  1. Configure your cluster correctly
  2. Use LZO compression
  3. Tune the number of map and reduce tasks appropriately
  4. Write a combiner
  5. Use the most appropriate and compact Writable type for your data
  6. Reuse Writables
  7. Use “poor man’s profiling” to see what your tasks are doing (note: make sure you also learn how to get custom stats from Hadoop)

In case you are using Hadoop Cascading[6], then you’ll most probably find the tips ☞ here quite useful:

  1. Filter data as early as possible
  2. Eliminate reduces by using MultiGroupBy
  3. Define serialization tokens for custom writables
  4. Dealing with unbalanced joins

Or if you are using the Amazon Elastic MapReduce[8], you’ll probably want to check ☞ Peter Warden tips:

  1. Don’t put underscores in bucket names
  2. Start off small
  3. Use the log files
  4. GZipped input

It is always nice to have around a list of things to check before going in production, so you should probably bookmark this page for when that time comes!