You are running or at least planning to run a Hadoop cluster. Even if it is for a one time job, as NY Times or Digg are using it, or for long term, you will always want to take advantage of the optimizations and improvements that other did before you.
You can start by using Amdhal’s law for Hadoop provisioning and only then look into the expert tips shared with us by Todd Lipcon of Cloudera and Nathan Marz of BackType.
The “7 Tips for Improving MapReduce Performance”, included below for reference, are described in much detail and accompanied by diagnostics and benchmarks in the ☞ original article.
- Configure your cluster correctly
- Use LZO compression
- Tune the number of map and reduce tasks appropriately
- Write a combiner
- Use the most appropriate and compact Writable type for your data
- Reuse Writables
- Use “poor man’s profiling” to see what your tasks are doing (note: make sure you also learn how to get custom stats from Hadoop)
- Filter data as early as possible
- Eliminate reduces by using MultiGroupBy
- Define serialization tokens for custom writables
- Dealing with unbalanced joins
- Don’t put underscores in bucket names
- Start off small
- Use the log files
- GZipped input
It is always nice to have around a list of things to check before going in production, so you should probably bookmark this page for when that time comes!
-  Hadoop, NY Times and Open Source Libraries
-  Digg moving data from MySQL to Cassandra using Hadoop
-  Applying Amdhal’s Law for Hadoop Provisioning
-  ☞ 7 Tips for Improving MapReduce Performance
-  Generate custom stats from Hadoop
-  ☞ Cascading (↩)
-  ☞ Tips for Optimizing Cascading Flows
-  ☞ Amazon Elastic MapReduce (↩)
-  ☞ Elastic MapReduce Tips