One of the first things you learn when speaking to guys handling tons of data is that the they tend to use a different definition of commodity hardware and that commodity hardware is not synonymous to crappy (or cheap) hardware. So, before starting to build your next Hadoop cluster you should ask yourself how to choose and what hardware to use.
The Cloudera support team’s article on this ☞ topic is a both a fantastic and useful read. It not only provides some recommendations, but also tries to teach you to the process of gathering the requirements for your cluster:
It is straightforward to measure live workloads and determine bottlenecks by putting thorough monitoring in place on the Hadoop cluster. We recommend installing Ganglia on all Hadoop machines to provide real-time statistics about CPU, disk, and network load. With Ganglia installed a Hadoop administrator can then run their MapReduce jobs and check the Ganglia dashboard to see how each machine is performing.
I’d say that for scenarios where you don’t know before hand if the MapReduce jobs will be IO-bound or CPU-bound the cloud is probably the easiest (cheapest?) way to experiment and learn about it. I also think that except for cases where you can predict the scenarios will always remain the same, the cloud can remain a better alternative than owning your own cluster.
Cloudera team looks at the 4 types of nodes in a Hadoop cluster and makes some generic recommendations:
We recommend the following specifications for datanodes/tasktrackers in a balanced Hadoop cluster:
- 4 1TB hard disks in a JBOD (Just a Bunch Of Disks) configuration
- 2 quad core CPUs, running at least 2-2.5GHz
- 16-24GBs of RAM (24-32GBs if you’re considering HBase)
- Gigabit Ethernet
[…] We recommend our customers purchase hardened machines for running the namenodes and jobtrackers, with redundant power and enterprise-grade RAIDed disks. Namenodes also require more RAM relative to the number of data blocks in the cluster. A good rule of thumb is to assume 1GB of namenode memory for every one million blocks stored in the distributed file system. With 100 datanodes in a cluster, 32GBs of RAM on the namenode provides plenty of room to grow. We also recommend having a standby machine to replace the namenode or jobtracker, in the case when one of these fails suddenly.
but also recommendations based on different workloads:
- Light Processing Configuration (1U/machine): Two quad core CPUs, 8GB memory, and 4 disk drives (1TB or 2TB). Note that CPU-intensive work such as natural language processing involves loading large models into RAM before processing data and should be configured with 2GB RAM/core instead of 1GB RAM/core.
- Balanced Compute Configuration (1U/machine): Two quad core CPUs, 16 to 24GB memory, and 4 disk drives (1TB or 2TB) directly attached using the motherboard controller. These are often available as twins with two motherboards and 8 drives in a single 2U cabinet.
- Storage Heavy Configuration (2U/machine): Two quad core CPUs, 16 to 24GB memory, and 12 disk drives (1TB or 2TB). The power consumption for this type of machine starts around ~200W in idle state and can go as high as ~350W when active.
- Compute Intensive Configuration (2U/machine): Two quad core CPUs, 48-72GB memory, and 8 disk drives (1TB or 2TB). These are often used when a combination of large in-memory models and heavy reference data caching is required.
There is a lot more in the article and I’d strongly encourage you to add it to your reading list before jumping to build a Hadoop cluster. You should also check the Amdhal’s law applied to Hadoop provisioning for understanding the dynamics of the cluster.