After some Hadoop hardware recommendations and using Amdhal’s law for Hadoop provisioning, Cloudera shares its know-how on Hadoop/HBase capacity planning covering aspects like network, memory, disk, and CPU:
Since we are talking about data, the first crucial parameter is how much disk space we need on all of the Hadoop nodes to store all of your data and what compression algorithm you are going to use to store the data. For the MapReduce components an important consideration is how much computational power you need to process the data and whether the jobs you are going to run on the cluster is CPU or I/O intensive. […] Finally, HBase is mainly memory driven and we need to consider the data access pattern in your application and how much memory you need so that the HBase nodes do not swap the data too often to the disk. Most of the written data end up in memstores before they finally end up on disk, so you should plan for more memory in write-intensive workloads like web crawling.
Original title and link for this post: Hadoop/HBase Capacity Planning (published on the NoSQL blog: myNoSQL)