Hadoop on EC2: A Detailed Guide

Lars George, with his very (very) detailed style, explains step by step how to setup Hadoop on EC2.

Let’s jump into it head first and solve the problem of actually launching a cluster. You have heard that Hadoop is shipped with EC2 support, but how do you actually start up a Hadoop cluster on EC2? You do have a couple of choices and as Tom’s article above explains you could start all instances in the cluster by hand. But why would you want to do that if there are scripts available that do all the work for you? And to complicate matters, how do you select the AMI (the Amazon Machine Image) that has the Hadoop version you need or want? Does it have Hive installed for your subsequent analysis of the collected data? Just running a check to count the available public Hadoop images returns 41! That gets daunting very quickly. Sure you can roll your own - but that implies even more manual labor that you probably better spend on productive work. But there is help available..

Lars is using the CDH tools for this tutorial and points out the the ☞ incubating Whirr project.

