Ricky Ho has two great articles on how MapReduce is implemented by Hadoop and Cloud MapReduce:
Cloud MapReduce enjoys the inherit scalability and resiliency, which greatly simplifies its architecture.
- Cloud MapReduce doesn’t need to design a central coordinator components (like the NameNode and JobTracker in the Hadoop environment). They simply store the job progress status information in the distributed metadata store (SimpleDB).
- Cloud MapReduce doesn’t need to worry about scalability in the communication path and how data can be moved efficiently between nodes, all is taken care by the underlying CloudOS
- Cloud MapReduce doesn’t need to worry about disk I/O issue because all storage is effectively remote and being taken care by the Cloud OS.
Cloud MapReduce implementation is detailed in this ☞ paper (PDF).
These are very interesting details on how to build a scalable (probably also fault tolerant) solution.