An interesting look at what happens during the map phase in Hadoop and the impact of emitting key-value pairs:
- a direct negative impact on the map time and CPU usage, due to more serialization
- an indirect negative impact on CPU due to more spilling and additional deserialization in the combine step
- a direct impact on the map task, due to more intermediate files, which makes the final merge more expensive
The main point of the dynaTrace blog post is that even if Hadoop makes it easy to throw more hardware at a problem, wasting resources with bad code in MapReduce tasks comes with a noticeable and measurable cost.
Original title and link: MapReduce With Hadoop: What Happens During Mapping ( ©myNoSQL)