Hadoop and Complex Data Processing Workflows with Cascading
Understanding the basic concepts behind MapReduce is not a very difficult task, but those using extensively MapReduce tasks inside Hadoop are already facing new challenges like:
- how can you run multiple map and/or reduce phases in your data processing?
- how can you better coordinate the data processing execution flow for more complex scenarios?
- how can you perform additional work between map/reduce phases?
Addressing these new challenges is the goal of the ☞ Cascading project:
Cascading is a feature rich API for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster.
Christopher Curtin’s slides embedded below are offering a good overview of what can be achieved using Cascading (starting with slide 20).