ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Hadoop and Complex Data Processing Workflows with Cascading

Understanding the basic concepts behind MapReduce is not a very difficult task, but those using extensively MapReduce tasks inside Hadoop are already facing new challenges like:

  • how can you run multiple map and/or reduce phases in your data processing?
  • how can you better coordinate the data processing execution flow for more complex scenarios?
  • how can you perform additional work between map/reduce phases?

Addressing these new challenges is the goal of the ☞ Cascading project:

Cascading is a feature rich API for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster.

Christopher Curtin’s slides embedded below are offering a good overview of what can be achieved using Cascading (starting with slide 20).