ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Hadoop, NY Times and Open Source Libraries

I guess everyone with some interest in Hadoop already knows the story of NY Times converting more than 130 years worth of articles (11 million articles in TIFF format) into PDFs using Hadoop and Amazon EC2 [1]. What I didn’t know though is that this wasn’t an one-time only project, NY Times continuing to use Hadoop for other projects [2] and that they open sourced [3] the Map/Reduce Toolkit (MRToolkit) [4] project for use with a not so well known feature: Hadoop Streaming [5]

It takes care of the details of setting up and running Apache Hadoop jobs, and encapsulates most of the complexity of writing map and reduce steps. The toolkit, which is Ruby-based, provides the framework — you only have to supply the details of the map and reduce steps.

There is also another Ruby library for Hadoop streaming: ☞ wukong which simplifies the data interaction layer:

Treat your dataset like a

  • stream of lines when it’s efficient to process by lines
  • stream of field arrays when it’s efficient to deal directly with fields
  • stream of lightweight objects when it’s efficient to deal with objects

Do you have any favorite library that you use with Hadoop? Is it in our NoSQL libraries list?