Hadoop, NY Times and Open Source Libraries
I guess everyone with some interest in Hadoop already knows the story of NY Times converting more than 130 years worth of articles (11 million articles in TIFF format) into PDFs using Hadoop and Amazon EC2 [1]. What I didn’t know though is that this wasn’t an one-time only project, NY Times continuing to use Hadoop for other projects [2] and that they open sourced [3] the Map/Reduce Toolkit (MRToolkit) [4] project for use with a not so well known feature: Hadoop Streaming [5]
It takes care of the details of setting up and running Apache Hadoop jobs, and encapsulates most of the complexity of writing map and reduce steps. The toolkit, which is Ruby-based, provides the framework — you only have to supply the details of the map and reduce steps.
There is also another Ruby library for Hadoop streaming: ☞ wukong which simplifies the data interaction layer:
Treat your dataset like a
- stream of lines when it’s efficient to process by lines
- stream of field arrays when it’s efficient to deal directly with fields
- stream of lightweight objects when it’s efficient to deal with objects
Do you have any favorite library that you use with Hadoop? Is it in our NoSQL libraries list?
References
- [1] ☞ Self-service, Prorated Super Computing Fun! (2007) (↩)
- [2] ☞ The New York Times Archives + Amazon Web Services = TimesMachine (2008) (↩)
- [3] ☞ Announcing the Map/Reduce Toolkit (2009) (↩)
- [4] ☞ mrtoolkit (↩)
- [5] ☞ Hadoop Streaming (↩)
- [6] ☞ Easy Map-Reduce with Hadoop Streaming
- [7] ☞ wukong