NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Integrating MongoDB and Hadoop at Groupon

After looking at the 2 default options, GroupOn engineers came up with their custom solution that involves a complicated procedure for backing up MongoDB’s data files into a Hadoop cluster and then a custom InputFormat reader:

To solve this problem we backup raw Mongo data files to our Hadoop Distributed File System (HDFS), then read them directly using an InputFormat. This approach has the drawback of not reading the most current Mongo data for each MapReduce, but it means we have a backup of our data in HDFS and can map over an entire collection faster because of the throughput of our Hadoop cluster. Moving data from a sharded Mongo cluster into HDFS, however, has challenges of its own.

While I used integrating in the title, this looks more like patching the two to work together.

Original title and link: Integrating MongoDB and Hadoop at Groupon (NoSQL database©myNoSQL)