After looking at the 2 default options, GroupOn engineers came up with their custom solution that involves a complicated procedure for backing up MongoDB’s data files into a Hadoop cluster and then a custom InputFormat reader:
To solve this problem we backup raw Mongo data files to our Hadoop
Distributed File System (HDFS), then read them directly using an
InputFormat. This approach has the drawback of not reading the most current
Mongo data for each MapReduce, but it means we have a backup of our data in
HDFS and can map over an entire collection faster because of the throughput
of our Hadoop cluster. Moving data from a sharded Mongo cluster into HDFS,
however, has challenges of its own.
While I used integrating in the title, this looks more like patching the two to work together.
Original title and link: Integrating MongoDB and Hadoop at Groupon ( ©myNoSQL)