Joe Ennever, Data Infrastructure Engineer, describes how Foursquare is integrating Hadoop with MongoDB. The process is complicated and looks quite fragile:
Every six hours a process running on each node issues an
LVM snapshot command for the disk on which a given mongo
database runs. A central coordinator ensures that this
happens as close as possible to simultaneously across all
A separate process is continuously monitoring this
directory, waiting for a complete cluster to be available
(i.e., every shard from the cluster exists). Once that
happens, the files are downloaded, decompressed, and
extracted in parallel across several servers. When an
individual file finishes downloading, we launch the
mongodump utility to write the data back to HDFS.
I’d say calling this process of moving data between MongoDB and Hadoop integration is an overstatement. It looks more like a multi-step, centrally coordinated but still optimistical ETL process. And it doesn’t stop here as the process also performs a BSON to Thrift conversion:
Having our entire Mongo database available in Hadoop is nice, but it’s not
very useful unless it can be read by MapReduce jobs. Every Mongo collection
that gets read in Hadoop has an associated Thrift definition, and a custom
input format turns the BSON into the Thrift object. Since we use Scala, the
Thrift objects are generated Scala code, using Spindle.
All these make me think that that pulling data from MongoDB into Hadoop is not really a critical part of the Foursquare’s data flow.
✚ As a side note, here’s what I’d call an ideal solution. And it’s already available.
Original title and link: Integrating MongoDB and Hadoop at Foursquare