NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Integrating MongoDB and Hadoop at Foursquare

Joe Ennever, Data Infrastructure Engineer, describes how Foursquare is integrating Hadoop with MongoDB. The process is complicated and looks quite fragile:

Every six hours a process running on each node issues an LVM snapshot command for the disk on which a given mongo database runs. A central coordinator ensures that this happens as close as possible to simultaneously across all clusters. […]

A separate process is continuously monitoring this directory, waiting for a complete cluster to be available (i.e., every shard from the cluster exists). Once that happens, the files are downloaded, decompressed, and extracted in parallel across several servers. When an individual file finishes downloading, we launch the mongodump utility to write the data back to HDFS.

I’d say calling this process of moving data between MongoDB and Hadoop integration is an overstatement. It looks more like a multi-step, centrally coordinated but still optimistical ETL process. And it doesn’t stop here as the process also performs a BSON to Thrift conversion:

Having our entire Mongo database available in Hadoop is nice, but it’s not very useful unless it can be read by MapReduce jobs. Every Mongo collection that gets read in Hadoop has an associated Thrift definition, and a custom input format turns the BSON into the Thrift object. Since we use Scala, the Thrift objects are generated Scala code, using Spindle.

All these make me think that that pulling data from MongoDB into Hadoop is not really a critical part of the Foursquare’s data flow.

✚ As a side note, here’s what I’d call an ideal solution. And it’s already available.

Original title and link: Integrating MongoDB and Hadoop at Foursquare (NoSQL database©myNoSQL)