While I’ve found the whole post very educative — and very balanced considering the topic — the part that I’m linking to is about integrating MongoDB with Hadoop. After reading the story of integrating MongoDB and Hadoop at Foursquare, there were quite a few questions bugging me. This post doesn’t answer any of them, but it brings in some more details about existing tools, a completely different solution, and what seems to be an overarching theme when using Hadoop and MongoDB in the same phrase:
We’re big users of Hadoop MapReduce and tend to lean on it whenever we need
to make large scale migrations, especially ones with lots of transformation.
That fact along with our existing conversion project from before, we used
10gen’s mongo-hadoop project which has input and output formats for Hadoop.
We immediately realized that the InputFormat which connected to a MongoDB
cluster was ill-suited to our usage. We had 3TB of partially-overlapping
data across 2 clusters. After calculating input splits for a few hours, it
began pulling documents at an uncomfortably slow pace. It was slow enough,
in fact, that we developed an alternative plan.
You’ll have to read the post to learn how they’ve accomplished their goal, but as a spoiler, it was once again more of an ETL process rather than an integration.
✚ The corresponding HN thread; it’s focused mostly on the from MongoDB to Cassandra parts.
Original title and link: The Hadoop as ETL part in migrating from MongoDB to Cassandra at FullContact