ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

ETL: All content tagged as ETL in NoSQL databases and polyglot persistence

The Hadoop as ETL part in migrating from MongoDB to Cassandra at FullContact

While I’ve found the whole post very educative — and very balanced considering the topic — the part that I’m linking to is about integrating MongoDB with Hadoop. After reading the story of integrating MongoDB and Hadoop at Foursquare, there were quite a few questions bugging me. This post doesn’t answer any of them, but it brings in some more details about existing tools, a completely different solution, and what seems to be an overarching theme when using Hadoop and MongoDB in the same phrase:

We’re big users of Hadoop MapReduce and tend to lean on it whenever we need to make large scale migrations, especially ones with lots of transformation. That fact along with our existing conversion project from before, we used 10gen’s mongo-hadoop project which has input and output formats for Hadoop. We immediately realized that the InputFormat which connected to a MongoDB cluster was ill-suited to our usage. We had 3TB of partially-overlapping data across 2 clusters. After calculating input splits for a few hours, it began pulling documents at an uncomfortably slow pace. It was slow enough, in fact, that we developed an alternative plan.

You’ll have to read the post to learn how they’ve accomplished their goal, but as a spoiler, it was once again more of an ETL process rather than an integration.

✚ The corresponding HN thread; it’s focused mostly on the from MongoDB to Cassandra parts.

Original title and link: The Hadoop as ETL part in migrating from MongoDB to Cassandra at FullContact (NoSQL database©myNoSQL)

via: http://www.fullcontact.com/blog/mongo-to-cassandra-migration/


On the topic of importing data into Neo4j

This post authored by Rik van Bruggen mentions the use of Talend ETL tool which brought an import job down from 1 hour to a couple of minutes:

This is where it got interesting. The spreadsheet import mechanism worked ok - but it really wasn’t great. It took more than an hour to get the dataset to load - so I had to look for alternatives. Thanks to my French friend and colleague Cédric, I bumped into the Talend ETL (Extract - Transform - Load) tools. I found out that there was a proper neo4j connector that was developed by Zenika, a French integrator that really seems to know their stuff.

There’s also a short video demoing Talend:

✚ I’ve mentioned what I see as the complexity of importing data into graph databases in On Importing Data into Neo4j

Original title and link: On the topic of importing data into Neo4j (NoSQL database©myNoSQL)

via: http://blog.neo4j.org/2013/07/fun-with-music-neo4j-and-talend.html


On Importing Data into Neo4j

For operations where massive amounts of data flow in or out of a Neo4j database, the interaction with the available APIs should be more considerate than with your usual, ad-hoc, local graph queries.

I’ll tell you the truth: when thinking about importing large amounts of data into a graph database I don’t feel very comfortable. And it’s not about the amount. It’s about the complexity of the data. Nodes. Properties of nodes. Relationships and their properties. And direction.

I hope this series started by Michael Hunger will help me learn more about graph database ETL.

Original title and link: On Importing Data into Neo4j (NoSQL database©myNoSQL)

via: http://jexp.de/blog/2013/05/on-importing-data-in-neo4j-blog-series/


4 Modern Data Loading Tips to Save Time, Money & Your Sanity

April Healy for Attunity1:

Here are 4 tips for modern data loading that can save you a lot of time, money, and your sanity!

  1. Ditch the development — Look into data ingestion/integration software that frees up your developer hours. Save those hours for revenue generating projects that exploit the opportunities you uncover with all the new data you are able to analyze.
  2. Find your “wingman” — Partner with a company that has the right solutions for your business needs. Things you might need: the capability to handle heterogeneous data sources, minimal impact on the source systems, automated processes, and/or graphical interfaces that make set up a snap. Your wingman should be able to help you get the data into your BI environment when you want it without a lot of hassle.
  3. Consider ingesting all the data in raw form — Gone are the days when data had to be cleansed, no pristine, before it entered the hallowed halls of the data warehouse. Let your data analysts sort out what is valuable within the BI environment. Who knows what nuggets they will find.
  4. Boost your BI System ROI by loading smarter — Don’t wait weeks or months to access new data sources; enhance your loading processes so you can “do it now!”

All good advice (well, the first three at least). After considering them, draw a line and compare:

  1. what you’ll gain? (the answer is probably a combination of the speed of solving the problem and learning about the process from experts in the fields)
  2. what you’ll lose? (the answer is probably that all your eggs will be in that vendor’s basket—whatever he wants you to pay for the next update or change or upgrade you’ll have to pay)

My generic answer is that I don’t believe in complete white or complete black: the optimal solution is probably a combination of finding a set of good tools and hiring people that can take care of it and evolve the solution. And even if I don’t see the world in black and white, I still don’t see it black and white—there might be companies out there where having a solution right now could show great ROI, thus pushing building a team to a second place priority. Or vice-versa.


  1. You guessed it right: they sell ETL solutions. 

Original title and link: 4 Modern Data Loading Tips to Save Time, Money & Your Sanity (NoSQL database©myNoSQL)

via: http://www.attunity.com/blog/4-modern-data-loading-tips-save-time-money-your-sanity


Michael Stonebraker's New Data Company Raises Seed Funding From Google Ventures and NEA

According to boston.com, Michael Stonebraker’s new company Data Tamer has raised seed funding from Google Ventures and NEA for an ETL-in-the-cloud service:

This year, there’s yet another Stonebraker startup: Data Tamer, which has raised seed funding from Google Ventures and NEA. The business side of the fledgling company is being run by Andy Palmer, a frequent Stonebraker co-conspirator. And the technological underpinnings of Data Tamer come from Stonebraker’s lab at MIT, as well as work done at Brandeis by Mitch Cherniack and at Brown by Stan Zdonik.

I don’t think I’d be wrong saying that Michael Stonebraker is the most prolific person in the data space (INGRES, Illustra with Posgres, Cohera Co with Mariposa, StreamBase with Aurora, Vertica with the C-Store based homonymous product, VoltDB with the H-Store homonymous product, SciDB, Data Tamer).

Daniel Abadi

Original title and link: Michael Stonebraker’s New Data Company Raises Seed Funding From Google Ventures and NEA (NoSQL database©myNoSQL)

via: http://www.boston.com/business/technology/innoeco/2013/02/new_startup_data_tamer_raises.html


Moving Data From Oracle to MongoDB : Bridging the Gap With JRuby

A homegrown ETL process for migrating data from Oracle to MongoDB based on JRuby chameleonic capabilities: a Ruby implementation integrating well in a Java environment:

Rather than having to re-map one database or the other in the other persistence technology to facilitate the ETL process (not DRY), JRuby allowed the two persistence technologies to interoperate. By utilizing JRuby’s powerful embedding capabilities, we were able to read data out of Oracle via Hibernate and write data to MongoDB via MongoMapper.

Original title and link: Moving Data From Oracle to MongoDB : Bridging the Gap With JRuby (NoSQL database©myNoSQL)

via: http://blog.jruby.org/2012/05/bridging-the-gap-with-jruby/


Introducing Databus: LinkedIn's Low Latency Change Data Capture Tool

Great article by Siddharth Anand1 introducing LinkedIn’s Databus: a low latency system used for transferring data between data stores (change data capture system):

Databus offers the following feature:

  • Pub-sub semantics
  • In-commit-order delivery guarantees
  • Commits at the source are grouped by transaction
    • ACID semantics are preserved through the entire pipeline
  • Supports partitioning of streams
    • Ordering guarantees are then per partition
  • Like other messaging systems, offers very low latency consumption for recently-published messages
  • Unlike other messaging systems, offers arbitrarily-long look-back with no impact to the source
  • High Availability and Reliability

The ESB model is well-known, but like NoSQL databases, Databus is specialized in handling specific requirements related to distributed systems and high volume data processing architectures.


  1. Siddharth Anand: senior member of LinkedIn’s Distributed Data Systems team 

Original title and link: Introducing Databus: LinkedIn’s Low Latency Change Data Capture Tool (NoSQL database©myNoSQL)

via: http://highscalability.com/blog/2012/3/19/linkedin-creating-a-low-latency-change-data-capture-system-w.html


A MongoDB Map/Reduce Job Explained

A real-world MongoDB map/reduce example used by the private group mailing lists tool Fiesta explained in detail. The only part I don’t agree with is emphasized below:

Map/Reduce is a great way to do aggregations and ETL-type operations with MongoDB.

Probably nitpicking, but MongoDB’s MapReduce—actually this applies to most NoSQL databases MapReduce implementations: CouchDB, Riak, etc.—can do only the transform part and very less so load[1] and no extract.


  1. One could argue that MongoDB’s out option can be seen as equivalent to the load phase, but we can agree that having the results replacing or merged in a collection is just a use case  

Original title and link: A MongoDB Map/Reduce Job Explained (NoSQL database©myNoSQL)

via: http://blog.fiesta.cc/post/10980328832/walkthrough-a-mongodb-map-reduce-job


Hadoop and Netezza: Differences & Similarities

Most of the time vendor videos are emphasizing the superiority of their own commercial platform. But this short video gives a fair overview of the similarities and differences between Hadoop and Netezza.

The video is 5 minutes long and well worth watching.