ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

ETL: All content tagged as ETL in NoSQL databases and polyglot persistence

4 Modern Data Loading Tips to Save Time, Money & Your Sanity

April Healy for Attunity1:

Here are 4 tips for modern data loading that can save you a lot of time, money, and your sanity!

  1. Ditch the development — Look into data ingestion/integration software that frees up your developer hours. Save those hours for revenue generating projects that exploit the opportunities you uncover with all the new data you are able to analyze.
  2. Find your “wingman” — Partner with a company that has the right solutions for your business needs. Things you might need: the capability to handle heterogeneous data sources, minimal impact on the source systems, automated processes, and/or graphical interfaces that make set up a snap. Your wingman should be able to help you get the data into your BI environment when you want it without a lot of hassle.
  3. Consider ingesting all the data in raw form — Gone are the days when data had to be cleansed, no pristine, before it entered the hallowed halls of the data warehouse. Let your data analysts sort out what is valuable within the BI environment. Who knows what nuggets they will find.
  4. Boost your BI System ROI by loading smarter — Don’t wait weeks or months to access new data sources; enhance your loading processes so you can “do it now!”

All good advice (well, the first three at least). After considering them, draw a line and compare:

  1. what you’ll gain? (the answer is probably a combination of the speed of solving the problem and learning about the process from experts in the fields)
  2. what you’ll lose? (the answer is probably that all your eggs will be in that vendor’s basket—whatever he wants you to pay for the next update or change or upgrade you’ll have to pay)

My generic answer is that I don’t believe in complete white or complete black: the optimal solution is probably a combination of finding a set of good tools and hiring people that can take care of it and evolve the solution. And even if I don’t see the world in black and white, I still don’t see it black and white—there might be companies out there where having a solution right now could show great ROI, thus pushing building a team to a second place priority. Or vice-versa.


  1. You guessed it right: they sell ETL solutions. 

Original title and link: 4 Modern Data Loading Tips to Save Time, Money & Your Sanity (NoSQL database©myNoSQL)

via: http://www.attunity.com/blog/4-modern-data-loading-tips-save-time-money-your-sanity


Michael Stonebraker's New Data Company Raises Seed Funding From Google Ventures and NEA

According to boston.com, Michael Stonebraker’s new company Data Tamer has raised seed funding from Google Ventures and NEA for an ETL-in-the-cloud service:

This year, there’s yet another Stonebraker startup: Data Tamer, which has raised seed funding from Google Ventures and NEA. The business side of the fledgling company is being run by Andy Palmer, a frequent Stonebraker co-conspirator. And the technological underpinnings of Data Tamer come from Stonebraker’s lab at MIT, as well as work done at Brandeis by Mitch Cherniack and at Brown by Stan Zdonik.

I don’t think I’d be wrong saying that Michael Stonebraker is the most prolific person in the data space (INGRES, Illustra with Posgres, Cohera Co with Mariposa, StreamBase with Aurora, Vertica with the C-Store based homonymous product, VoltDB with the H-Store homonymous product, SciDB, Data Tamer).

Daniel Abadi

Original title and link: Michael Stonebraker’s New Data Company Raises Seed Funding From Google Ventures and NEA (NoSQL database©myNoSQL)

via: http://www.boston.com/business/technology/innoeco/2013/02/new_startup_data_tamer_raises.html


Moving Data From Oracle to MongoDB : Bridging the Gap With JRuby

A homegrown ETL process for migrating data from Oracle to MongoDB based on JRuby chameleonic capabilities: a Ruby implementation integrating well in a Java environment:

Rather than having to re-map one database or the other in the other persistence technology to facilitate the ETL process (not DRY), JRuby allowed the two persistence technologies to interoperate. By utilizing JRuby’s powerful embedding capabilities, we were able to read data out of Oracle via Hibernate and write data to MongoDB via MongoMapper.

Original title and link: Moving Data From Oracle to MongoDB : Bridging the Gap With JRuby (NoSQL database©myNoSQL)

via: http://blog.jruby.org/2012/05/bridging-the-gap-with-jruby/


Introducing Databus: LinkedIn's Low Latency Change Data Capture Tool

Great article by Siddharth Anand1 introducing LinkedIn’s Databus: a low latency system used for transferring data between data stores (change data capture system):

Databus offers the following feature:

  • Pub-sub semantics
  • In-commit-order delivery guarantees
  • Commits at the source are grouped by transaction
    • ACID semantics are preserved through the entire pipeline
  • Supports partitioning of streams
    • Ordering guarantees are then per partition
  • Like other messaging systems, offers very low latency consumption for recently-published messages
  • Unlike other messaging systems, offers arbitrarily-long look-back with no impact to the source
  • High Availability and Reliability

The ESB model is well-known, but like NoSQL databases, Databus is specialized in handling specific requirements related to distributed systems and high volume data processing architectures.


  1. Siddharth Anand: senior member of LinkedIn’s Distributed Data Systems team 

Original title and link: Introducing Databus: LinkedIn’s Low Latency Change Data Capture Tool (NoSQL database©myNoSQL)

via: http://highscalability.com/blog/2012/3/19/linkedin-creating-a-low-latency-change-data-capture-system-w.html


A MongoDB Map/Reduce Job Explained

A real-world MongoDB map/reduce example used by the private group mailing lists tool Fiesta explained in detail. The only part I don’t agree with is emphasized below:

Map/Reduce is a great way to do aggregations and ETL-type operations with MongoDB.

Probably nitpicking, but MongoDB’s MapReduce—actually this applies to most NoSQL databases MapReduce implementations: CouchDB, Riak, etc.—can do only the transform part and very less so load[1] and no extract.


  1. One could argue that MongoDB’s out option can be seen as equivalent to the load phase, but we can agree that having the results replacing or merged in a collection is just a use case  

Original title and link: A MongoDB Map/Reduce Job Explained (NoSQL database©myNoSQL)

via: http://blog.fiesta.cc/post/10980328832/walkthrough-a-mongodb-map-reduce-job


Hadoop and Netezza: Differences & Similarities

Most of the time vendor videos are emphasizing the superiority of their own commercial platform. But this short video gives a fair overview of the similarities and differences between Hadoop and Netezza.

The video is 5 minutes long and well worth watching.