ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

3 Steps for a Fast Relational Database to Hadoop Data Load [Sponsor]

Words from this week’s sponsor, Pervasive/Actian:


So, you want to pull a buttload (That’s a technical term.) of data out of a relational database and slam it into HDFS or HBase for processing. Well, maybe you’ve got a nice, powerful Hadoop cluster, but that old school database isn’t designed for parallel data exports. How do you get the data moved into Hadoop before you’re eligible for retirement?

Here’s how:

  1. Use the new Actian Rushloader. It’s a nice, simple, free tool that allows you to pull data from any database that has a JDBC driver, as well as log files, delimited files, HBase and ARFF files. RushLoader functions on any operating system with a JVM and with any file system, including Amazon S3, UNIX and HDFS.

    The nice thing about RushLoader is that on the surface, it’s a quick and easy, point and click workflow tool, a cut down version of the KNIME open source data mining platform. Under the covers, it uses the DataRush engine that divides and optimizes workloads at runtime, so it takes full advantage of as much parallel hardware power as you give it, without you having to do any coding work to make it happen.

  2. Configure the data query in the Rushloader database reader like this:

    (t = a table name, c = a column name)
     Select * from t where c =?
  3. Set up a parameter query for ? like this: Select distinct c from table

These three steps will give you all the distinct values in the column, and send a separate query for each value to the database. Having each row query separated allows the DataRush engine to automatically spread the work across the available machines and threads, giving you a high speed parallel data pull. There’s more info on parameter queries is in the DataRush docs, and the new Actian big data community provides a DataRush toolset discussion forum if you run into trouble.

The free RushLoader includes simple row and column filtering. If you want to get any more sophisticated about the load - add data quality checks, do aggregations, sorting, source joins, lookups, that sort of thing - you have to move up to the commercial version, RushAnalytics. If all you need is a lot of data pulled from an RDBMS and slammed into Hadoop, Rushloader can do the job faster by far than anything else on the market.

Original title and link: 3 Steps for a Fast Relational Database to Hadoop Data Load [Sponsor] (NoSQL database©myNoSQL)