NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Why Database-To-Hadoop Connectors Are Fundamentally Flawed and Entirely Unnecessary

This post by Daniel Abadi dates from July 2012, but it’s so right that I wonder why I didn’t link to it before:

Many people don’t realize that Hadoop and parallel relational databases have an extremely similar design. Both are capable of storing large data sets by breaking the data into pieces and storing them on multiple independent (“shared-nothing”) machines in a cluster. Both scale processing over these large data sets by parallelizing the processing of the data over these independent machines. Both do as much independent processing as possible across individual partitions of data, in order to reduce the amount of data that must be exchanged between machines. Both store data redundantly in order to increase fault tolerance.  The algorithms for scaling operations like selecting data, projecting data, grouping data, aggregating data, sorting data, and even joining data are the same. If you squint, the basic data processing technology of Hadoop and parallel database systems are identical.

There is absolutely no technical reason why there needs to be two separate systems doing the exact same type of parallel processing.

Someone trying to defend his position could counter this perspective by mentioning that Daniel Abadi’s product Hadapt is competing on this exact segment of the market. But by doing so, that someone would just prove how right Abadi is.

Original title and link: Why Database-To-Hadoop Connectors Are Fundamentally Flawed and Entirely Unnecessary (NoSQL database©myNoSQL)