This post by Daniel Abadi dates from July 2012, but it’s so right that I wonder why I didn’t link to it before:
Many people don’t realize that Hadoop and parallel relational
databases have an extremely similar design. Both are capable of
storing large data sets by breaking the data into pieces and storing
them on multiple independent (“shared-nothing”) machines in a
cluster. Both scale processing over these large data sets by
parallelizing the processing of the data over these independent
machines. Both do as much independent processing as possible across
individual partitions of data, in order to reduce the amount of data
that must be exchanged between machines. Both store data redundantly
in order to increase fault tolerance. The algorithms for scaling
operations like selecting data, projecting data, grouping data,
aggregating data, sorting data, and even joining data are the same.
If you squint, the basic data processing technology of Hadoop and
parallel database systems are identical.
There is absolutely no technical reason why there needs to be two
separate systems doing the exact same type of parallel processing.
Someone trying to defend his position could counter this perspective by mentioning that Daniel Abadi’s product Hadapt is competing on this exact segment of the market. But by doing so, that someone would just prove how right Abadi is.
Original title and link: Why Database-To-Hadoop Connectors Are Fundamentally Flawed and Entirely Unnecessary ( ©myNoSQL)