NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



MPP: All content tagged as MPP in NoSQL databases and polyglot persistence

Counterpoint: Why Some Hadoop Adapters Make *Perfect* Sense

Jeff Darcy’s reply to Daniel Abadi’s “Why Database-To-Hadoop Connectors Are Fundamentally Flawed and Entirely Unnecessary“:

Going back to Daniel’s argument, RDBMS-to-Hadoop connectors are indeed silly because they incur a migration cost without adding semantic value. Moving from one structured silo to another structured silo really is a waste of time. That is also exactly why filesystem-to-Hadoop connectors do make sense, because they flip that equation on its head — they do add semantic value, and they avoid a migration cost that would otherwise exist when importing data into HDFS. Things like GlusterFS’s UFO or MapR’s “direct access NFS” decrease total time to solution vs. the HDFS baseline.

Original title and link: Counterpoint: Why Some Hadoop Adapters Make *Perfect* Sense (NoSQL database©myNoSQL)


Why Database-To-Hadoop Connectors Are Fundamentally Flawed and Entirely Unnecessary

This post by Daniel Abadi dates from July 2012, but it’s so right that I wonder why I didn’t link to it before:

Many people don’t realize that Hadoop and parallel relational databases have an extremely similar design. Both are capable of storing large data sets by breaking the data into pieces and storing them on multiple independent (“shared-nothing”) machines in a cluster. Both scale processing over these large data sets by parallelizing the processing of the data over these independent machines. Both do as much independent processing as possible across individual partitions of data, in order to reduce the amount of data that must be exchanged between machines. Both store data redundantly in order to increase fault tolerance.  The algorithms for scaling operations like selecting data, projecting data, grouping data, aggregating data, sorting data, and even joining data are the same. If you squint, the basic data processing technology of Hadoop and parallel database systems are identical.

There is absolutely no technical reason why there needs to be two separate systems doing the exact same type of parallel processing.

Someone trying to defend his position could counter this perspective by mentioning that Daniel Abadi’s product Hadapt is competing on this exact segment of the market. But by doing so, that someone would just prove how right Abadi is.

Original title and link: Why Database-To-Hadoop Connectors Are Fundamentally Flawed and Entirely Unnecessary (NoSQL database©myNoSQL)


Big Data Implications for IT Architecture and Infrastructure

Teradata’s Martin Willcox:

From an IT architecture / infrastructure perspective, I think that the key thing to understand about all of this is that, at least for the foreseeable future, we’ll need at least two different types of “database” technology to efficiently manage and exploit the relational and non-relational data, respectively: an integrated data warehouse, built on an Massively Parallel Processing (MPP) DBMS platform for the relational data, and the relational meta-data that we generate by processing the non-relational data (for example, that a call was made at this date and time, by this customer, and that they were assessed as being stressed and agitated); and another platform for the processing of the non-relational data, that enables us to parallelise complex algorithms - and so bring them to bear on large data-sets - using the MapReduce programming model. Since the value of these data are much greater in combination than in isolation – and because we may be shipping very large volumes of data between the different platforms - considerations of how best to connect and integrate these two repositories become very important.

One of the few corporate blog posts that do not try to position Hadoop (and implicitely MapReduce) in a corner.

This sane perspective could be a validation of my thoughts about the Teradata and Hortwonworks partnership.

Original title and link: Big Data Implications for IT Architecture and Infrastructure (NoSQL database©myNoSQL)


MapReduce and Massively Parallel Processing (MPP): Two Sides of the Big Data

Andrew Brust for ZDNet:

But, for a variety of reasons, MPP and MapReduce are used in rather different scenarios. You will find MPP employed in high-end data warehousing appliances. […] MPP gets used on expensive, specialized hardware tuned for CPU, storage and network performance. MapReduce and Hadoop find themselves deployed to clusters of commodity servers that in turn use commodity disks. The commodity nature of typical Hadoop hardware (and the free nature of Hadoop software) means that clusters can grow as data volumes do, whereas MPP products are bound by the cost of, and finite hardware in, the appliance and the relative high cost of the software. […] MPP and MapReduce are separated by more than just hardware. MapReduce’s native control mechanism is Java code (to implement the Map and Reduce logic), whereas MPP products are queried with SQL (Structured Query Language). […] Nonetheless, Hadoop is natively controlled through imperative code while MPP appliances are queried though declarative query. In a great many cases, SQL is easier and more productive than is writing MapReduce jobs, and database professionals with the SQL skill set are more plentiful and less costly than Hadoop specialists.

I totally agree with Andrew Brust that none of these are good reasons for these platforms to remain separate. Actually when analyzing the importance of the Teradata (MPP) and Hortonworks (Hadoop) partnership, I wrote:

Depending on the level of integration the two team will pull together, this partnership might result in one of the most complete and powerful structured and unstructured data warehouse and analytics platform.

This very same thing could be said about any platform that would offer a viable, fully integrated, cost effective, distributed, structured and unstructured data warehouse or analytics platform. MPP and MapReduce do not represent different sides of the Big Data, but rather complementary approaches for Big Data.

Original title and link: MapReduce and Massively Paralle Processing (MPP): Two Sides of the Big Data (NoSQL database©myNoSQL)


Improving Hadoop Performance by (Up To) 1000x

LinkedIn’s Adam Silberstein and Daniel Tunkelang provide a fantastic summary of a presentation I wish I could attend: Daniel Abadi’s “Improving Hadoop Performance by (up to) 1000x”.

Overly simplified, Daniel Abadi’s proposal is to create an analytical platform by using the best of two worlds: Hadoop and row-based or column-based relational database storage and query engines.

Hadapt, the company founded by Daniel Abadi, is in my list of the 8 most interesting companies for Hadoop’s future because I think that an interesting product can be built by combining the long optimized and tested storage and query engines of relational databases with Hadoop’s fault tolerance, scalability, and power, topped with a resource management layer.

Original title and link: Improving Hadoop Performance by (Up To) 1000x (NoSQL database©myNoSQL)


Sybase: Distributed Shared-everything MPP Query Processing Architecture

Using an MPP shared-everything architecture, Sybase IQ 15.3 PlexQ Distributed Query Platform surpasses typical shared-nothing MPP architectures with better concurrency, self service ad-hoc queries, and independent scale out of compute and storage resources. With this architecture, PlexQ can exceed Service Level Agreements (SLAs) through simple and flexible resource provisioning that allows nodes to be grouped together as unified images that can be assigned to different application profiles.

Is this going against what the web, MapReduce, Hadoop, and (some) NoSQL databases are teaching us?

Update: I realized that my question above can be misinterpreted so here are my real questions:

  1. How does this shared-everything model work?
  2. What are the pros/cons of this shared-everything approach?

Markus ‘maol’ Perdrizat

Original title and link: Sybase: Distributed Shared-everything MPP Query Processing Architecture (NoSQL databases © myNoSQL)