NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Main difference between Hadapt and Microsoft Polybase, HAWQ, SQL-H

Daniel Abadi in an email to Curt Monash analyzing a the Microsoft Polybase paper1:

The basic difference between Polybase and Hadapt is the following. With Polybase, the basic interface to the user is the MPP database software (and DBMS storage) that Microsoft is selling. Hadoop is viewed as a secondary source of data — if you have a dataset stored inside Hadoop instead of the database system for whatever reason, then the database system can access that Hadoop data on the fly and include that data in query processing alongside data that is already stored inside the database system. However, the user must be aware that she might want to query the data in Hadoop in advance — she must register this Hadoop data to the MPP database through an external table definition (and ideally statistics should be generated in advance to help the optimizer). Furthermore, the Hadoop data must be structured, since the external table definition requires this (so you can’t really access arbitrary unstructured data in Hadoop). The same is true for SQL-H and Hawq — they all can access data in Hadoop (in particular data stored in HDFS), but there needs to be some sort of structured schema defined in order for the database to understand how to access it via SQL. So, bottom line, Polybase/SQL-H/Hawq let you dynamically get at data in Hadoop/HDFS that could theoretically have been stored in the DBMS all along, but for some reason is being stored in Hadoop instead of the DBMS.

It’s a long paragraph, but the difference Daniel Abadi is emphasizing is critical: “Hadoop/HDFS data that could theoretically have been stored in DBMS all along”.

  1. According to Microsoft GraySystemsLab page on Polybase

    […] the goal of the Polybase project is to allow SQL Server PDW users to execute queries against data stored in Hadoop, specifically the Hadoop distributed file system (HDFS). Polybase is agnostic on both the type of the Hadoop cluster (Linux or Windows) and whether it is a separate cluster or whether the Hadoop nodes are co-located with the nodes of the PDW appliance.

    And here’re my (very) brief thoughts about Polybase when I first learned about it.

Original title and link: Main difference between Hadapt and Microsoft Polybase, HAWQ, SQL-H (NoSQL database©myNoSQL)