NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



HAWQ: All content tagged as HAWQ in NoSQL databases and polyglot persistence

When should I use Greenplum Database versus HAWQ?

Jon Roberts about the use cases for Greenplum and HAWQ, both technologies offered by Pivotal:

Greenplum is a robust MPP database that works very well for Data Marts and Enterprise Data Warehouses that tackles historical Business Intelligence reporting as well as predictive analytical use cases. HAWQ provides the most robust SQL interface for Hadoop and can tackle data exploration and transformation in HDFS.

First questions that popped in my mind:

  1. why isn’t HAWQ good for reporting?
  2. why isn’t HAWQ good for predictive analytics?

I don’t have a good answer for any of these. For the first, I assume that the implied answer is Hadoop’s latency. On the other hand, what I know is that Microsoft and Hortonworks are trying to bring Hadoop data into Excel with HDInsight. This is not traditional reporting, but if that’s acceptable from a latency point of view, I’m not sure why it wouldn’t work for reporting too.

For the second question, Hadoop and the tools built around it are well known for predictive analytics. So maybe this separation is due only to HAWQ. Another explanation could be product positioning.

This last part seems to be confirmed by the rest of the post which is making the point that data stored in HDFS is temporary and once it is processed with HAWQ it is moved into Greenplum.

Greenplum and HAWQ

In other words, HAWQ is just for ETL/ELT on Hadoop.

✚ I’m pretty sure that many traditional data warehouse companies that are forced to come up with coherent proposals for architectures based on their core products and Hadoop are facing the same product positioning problem — it’s difficult to accept in front of the customers that Hadoop might be capable to replace core functionality of the products you are selling.

What is the best answer to this positioning dilemma?

  1. Find a spot for Hadoop that is not hurting your core products. Let’s say ETL.
  2. Propose an architecture where your core products and Hadoop are fully complementing and interacting with each other.

You already know my answer.

Original title and link: When should I use Greenplum Database versus HAWQ? (NoSQL database©myNoSQL)


Everything is faster than Hive

Derrick Harris has brought together a series of benchmarks conducted by the different SQL-on-Hadoop implementors comparing their solution (Impala, Stinger/Tez, HAWQ, Shark) with

For what it’s worth, everyone is faster than Hive — that’s the whole point of all of these SQL-on-Hadoop technologies. How they compare with each other is harder to gauge, and a determination probably best left to individual companies to test on their own workloads as they’re making their own buying decisions. But for what it’s worth, here is a collection of more benchmark tests showing the performance of various Hadoop query engines against Hive, relational databases and, sometimes, themselves.

As Derrick Harris remarks, the only direct comparisons are between HAWQ and Impala (and this seems to be old as it mentions Impala being in beta) and the benchmark run by AMPlab (the guys behind Shark) comparing Redshift, Hive, Shark, and Impala.

The good part is that both the Hive Testbench and AMPlab benchmark are available on GitHub.

Original title and link: Everything is faster than Hive (NoSQL database©myNoSQL)


Aster Data, HAWQ, GPDB and the First Hadoop Squeeze

Rob Klopp:

But there are three products, the Greenplum database (GPDB), HAWQ, and Aster Data, that will be squeezed more quickly as they are positioned either in between the EDW and Hadoop… or directly over Hadoop. In this post I’ll explain what I suspect Pivotal and Teradata are trying to do… why I believe their strategy will not work for long… and why readers of this blog should be careful moving forward.

This is a very interesting analysis of the enterprise data warehouse market. There’s also a nice visualization of this prediction:


Here’s an alternative though. As showed in the picture above, the expansion of in-memory databases’ depends heavily on the evolution of the price of memory. It’s hard to argument against price predictions or Moore’s law. But accidents even if rare are still possible. Any significant change in the trend of memory costs, or other hardware market conditions (e.g. an unpredicted decrease of the price for SSDs), could give Teradata and Pivotal the extra time/conditions to break into advanced hybrid storage solutions that would offer slightly less fast but also less expensive products than their competitors’ in-memory databases.

Original title and link: Aster Data, HAWQ, GPDB and the First Hadoop Squeeze (NoSQL database©myNoSQL)


Main difference between Hadapt and Microsoft Polybase, HAWQ, SQL-H

Daniel Abadi in an email to Curt Monash analyzing a the Microsoft Polybase paper1:

The basic difference between Polybase and Hadapt is the following. With Polybase, the basic interface to the user is the MPP database software (and DBMS storage) that Microsoft is selling. Hadoop is viewed as a secondary source of data — if you have a dataset stored inside Hadoop instead of the database system for whatever reason, then the database system can access that Hadoop data on the fly and include that data in query processing alongside data that is already stored inside the database system. However, the user must be aware that she might want to query the data in Hadoop in advance — she must register this Hadoop data to the MPP database through an external table definition (and ideally statistics should be generated in advance to help the optimizer). Furthermore, the Hadoop data must be structured, since the external table definition requires this (so you can’t really access arbitrary unstructured data in Hadoop). The same is true for SQL-H and Hawq — they all can access data in Hadoop (in particular data stored in HDFS), but there needs to be some sort of structured schema defined in order for the database to understand how to access it via SQL. So, bottom line, Polybase/SQL-H/Hawq let you dynamically get at data in Hadoop/HDFS that could theoretically have been stored in the DBMS all along, but for some reason is being stored in Hadoop instead of the DBMS.

It’s a long paragraph, but the difference Daniel Abadi is emphasizing is critical: “Hadoop/HDFS data that could theoretically have been stored in DBMS all along”.

  1. According to Microsoft GraySystemsLab page on Polybase

    […] the goal of the Polybase project is to allow SQL Server PDW users to execute queries against data stored in Hadoop, specifically the Hadoop distributed file system (HDFS). Polybase is agnostic on both the type of the Hadoop cluster (Linux or Windows) and whether it is a separate cluster or whether the Hadoop nodes are co-located with the nodes of the PDW appliance.

    And here’re my (very) brief thoughts about Polybase when I first learned about it.

Original title and link: Main difference between Hadapt and Microsoft Polybase, HAWQ, SQL-H (NoSQL database©myNoSQL)