Hive: All content tagged as Hive in NoSQL databases and polyglot persistence
In 4 years of writing this blog I haven’t seen such a prolific month:
- Apache Hadoop 2.2.0 (more links here)
- Apache HBase 0.96 (here and here)
- Apache Hive 0.12 (more links here)
- Apache Ambari 1.4.1
- Apache Pig 0.12
- Apache Oozie 4.0.0
- Plus Presto.
Actually I don’t think I’ve ever seen such an ecosystem like the one created around Hadoop.
Original title and link: A prolific season for Hadoop and its ecosystem ( ©myNoSQL)
And at the end of October, Hortonworks has shared a new set of results:
Historically, even simple Hive queries could not run in less than 30 seconds, yet many of these queries are running in less than 10 seconds. How did that happen? The answer mainly boils down to Apache Tez and Apache Hadoop YARN, which proves that Hadoop is more than just batch. Tez features such as container pre-launch and re-use overcome Hadoop’s traditional latency barriers, and are available to any data processing framework running in Hadoop.
Original title and link: Status update on Project Stinger, the interactive query for Apache Hive ( ©myNoSQL)
Another weekend read, this time from Facebook and The Ohio State University and closer to the hot topic of the last two weeks: SQL, MapReduce, Hadoop:
MapReduce has become an effective approach to big data analytics in large cluster systems, where SQL-like queries play important roles to interface between users and systems. However, based on our Facebook daily operation results, certain types of queries are executed at an unacceptable low speed by Hive (a production SQL-to-MapReduce translator). In this paper, we demonstrate that existing SQL-to-MapReduce translators that operate in a one-operation-to-one-job mode and do not consider query correlations cannot generate high-performance MapReduce programs for certain queries, due to the mismatch between complex SQL structures and simple MapReduce framework. We propose and develop a system called YSmart, a correlation aware SQL-to- MapReduce translator. YSmart applies a set of rules to use the minimal number of MapReduce jobs to execute multiple correlated operations in a complex query. YSmart can significantly reduce redundant computations, I/O operations and network transfers compared to existing translators. We have implemented YSmart with intensive evaluation for complex queries on two Amazon EC2 clusters and one Facebook production cluster. The results show that YSmart can outperform Hive and Pig, two widely used SQL-to-MapReduce translators, by more than four times for query execution.