ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

hive: All content tagged as hive in NoSQL databases and polyglot persistence

Paper: YSmart - Yet Another SQL-to-MapReduce Translator

Another weekend read, this time from Facebook and The Ohio State University and closer to the hot topic of the last two weeks: SQL, MapReduce, Hadoop:

MapReduce has become an effective approach to big data analytics in large cluster systems, where SQL-like queries play important roles to interface between users and systems. However, based on our Facebook daily operation results, certain types of queries are executed at an unacceptable low speed by Hive (a production SQL-to-MapReduce translator). In this paper, we demonstrate that existing SQL-to-MapReduce translators that operate in a one-operation-to-one-job mode and do not consider query correlations cannot generate high-performance MapReduce programs for certain queries, due to the mismatch between complex SQL structures and simple MapReduce framework. We propose and develop a system called YSmart, a correlation aware SQL-to- MapReduce translator. YSmart applies a set of rules to use the minimal number of MapReduce jobs to execute multiple correlated operations in a complex query. YSmart can significantly reduce redundant computations, I/O operations and network transfers compared to existing translators. We have implemented YSmart with intensive evaluation for complex queries on two Amazon EC2 clusters and one Facebook production cluster. The results show that YSmart can outperform Hive and Pig, two widely used SQL-to-MapReduce translators, by more than four times for query execution.


What Makes Amazon Redshift Faster Than Hive?

I’m not implying that this question appeared on Quora after my link and comments about Redshift’s performance and costs at AirBnb, but Reynold Xin’s answer covers in a more formal way the reasons of Redshift being faster than Hive I’ve suggested in that post:

Some of the advantages you gain from massive scale and flexibility make it challenging to build a more performant query engine. The following outlines how various features (or lack of features) influences performance:

  1. data format
  2. task launch overhead (nb: this can be optimized in Hive/Hadoop)
  3. intermediate data materialization vs pipelining
  4. columnar data format
  5. columnar query engine
  6. faster S3 connection

Original title and link: What Makes Amazon Redshift Faster Than Hive? (NoSQL database©myNoSQL)

via: http://www.quora.com/Hive-computing/What-makes-Amazon-Redshift-faster-than-Hive


Writing Hive UDFs With Java - a Tutorial

Alexander Dean’s tutorial published in SDJ:

In this article you will learn how to write a user-defined function (“UDF”) to work with the Apache Hive platform. We will start gently with an introduction to Hive, then move on to developing the UDF and writing tests for it. We will write our UDF in Java, but use Scala’s SBT as our build tool and write our tests in Scala with Specs2.

As far as I know it’s quite easy to write UDFs for Pig and Hive in any language that has a JVM implementation (Python with Jython, Ruby with JRuby, Groovy).

Original title and link: Writing Hive UDFs With Java - a Tutorial (NoSQL database©myNoSQL)

via: http://snowplowanalytics.com/blog/2013/02/08/writing-hive-udfs-and-serdes/


Hadoop in 2013: What Hortonworks Will Focus On

Shaun Connolly summarizing a recent webinar about where Hortonwork’s work on Hadoop will focus in 2013:

[…] Interactive Query, Business Continuity (DR, Snapshots, etc.), Secure Access, as well as ongoing investments in Data Integration, Management (i.e. Ambari), and Online Data (i.e. HBase).
[…] Rather than abandon the Apache Hive community, Hortonworks is focused on working in the community to optimize Hive’s ability to serve big data exploration and interactive query in support of important BI use cases. Moreover, we are focused on enabling Hive to take advantage of YARN in Apache Hadoop 2.0, which will help ensure fast query workloads don’t compete for resources with the other jobs running in the cluster. Enabling Hadoop to predictably support enterprise workloads that span Batch, Interactive, and Online use cases is an important area of focus for us.

Basically this says that Hortonworks sees YARN and Hive as the answer to online or real-time interactive querying of Hadoop data. Cloudera’s take on this is different.

Original title and link: Hadoop in 2013: What Hortonworks Will Focus On (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/the-road-ahead-for-hortonworks-and-hadoop/


It Is About Apache Hive, but What Is a SerDe?

The original title of the article is “How-to: Use a SerDe in Apache Hive“, so I knew it was something about Hive, but still had no idea what SerDe is:

The SerDe interface allows you to instruct Hive as to how a record should be processed. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system. Commonly, Deserializers are used at query time to execute SELECT statements, and Serializers are used when writing data, such as through an INSERT-SELECT statement.

On one side we have the Spring frameworks with names like PreAuthenticatedGrantedAuthoritiesWebAuthenticationDetails1, then we have YouAreDeadException and end with SerDe. No middle ground in the Java world.


  1. Jacek found this Spring class name which has 59 characters. His post is from 2011, so who knows if there isn’t a longer one since then. 

Original title and link: It Is About Apache Hive, but What Is a SerDe? (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/


11 Interesting Releases From the First Weeks of January

The list of releases I wanted to post about has been growing fast these last couple of weeks, so instead of waiting leaving it to Here it is (in no particular order1):

  1. (Jan.2nd) Cassandra 1.2 — announcement on DataStax’s blog. I’m currently learning and working on a post looking at what’s new in Cassandra 1.2.
  2. (Jan.10th) Apache Pig 0.10.1 — Hortonworks wrote about it
  3. (Jan.10th) DataStax Community Edition 1.2 and OpsCenter 2.1.3 — DataStax announcement
  4. (Jan.10th) CouchDB 1.0.4, 1.1.2, and 1.2.1 — releases fixing some security vulnerabilities
  5. (Jan.11th) MongoDB 2.3.2 unstable — announcement. This dev release includes support for full text indexing. For more details you can check:

    […] an open source project extending Hadoop and Hive with a collection of useful user-defined-functions. Its aim is to make the Hive Big Data developer more productive, and to enable scalable and robust dataflows.


  1. I’ve tried to order it chronologically, but most probably I’ve failed. 

Original title and link: 11 Interesting Releases From the First Weeks of January (NoSQL database©myNoSQL)


What Is the Spring Data Project?

Short answer: another sign that the Spring framework wants to do everything everywhere. A mammoth1.

Version 1.0 was released in 2004 as a lightweight alternative to Enterprise Java Beans (EJB). Since, then Spring has expanded into many other areas of enterprise development, such as enterprise integration (Spring Integration), batch processing (Spring Batch), web development (Spring MVC, Spring Webflow), security (Spring Security). Spring continues to push the envelope for mobile applications (Spring Mobile), social media (Spring Social), rich web applications (Spring MVC, s2js Javascript libraries), and NoSQL data access(Spring Data).

[…]

The complete pipeline can be implemented using Spring for Apache Hadoop along with Spring Integration and Spring Batch. However, Hadoop has its own set of challenges which the Spring for Apache Hadoop project is designed to address. Like all Spring projects, it leverages the Spring Framework to provide a consistent structure and simplify writing Hadoop applications. For example, Hadoop applications rely heavily on command shell tools. So applications end up being a hodge-podge of Perl, Python, Ruby, and bash scripts. Spring for Apache Hadoop, provides a dedicated XML namespace for configuring Hadoop jobs with embedded scripting features and support for Hive and Pig.


  1. There’s a business reason for doing this though: when you have tons of clients you want to make sure they don’t have a chance to step outside. Is this new year resolution a heresy : I plan to use vastly less Spring this year

Original title and link: What Is the Spring Data Project? (NoSQL database©myNoSQL)

via: http://www.odbms.org/blog/2013/01/the-spring-data-project-interview-with-david-turanski/


Hive Top-K Optimization

A simple optimization of top-k queries that can make a huge difference: going from the default behavior of:

  1. sifting through all the data (necessary),
  2. sorting it all (necessary),
  3. writing all the results to disk (unnecessary—saving all the limit results from each map is enough), and
  4. having the reducer process again all the data (unnecessary—the previous step already reduced the amount of data down to the limit * number_of_partitions).

For reference a top-k query is:

SELECT * FROM T ORDER BY a DESC LIMIT 10

Original title and link: Hive Top-K Optimization (NoSQL database©myNoSQL)

via: http://www.qubole.com/blog/index.php/top-k-optimization/


Pig Performance and Optimization Analysis

Although Pig is designed as a data flow language, it supports all the functionalities required by TPC-H; thus it makes sense to use TPC-H to benchmark Pig’s performance. Below is the final result.

tpc-h 100gb

Original title and link: Pig Performance and Optimization Analysis (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/pig-performance-and-optimization-analysis/


Klout Data Architecture: MySQL, HBase, Hive, Pig, Elastic Search, MongoDB, SSAS

Just found slideck (embedded below) describing the data workflow at Klout. Their architecture includes many interesting pieces combining both NoSQL and relational databases with Hadoop and Hive and Pig and traditional BI. Even Excel gets a mention in the slides:

  1. Pig and Hive
  2. HBase
  3. Elastic Search
  4. MongoDB
  5. MySQL

Klout Data Architecture


Comparing File Formats and Compression Methods in HDFS and Hive

The post is a bit old, but the data contained comparing different compression methods is helpful:

HDFS and Hive comparing file formats and compression methods

Original title and link: Comparing File Formats and Compression Methods in HDFS and Hive (NoSQL database©myNoSQL)


Hortonworks Data Platform 1.0

Hortonworks has announced the 1.0 release of the Hortonworks Data Platform prior to the Hadoop Summit 2012 together with a lot of supporting quotes from companies like Attunity, Dataguise, Datameer, Karmasphere, Kognitio, MarkLogic, Microsoft, NetApp, StackIQ, Syncsort, Talend, 10gen, Teradata, and VMware.

Some info points:

  1. Hortonworks Data Platform is a platform meant to simplify the installation, integration, management, and use of Apache Hadoop

    hdp-diagram

    1. HDP 1.0 is based on Apache Hadoop 1.0
    2. Apache Ambari is used for installation and provisioning
    3. The same Apache Amabari is behind the Hortonworks Management Console
    4. For Data integration, HDP offers WebHDFS, HCatalog APIs, and Talend Open Studio
    5. Apache HCatalog is the solution offering metadata and table management
  2. Hortonworks Data Platform is 100% open source—I really appreciate Hortonworks’s dedication to the Apache Hadoop project and open source community

  3. HDP comes with 3 levels of support subscriptions, pricing starting at $12500/year for a 10 nodes cluster

One of the most interesting aspects of the Hortonworks Data Platform release is that the high-availability (HA) option for HDP is based on using VMWare-powered virtual machines for the NameNode and JobTracker. My first thought about this approach is that it was chosen to strengthen a partnership with VMWare. On the other hand, Hadoop 2.0 contains already a new highly-available version of the NameNode (Cloudera Hadoop Distribution uses this solution) and VMWare has bigger plans for a virtualization-friendly Hadoop environment with project Serengeti.

You can read a lot of posts about this announcement, but you’ll find all the details in Hortonworks’s John Kreisa’s post here and the PR announcement.

Original title and link: Hortonworks Data Platform 1.0 (NoSQL database©myNoSQL)