hive: All content tagged as hive in NoSQL databases and polyglot persistence
Friday, 24 May 2013
Optimizing Joins running on HDInsight Hive on Azure
Two notable things in Denny Lee’s post about optimizing some of the Hive joins used by Microsoft’s Online Services Division:
- Microsoft is drinking their own HDInsight on Azure champaign. This will take HDInsight product far as they’ll always have first hand feedback about parts of the system that need improvement.
- Know the different types of JOINs supported by Hive and don’t be afraid of experimenting.
✚ An extra point for the link to Liyin Tang and Namit Jain’s Join strategies in Hive (PDF)
Original title and link: Optimizing Joins running on HDInsight Hive on Azure (©myNoSQL)
via: http://dennyglee.com/2013/04/26/optimizing-joins-running-on-hdinsight-hive-on-azure-at-gfs/
RCFile - OCFile - Parquet: Storing Big Data With Hive
Christian Prokopp explaining the advantages of the RCFile storage:
The state-of-the-art solution for Hive is the RCFile. The format has been co-developed by Facebook, which is running the largest Hadoop and Hive installation in the world. RCFile has been adopted by the Hive and Pig projects as the core format for table like data storage. The goal of the format development was “(1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) strong adaptivity to highly dynamic workload patterns,” as can be seen in this PDF from the development teams.
Questions:
- is there any connection between the RCFile and Parquet the new columnar storage format? At first glance, the goals of the two are pretty similar.
- It looks like there’s already a new format that will supersede RCFile: ORC Files. Are all these 3 approaches independent of each other? If yes, then would are the pros and cons of each of them?
Original title and link: RCFile - OCFile - Parquet: Storing Big Data With Hive (©myNoSQL)
via: http://www.bigdatarepublic.com/author.asp?section_id=2840&doc_id=262756
Tuesday, 21 May 2013
Apache Hive 0.11: Stinger Phase 1 Delivered
Owen O’Malley on Hortonworks’ blog:
As representatives of this open, community led effort we are very proud to announce the first release of the new and improved Apache Hive, version 0.11. This substantial release embodies the work of a wide group of people from Microsoft, Facebook , Yahoo, SAP and others. Together we have addressed 386 JIRA tickets, of which there were 28 new features and 276 bug fixes. There were FIFTY-FIVE developers involved in this and I would like to thank every one of them.
This is indeed the power of open. But don’t forget that too much bragging might diminish it: keep repeating a word and its value will slowly vanish.
Original title and link: Apache Hive 0.11: Stinger Phase 1 Delivered (©myNoSQL)
via: http://hortonworks.com/blog/apache-hive-0-11-stinger-phase-1-delivered/
Friday, 15 March 2013
Paper: YSmart - Yet Another SQL-to-MapReduce Translator
Another weekend read, this time from Facebook and The Ohio State University and closer to the hot topic of the last two weeks: SQL, MapReduce, Hadoop:
MapReduce has become an effective approach to big data analytics in large cluster systems, where SQL-like queries play important roles to interface between users and systems. However, based on our Facebook daily operation results, certain types of queries are executed at an unacceptable low speed by Hive (a production SQL-to-MapReduce translator). In this paper, we demonstrate that existing SQL-to-MapReduce translators that operate in a one-operation-to-one-job mode and do not consider query correlations cannot generate high-performance MapReduce programs for certain queries, due to the mismatch between complex SQL structures and simple MapReduce framework. We propose and develop a system called YSmart, a correlation aware SQL-to- MapReduce translator. YSmart applies a set of rules to use the minimal number of MapReduce jobs to execute multiple correlated operations in a complex query. YSmart can significantly reduce redundant computations, I/O operations and network transfers compared to existing translators. We have implemented YSmart with intensive evaluation for complex queries on two Amazon EC2 clusters and one Facebook production cluster. The results show that YSmart can outperform Hive and Pig, two widely used SQL-to-MapReduce translators, by more than four times for query execution.
Tuesday, 26 February 2013
What Makes Amazon Redshift Faster Than Hive?
I’m not implying that this question appeared on Quora after my link and comments about Redshift’s performance and costs at AirBnb, but Reynold Xin’s answer covers in a more formal way the reasons of Redshift being faster than Hive I’ve suggested in that post:
Some of the advantages you gain from massive scale and flexibility make it challenging to build a more performant query engine. The following outlines how various features (or lack of features) influences performance:
- data format
- task launch overhead (nb: this can be optimized in Hive/Hadoop)
- intermediate data materialization vs pipelining
- columnar data format
- columnar query engine
- faster S3 connection
Original title and link: What Makes Amazon Redshift Faster Than Hive? (©myNoSQL)
via: http://www.quora.com/Hive-computing/What-makes-Amazon-Redshift-faster-than-Hive
Monday, 11 February 2013
Writing Hive UDFs With Java - a Tutorial
Alexander Dean’s tutorial published in SDJ:
In this article you will learn how to write a user-defined function (“UDF”) to work with the Apache Hive platform. We will start gently with an introduction to Hive, then move on to developing the UDF and writing tests for it. We will write our UDF in Java, but use Scala’s SBT as our build tool and write our tests in Scala with Specs2.
As far as I know it’s quite easy to write UDFs for Pig and Hive in any language that has a JVM implementation (Python with Jython, Ruby with JRuby, Groovy).
Original title and link: Writing Hive UDFs With Java - a Tutorial (©myNoSQL)
via: http://snowplowanalytics.com/blog/2013/02/08/writing-hive-udfs-and-serdes/
Wednesday, 30 January 2013
Hadoop in 2013: What Hortonworks Will Focus On
Shaun Connolly summarizing a recent webinar about where Hortonwork’s work on Hadoop will focus in 2013:
[…] Interactive Query, Business Continuity (DR, Snapshots, etc.), Secure Access, as well as ongoing investments in Data Integration, Management (i.e. Ambari), and Online Data (i.e. HBase).
[…] Rather than abandon the Apache Hive community, Hortonworks is focused on working in the community to optimize Hive’s ability to serve big data exploration and interactive query in support of important BI use cases. Moreover, we are focused on enabling Hive to take advantage of YARN in Apache Hadoop 2.0, which will help ensure fast query workloads don’t compete for resources with the other jobs running in the cluster. Enabling Hadoop to predictably support enterprise workloads that span Batch, Interactive, and Online use cases is an important area of focus for us.
Basically this says that Hortonworks sees YARN and Hive as the answer to online or real-time interactive querying of Hadoop data. Cloudera’s take on this is different.
Original title and link: Hadoop in 2013: What Hortonworks Will Focus On (©myNoSQL)
via: http://hortonworks.com/blog/the-road-ahead-for-hortonworks-and-hadoop/
Thursday, 24 January 2013
It Is About Apache Hive, but What Is a SerDe?
The original title of the article is “How-to: Use a SerDe in Apache Hive“, so I knew it was something about Hive, but still had no idea what SerDe is:
The SerDe interface allows you to instruct Hive as to how a record should be processed. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system. Commonly, Deserializers are used at query time to execute SELECT statements, and Serializers are used when writing data, such as through an INSERT-SELECT statement.
On one side we have the Spring frameworks with names like PreAuthenticatedGrantedAuthoritiesWebAuthenticationDetails1, then we have YouAreDeadException and end with SerDe. No middle ground in the Java world.
-
Jacek found this Spring class name which has 59 characters. His post is from 2011, so who knows if there isn’t a longer one since then. ↩
Original title and link: It Is About Apache Hive, but What Is a SerDe? (©myNoSQL)
via: http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/
Monday, 21 January 2013
11 Interesting Releases From the First Weeks of January
The list of releases I wanted to post about has been growing fast these last couple of weeks, so instead of waiting leaving it to Here it is (in no particular order1):
- (Jan.2nd) Cassandra 1.2 — announcement on DataStax’s blog. I’m currently learning and working on a post looking at what’s new in Cassandra 1.2.
- (Jan.10th) Apache Pig 0.10.1 — Hortonworks wrote about it
- (Jan.10th) DataStax Community Edition 1.2 and OpsCenter 2.1.3 — DataStax announcement
- (Jan.10th) CouchDB 1.0.4, 1.1.2, and 1.2.1 — releases fixing some security vulnerabilities
-
(Jan.11th) MongoDB 2.3.2 unstable — announcement. This dev release includes support for full text indexing. For more details you can check:
- MongoDB Full Text Search Explained and MongoDB Text Search Tutorial
- Full text search in MongoDB: details about supported languages and queries
- Indexing a Markdown blog using MongoDB full text indexing
- Short demo of MongoDB text search and hashed shard keys
- (Jan.12th) Apache HBase 0.94.4 — announcement and release notes
- (Jan.14th) Apache Hive 0.10.0: Hortonworks’s post about it
- (Jan.15th) Hortonworks Data Platform 1.2 featuring Apache Amabari — official PR announcement
- (Jan.16th) Redis 2.6.9 — release notes
- (Jan.16th) HyperDex 1.0RC1 — no docs
- (Jan.16th) Klout’s Brickhouse — announcement:
[…] an open source project extending Hadoop and Hive with a collection of useful user-defined-functions. Its aim is to make the Hive Big Data developer more productive, and to enable scalable and robust dataflows.
-
I’ve tried to order it chronologically, but most probably I’ve failed. ↩
Original title and link: 11 Interesting Releases From the First Weeks of January (©myNoSQL)
Thursday, 3 January 2013
What Is the Spring Data Project?
Short answer: another sign that the Spring framework wants to do everything everywhere. A mammoth1.
Version 1.0 was released in 2004 as a lightweight alternative to Enterprise Java Beans (EJB). Since, then Spring has expanded into many other areas of enterprise development, such as enterprise integration (Spring Integration), batch processing (Spring Batch), web development (Spring MVC, Spring Webflow), security (Spring Security). Spring continues to push the envelope for mobile applications (Spring Mobile), social media (Spring Social), rich web applications (Spring MVC, s2js Javascript libraries), and NoSQL data access(Spring Data).
[…]
The complete pipeline can be implemented using Spring for Apache Hadoop along with Spring Integration and Spring Batch. However, Hadoop has its own set of challenges which the Spring for Apache Hadoop project is designed to address. Like all Spring projects, it leverages the Spring Framework to provide a consistent structure and simplify writing Hadoop applications. For example, Hadoop applications rely heavily on command shell tools. So applications end up being a hodge-podge of Perl, Python, Ruby, and bash scripts. Spring for Apache Hadoop, provides a dedicated XML namespace for configuring Hadoop jobs with embedded scripting features and support for Hive and Pig.
-
There’s a business reason for doing this though: when you have tons of clients you want to make sure they don’t have a chance to step outside. Is this new year resolution a heresy : I plan to use vastly less Spring this year? ↩
Original title and link: What Is the Spring Data Project? (©myNoSQL)
via: http://www.odbms.org/blog/2013/01/the-spring-data-project-interview-with-david-turanski/
Monday, 17 December 2012
Hive Top-K Optimization
A simple optimization of top-k queries that can make a huge difference: going from the default behavior of:
- sifting through all the data (necessary),
- sorting it all (necessary),
- writing all the results to disk (unnecessary—saving all the
limitresults from eachmapis enough), and - having the reducer process again all the data (unnecessary—the previous step already reduced the amount of data down to the
limit* number_of_partitions).
For reference a top-k query is:
SELECT * FROM T ORDER BY a DESC LIMIT 10
Original title and link: Hive Top-K Optimization (©myNoSQL)
via: http://www.qubole.com/blog/index.php/top-k-optimization/
Tuesday, 4 September 2012
Pig Performance and Optimization Analysis
Although Pig is designed as a data flow language, it supports all the functionalities required by TPC-H; thus it makes sense to use TPC-H to benchmark Pig’s performance. Below is the final result.
Original title and link: Pig Performance and Optimization Analysis (©myNoSQL)
via: http://hortonworks.com/blog/pig-performance-and-optimization-analysis/
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling
