ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

hive: All content tagged as hive in NoSQL databases and polyglot persistence

Scaling the Facebook data warehouse to 300 PB

Fascinating read, raising interesting observations on different levels:

  1. At Facebook, data warehouse means Hadoop and Hive.

    Our warehouse stores upwards of 300 PB of Hive data, with an incoming daily rate of about 600 TB.

  2. I don’t see how in-memory solutions, like Hana, will see their market expanding.

    In the Enterprise Data Warehouses and the first Hadoop squeeze, Rob Klopp predicted a squeeze of the EDW market under the pressure of in-memory DBMS and Hadoop. I still think that in-memory will become just a custom engine in the Hadoop toolkit and existing EDW products.

    On the always mentioned argument that “not everybody is Facebook”, I think that the part that is hidden under the rug is that today’s size of data is the smallest you’ll ever have.

    In the last year, the warehouse has seen a 3x growth in the amount of data stored. Given this growth trajectory, storage efficiency is and will continue to be a focus for our warehouse infrastructure.

  3. At Facebook’s scale, balancing availability and costs is again a challenge. But there’s no mention of network attached storage.

    There are many areas we are innovating in to improve storage efficiency for the warehouse – building cold storage data centers, adopting techniques like RAID in HDFS to reduce replication ratios (while maintaining high availability), and using compression for data reduction before it’s written to HDFS.

  4. For the nits and bolts of effectively optimizing compression, read the rest of the post which covers the optimization Facebook brought to the ORCFile format.

    There seem to be two competing formats at play: ORCFile (with support from Hortonworks and Facebook) and Parquet (with support from Twitter and Cloudera). Unfortunately I don’t have any good comparison of the two. And I couldn’t find one (why?).

Original title and link: Scaling the Facebook data warehouse to 300 PB (NoSQL database©myNoSQL)

via: https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/


SQL on Hadoop: An overview of frameworks and their applicability

An overview of the 3 SQL-on-Hadoop execution models — batch (10s of minutes and up), interactive (up to minutes), operational (sub-second), their applicability in the field of applications, and the main characteristics of the tools/frameworks in each of these categories:

Within the big data landscape there are multiple approaches to accessing, analyzing, and manipulating data in Hadoop. Each depends on key considerations such as latency, ANSI SQL completeness (and the ability to tolerate machine-generated SQL), developer and analyst skillsets, and architecture tradeoffs.

The usual suspects are included: Hive, Impala, Preso, Spark/Shark, Drill.

sql-on-hadoop-segments-diagram

Original title and link: SQL on Hadoop: An overview of frameworks and their applicability (NoSQL database©myNoSQL)

via: http://www.mapr.com/products/sql-on-hadoop-details


Everything is faster than Hive

Derrick Harris has brought together a series of benchmarks conducted by the different SQL-on-Hadoop implementors comparing their solution (Impala, Stinger/Tez, HAWQ, Shark) with

For what it’s worth, everyone is faster than Hive — that’s the whole point of all of these SQL-on-Hadoop technologies. How they compare with each other is harder to gauge, and a determination probably best left to individual companies to test on their own workloads as they’re making their own buying decisions. But for what it’s worth, here is a collection of more benchmark tests showing the performance of various Hadoop query engines against Hive, relational databases and, sometimes, themselves.

As Derrick Harris remarks, the only direct comparisons are between HAWQ and Impala (and this seems to be old as it mentions Impala being in beta) and the benchmark run by AMPlab (the guys behind Shark) comparing Redshift, Hive, Shark, and Impala.

The good part is that both the Hive Testbench and AMPlab benchmark are available on GitHub.

Original title and link: Everything is faster than Hive (NoSQL database©myNoSQL)

via: http://gigaom.com/2014/01/13/cloudera-says-impala-is-faster-than-hive-which-isnt-saying-much/


HAWK: Performance monitoring tool for Hive

JunHo Cho’s slides introducing HAWK, a performance monitoring tool for Hive:

✚ I couldn’t find a link for HAWK. The slides are pointing to NexR.

Original title and link: HAWK: Performance monitoring tool for Hive (NoSQL database©myNoSQL)


A quick guide to using Sentry authorization in Hive

A guide to Apache Sentry:

Sentry brings in fine-grained authorization support for both data and metadata in a Hadoop cluster. It is already being used in production systems to secure the data and provide fine-grained access to its users. It is also integrated with the version of Hive shipping in CDH (upstream contribution is pending), Cloudera Impala, and Cloudera Search.

Original title and link: A quick guide to using Sentry authorization in Hive (NoSQL database©myNoSQL)

via: https://blogs.apache.org/sentry/entry/getting_started


A prolific season for Hadoop and its ecosystem

In 4 years of writing this blog I haven’t seen such a prolific month:

  • Apache Hadoop 2.2.0 (more links here)
  • Apache HBase 0.96 (here and here)
  • Apache Hive 0.12 (more links here)
  • Apache Ambari 1.4.1
  • Apache Pig 0.12
  • Apache Oozie 4.0.0
  • Plus Presto.

Actually I don’t think I’ve ever seen such an ecosystem like the one created around Hadoop.

Original title and link: A prolific season for Hadoop and its ecosystem (NoSQL database©myNoSQL)


Status update on Project Stinger, the interactive query for Apache Hive

Cloudera is investing in Impala. Pivotal in HAWQ. Facebook, who created Hive, has announced Presto.

Hortonworks continues to work on Hive with project Stinger and Apache Tez. Mid-October, they announced Hive 0.12:

Hive12deux

And at the end of October, Hortonworks has shared a new set of results:

Historically, even simple Hive queries could not run in less than 30 seconds, yet many of these queries are running in less than 10 seconds. How did that happen? The answer mainly boils down to Apache Tez and Apache Hadoop YARN, which proves that Hadoop is more than just batch. Tez features such as container pre-launch and re-use overcome Hadoop’s traditional latency barriers, and are available to any data processing framework running in Hadoop.

stinger1

Pretty impressive.

Original title and link: Status update on Project Stinger, the interactive query for Apache Hive (NoSQL database©myNoSQL)


Hive Cheat Sheet for SQL Users

Nice resource for people familiar with SQL looking into Hive:

Simple Hive Cheat Sheet for SQL Users

Original title and link: Hive Cheat Sheet for SQL Users (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/hive-cheat-sheet-for-sql-users/


How Safari Books Online uses Google BigQuery for BI

Looking for alternative solutions to built our dashboards and enable interactive ad-hoc querying, we played with several technologies, including Hadoop. In the end, we decided to use Google BigQuery.

Compare the original processing flow:

BigQuery processing flow

with these 2 possible alternatives and tell me if you notice any significant differences.

Alternatives to BigQuery

Original title and link: How Safari Books Online uses Google BigQuery for BI (NoSQL database©myNoSQL)

via: http://googlecloudplatform.blogspot.com/2013/07/how-safari-books-online-uses-google.html


Optimizing Joins running on HDInsight Hive on Azure

Two notable things in Denny Lee’s post about optimizing some of the Hive joins used by Microsoft’s Online Services Division:

  1. Microsoft is drinking their own HDInsight on Azure champaign. This will take HDInsight product far as they’ll always have first hand feedback about parts of the system that need improvement.
  2. Know the different types of JOINs supported by Hive and don’t be afraid of experimenting.

✚ An extra point for the link to Liyin Tang and Namit Jain’s Join strategies in Hive (PDF)

Original title and link: Optimizing Joins running on HDInsight Hive on Azure (NoSQL database©myNoSQL)

via: http://dennyglee.com/2013/04/26/optimizing-joins-running-on-hdinsight-hive-on-azure-at-gfs/


RCFile - OCFile - Parquet: Storing Big Data With Hive

Christian Prokopp explaining the advantages of the RCFile storage:

The state-of-the-art solution for Hive is the RCFile. The format has been co-developed by Facebook, which is running the largest Hadoop and Hive installation in the world. RCFile has been adopted by the Hive and Pig projects as the core format for table like data storage. The goal of the format development was “(1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) strong adaptivity to highly dynamic workload patterns,” as can be seen in this PDF from the development teams.

Questions:

  1. is there any connection between the RCFile and Parquet the new columnar storage format? At first glance, the goals of the two are pretty similar.
  2. It looks like there’s already a new format that will supersede RCFile: ORC Files. Are all these 3 approaches independent of each other? If yes, then would are the pros and cons of each of them?

Original title and link: RCFile - OCFile - Parquet: Storing Big Data With Hive (NoSQL database©myNoSQL)

via: http://www.bigdatarepublic.com/author.asp?section_id=2840&doc_id=262756


Apache Hive 0.11: Stinger Phase 1 Delivered

Owen O’Malley on Hortonworks’ blog:

As representatives of this open, community led effort we are very proud to announce the first release of the new and improved Apache Hive, version 0.11. This substantial release embodies the work of a wide group of people from Microsoft, Facebook , Yahoo, SAP and others. Together we have addressed 386 JIRA tickets, of which there were 28 new features and 276 bug fixes. There were FIFTY-FIVE developers involved in this and I would like to thank every one of them.

This is indeed the power of open. But don’t forget that too much bragging might diminish it: keep repeating a word and its value will slowly vanish.

Original title and link: Apache Hive 0.11: Stinger Phase 1 Delivered (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/apache-hive-0-11-stinger-phase-1-delivered/