hive: All content tagged as hive in NoSQL databases and polyglot persistence
Tuesday, 17 July 2012
Klout Data Architecture: MySQL, HBase, Hive, Pig, Elastic Search, MongoDB, SSAS
Just found slideck (embedded below) describing the data workflow at Klout. Their architecture includes many interesting pieces combining both NoSQL and relational databases with Hadoop and Hive and Pig and traditional BI. Even Excel gets a mention in the slides:
- Pig and Hive
- HBase
- Elastic Search
- MongoDB
- MySQL
Monday, 16 July 2012
Comparing File Formats and Compression Methods in HDFS and Hive
The post is a bit old, but the data contained comparing different compression methods is helpful:
Original title and link: Comparing File Formats and Compression Methods in HDFS and Hive (©myNoSQL)
Friday, 15 June 2012
Hortonworks Data Platform 1.0
Hortonworks has announced the 1.0 release of the Hortonworks Data Platform prior to the Hadoop Summit 2012 together with a lot of supporting quotes from companies like Attunity, Dataguise, Datameer, Karmasphere, Kognitio, MarkLogic, Microsoft, NetApp, StackIQ, Syncsort, Talend, 10gen, Teradata, and VMware.
Some info points:
-
Hortonworks Data Platform is a platform meant to simplify the installation, integration, management, and use of Apache Hadoop
- HDP 1.0 is based on Apache Hadoop 1.0
- Apache Ambari is used for installation and provisioning
- The same Apache Amabari is behind the Hortonworks Management Console
- For Data integration, HDP offers WebHDFS, HCatalog APIs, and Talend Open Studio
- Apache HCatalog is the solution offering metadata and table management
-
Hortonworks Data Platform is 100% open source—I really appreciate Hortonworks’s dedication to the Apache Hadoop project and open source community
- HDP comes with 3 levels of support subscriptions, pricing starting at $12500/year for a 10 nodes cluster
One of the most interesting aspects of the Hortonworks Data Platform release is that the high-availability (HA) option for HDP is based on using VMWare-powered virtual machines for the NameNode and JobTracker. My first thought about this approach is that it was chosen to strengthen a partnership with VMWare. On the other hand, Hadoop 2.0 contains already a new highly-available version of the NameNode (Cloudera Hadoop Distribution uses this solution) and VMWare has bigger plans for a virtualization-friendly Hadoop environment with project Serengeti.
You can read a lot of posts about this announcement, but you’ll find all the details in Hortonworks’s John Kreisa’s post here and the PR announcement.
Original title and link: Hortonworks Data Platform 1.0 (©myNoSQL)
Thursday, 24 May 2012
Using R With Cassandra Through JDBC or Hive
A short post by Jake Luciani listing 2 R modules—RJDBC module and RCassandra—that enable using R with Cassandra through either the JDBC or Hive drivers.
This is a good example of what I meant by designing products with openness and integration in mind.
Original title and link: Using R With Cassandra Through JDBC or Hive (©myNoSQL)
via: http://www.datastax.com/dev/blog/big-analytics-with-r-cassandra-and-hive
Wednesday, 4 April 2012
Apache Bigtop: Apache Big Data Management Distribution Based on Apache Hadoop
The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.
Currently packaging:
- Apache Hadoop 1.0.x
- Apache Zookeeper 3.4.3
- Apache HBase 0.92.0
- Apache Hive 0.8.1
- Apache Pig 0.9.2
- Apache Mahout 0.6.1
- Apache Oozie 3.1.3
- Apache Sqoop 1.4.1
- Apache Flume 1.0.0
- Apache Whirr 0.7.0
Apache Bigtop looks like the first step towards the Big Data LAMP-like platform analysts are calling for. Practically though it’s goal is to ensure that all the components of the wide Hadoop ecosystem remain interoperable.
Original title and link: Apache Bigtop: Apache Big Data Management Distribution Based on Apache Hadoop (©myNoSQL)
Monday, 26 March 2012
Impressions About Hive, Pig, Scalding, Scoobi, Scrunch, Spark
Sami Badawi enumerates the issues he encountered while trying all these tools (Pig1, Scalding2, Scoobi3, Hive4, Spark5, Scrunch6, Cascalog7) for a simple experiment with Hadoop:
The task was to read log files join with other data do some statistics on arrays of doubles. Writing Hadoop MapReduce classes in Java is the assembly code of Big Data.
-
Pig : a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. ↩
-
Hive: a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. ↩
-
Spark: open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write ↩
-
Cascalog: a fully-featured Clojure-based data processing and querying library for Hadoop ↩
Original title and link: Impressions About Hive, Pig, Scalding, Scoobi, Scrunch, Spark (©myNoSQL)
via: http://blog.samibadawi.com/2012/03/hive-pig-scalding-scoobi-scrunch-and.html
Monday, 20 February 2012
Lightning Talk on Cascalog
Just 19 slides, but Paul Lam manages to provide both a comparison of Cascalog and Hive, plus an overview of the most interesting bits of Cascalog.
Cascalog vs Hive

Cascalog Query Pipe Assembly

Highly recommended for understanding what’s in the Cascalog box.
Monday, 13 February 2012
The components and their functions in the Hadoop ecosystem
Edd Dumbill enumerates the various components of the Hadoop ecosystem:

My quick reference of the Hadoop ecosystem is including a couple of other tools that are not in this list, with the exception of Ambari and HCatalog which were released later.
Original title and link: The components and their functions in the Hadoop ecosystem (©myNoSQL)
Wednesday, 8 February 2012
Visualizing Hadoop data with Tableau Software and Cloudera Connector for Tableau
Put together one of the most impressive visualization tools, Tableau Software, with one of the best solutions for big data, Hadoop, and you’ll probably get some astonishing results.

Credit Cloudera.
While Tableau Software works with structured data only, with this connector it gets access to Hive through HiveQL.
Original title and link: Visualizing Hadoop data with Tableau Software and Cloudera Connector for Tableau (©myNoSQL)
via: http://www.cloudera.com/blog/2012/02/cloudera-connector-for-tableau-has-been-released/
Wednesday, 1 February 2012
Amazon Elastic MapReduce New Features: Metrics, Updates, VPC, and Cluster Compute Support
Starting today customers can view graphs of 23 job flow metrics within the EMR Console by selecting the Monitoring tab in the Job Flow Details page. These metrics are pushed CloudWatch every five minutes at no cost to you and include information on:
- Job flow progress including metrics on the number of map and reduce tasks running and remaining in your job flow and the number of bytes read and written to S3 and HDFS.
- Job flow contention including metrics on HDFS utilization, map and reduce slots open, jobs running, and the ratio between map tasks remaining and map slots.
- Job flow health including metrics on whether your job flow is idle, if there are missing data blocks, and if there are any dead nodes.
That’s like free pr0n for operations teams.
On a different note, I’ve noticed that the Hadoop stack (Hadoop, Hive, Pig) on Amazon Elastic MapReduce is based on second to last versions, which says that extensive testing is performed on Amazon side before rolling new versions out:
- Hadoop: 0.20.205 precursor of Hadoop 1.0.0 supports append and security, but doesn’t have RAID, symlinks or MR2
- Hive: 0.7.1 (precursor of latest 0.8.0)
- Pig: 0.9.1 (precursor of latest 0.9.2)
Original title and link: Amazon Elastic MapReduce New Features: Metrics, Updates, VPC, and Cluster Compute Support (©myNoSQL)
Monday, 30 January 2012
Powered by Hadoop and Hive: Budgeting for snow removal in your local community
I don’t know how I ended up becoming the head of our local community association. Anyhow, I’m now responsible for laying out next year’s budget. Most of our expenses seem to be fixed from one year to another, but then there’s the expense for the snow removal service. This year, no snow. Last year, most snow on record in 30 years! How do you budget for something as volatile as snow? I need more data!
Instead of just googling the answer, we’re going to fetch some raw data and feed it into Hadoop Hive.
Hadoop FTW!
Original title and link: Powered by Hadoop and Hive: Budgeting for snow removal in your local community (©myNoSQL)
via: http://magnusljadas.wordpress.com/2012/01/29/search-for-snow-with-hadoop-hive/
Friday, 27 January 2012
Measuring User Retention With Hadoop and Hive
A very practical example of how Hive and Hadoop could deliver value when applied to clickstreams, the most common data for each web property:
Hadoop, Hive, and related technologies are formidable tools for unlocking value from data. […] Retention measurements are particularly significant because they paint a detailed picture about the overall stickiness of a product across the entire userbase.
The same clickstream data can be used to calculate visitors’ conversion with the Bayesian discriminant using Hadoop.
Original title and link: Measuring User Retention With Hadoop and Hive (©myNoSQL)
via: http://blog.polarmobile.com/2012/01/measuring-user-retention-with-hadoop-and-hive/
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling


