Hive: All content tagged as Hive in NoSQL databases and polyglot persistence
Friday, 15 June 2012
Hortonworks Data Platform 1.0
Hortonworks has announced the 1.0 release of the Hortonworks Data Platform prior to the Hadoop Summit 2012 together with a lot of supporting quotes from companies like Attunity, Dataguise, Datameer, Karmasphere, Kognitio, MarkLogic, Microsoft, NetApp, StackIQ, Syncsort, Talend, 10gen, Teradata, and VMware.
Some info points:
-
Hortonworks Data Platform is a platform meant to simplify the installation, integration, management, and use of Apache Hadoop
- HDP 1.0 is based on Apache Hadoop 1.0
- Apache Ambari is used for installation and provisioning
- The same Apache Amabari is behind the Hortonworks Management Console
- For Data integration, HDP offers WebHDFS, HCatalog APIs, and Talend Open Studio
- Apache HCatalog is the solution offering metadata and table management
-
Hortonworks Data Platform is 100% open source—I really appreciate Hortonworks’s dedication to the Apache Hadoop project and open source community
- HDP comes with 3 levels of support subscriptions, pricing starting at $12500/year for a 10 nodes cluster
One of the most interesting aspects of the Hortonworks Data Platform release is that the high-availability (HA) option for HDP is based on using VMWare-powered virtual machines for the NameNode and JobTracker. My first thought about this approach is that it was chosen to strengthen a partnership with VMWare. On the other hand, Hadoop 2.0 contains already a new highly-available version of the NameNode (Cloudera Hadoop Distribution uses this solution) and VMWare has bigger plans for a virtualization-friendly Hadoop environment with project Serengeti.
You can read a lot of posts about this announcement, but you’ll find all the details in Hortonworks’s John Kreisa’s post here and the PR announcement.
Original title and link: Hortonworks Data Platform 1.0 (©myNoSQL)
Thursday, 24 May 2012
Using R With Cassandra Through JDBC or Hive
A short post by Jake Luciani listing 2 R modules—RJDBC module and RCassandra—that enable using R with Cassandra through either the JDBC or Hive drivers.
This is a good example of what I meant by designing products with openness and integration in mind.
Original title and link: Using R With Cassandra Through JDBC or Hive (©myNoSQL)
via: http://www.datastax.com/dev/blog/big-analytics-with-r-cassandra-and-hive
Wednesday, 4 April 2012
Apache Bigtop: Apache Big Data Management Distribution Based on Apache Hadoop
The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.
Currently packaging:
- Apache Hadoop 1.0.x
- Apache Zookeeper 3.4.3
- Apache HBase 0.92.0
- Apache Hive 0.8.1
- Apache Pig 0.9.2
- Apache Mahout 0.6.1
- Apache Oozie 3.1.3
- Apache Sqoop 1.4.1
- Apache Flume 1.0.0
- Apache Whirr 0.7.0
Apache Bigtop looks like the first step towards the Big Data LAMP-like platform analysts are calling for. Practically though it’s goal is to ensure that all the components of the wide Hadoop ecosystem remain interoperable.
Original title and link: Apache Bigtop: Apache Big Data Management Distribution Based on Apache Hadoop (©myNoSQL)
Monday, 26 March 2012
Impressions About Hive, Pig, Scalding, Scoobi, Scrunch, Spark
Sami Badawi enumerates the issues he encountered while trying all these tools (Pig1, Scalding2, Scoobi3, Hive4, Spark5, Scrunch6, Cascalog7) for a simple experiment with Hadoop:
The task was to read log files join with other data do some statistics on arrays of doubles. Writing Hadoop MapReduce classes in Java is the assembly code of Big Data.
-
Pig : a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. ↩
-
Hive: a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. ↩
-
Spark: open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write ↩
-
Cascalog: a fully-featured Clojure-based data processing and querying library for Hadoop ↩
Original title and link: Impressions About Hive, Pig, Scalding, Scoobi, Scrunch, Spark (©myNoSQL)
via: http://blog.samibadawi.com/2012/03/hive-pig-scalding-scoobi-scrunch-and.html
Monday, 20 February 2012
Lightning Talk on Cascalog
Just 19 slides, but Paul Lam manages to provide both a comparison of Cascalog and Hive, plus an overview of the most interesting bits of Cascalog.
Cascalog vs Hive

Cascalog Query Pipe Assembly

Highly recommended for understanding what’s in the Cascalog box.
Monday, 13 February 2012
The components and their functions in the Hadoop ecosystem
Edd Dumbill enumerates the various components of the Hadoop ecosystem:

My quick reference of the Hadoop ecosystem is including a couple of other tools that are not in this list, with the exception of Ambari and HCatalog which were released later.
Original title and link: The components and their functions in the Hadoop ecosystem (©myNoSQL)
Wednesday, 8 February 2012
Visualizing Hadoop data with Tableau Software and Cloudera Connector for Tableau
Put together one of the most impressive visualization tools, Tableau Software, with one of the best solutions for big data, Hadoop, and you’ll probably get some astonishing results.

Credit Cloudera.
While Tableau Software works with structured data only, with this connector it gets access to Hive through HiveQL.
Original title and link: Visualizing Hadoop data with Tableau Software and Cloudera Connector for Tableau (©myNoSQL)
via: http://www.cloudera.com/blog/2012/02/cloudera-connector-for-tableau-has-been-released/
Wednesday, 1 February 2012
Amazon Elastic MapReduce New Features: Metrics, Updates, VPC, and Cluster Compute Support
Starting today customers can view graphs of 23 job flow metrics within the EMR Console by selecting the Monitoring tab in the Job Flow Details page. These metrics are pushed CloudWatch every five minutes at no cost to you and include information on:
- Job flow progress including metrics on the number of map and reduce tasks running and remaining in your job flow and the number of bytes read and written to S3 and HDFS.
- Job flow contention including metrics on HDFS utilization, map and reduce slots open, jobs running, and the ratio between map tasks remaining and map slots.
- Job flow health including metrics on whether your job flow is idle, if there are missing data blocks, and if there are any dead nodes.
That’s like free pr0n for operations teams.
On a different note, I’ve noticed that the Hadoop stack (Hadoop, Hive, Pig) on Amazon Elastic MapReduce is based on second to last versions, which says that extensive testing is performed on Amazon side before rolling new versions out:
- Hadoop: 0.20.205 precursor of Hadoop 1.0.0 supports append and security, but doesn’t have RAID, symlinks or MR2
- Hive: 0.7.1 (precursor of latest 0.8.0)
- Pig: 0.9.1 (precursor of latest 0.9.2)
Original title and link: Amazon Elastic MapReduce New Features: Metrics, Updates, VPC, and Cluster Compute Support (©myNoSQL)
Monday, 30 January 2012
Powered by Hadoop and Hive: Budgeting for snow removal in your local community
I don’t know how I ended up becoming the head of our local community association. Anyhow, I’m now responsible for laying out next year’s budget. Most of our expenses seem to be fixed from one year to another, but then there’s the expense for the snow removal service. This year, no snow. Last year, most snow on record in 30 years! How do you budget for something as volatile as snow? I need more data!
Instead of just googling the answer, we’re going to fetch some raw data and feed it into Hadoop Hive.
Hadoop FTW!
Original title and link: Powered by Hadoop and Hive: Budgeting for snow removal in your local community (©myNoSQL)
via: http://magnusljadas.wordpress.com/2012/01/29/search-for-snow-with-hadoop-hive/
Friday, 27 January 2012
Measuring User Retention With Hadoop and Hive
A very practical example of how Hive and Hadoop could deliver value when applied to clickstreams, the most common data for each web property:
Hadoop, Hive, and related technologies are formidable tools for unlocking value from data. […] Retention measurements are particularly significant because they paint a detailed picture about the overall stickiness of a product across the entire userbase.
The same clickstream data can be used to calculate visitors’ conversion with the Bayesian discriminant using Hadoop.
Original title and link: Measuring User Retention With Hadoop and Hive (©myNoSQL)
via: http://blog.polarmobile.com/2012/01/measuring-user-retention-with-hadoop-and-hive/
Tuesday, 3 January 2012
Last NoSQL Releases in 2011: MongoDB, Hive, ZooKeeper, Whirr, HBase, Redis, and Hadoop 1.0.0
Let’s start the year with a quick review of the latest releases that happened in December. Make sure that you scroll to the end as there are quite a few important ones.
MongoDB 2.0.2
Announced on Dec.15th, MongoDB 2.0.2 is a bug fix release:
- Hit config server only once per mongos on meta data change to not overwhelm
- Removed unnecessary connection close and open between mongos and mongod after getLastError
- Replica set primaries close all sockets on stepDown()
- Do not require authentication for the buildInfo command
- scons option for using system libraries
Apache Hive 0.8.0
Apache Hive 0.8.0 came out on Dec.19th. The list of new features, improvements, and bug fixes is extremely long.
Just as a side note, who came out with the idea of having a Hive fans’ page on Facebook?
Apache ZooKeeper 3.4.2
ZooKeeper 3.4.0 has been followed up shortly by two new minor version updates fixing some critical bugs. The list of issues fixed in ZooKeeper 3.4.1 can be found here and for ZooKeeper 3.4.2 the 2 fixed bugs are listed here.
As with ZooKeeper 3.4.0, these versions are not yet production ready.
Apache Whirr 0.7.0
Apache Whirr 0.7.0 has been released on Dec.21st featuring 56 improvements and bug fixes including support for Puppet & Chef, and Mahout and Ganglia as a service. The complete list can be found here.
Some more details about Whirr 0.7.0 can be found here.
Apache HBase 0.90.5
Released Dec.23rd, HBase 0.90.5 packs 81 bug fixes. The complete list can be found here.
Redis 2.4.5
Redis 2.4.5 was released on Dec.23rd and provides 4 bug fixes:
- [BUGFIX] Fixed a ZUNIONSTORE/ZINTERSTORE bug that can cause a NaN to be inserted as a sorted set element score. This happens when one of the elements has
+inf/-infscore and the weight used is 0. - [BUGFIX] Fixed memory leak in
CLIENT INFO. - [BUGFIX] Fixed a non critical
SORTbug (Issue 224). - [BUGFIX] Fixed a replication bug: now the timeout configuration is respected during the connection with the master.
--quietoption implemented in the Redis test.
Last but definitely one of the most important announcements that came in December:
Hadoop 1.0.0
Based on the 0.20-security code line, Hadoop 1.0.0 was announced on Dec.29. This release includes support for:
- HBase (append/hsynch/hflush) and Security
- Webhdfs (with full support for security)
- Performance enhanced access to local files for HBase
- Other performance enhancements, bug fixes, and features
- All version 0.20.205 and prior 0.20.2xx features
Complete release notes are available here.
Stéphane Fréchette, Ryan Slobojan, Duane Moore, Arun C. Murthy
And with this we are ready for 2012.
Original title and link: Last NoSQL Releases in 2011: MongoDB, Hive, ZooKeeper, Whirr, HBase, Redis, and Hadoop 1.0.0 (©myNoSQL)
Tuesday, 20 December 2011
Doug Cutting About Hadoop, Its Adoption and Future, and Its Relationship With Relational Databases
Jaikumar Vijayan (Computerworld) interviews Doug Cutting:
Q: How would you describe Hadoop to a CIO or a CFO? Why should enterprises care about it?
A: At a really simple level, it lets you affordably save and process vastly more data than you could before. With more data and the ability to process it, companies can see more, they can learn more, they can do more. [With Hadoop] you can start to do all sorts of analyses that just weren’t practical before. You can start to look at patterns over years, over seasons, across demographics. You have enough data to fill in patterns and make predictions and decide, “How should we price things?” and “What should we be selling now?” and “How should we advertise?” It is not only about having data for longer periods, but also richer data about any given period.
The interview covers topics like why the interest in Hadoop, Hadoop adoption in the enterprise world and outside, limitations of relational database. It is a must read—if only they would have added some newlines here and there.
Original title and link: Doug Cutting About Hadoop, Its Adoption and Future, and Its Relationship With Relational Databases (©myNoSQL)
via: http://www.computerworld.com/s/article/9222758/The_Grill_Doug_Cutting
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling
