hive: All content tagged as hive in NoSQL databases and polyglot persistence
Monday, 13 February 2012
The components and their functions in the Hadoop ecosystem
Edd Dumbill enumerates the various components of the Hadoop ecosystem:

My quick reference of the Hadoop ecosystem is including a couple of other tools that are not in this list, with the exception of Ambari and HCatalog which were released later.
Original title and link: The components and their functions in the Hadoop ecosystem (©myNoSQL)
Wednesday, 8 February 2012
Visualizing Hadoop data with Tableau Software and Cloudera Connector for Tableau
Put together one of the most impressive visualization tools, Tableau Software, with one of the best solutions for big data, Hadoop, and you’ll probably get some astonishing results.

Credit Cloudera.
While Tableau Software works with structured data only, with this connector it gets access to Hive through HiveQL.
Original title and link: Visualizing Hadoop data with Tableau Software and Cloudera Connector for Tableau (©myNoSQL)
via: http://www.cloudera.com/blog/2012/02/cloudera-connector-for-tableau-has-been-released/
Wednesday, 1 February 2012
Amazon Elastic MapReduce New Features: Metrics, Updates, VPC, and Cluster Compute Support
Starting today customers can view graphs of 23 job flow metrics within the EMR Console by selecting the Monitoring tab in the Job Flow Details page. These metrics are pushed CloudWatch every five minutes at no cost to you and include information on:
- Job flow progress including metrics on the number of map and reduce tasks running and remaining in your job flow and the number of bytes read and written to S3 and HDFS.
- Job flow contention including metrics on HDFS utilization, map and reduce slots open, jobs running, and the ratio between map tasks remaining and map slots.
- Job flow health including metrics on whether your job flow is idle, if there are missing data blocks, and if there are any dead nodes.
That’s like free pr0n for operations teams.
On a different note, I’ve noticed that the Hadoop stack (Hadoop, Hive, Pig) on Amazon Elastic MapReduce is based on second to last versions, which says that extensive testing is performed on Amazon side before rolling new versions out:
- Hadoop: 0.20.205 precursor of Hadoop 1.0.0 supports append and security, but doesn’t have RAID, symlinks or MR2
- Hive: 0.7.1 (precursor of latest 0.8.0)
- Pig: 0.9.1 (precursor of latest 0.9.2)
Original title and link: Amazon Elastic MapReduce New Features: Metrics, Updates, VPC, and Cluster Compute Support (©myNoSQL)
Monday, 30 January 2012
Powered by Hadoop and Hive: Budgeting for snow removal in your local community
I don’t know how I ended up becoming the head of our local community association. Anyhow, I’m now responsible for laying out next year’s budget. Most of our expenses seem to be fixed from one year to another, but then there’s the expense for the snow removal service. This year, no snow. Last year, most snow on record in 30 years! How do you budget for something as volatile as snow? I need more data!
Instead of just googling the answer, we’re going to fetch some raw data and feed it into Hadoop Hive.
Hadoop FTW!
Original title and link: Powered by Hadoop and Hive: Budgeting for snow removal in your local community (©myNoSQL)
via: http://magnusljadas.wordpress.com/2012/01/29/search-for-snow-with-hadoop-hive/
Friday, 27 January 2012
Measuring User Retention With Hadoop and Hive
A very practical example of how Hive and Hadoop could deliver value when applied to clickstreams, the most common data for each web property:
Hadoop, Hive, and related technologies are formidable tools for unlocking value from data. […] Retention measurements are particularly significant because they paint a detailed picture about the overall stickiness of a product across the entire userbase.
The same clickstream data can be used to calculate visitors’ conversion with the Bayesian discriminant using Hadoop.
Original title and link: Measuring User Retention With Hadoop and Hive (©myNoSQL)
via: http://blog.polarmobile.com/2012/01/measuring-user-retention-with-hadoop-and-hive/
Tuesday, 3 January 2012
Last NoSQL Releases in 2011: MongoDB, Hive, ZooKeeper, Whirr, HBase, Redis, and Hadoop 1.0.0
Let’s start the year with a quick review of the latest releases that happened in December. Make sure that you scroll to the end as there are quite a few important ones.
MongoDB 2.0.2
Announced on Dec.15th, MongoDB 2.0.2 is a bug fix release:
- Hit config server only once per mongos on meta data change to not overwhelm
- Removed unnecessary connection close and open between mongos and mongod after getLastError
- Replica set primaries close all sockets on stepDown()
- Do not require authentication for the buildInfo command
- scons option for using system libraries
Apache Hive 0.8.0
Apache Hive 0.8.0 came out on Dec.19th. The list of new features, improvements, and bug fixes is extremely long.
Just as a side note, who came out with the idea of having a Hive fans’ page on Facebook?
Apache ZooKeeper 3.4.2
ZooKeeper 3.4.0 has been followed up shortly by two new minor version updates fixing some critical bugs. The list of issues fixed in ZooKeeper 3.4.1 can be found here and for ZooKeeper 3.4.2 the 2 fixed bugs are listed here.
As with ZooKeeper 3.4.0, these versions are not yet production ready.
Apache Whirr 0.7.0
Apache Whirr 0.7.0 has been released on Dec.21st featuring 56 improvements and bug fixes including support for Puppet & Chef, and Mahout and Ganglia as a service. The complete list can be found here.
Some more details about Whirr 0.7.0 can be found here.
Apache HBase 0.90.5
Released Dec.23rd, HBase 0.90.5 packs 81 bug fixes. The complete list can be found here.
Redis 2.4.5
Redis 2.4.5 was released on Dec.23rd and provides 4 bug fixes:
- [BUGFIX] Fixed a ZUNIONSTORE/ZINTERSTORE bug that can cause a NaN to be inserted as a sorted set element score. This happens when one of the elements has
+inf/-infscore and the weight used is 0. - [BUGFIX] Fixed memory leak in
CLIENT INFO. - [BUGFIX] Fixed a non critical
SORTbug (Issue 224). - [BUGFIX] Fixed a replication bug: now the timeout configuration is respected during the connection with the master.
--quietoption implemented in the Redis test.
Last but definitely one of the most important announcements that came in December:
Hadoop 1.0.0
Based on the 0.20-security code line, Hadoop 1.0.0 was announced on Dec.29. This release includes support for:
- HBase (append/hsynch/hflush) and Security
- Webhdfs (with full support for security)
- Performance enhanced access to local files for HBase
- Other performance enhancements, bug fixes, and features
- All version 0.20.205 and prior 0.20.2xx features
Complete release notes are available here.
Stéphane Fréchette, Ryan Slobojan, Duane Moore, Arun C. Murthy
And with this we are ready for 2012.
Original title and link: Last NoSQL Releases in 2011: MongoDB, Hive, ZooKeeper, Whirr, HBase, Redis, and Hadoop 1.0.0 (©myNoSQL)
Tuesday, 20 December 2011
Doug Cutting About Hadoop, Its Adoption and Future, and Its Relationship With Relational Databases
Jaikumar Vijayan (Computerworld) interviews Doug Cutting:
Q: How would you describe Hadoop to a CIO or a CFO? Why should enterprises care about it?
A: At a really simple level, it lets you affordably save and process vastly more data than you could before. With more data and the ability to process it, companies can see more, they can learn more, they can do more. [With Hadoop] you can start to do all sorts of analyses that just weren’t practical before. You can start to look at patterns over years, over seasons, across demographics. You have enough data to fill in patterns and make predictions and decide, “How should we price things?” and “What should we be selling now?” and “How should we advertise?” It is not only about having data for longer periods, but also richer data about any given period.
The interview covers topics like why the interest in Hadoop, Hadoop adoption in the enterprise world and outside, limitations of relational database. It is a must read—if only they would have added some newlines here and there.
Original title and link: Doug Cutting About Hadoop, Its Adoption and Future, and Its Relationship With Relational Databases (©myNoSQL)
via: http://www.computerworld.com/s/article/9222758/The_Grill_Doug_Cutting
Monday, 5 December 2011
Looking for a Map Reduce Language
Java, Cascading, Pipes - C++, Hive, Pig, Rhipe, Dumbo, Cascalog… which one of these should you use for writing Map Reduce code?
Antonio Piccolboni takes them up for a test:
At the end of this by necessity incomplete and unscientific language and library comparison, there is a winner and there isn’t. There isn’t because language comparison is always multidimensional and subjective but also because the intended applications are very different. On the other hand, looking for a general purpose, moderately elegant, not necessarily most efficient, not necessarily mature language for exploration purposes, Rhipe seems to fit the bill pretty nicely.
Original title and link: Looking for a Map Reduce Language (©myNoSQL)
via: http://blog.piccolboni.info/2011/04/looking-for-map-reduce-language.html
Thursday, 1 December 2011
What Is Informatica HParser for Hadoop?
Sifting through the PRish announcements related to Informatica HParser, what I’ve figured out so far is:
- it is the T in ETL
- a visual tool for creating parsing definitions for formats like web logs, XML, JSON, FIX, SWIFT, HL7, CDR, WORD, PDF, XLS, etc.
- transformations can be accessed from Hadoop MapReduce, Hive, or Pig
- the benefits of using HParser come from being able to share the same parsing definitions/transformations in the context of the Hadoop distributed environment
- HParser tries to provide an optimal transformation solution when streaming, splitting, and processing large files
- HParser is available in two licensing formats: community and commercial
Original title and link: What Is Informatica HParser for Hadoop? (©myNoSQL)
Monday, 27 June 2011
Biodiversity Indexing: Offline Processing With Hadoop, Hive, Sqoop, Oozie
The architecture for offline processing biodiversity based on Sqoop, Hadoop, Oozie, and Hive:

And its future:
Following this processing work, we expect to modify our crawling to harvest directly into HBase. The flexibility HBase offers will allow us to grow incrementally the richness of the terms indexed in the Portal, while integrating nicely into Hadoop based workflows. The addition of coprocessors to HBase is of particular interest to further reduce the latency involved in processing, by eliminating batch processing altogether.
Many companies working with large datasets have to deal with multiple systems and duplicate data between the online services and offline processors. While the infrastructure costs are going down, the costs of complexity are not. The HBase + Hadoop and Cassandra + Brisk combos are starting to address this problem.
Original title and link: Biodiversity Indexing: Offline Processing With Hadoop, Hive, Sqoop, Oozie (©myNoSQL)
via: http://www.cloudera.com/blog/2011/06/biodiversity-indexing-migration-from-mysql-to-hadoop/
Thursday, 16 June 2011
Choosing Technologies: The Library of Congress and the Twitter Archive
Remember when everyone was suggesting solutions for Twitter architecture? Now the Library of Congress is trying to figure out what technologies to use to store the Twitter archive:
The project is still very much under construction, and the team is weighing a number of different open source technologies in order to build out the storage, management and querying of the Twitter archive. While the decision hasn’t been made yet on which tools to use, the library is testing the following in various combinations: Hive, ElasticSearch, Pig, Elephant-bird, HBase, and Hadoop.
Note that in terms of storage only HBase is mentioned—Twitter’s main tweet storage is MySQL though.
Original title and link: Choosing Technologies: The Library of Congress and the Twitter Archive (NoSQL database©myNoSQL)
Friday, 3 June 2011
Experimenting with Hadoop using Cloudera VirtualBox Demo

If you don’t count the download, you’ll get this up and running in 5 minutes tops. At the end you’ll have Hadoop, Sqoop, Pig, Hive, HBase, ZooKeeper, Oozie, Hume, Flume, and Whirr all configured and ready to experiment with.
Making it easy for users to experiment with these tools increases the chances for adoption. Adoption means business.
Original title and link: Experimenting with Hadoop using Cloudera VirtualBox Demo (NoSQL databases © myNoSQL)
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling