Hive: All content tagged as Hive in NoSQL databases and polyglot persistence
Wednesday, 16 June 2010
Integrating Hive and HBase at Facebook
While definitely interesting, something doesn’t seem to add up:
It (nb HBase) sidesteps Hadoop’s append-only constraint by keeping recently updated data in memory and incrementally rewriting data to new files, splitting and merging intelligently based on data distribution changes. Since it is based on Hadoop, making HBase interoperate with Hive is straightforward, meaning HBase tables can be accessed as if they were native Hive tables. As a result, a single Hive query can now perform complex operations such as join, union, and aggregation across combinations of HBase and native Hive tables. Likewise, Hive’s INSERT statement can be used to move data between HBase and native Hive tables, or to reorganize data within HBase itself.
What I seem to not understand is:
- Facebook is already using Cassandra
- Cassandra works well with Hadoop, at least starting with version 0.6.0
- Hive works on top of Hadoop
So why HBase?
via: http://www.cloudera.com/blog/2010/06/integrating-hive-and-hbase/
Wednesday, 9 June 2010
Presentation: Hive - A Petabyte Scale Data Warehouse Using Hadoop
Lately I’ve been mentioning Hive quite a few times when writing about working with NoSQL data, but I was missing a good slidedeck providing details of the Hive architecture, usage scenarios, and other interesting details about Hive.

The presentation embedded below coming from the Facebook Data Infrastructure team provides all these details and much more (i.e. Hive usage at Facebook, Hadoop and Hive clusters, etc.)
Thursday, 3 June 2010
Amazon Elastic MapReduce Upgrades Hadoop, Hive and Pig
Amazon upgraded the set of tools to work with NoSQL data (and not only):
Customers can now take advantage of improved Hadoop performance and the following new features:
- Multiple inputs class for reading multiple types of data.
- Multiple outputs class for writing multiple types of data.
- ChainMapper and ChainReducer which allows users to perform M+RM* within one Hadoop job. Previously customers could only run one mapper and one reducer per job.
- Skip bad records in the dataset that cause jobs to fail. This allows a job to complete even if some records in a dataset are erroneous.
- JVM reuse across task boundaries to increase performance when processing small files.
- Support for bzip2 compression.
via: http://developer.amazonwebservices.com/connect/ann.jspa?annID=697
Thursday, 20 May 2010
Google BigQuery SQL-like API
Google has announced at GoogleIO 2010, but didn’t launch yet, a new API for ad-hoc analysis, reporting, data exploration of massively large datasets: ☞ BigQuery. What I find interesting is that, BigQuery is using ☞ an SQL flavor, instead of MapReduce or Hive or PIG.
It still strikes me that Google hasn’t figured out yet a way to expose access to their MapReduce implementation. Judging by the numbers in the industry, I’d say that by now Hadoop is probably handling the largest volumes of data.
Thursday, 25 March 2010
Cloudera Distribution for Hadoop will include PIG, Hive and why it matters
Cloudera distributes an easy to install pre-packaged version of Hadoop that includes various bug fixes and optimizations. Yesterday they have announced the availability of a new version called ☞ CDH2 (nb Cloudera Distribution for Hadoop), but also the first beta of the upcoming version that will include support for Pig and Hive, the tools that help you put your NoSQL data to work.
But why is this important? While NoSQL solutions are helping us tackle problems like
- cost[1] and complexity[2] and productivity
- availability, scalability
- storing huge amounts of data[3]
none of these are really the end goals. While I don’t feel comfortable disagreeing with Google’s chief scientist, Peter Norvig:
We don’t have better algorithms than anyone else. We just have more data.
I don’t really think it is only about the data, but rather the intel that can be built around the data. And that’s exactly what tools like Hadoop and PIG and Hive will help us achieve.
We have a system in place based on shared mysql + memcache but its quickly becoming prohibitively costly (in terms of manpower) to operate.
References
-
[1]
In the interview about Cassandra usage at Twitter, Ryan mentioned: (↩)
We have a system in place based on shared mysql + memcache but its quickly becoming prohibitively costly (in terms of manpower) to operate.
- [2] Scalability is not only about size, but also complexity: The new dimension of NoSQL scalability: complexity (↩)
- [3] Why NoSQL is here to stay? (↩)
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling