ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Hive: All content tagged as Hive in NoSQL databases and polyglot persistence

Integrating Hive and HBase at Facebook

While definitely interesting, something doesn’t seem to add up:

It (nb HBase) sidesteps Hadoop’s append-only constraint by keeping recently updated data in memory and incrementally rewriting data to new files, splitting and merging intelligently based on data distribution changes. Since it is based on Hadoop, making HBase interoperate with Hive is straightforward, meaning HBase tables can be accessed as if they were native Hive tables. As a result, a single Hive query can now perform complex operations such as join, union, and aggregation across combinations of HBase and native Hive tables. Likewise, Hive’s INSERT statement can be used to move data between HBase and native Hive tables, or to reorganize data within HBase itself.

What I seem to not understand is:

So why HBase?

via: http://www.cloudera.com/blog/2010/06/integrating-hive-and-hbase/


Presentation: Hive - A Petabyte Scale Data Warehouse Using Hadoop

Lately I’ve been mentioning Hive quite a few times when writing about working with NoSQL data, but I was missing a good slidedeck providing details of the Hive architecture, usage scenarios, and other interesting details about Hive.

The presentation embedded below coming from the Facebook Data Infrastructure team provides all these details and much more (i.e. Hive usage at Facebook, Hadoop and Hive clusters, etc.)


Amazon Elastic MapReduce Upgrades Hadoop, Hive and Pig

Amazon upgraded the set of tools to work with NoSQL data (and not only):

Customers can now take advantage of improved Hadoop performance and the following new features:

  • Multiple inputs class for reading multiple types of data.
  • Multiple outputs class for writing multiple types of data.
  • ChainMapper and ChainReducer which allows users to perform M+RM* within one Hadoop job. Previously customers could only run one mapper and one reducer per job.
  • Skip bad records in the dataset that cause jobs to fail. This allows a job to complete even if some records in a dataset are erroneous.
  • JVM reuse across task boundaries to increase performance when processing small files.
  • Support for bzip2 compression.

via: http://developer.amazonwebservices.com/connect/ann.jspa?annID=697


Google BigQuery SQL-like API

Google has announced at GoogleIO 2010, but didn’t launch yet, a new API for ad-hoc analysis, reporting, data exploration of massively large datasets: ☞ BigQuery. What I find interesting is that, BigQuery is using ☞ an SQL flavor, instead of MapReduce or Hive or PIG.

It still strikes me that Google hasn’t figured out yet a way to expose access to their MapReduce implementation. Judging by the numbers in the industry, I’d say that by now Hadoop is probably handling the largest volumes of data.


Cloudera Distribution for Hadoop will include PIG, Hive and why it matters

Cloudera distributes an easy to install pre-packaged version of Hadoop that includes various bug fixes and optimizations. Yesterday they have announced the availability of a new version called ☞ CDH2 (nb Cloudera Distribution for Hadoop), but also the first beta of the upcoming version that will include support for Pig and Hive, the tools that help you put your NoSQL data to work.

But why is this important? While NoSQL solutions are helping us tackle problems like

  • cost[1] and complexity[2] and productivity
  • availability, scalability
  • storing huge amounts of data[3]

none of these are really the end goals. While I don’t feel comfortable disagreeing with Google’s chief scientist, Peter Norvig:

We don’t have better algorithms than anyone else. We just have more data.

I don’t really think it is only about the data, but rather the intel that can be built around the data. And that’s exactly what tools like Hadoop and PIG and Hive will help us achieve.

We have a system in place based on shared mysql + memcache but its quickly becoming prohibitively costly (in terms of manpower) to operate.

References