NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Hive: All content tagged as Hive in NoSQL databases and polyglot persistence

Hive and HBase in Toad for Cloud Demo

Jeremiah Peschka put together two short videos demoing Toad for Cloud Eclipse plugin with Hive and HBase. Those complaining about lack of SQL in NoSQL databases should check it out. On a different note though, I did express a few concerns about such a tool related to the complexity and performance of building the indirection layer and supporting operations that are not native to target system. I’d add to these the fact that some NoSQL databases are continuously adding features that can radically change the way this tool performs (e.g. Cassandra 0.7 will feature secondary indexes).

Hive and Toad for Cloud

HBase and Toad for Cloud

Looking at how slow the tool performs and the fact that it doesn’t have any sort of results pagination seems to be a confirmation of some of the concerns expressed above. On the other hand it is kind of difficult to understand a tool by just watching a video.

Original title and link: Hive and HBase in Toad for Cloud Demo (NoSQL databases © myNoSQL)

Amazon Elastic MapReduce Updates

Updates from Amazon including upgraded Hive, multipart upload, optimized JDBC drivers:

  • Support for S3’s Large Objects and Multipart Upload

Amazon Elastic MapReduce supports this feature too allowing MapREduce to behin the upload before the Hadoop task is finished

  • Upgraded Hive Support

Currently you can run both Hive 0.5 and 0.7 concurrenty in the same cluster

  • JDBC Drivers for Hive

Optimized JDBC drivers for Hive.

Original title and link: Amazon Elastic MapReduce Updates (NoSQL databases © myNoSQL)


Quick Reference: Hadoop Tools Ecosystem

Just a quick reference of the continuously growing Hadoop tools ecosystem.


The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.


A distributed file system that provides high throughput access to application data.


A software framework for distributed processing of large data sets on compute clusters.

Amazon Elastic MapReduce

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Cloudera Distribution for Hadoop (CDH)

Cloudera’s Distribution for Hadoop (CDH) sets a new standard for Hadoop-based data management platforms.


A high-performance coordination service for distributed applications. ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.


A scalable, distributed database that supports structured data storage for large tables.


A data serialization system. Similar to ☞ Thrift and ☞ Protocol Buffers.


Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:

  • Imports individual tables or entire databases to files in HDFS
  • Generates Java classes to allow you to interact with your imported data
  • Provides the ability to import from SQL databases straight into your Hive data warehouse


Flume is a distributed, reliable, and available service for efficiently moving large amounts of data soon after the data is produced.


Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language.


A high-level data-flow language and execution framework for parallel computation. Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.


Oozie is a workflow/coordination service to manage data processing jobs for Apache Hadoop. It is an extensible, scalable and data-aware service to orchestrate dependencies between jobs running on Hadoop (including HDFS, Pig and MapReduce).


Cascading is a Query API and Query Planner used for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster.


Cascalog is a tool for processing data on Hadoop with Clojure in a concise and expressive manner. Cascalog combines two cutting edge technologies in Clojure and Hadoop and resurrects an old one in Datalog. Cascalog is high performance, flexible, and robust.


Hue is a graphical user interface to operate and develop applications for Hadoop. Hue applications are collected into a desktop-style environment and delivered as a Web application, requiring no additional installation for individual users.

You can read more about HUE on ☞ Cloudera blog.


Chukwa is a data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.


A Scalable machine learning and data mining library.

Integration with Relational databases

Integration with Data Warehouses

The only list I have is MapReduce, RDBMS, and Data Warehouse, but I’m afraid it is quite a bit old. So maybe someone could help me update it.

Anything else? Once we validate this list, I’ll probably have to move it on the NoSQL reference

Original title and link: Quick Reference: Hadoop Tools Ecosystem (NoSQL databases © myNoSQL)

Cloudera: All Your Big Data Are Belong to Us

Matt Asay (GigaOm):

Where Cloudera shines, however, is in taking these different contributions and making Hadoop relevant for enterprise IT[8], where data mining has waxed and waned over the years. […] Cloudera, in other words, is banking on the complexity of Hadoop to drive enterprise IT to its own Cloudera Enteprise tools.

Additionally, I think what Cloudera is “selling” is a good set of tools — Hadoop, HBase, Hive, Pig, Oozie, Sqoop, Flume, Zookeeper, Hue — put together based on their expertise in the field.

Original title and link: Cloudera: All Your Big Data Are Belong to Us (NoSQL databases © myNoSQL)


MongoDB or Hadoop?

Posted on the MongoDB mailing list:

I have about 500M log file entries each representing an “ad impression” (we are an advertising company). Each “hit” has about 50 attributes to it (example: Country, State, City, Adsize, Browser, OS, etc) .. I want to load all 500M into some form of database and then run queries against this set.

As you could expect MongoDB is considered as a possibility. But I’d call that a biased vendor advise. I’ll be blunt: invest in your future by using Hadoop and Pig. Hive may fit too.

Original title and link for this post: MongoDB or Hadoop? (published on the NoSQL blog: myNoSQL)


Pig and Hive at Yahoo!

Fantastic post on Yahoo! Hadoop blog presenting a series of scenarios where using Pig and Hive makes things a lot better:

The widespread use of Pig at Yahoo! has enabled the migration of our data factory processing to Hadoop. With the adoption of Hive, we will be able to move much of our data warehousing to Hadoop as well. Having the data factory and the data warehouse on the same system will lower data-loading time into the warehouse — as soon as the factory is finished, it is available in the warehouse. It will also enable us to share — across both the factory and the warehouse — metadata, monitoring, and management tools; support and operations teams; and hardware. So we are excited to add Hive to our toolkit, and look forward to using both these tools together as we lean on Hadoop to do more and more of our data processing.

The use cases mentioned in the post:

  • data preparation and presentation:

    Given the different workloads and different users for each phase, we have found that different tools work best in each phase. Pig (combined with a workflow system such as Oozie) is best suited for the data factory, and Hive for the data warehouse.

  • data factories: pipelines (Pig + Oozie), iterative processing (Pig), research (Pig)
  • data warehouse: business-intelligence analysis and ad-hoc queries

    In both cases, the relational model and SQL are the best fit. Indeed, data warehousing has been one of the core use cases for SQL through much of its history. It has the right constructs to support the types of queries and tools that analysts want to use. And it is already in use by both the tools and users in the field. The Hadoop subproject Hive provides a SQL interface and relational model for Hadoop.

Yahoo! gets way to little credit for its work on bigdata and its contributions to the open source.

Original title and link for this post: Pig and Hive at Yahoo! (published on the NoSQL blog: myNoSQL)


Howl: Unifying Metadata Layer for Hive and Pig

Yet another contribution from Yahoo!:

Common metadata layer for Hadoop’s Map Reduce, Pig, and Hive

Howl: Unifying Metadata Layer for Hive and Pig originally posted on the NoSQL blog: myNoSQL


NoSQL Databases and Data Warehousing

I didn’t know data warehousing strictly imposes a relational model:

From a philosophical standpoint, my largest problem with NoSQL databases is that they don’t respect relational theory. In short, they aren’t meant to deal with sets of data, but lists. Relational algebra was created to deal with the large sets of data and have them interact. Reporting and analytics rely on that.

I’d bet people building and using Hive, Pig, Flume and other data warehousing tools would disagree with Eric Hewitt.

NoSQL Databases and Data Warehousing originally posted on the NoSQL blog: myNoSQL


Hive integrations at Facebook: HBase & RCFile

Recently I’ve written about more integration for Hive mentioning that Facebook is working on integrating Hive with HBase.

Yahoo! Developer Network Blog has ☞ published the video of John Sichi and Yongqiang He of Facebook presenting about Hive and HBase integration at Hadoop summit[1]:

  1. You can find some of the news coming from Hadoop 2010 Summit in this post Hadoop and HBase status updates after Hadoop Summit  ()

5 Years Old Hadoop Celebration at Hadoop Summit, Plus New Tools

I didn’t realize Hadoop has been so long on the market: 5 years. In just a couple of hours, the celebration will start at ☞ Hadoop Summit in Santa Clara.

Yahoo!, the most active contributor to Hadoop, will ☞ open source today two new tools: Hadoop with Security and Oozie, a workflow engine.

Hadoop Security integrates Hadoop with Kerberos, providing secure access and processing of business-sensitive data.This enables organizations to leverage and extract value from their data and hardware investment in Hadoop across the enterprise while maintaining data security, allowing new collaborations and applications with business-critical data.

Oozie is an open-source workflow solution to manage jobs running on Hadoop, including HDFS, Pig, and MapReduce. Oozie — a name for an elephant tamer — was designed for Yahoo!’s rigorous use case of managing complex workflows and data pipelines at global scale. It is integrated with Hadoop Security and is quickly becoming the de-facto standard for ETL (extraction, transformation, loading) processing at Yahoo!.

Update: It looks like the news are not stopping here, Cloudera making ☞ a big announcement accompanying the new release of Cloudera’s Distribution for Hadoop CDHv3 Beta2:

The additional packages include HBase, the popular distributed columnar storage system with fast read-write access to data managed by HDFS, Hive and Pig for query access to data stored in a Hadoop cluster, Apache Zookeeper for distributed process coordination and Sqoop for moving data between Hadoop and relational database systems. We’ve adopted the outstanding workflow engine out of Yahoo!, Oozie, and have made contributions of our own to adapt it for widespread use by general enterprise customers. We’ve also released – this is a big deal, and I’m really pleased to announce it – our continuous data loading system, Flume, and our Hadoop User Environment software (formerly Cloudera Desktop, and henceforth “Hue”) under the Apache Software License, version 2.

Also worth mentioning, going forward Cloudera will also have a commercial offering: ☞ Cloudera Enterprise:

Cloudera Enterprise combines the open source CDHv3 platform with critical monitoring, management and administrative tools that our enterprise customers have told us they need to put Hadoop into production. We’ve added dashboards for critical IT tasks, including monitoring cluster status and activity, keeping track of data flows into Hadoop in real time based on the services that Flume provides, and controlling access to data and resources by users and groups. We’ve integrated access controls with Active Directory and other LDAP implementations so that IT staff can control rights and identities in the same way as they do for other business platforms they use. Cloudera Enterprise is available by annual subscription and includes maintenance, updates and support.

More Integrations for Hive

Hive is data warehouse infrastructure built on top of Hadoop offering tools for data ETL, a mechanism to put structures on the data, and the capability to querying and analyzing large data sets stored in Hadoop[1]. To better understand the benefits of Hive you can check how Facebook is using Hive to deal with petabyte scale data warehouse.

Recently, John Sichi a member of the Data infrastructure team at Facebook published an article on integrating Hive and HBase. Also there is interest in having Hive work with Cassandra and this is ☞ tracked in Cassandra JIRA (nb: not sure there’s any advance on this yet though).

Hypertable, another wide-column store, provides a way to integrating with Hive described ☞ here:

Hypertable-Hive integration allows Hive QL statements to read and write to Hypertable via SELECT and INSERT commands. […] Currently the Hypertable storage handler only supports external, non-native tables.

Somehow all this work to provide a common data warehouse infrastructure on top of existing NoSQL solutions (or at least the wide-column stores which are focused on large scale datasets) seems to confirm there’s no need for a common NoSQL language.

Integrating Hive and HBase at Facebook

While definitely interesting, something doesn’t seem to add up:

It (nb HBase) sidesteps Hadoop’s append-only constraint by keeping recently updated data in memory and incrementally rewriting data to new files, splitting and merging intelligently based on data distribution changes. Since it is based on Hadoop, making HBase interoperate with Hive is straightforward, meaning HBase tables can be accessed as if they were native Hive tables. As a result, a single Hive query can now perform complex operations such as join, union, and aggregation across combinations of HBase and native Hive tables. Likewise, Hive’s INSERT statement can be used to move data between HBase and native Hive tables, or to reorganize data within HBase itself.

What I seem to not understand is:

So why HBase?