NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Hive: All content tagged as Hive in NoSQL databases and polyglot persistence

Hadoop, Hive and Redis for Foursquare Analytics

Foursquare’s move from querying the production databases to a data analytics system using Hadoop and Hive with Redis playing the role of a cache:

  • Provide an easy-to-use end-point to run data exploration queries (using SQL and simple web-forms).
  • Cache the results of queries (in a database) to power reports, so that the data is available to everyone, whenever it is needed.
  • Allow our hadoop cluster to be totally dynamic without having to move data around (we shut it down at night and on weekends).
  • Add new data in a simple way (just put it in Amazon S3!).
  • Analyse data from several data sources (mongodb, postgres, log-files).

Foursquare Analytics Architecture

One of the most often heard complains about NoSQL databases is about their reduced querying capabilities. Running reports and analysis against the production servers is only going to work when you have little data and the set of queries is limitted and stable over time. Otherwise you’ll want to run these against a copy of your data to avoid bringing down production databases and avoid corrupting data.

Original title and link: Hadoop, Hive and Redis for Foursquare Analytics (NoSQL databases © myNoSQL)


Cloudera’s Distribution for Apache Hadoop version 3 Beta 4

New version of Cloudera’s Hadoop distro — complete release notes available here:

CDH3 Beta 4 also includes new versions of many components. Highlights include:

  • HBase 0.90.1, including much improved stability and operability.
  • Hive 0.7.0rc0, including the beginnings of authorization support, support for multiple databases, and many other new features.
  • Pig 0.8.0, including many new features like scalar types, custom partitioners, and improved UDF language support.
  • Flume 0.9.3, including support for Windows and improved monitoring capabilities.
  • Sqoop 1.2, including improvements to usability and Oracle integration.
  • Whirr 0.3, including support for starting HBase clusters on popular cloud platforms.

Plus many scalability improvements contributed by Yahoo!.

Cloudera’s CDH is the most popular Hadoop distro bringing together many components of the Hadoop ecosystem. Yahoo remains the main innovator behind Hadoop.

Original title and link: Cloudera’s Distribution for Apache Hadoop version 3 Beta 4 (NoSQL databases © myNoSQL)


The Backstory of Yahoo and Hadoop

We currently have nearly 100 people working on Apache Hadoop and related projects, such as Pig, ZooKeeper, Hive, Howl, HBase and Oozie. Over the last 5 years, we’ve invested nearly 300 person-years into these projects. […] Today Yahoo runs on over 40,000 Hadoop machines (>300k cores). They are used by over a thousand regular users from our science and development teams. Hadoop is at the center of our research in search, advertising, spam detection, personalization and many other topics.

I assume there’s no surpise to anyone I’m a big fan of Yahoo! open source initiatives.

Original title and link: The Backstory of Yahoo and Hadoop (NoSQL databases © myNoSQL)


Hive and HBase in Toad for Cloud Demo

Jeremiah Peschka put together two short videos demoing Toad for Cloud Eclipse plugin with Hive and HBase. Those complaining about lack of SQL in NoSQL databases should check it out. On a different note though, I did express a few concerns about such a tool related to the complexity and performance of building the indirection layer and supporting operations that are not native to target system. I’d add to these the fact that some NoSQL databases are continuously adding features that can radically change the way this tool performs (e.g. Cassandra 0.7 will feature secondary indexes).

Hive and Toad for Cloud

HBase and Toad for Cloud

Looking at how slow the tool performs and the fact that it doesn’t have any sort of results pagination seems to be a confirmation of some of the concerns expressed above. On the other hand it is kind of difficult to understand a tool by just watching a video.

Original title and link: Hive and HBase in Toad for Cloud Demo (NoSQL databases © myNoSQL)

Amazon Elastic MapReduce Updates

Updates from Amazon including upgraded Hive, multipart upload, optimized JDBC drivers:

  • Support for S3’s Large Objects and Multipart Upload

Amazon Elastic MapReduce supports this feature too allowing MapREduce to behin the upload before the Hadoop task is finished

  • Upgraded Hive Support

Currently you can run both Hive 0.5 and 0.7 concurrenty in the same cluster

  • JDBC Drivers for Hive

Optimized JDBC drivers for Hive.

Original title and link: Amazon Elastic MapReduce Updates (NoSQL databases © myNoSQL)


Quick Reference: Hadoop Tools Ecosystem

Just a quick reference of the continuously growing Hadoop tools ecosystem.


The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.


A distributed file system that provides high throughput access to application data.


A software framework for distributed processing of large data sets on compute clusters.

Amazon Elastic MapReduce

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Cloudera Distribution for Hadoop (CDH)

Cloudera’s Distribution for Hadoop (CDH) sets a new standard for Hadoop-based data management platforms.


A high-performance coordination service for distributed applications. ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.


A scalable, distributed database that supports structured data storage for large tables.


A data serialization system. Similar to ☞ Thrift and ☞ Protocol Buffers.


Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:

  • Imports individual tables or entire databases to files in HDFS
  • Generates Java classes to allow you to interact with your imported data
  • Provides the ability to import from SQL databases straight into your Hive data warehouse


Flume is a distributed, reliable, and available service for efficiently moving large amounts of data soon after the data is produced.


Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language.


A high-level data-flow language and execution framework for parallel computation. Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.


Oozie is a workflow/coordination service to manage data processing jobs for Apache Hadoop. It is an extensible, scalable and data-aware service to orchestrate dependencies between jobs running on Hadoop (including HDFS, Pig and MapReduce).


Cascading is a Query API and Query Planner used for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster.


Cascalog is a tool for processing data on Hadoop with Clojure in a concise and expressive manner. Cascalog combines two cutting edge technologies in Clojure and Hadoop and resurrects an old one in Datalog. Cascalog is high performance, flexible, and robust.


Hue is a graphical user interface to operate and develop applications for Hadoop. Hue applications are collected into a desktop-style environment and delivered as a Web application, requiring no additional installation for individual users.

You can read more about HUE on ☞ Cloudera blog.


Chukwa is a data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.


A Scalable machine learning and data mining library.

Integration with Relational databases

Integration with Data Warehouses

The only list I have is MapReduce, RDBMS, and Data Warehouse, but I’m afraid it is quite a bit old. So maybe someone could help me update it.

Anything else? Once we validate this list, I’ll probably have to move it on the NoSQL reference

Original title and link: Quick Reference: Hadoop Tools Ecosystem (NoSQL databases © myNoSQL)

Cloudera: All Your Big Data Are Belong to Us

Matt Asay (GigaOm):

Where Cloudera shines, however, is in taking these different contributions and making Hadoop relevant for enterprise IT[8], where data mining has waxed and waned over the years. […] Cloudera, in other words, is banking on the complexity of Hadoop to drive enterprise IT to its own Cloudera Enteprise tools.

Additionally, I think what Cloudera is “selling” is a good set of tools — Hadoop, HBase, Hive, Pig, Oozie, Sqoop, Flume, Zookeeper, Hue — put together based on their expertise in the field.

Original title and link: Cloudera: All Your Big Data Are Belong to Us (NoSQL databases © myNoSQL)


MongoDB or Hadoop?

Posted on the MongoDB mailing list:

I have about 500M log file entries each representing an “ad impression” (we are an advertising company). Each “hit” has about 50 attributes to it (example: Country, State, City, Adsize, Browser, OS, etc) .. I want to load all 500M into some form of database and then run queries against this set.

As you could expect MongoDB is considered as a possibility. But I’d call that a biased vendor advise. I’ll be blunt: invest in your future by using Hadoop and Pig. Hive may fit too.

Original title and link for this post: MongoDB or Hadoop? (published on the NoSQL blog: myNoSQL)


Pig and Hive at Yahoo!

Fantastic post on Yahoo! Hadoop blog presenting a series of scenarios where using Pig and Hive makes things a lot better:

The widespread use of Pig at Yahoo! has enabled the migration of our data factory processing to Hadoop. With the adoption of Hive, we will be able to move much of our data warehousing to Hadoop as well. Having the data factory and the data warehouse on the same system will lower data-loading time into the warehouse — as soon as the factory is finished, it is available in the warehouse. It will also enable us to share — across both the factory and the warehouse — metadata, monitoring, and management tools; support and operations teams; and hardware. So we are excited to add Hive to our toolkit, and look forward to using both these tools together as we lean on Hadoop to do more and more of our data processing.

The use cases mentioned in the post:

  • data preparation and presentation:

    Given the different workloads and different users for each phase, we have found that different tools work best in each phase. Pig (combined with a workflow system such as Oozie) is best suited for the data factory, and Hive for the data warehouse.

  • data factories: pipelines (Pig + Oozie), iterative processing (Pig), research (Pig)
  • data warehouse: business-intelligence analysis and ad-hoc queries

    In both cases, the relational model and SQL are the best fit. Indeed, data warehousing has been one of the core use cases for SQL through much of its history. It has the right constructs to support the types of queries and tools that analysts want to use. And it is already in use by both the tools and users in the field. The Hadoop subproject Hive provides a SQL interface and relational model for Hadoop.

Yahoo! gets way to little credit for its work on bigdata and its contributions to the open source.

Original title and link for this post: Pig and Hive at Yahoo! (published on the NoSQL blog: myNoSQL)


Howl: Unifying Metadata Layer for Hive and Pig

Yet another contribution from Yahoo!:

Common metadata layer for Hadoop’s Map Reduce, Pig, and Hive

Howl: Unifying Metadata Layer for Hive and Pig originally posted on the NoSQL blog: myNoSQL


NoSQL Databases and Data Warehousing

I didn’t know data warehousing strictly imposes a relational model:

From a philosophical standpoint, my largest problem with NoSQL databases is that they don’t respect relational theory. In short, they aren’t meant to deal with sets of data, but lists. Relational algebra was created to deal with the large sets of data and have them interact. Reporting and analytics rely on that.

I’d bet people building and using Hive, Pig, Flume and other data warehousing tools would disagree with Eric Hewitt.

NoSQL Databases and Data Warehousing originally posted on the NoSQL blog: myNoSQL


Hive integrations at Facebook: HBase & RCFile

Recently I’ve written about more integration for Hive mentioning that Facebook is working on integrating Hive with HBase.

Yahoo! Developer Network Blog has ☞ published the video of John Sichi and Yongqiang He of Facebook presenting about Hive and HBase integration at Hadoop summit[1]:

  1. You can find some of the news coming from Hadoop 2010 Summit in this post Hadoop and HBase status updates after Hadoop Summit  ()