NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



cdh: All content tagged as cdh in NoSQL databases and polyglot persistence

Cloudera shipped a mountain... what can you read between the lines

Cloudera Engineering (@ClouderaEng) shipped a mountain of new product (production-grade software, not just technical previews): Cloudera Impala, Cloudera Search, Cloudera Navigator, Cloudera Development Kit (now Kite SDK), new Apache Accumulo packages for CDH, and several iterative releases of CDH and Cloudera Manager. (And, the Cloudera Enterprise 5 Beta release was made available to the world.). Furthermore, as always, a ton of bug fixes and new features went upstream, with the features notably but not exclusively HiveServer2 and Apache Sentry (incubating).

How many things can you read in this paragraph?

  1. a not that subtle stab at Hortonwork’s series of technical previews.
  2. more and more projects brought under the CDH umbrella. Does more ever become too much? (I cannot explain why, but my first thought was “this feels so Oracle-style”)
  3. Cloudera’s current big bet is Impala. SQL and low latency querying. A big win for the project, but not necessarily a direct financial win for Cloudera, was its addition as a supported service on Amazon Elastic MapReduce.

Original title and link: Cloudera shipped a mountain… what can you read between the lines (NoSQL database©myNoSQL)


With New Product Packaging, Adopting the Platform for Big Data is Even Easier

In addition, by choosing Cloudera Enterprise, you open the door to add other capabilities to your subscription as you wish – powerful tools like:

  • Cloudera Enterprise RTD (Real Time Delivery) – Support for HBase
  • Cloudera Enterprise RTQ (Real Time Query) – Support for Impala
  • Cloudera Enterprise BDR (Backup and Disaster Recovery) - Support for BDR
  • Cloudera Navigator – Data management for your Cloudera Enterprise deployment

And when Cloudera Search (beta) becomes generally available, you’ll be able to add:

  • RTS (Real Time Search) – Support for Cloudera Search

Isn’t this called nickel-and-diming?

Original title and link: With New Product Packaging, Adopting the Platform for Big Data is Even Easier (NoSQL database©myNoSQL)


Hadoop Cluster Automation APIs: Ambari and Cloudera Manager

Two links for those interested in seeing how an automation API for Hadoop would look like:

  1. Ambari API reference v1
  2. Cloudera Manager API v1

At the first glance both of the APIs support the same range of resources/end points.

Cloudera Manager comes in two editions: free and enterprise with some of the automation features (service monitoring & management, security), being available only in the latter one. I’m not sure if all the endpoints are available through the free edition of the Cloudera Manager.

Original title and link: Hadoop Cluster Automation APIs: Ambari and Cloudera Manager (NoSQL database©myNoSQL)

Nokia’s Big Data Ecosystem: Hadoop, Teradata, Oracle, MySQL

Nokia’s big data ecosystem consists of a centralized, petabyte-scale Hadoop cluster that is interconnected with a 100-TB Teradata enterprise data warehouse (EDW), numerous Oracle and MySQL data marts, and visualization technologies that allow Nokia’s 60,000+ users around the world tap into the massive data store. Multi-structured data is constantly being streamed into Hadoop from the relational systems, and hundreds of thousands of Scribe processes run every day to move data from, for example, servers in Singapore to a Hadoop cluster in the UK. Nokia is also a big user of Apache Sqoop and Apache HBase.

In the coming years you’ll hear more often stories—sales pitches—about single unified platforms solving all these problems at once. But platforms that will survive and thrive are those that will accomplish two things:

  1. keep the data gates open: in and out.
  2. work with different other platform to make this efficiently for users

Original title and link: Nokia’s Big Data Ecosystem: Hadoop, Teradata, Oracle, MySQL (NoSQL database©myNoSQL)


Whirr and Hadoop Quickstart Guide: Automating a Rackspace Hadoop Cluster

Even if most of the examples show Whirr in action on the Amazon cloud, Whirr it’s cloud-neutral. Bob Gourley uses Whirr to fire up a CDH1 cluster on Rackspace.

  1. Cloudera Distribution of Hadoop. 

Original title and link: Whirr and Hadoop Quickstart Guide: Automating a Rackspace Hadoop Cluster (NoSQL database©myNoSQL)


Hadoop Versions Take 2: What You Wanted to Know About Hadoop, but Were Too Afraid to Ask: Genealogy of Elephants

Another great diagram explaining the complicated tree of Hadoop versions.

Apache Hadoop Versions

Click for full size image. Credit Konstantin I. Boudnik & Cos

When compared with the other diagram of Apache Hadoop versions, this one contains some very interesting details about the versions of Hadoop used by third party distributions like EMC, IBM, MapR, and even Azure:

The diagram above clearly shows a few important gaps of the rest of commercial offerings:

  • none of them supports Kerberos security (EMC, IBM, and MapR)
  • unavailability of Hbase due to the lack of HDFS append in their systems (EMC, IBM). In case of MapR you end up using a custom HBase distributed by MapR. I don’t want to make any speculation of the latter in this article.

If I’d be in position to choose which version of Hadoop to be used for a project, here is where I’d start from:

  1. if the project would have a budget for prototyping and experimentation, my first choice would be the latest official Apache distribution. This would give access to both the latest and greatest (and not always bug free), but more importantly it would allow the team to access the Hadoop community know-how
  2. if the project would require getting up to speed as fast as possible (and I’d be able to get some budget for trainings), I’d start my investigation with Cloudera Distribution of Hadoop. Even if there would be no budget for getting support for Cloudera, the advantage would be in having everything well packaged together.

Original title and link: Hadoop Versions Take 2: What You Wanted to Know About Hadoop, but Were Too Afraid to Ask: Genealogy of Elephants (NoSQL database©myNoSQL)


Cloudera Enterprise: Cloudera Manager and Cloudera support

Cloudera Enterprise is what Cloudera sells in addition to their Cloudera Hadoop Distribution (CDH):

  • Cloudera Manager and Cloudera support
  • Cloudera Manager: end-to-end management application for Apache Hadoop
    • Deploy: automated installation
    • Discover: service health and monitoring, including events and alerts
    • Diagnose
      • Job analytics
      • Log search
      • Configuration recommendations
    • Act
      • Service and configuration management
      • Security management
    • Optimize
      • Resource and quota management
  • Free and Enterprise editions
  • Free edition: up to 50 nodes
  • Enterprise edition: no available pricing
  • Feature comparison
Cloudera Manager Editions

After the break: a short video about Cloudera Manager and media coverage:

Cloudera Hadoop Distribution on Dell's Commodity Servers


On the hardware side, the package can come with either Dell PowerEdge C2100, C6100 or C6105 servers. The PowerEdge C-series servers are uniquely suited for Hadoop’s multiserver deployments because of their modest physical size and power usage, […] A deployment based on the reference architecture could scale from six nodes to 720 nodes.


The cost of a minimum configuration would run from US$118,000 to $124,000, depending on the support options.

what’s that definition of commodity servers again?

Original title and link: Cloudera Hadoop Distribution on Dell’s Commodity Servers (NoSQL database©myNoSQL)


Hadoop on EC2: A Detailed Guide

Lars George, with his very (very) detailed style, explains step by step how to setup Hadoop on EC2.

Let’s jump into it head first and solve the problem of actually launching a cluster. You have heard that Hadoop is shipped with EC2 support, but how do you actually start up a Hadoop cluster on EC2? You do have a couple of choices and as Tom’s article above explains you could start all instances in the cluster by hand. But why would you want to do that if there are scripts available that do all the work for you? And to complicate matters, how do you select the AMI (the Amazon Machine Image) that has the Hadoop version you need or want? Does it have Hive installed for your subsequent analysis of the collected data? Just running a check to count the available public Hadoop images returns 41! That gets daunting very quickly. Sure you can roll your own - but that implies even more manual labor that you probably better spend on productive work. But there is help available..

Lars is using the CDH tools for this tutorial and points out the the ☞ incubating Whirr project.

Original title and link: Hadoop on EC2: A Detailed Guide (NoSQL databases © myNoSQL)


Quick Reference: Hadoop Tools Ecosystem

Just a quick reference of the continuously growing Hadoop tools ecosystem.


The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.


A distributed file system that provides high throughput access to application data.


A software framework for distributed processing of large data sets on compute clusters.

Amazon Elastic MapReduce

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Cloudera Distribution for Hadoop (CDH)

Cloudera’s Distribution for Hadoop (CDH) sets a new standard for Hadoop-based data management platforms.


A high-performance coordination service for distributed applications. ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.


A scalable, distributed database that supports structured data storage for large tables.


A data serialization system. Similar to ☞ Thrift and ☞ Protocol Buffers.


Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:

  • Imports individual tables or entire databases to files in HDFS
  • Generates Java classes to allow you to interact with your imported data
  • Provides the ability to import from SQL databases straight into your Hive data warehouse


Flume is a distributed, reliable, and available service for efficiently moving large amounts of data soon after the data is produced.


Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language.


A high-level data-flow language and execution framework for parallel computation. Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.


Oozie is a workflow/coordination service to manage data processing jobs for Apache Hadoop. It is an extensible, scalable and data-aware service to orchestrate dependencies between jobs running on Hadoop (including HDFS, Pig and MapReduce).


Cascading is a Query API and Query Planner used for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster.


Cascalog is a tool for processing data on Hadoop with Clojure in a concise and expressive manner. Cascalog combines two cutting edge technologies in Clojure and Hadoop and resurrects an old one in Datalog. Cascalog is high performance, flexible, and robust.


Hue is a graphical user interface to operate and develop applications for Hadoop. Hue applications are collected into a desktop-style environment and delivered as a Web application, requiring no additional installation for individual users.

You can read more about HUE on ☞ Cloudera blog.


Chukwa is a data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.


A Scalable machine learning and data mining library.

Integration with Relational databases

Integration with Data Warehouses

The only list I have is MapReduce, RDBMS, and Data Warehouse, but I’m afraid it is quite a bit old. So maybe someone could help me update it.

Anything else? Once we validate this list, I’ll probably have to move it on the NoSQL reference

Original title and link: Quick Reference: Hadoop Tools Ecosystem (NoSQL databases © myNoSQL)