CDH: All content tagged as CDH in NoSQL databases and polyglot persistence
Wednesday, 22 May 2013
Nokia’s Big Data Ecosystem: Hadoop, Teradata, Oracle, MySQL
Nokia’s big data ecosystem consists of a centralized, petabyte-scale Hadoop cluster that is interconnected with a 100-TB Teradata enterprise data warehouse (EDW), numerous Oracle and MySQL data marts, and visualization technologies that allow Nokia’s 60,000+ users around the world tap into the massive data store. Multi-structured data is constantly being streamed into Hadoop from the relational systems, and hundreds of thousands of Scribe processes run every day to move data from, for example, servers in Singapore to a Hadoop cluster in the UK. Nokia is also a big user of Apache Sqoop and Apache HBase.
In the coming years you’ll hear more often stories—sales pitches—about single unified platforms solving all these problems at once. But platforms that will survive and thrive are those that will accomplish two things:
- keep the data gates open: in and out.
- work with different other platform to make this efficiently for users
Original title and link: Nokia’s Big Data Ecosystem: Hadoop, Teradata, Oracle, MySQL (©myNoSQL)
Monday, 6 February 2012
Whirr and Hadoop Quickstart Guide: Automating a Rackspace Hadoop Cluster
Even if most of the examples show Whirr in action on the Amazon cloud, Whirr it’s cloud-neutral. Bob Gourley uses Whirr to fire up a CDH1 cluster on Rackspace.
-
Cloudera Distribution of Hadoop. ↩
Original title and link: Whirr and Hadoop Quickstart Guide: Automating a Rackspace Hadoop Cluster (©myNoSQL)
Tuesday, 17 January 2012
Hadoop Versions Take 2: What You Wanted to Know About Hadoop, but Were Too Afraid to Ask: Genealogy of Elephants
Another great diagram explaining the complicated tree of Hadoop versions.

Click for full size image. Credit Konstantin I. Boudnik & Cos
When compared with the other diagram of Apache Hadoop versions, this one contains some very interesting details about the versions of Hadoop used by third party distributions like EMC, IBM, MapR, and even Azure:
The diagram above clearly shows a few important gaps of the rest of commercial offerings:
- none of them supports Kerberos security (EMC, IBM, and MapR)
- unavailability of Hbase due to the lack of HDFS append in their systems (EMC, IBM). In case of MapR you end up using a custom HBase distributed by MapR. I don’t want to make any speculation of the latter in this article.
If I’d be in position to choose which version of Hadoop to be used for a project, here is where I’d start from:
- if the project would have a budget for prototyping and experimentation, my first choice would be the latest official Apache distribution. This would give access to both the latest and greatest (and not always bug free), but more importantly it would allow the team to access the Hadoop community know-how
- if the project would require getting up to speed as fast as possible (and I’d be able to get some budget for trainings), I’d start my investigation with Cloudera Distribution of Hadoop. Even if there would be no budget for getting support for Cloudera, the advantage would be in having everything well packaged together.
Original title and link: Hadoop Versions Take 2: What You Wanted to Know About Hadoop, but Were Too Afraid to Ask: Genealogy of Elephants (©myNoSQL)
via: https://blogs.apache.org/bigtop/entry/all_you_wanted_to_know
Thursday, 8 December 2011
Cloudera Enterprise: Cloudera Manager and Cloudera support
Cloudera Enterprise is what Cloudera sells in addition to their Cloudera Hadoop Distribution (CDH):
- Cloudera Manager and Cloudera support
- Cloudera Manager: end-to-end management application for Apache Hadoop
- Deploy: automated installation
- Discover: service health and monitoring, including events and alerts
- Diagnose
- Job analytics
- Log search
- Configuration recommendations
- Act
- Service and configuration management
- Security management
- Optimize
- Resource and quota management
- Free and Enterprise editions
- Free edition: up to 50 nodes
- Enterprise edition: no available pricing
- Feature comparison
After the break: a short video about Cloudera Manager and media coverage:
Tuesday, 23 August 2011
Cloudera Hadoop Distribution on Dell's Commodity Servers
Given:
On the hardware side, the package can come with either Dell PowerEdge C2100, C6100 or C6105 servers. The PowerEdge C-series servers are uniquely suited for Hadoop’s multiserver deployments because of their modest physical size and power usage, […] A deployment based on the reference architecture could scale from six nodes to 720 nodes.
and
The cost of a minimum configuration would run from US$118,000 to $124,000, depending on the support options.
what’s that definition of commodity servers again?
Original title and link: Cloudera Hadoop Distribution on Dell’s Commodity Servers (©myNoSQL)
via: http://www.networkworld.com/news/2011/080411-dell-sells-preconfigured-hadoop.html
Tuesday, 23 November 2010
Hadoop on EC2: A Detailed Guide
Lars George, with his very (very) detailed style, explains step by step how to setup Hadoop on EC2.
Let’s jump into it head first and solve the problem of actually launching a cluster. You have heard that Hadoop is shipped with EC2 support, but how do you actually start up a Hadoop cluster on EC2? You do have a couple of choices and as Tom’s article above explains you could start all instances in the cluster by hand. But why would you want to do that if there are scripts available that do all the work for you? And to complicate matters, how do you select the AMI (the Amazon Machine Image) that has the Hadoop version you need or want? Does it have Hive installed for your subsequent analysis of the collected data? Just running a check to count the available public Hadoop images returns 41! That gets daunting very quickly. Sure you can roll your own - but that implies even more manual labor that you probably better spend on productive work. But there is help available..
Lars is using the CDH tools for this tutorial and points out the the ☞ incubating Whirr project.
Original title and link: Hadoop on EC2: A Detailed Guide (NoSQL databases © myNoSQL)
via: http://www.larsgeorge.com/2010/10/hadoop-on-ec2-primer.html
Thursday, 11 November 2010
Quick Reference: Hadoop Tools Ecosystem
Just a quick reference of the continuously growing Hadoop tools ecosystem.
Hadoop
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.
HDFS
A distributed file system that provides high throughput access to application data.
MapReduce
A software framework for distributed processing of large data sets on compute clusters.
Amazon Elastic MapReduce
Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
aws.amazon.com/elasticmapreduce/
Cloudera Distribution for Hadoop (CDH)
Cloudera’s Distribution for Hadoop (CDH) sets a new standard for Hadoop-based data management platforms.
ZooKeeper
A high-performance coordination service for distributed applications. ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
HBase
A scalable, distributed database that supports structured data storage for large tables.
Avro
A data serialization system. Similar to ☞ Thrift and ☞ Protocol Buffers.
Sqoop
Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:
- Imports individual tables or entire databases to files in HDFS
- Generates Java classes to allow you to interact with your imported data
- Provides the ability to import from SQL databases straight into your Hive data warehouse
Flume
Flume is a distributed, reliable, and available service for efficiently moving large amounts of data soon after the data is produced.
archive.cloudera.com/cdh/3/flume/
Hive
Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language.
Pig
A high-level data-flow language and execution framework for parallel computation. Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
Oozie
Oozie is a workflow/coordination service to manage data processing jobs for Apache Hadoop. It is an extensible, scalable and data-aware service to orchestrate dependencies between jobs running on Hadoop (including HDFS, Pig and MapReduce).
Cascading
Cascading is a Query API and Query Planner used for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster.
Cascalog
Cascalog is a tool for processing data on Hadoop with Clojure in a concise and expressive manner. Cascalog combines two cutting edge technologies in Clojure and Hadoop and resurrects an old one in Datalog. Cascalog is high performance, flexible, and robust.
github.com/nathanmarz/cascalog
HUE
Hue is a graphical user interface to operate and develop applications for Hadoop. Hue applications are collected into a desktop-style environment and delivered as a Web application, requiring no additional installation for individual users.
You can read more about HUE on ☞ Cloudera blog.
Chukwa
Chukwa is a data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.
Mahout
A Scalable machine learning and data mining library.
Integration with Relational databases
- Oracle
- Hadoop connector for Oracle Ora-Oop
- Hadoop and Oracle Parallel Processing
Integration with Data Warehouses
The only list I have is MapReduce, RDBMS, and Data Warehouse, but I’m afraid it is quite a bit old. So maybe someone could help me update it.
Anything else? Once we validate this list, I’ll probably have to move it on the NoSQL reference
Original title and link: Quick Reference: Hadoop Tools Ecosystem (NoSQL databases © myNoSQL)
