Cloudera: All content tagged as Cloudera in NoSQL databases and polyglot persistence
Sunday, 17 April 2011
Cloudera: An Operating System for BigData
From a must read Bloomberg article about BigData being used mainly for ads optimization:
Cloudera is essentially trying to build a type of operating system, à la Windows, for examining huge stockpiles of information. Where Windows manages the basic functions of a PC and its software, Cloudera’s technology helps companies break data into digestible chunks that can be spread across relatively cheap computers.
Original title and link: Cloudera: An Operating System for BigData (NoSQL databases © myNoSQL)
via: http://www.businessweek.com/print/magazine/content/11_17/b4225060960537.htm
Friday, 18 March 2011
Cloudera: A Business Inteligence Leader
The Informatica accord is Cloudera’s second partnership this year with a leading DI player. Back in August, Cloudera cemented a deal with open source software (OSS) data integration (DI) specialist Talend. It also has partnerships with Teradata Corp., the former Netezza Inc., the former Greenplum Software Corp., Aster Data Systems Inc., Vertica Inc., and Pentaho.
One thing’s for sure: Cloudera is certainly attracting attention.
The strategy is surprisingly simple: make it easy to put data in and get it out.
Original title and link: Cloudera: A Business Inteligence Leader (NoSQL databases © myNoSQL)
via: http://tdwi.org/articles/2011/02/16/cloudera-leader-bi-hadoop.aspx
Monday, 28 February 2011
Cloudera’s Distribution for Apache Hadoop version 3 Beta 4
New version of Cloudera’s Hadoop distro — complete release notes available here:
CDH3 Beta 4 also includes new versions of many components. Highlights include:
- HBase 0.90.1, including much improved stability and operability.
- Hive 0.7.0rc0, including the beginnings of authorization support, support for multiple databases, and many other new features.
- Pig 0.8.0, including many new features like scalar types, custom partitioners, and improved UDF language support.
- Flume 0.9.3, including support for Windows and improved monitoring capabilities.
- Sqoop 1.2, including improvements to usability and Oracle integration.
- Whirr 0.3, including support for starting HBase clusters on popular cloud platforms.
Plus many scalability improvements contributed by Yahoo!.
Cloudera’s CDH is the most popular Hadoop distro bringing together many components of the Hadoop ecosystem. Yahoo remains the main innovator behind Hadoop.
Original title and link: Cloudera’s Distribution for Apache Hadoop version 3 Beta 4 (NoSQL databases © myNoSQL)
via: http://www.cloudera.com/blog/2011/02/cdh3-beta-4-now-available
Thursday, 17 February 2011
Hadoop and Membase Case Study: AOL Advertising Architecture
Combining Hadoop and Membase to solve these challenges:
- How to analyze billions of user-related events, presented as a mix of structured and unstructured data, to infer demographic, psychographic and behavioral characteristics that are encapsulated into hundreds of millions of “cookie profiles”
- How to make hundreds of millions of cookie profiles available to their ad targeting platform with sub-millisecond, random read latency
- How to keep the user profiles fresh and current
In a much simplified form:
- crunch (nb: read it as pre-process and prepare) tons of data with Hadoop
- feed the results in a low latency, high throughput key-value store for serving them online
Original title and link: Hadoop and Membase Case Study: AOL Advertising Architecture (NoSQL databases © myNoSQL)
Tuesday, 8 February 2011
NoSQL databases, Quest Software, and Toad for Cloud
I wrote a couple of times about Quest Software’s Toad for Cloud[1], the free Eclipse tool that allows connecting to NoSQL databases and working with data in a tabular, SQLish format. But having in mind Quest’s business, there was something bugging me about Quest’s initiative: why would they make these tools? Even more, why would they make them available for free?
I think I got an answer to these questions. One area of expertise of Quest Software is represented by relational databases management tools. If offering these free tools would make it easy to see all data as tabular once again, moreover would make it easy to move it back to relational database, then Quest Software’s database management tools will continue to sell well and also enter the NoSQL databases market.
This is somehow similar to what Cloudera is doing for Hadoop: creating an ecosystem that enables everyone to important data into Hadoop. As long as more companies are using Hadoop, the more chances are Cloudera’s business will prosper.
-
Hive and HBase in Toad for Cloud and Riptano and Quest Partnership for Cassandra (nb: Riptano has been renamed to DataStax) ↩
Original title and link: NoSQL databases, Quest Software, and Toad for Cloud (NoSQL databases © myNoSQL)
Monday, 10 January 2011
Basic Setup for Cloudera Hadoop Distribution
This is the third time I’ve turned a vanilla mac into a ninja coding machine, and following my design principle of “first time = coincidence, second time = annoying, third time = pattern”, I’ve decided to write down the details for the next time.
Getting Cloudera’s Hadoop distribution up an running sounds pretty easy.
Original title and link: Basic Setup for Cloudera Hadoop Distribution (NoSQL databases © myNoSQL)
via: http://www.cloudera.com/blog/2011/01/setting-up-cdh3-hadoop-on-my-new-macbook-pro/
Thursday, 25 November 2010
Why the Cloudera - Membase partnership?
For those scenarios that require both scalable low latency data access and batch analytics to complete the application’s mission. This kind of hybrid, bidirectional data integration is the topological requirement of new applications – AOL Advertising and ShareThis are joint customers with these requirements. A Flume interface provides a streaming interface from Membase to Hadoop; a Sqoop utility can be used for batch transfers between the two. Both of these utilities will be familiar to Hadoop watchers.
Basically OLTP (Membase) and OLAP (Cloudera/Hadoop). And I told you everybody Flumes.
Original title and link: Why the Cloudera - Membase partnership? (NoSQL databases © myNoSQL)
Tuesday, 23 November 2010
Hadoop on EC2: A Detailed Guide
Lars George, with his very (very) detailed style, explains step by step how to setup Hadoop on EC2.
Let’s jump into it head first and solve the problem of actually launching a cluster. You have heard that Hadoop is shipped with EC2 support, but how do you actually start up a Hadoop cluster on EC2? You do have a couple of choices and as Tom’s article above explains you could start all instances in the cluster by hand. But why would you want to do that if there are scripts available that do all the work for you? And to complicate matters, how do you select the AMI (the Amazon Machine Image) that has the Hadoop version you need or want? Does it have Hive installed for your subsequent analysis of the collected data? Just running a check to count the available public Hadoop images returns 41! That gets daunting very quickly. Sure you can roll your own - but that implies even more manual labor that you probably better spend on productive work. But there is help available..
Lars is using the CDH tools for this tutorial and points out the the ☞ incubating Whirr project.
Original title and link: Hadoop on EC2: A Detailed Guide (NoSQL databases © myNoSQL)
via: http://www.larsgeorge.com/2010/10/hadoop-on-ec2-primer.html
Saturday, 13 November 2010
Videos from Hadoop World
There was one NoSQL conference that I’ve missed and I was really pissed off: Hadoop World. Even if I’ve followed and curated the Twitter feed, resulting in Hadoop World in tweets, the feeling of not being there made me really sad. But now, thanks to Cloudera I’ll be able to watch most of the presentations. Many of them have already been published and the complete list can be found ☞ here.
Based on the twitter activity on that day, I’ve selected below the ones that seemed to have generated most buzz. The list contains names like Facebook, Twitter, eBay, Yahoo!, StumbleUpon, comScore, Mozilla, AOL. And there are quite a few more …
Thursday, 11 November 2010
Quick Reference: Hadoop Tools Ecosystem
Just a quick reference of the continuously growing Hadoop tools ecosystem.
Hadoop
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.
HDFS
A distributed file system that provides high throughput access to application data.
MapReduce
A software framework for distributed processing of large data sets on compute clusters.
Amazon Elastic MapReduce
Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
aws.amazon.com/elasticmapreduce/
Cloudera Distribution for Hadoop (CDH)
Cloudera’s Distribution for Hadoop (CDH) sets a new standard for Hadoop-based data management platforms.
ZooKeeper
A high-performance coordination service for distributed applications. ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
HBase
A scalable, distributed database that supports structured data storage for large tables.
Avro
A data serialization system. Similar to ☞ Thrift and ☞ Protocol Buffers.
Sqoop
Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:
- Imports individual tables or entire databases to files in HDFS
- Generates Java classes to allow you to interact with your imported data
- Provides the ability to import from SQL databases straight into your Hive data warehouse
Flume
Flume is a distributed, reliable, and available service for efficiently moving large amounts of data soon after the data is produced.
archive.cloudera.com/cdh/3/flume/
Hive
Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language.
Pig
A high-level data-flow language and execution framework for parallel computation. Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
Oozie
Oozie is a workflow/coordination service to manage data processing jobs for Apache Hadoop. It is an extensible, scalable and data-aware service to orchestrate dependencies between jobs running on Hadoop (including HDFS, Pig and MapReduce).
Cascading
Cascading is a Query API and Query Planner used for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster.
Cascalog
Cascalog is a tool for processing data on Hadoop with Clojure in a concise and expressive manner. Cascalog combines two cutting edge technologies in Clojure and Hadoop and resurrects an old one in Datalog. Cascalog is high performance, flexible, and robust.
github.com/nathanmarz/cascalog
HUE
Hue is a graphical user interface to operate and develop applications for Hadoop. Hue applications are collected into a desktop-style environment and delivered as a Web application, requiring no additional installation for individual users.
You can read more about HUE on ☞ Cloudera blog.
Chukwa
Chukwa is a data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.
Mahout
A Scalable machine learning and data mining library.
Integration with Relational databases
- Oracle
- Hadoop connector for Oracle Ora-Oop
- Hadoop and Oracle Parallel Processing
Integration with Data Warehouses
The only list I have is MapReduce, RDBMS, and Data Warehouse, but I’m afraid it is quite a bit old. So maybe someone could help me update it.
Anything else? Once we validate this list, I’ll probably have to move it on the NoSQL reference
Original title and link: Quick Reference: Hadoop Tools Ecosystem (NoSQL databases © myNoSQL)
Tuesday, 26 October 2010
Cloudera Raises Another $25 million in Funding
WSJ.com:
Cloudera, a key distributor of software that helps companies analyze big piles of data, has raised another $25 million.
[…]
Cloudera, like other open-source companies, makes money on what is essentially free software by charging for Hadoop training classes and professional services to help companies get the software up and running. In June, the company started offering proprietary software tools to make it easier for companies to run Hadoop at large scale.
Personally I think Cloudera’s secret lies in building/improving/supporting and distributing, or simply put making it really easy for others to get their data into Hadoop. And I’m referring here to: Flume, Sqoop, Oozie, Hue.
Last, but not least, it doesn’t seem like everyone is agreeing with the statement from Beyond Hadoop - Next-Generation Big Data Architectures: “people who really do have cutting edge performance and scalability requirements today have already moved on from the Hadoop model”.
Update: ☞ New York Times article.
Original title and link: Cloudera Raises Another $25 million in Funding (NoSQL databases © myNoSQL)
via: http://blogs.wsj.com/digits/2010/10/26/cloudera-raises-hefty-funding-round/
Friday, 15 October 2010
Membase and Cloudera with Flume and Sqoop
James Phillips (Membase):
On the technology integration front, we have built and are making available to customers two mechanisms for integrating Membase and Cloudera Distribution for Hadoop (CDH). The first is a Membase NodeCode module that can stream data from Membase to CDH in real-time. As new operational data enters Membase, it can be massaged in real time and pumped into a CDH cluster for processing. The second is a Sqoop-derived batch loader utility that enables loading of data from Membase to CDH, and vice versa.
Real-time integration using Flume. Batch integration using Sqoop. Sounds like Cloudera’s tools are delivering.
Original title and link: Membase and Cloudera with Flume and Sqoop (NoSQL databases © myNoSQL)
via: http://www.infoq.com/news/2010/10/membase-cdh-integration
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling
