NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Cloudera: All content tagged as Cloudera in NoSQL databases and polyglot persistence

Cloudera Announces Cloudera Developer Kit, Enabling Developers to Build Hadoop Apps Faster

I didn’t know what to think of this announcement after reading the WSJ title . After checking the project GitHub page, I still don’t know what to make of it.

Original title and link: Cloudera Announces Cloudera Developer Kit, Enabling Developers to Build Hadoop Apps Faster (NoSQL database©myNoSQL)

Cloudera Impala 1.0 Release Notes and A Couple of Questions

This is what I’ve been looking for since posting about Impala 1.0: the release notes. From the new features list:

  • support for ALTER TABLE
  • REFRESH for a single table
  • Hints for specifying particular join strategies
  • Dynamic resource management, allowing high concurrency for Impala queries

Question: if I remember correctly Impala uses a single process on each machine to execute queries.

  1. is it multi-threaded?
  2. does it do any memory/CPU management so one query is not completely exhausting any of these resources?
  3. what happens with the queries executing when this process fails?

Original title and link: Cloudera Impala 1.0 Release Notes and A Couple of Questions (NoSQL database©myNoSQL)

Impala 1.0 - That was fast

Cloudera announces Impala 1.0 GA release.

That was fast—I guess this is one of the (little) advantages of having Hortonworks working on Stinger, Pivotal on HAWQ, Qubole offering Hive, Pig and Sqoop as-a-Service

Original title and link: Impala 1.0 - That was fast (NoSQL database©myNoSQL)

MapR Raises $30mil in Series C

Where is MapR today?

  1. MapR raised a total of $59mil.
  2. According to John Schroeder (CEO) “92% of MapR customers pay primarely for licenses and not for ancillary services and support”.
  3. According to Wikibon, MapR had $23mil. revenue in 2012, 49% of which coming from services (nb: this seem to contradict the above point)
  4. Support for MapR installations is offered by Accenture and Booz Allen Hamilton

How will MapR use the new capital?

With the new funding, the company plans to invest in research & development, and expand into Asia.

How is MapR seeing its competitors?

John Schroeder (CEO):

“Our competitors’ model is very cash intensive and you have to wonder whether or not they’ll ever be cash-flow positive”.

Cloudera has raised until now $141mil:

  1. Series A: $5mil
  2. Series B: $6mil
  3. Series C: $25mil
  4. Series D: $40mil
  5. Series E: $65mil

According to this, Cloudera raised $36mil in the first 3 rounds. I couldn’t find any official data about the capital raised by Hortonworks, but the number I’ve seen in a couple of places is $50mil. So far MapR raised $59mil.

Sources for these bits:

Original title and link: MapR Raises $30mil in Series C (NoSQL database©myNoSQL)

How Does MapR Compare to Cloudera?

Staying in the MapR land, the question of comparing MapR to Cloudera is answered by people from all sides (MapR, Cloudera and Hortonworks). My summary: “cool proprietary technology addressing some of the current limitations of the Hadoop, but also missing some of the features the Hadoop community has come up with”.

Original title and link: How Does MapR Compare to Cloudera? (NoSQL database©myNoSQL)


Parquet - Columnar Storage Format for Hadooop by Twitter and Cloudera

Announced 2 hours ago, by Twitter’s analytics infrastructure engineer Dmitriy Ryaboy, here comes Parquet:

We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.

The Parquet format page describes the details of the Apache Thrift metadata encoding, supported types, Thrift definitions, etc.

Original title and link: Parquet - Columnar Storage Format for Hadooop by Twitter and Cloudera (NoSQL database©myNoSQL)


How Many Hadoops?

The short answer is there is only one Apache Hadoop distribution.

The long answer is that there are many distributions that include Apache Hadoop or are claiming compatibility with Apache Hadoop.

The oldest and probably most popular: Cloudera’s Distribution of Hadoop (CDH)

The 100% open source: Hortonworks Data Platform.

The prioprietary: MapR.

The blue one: IBM InfoSphere BigInsights.

The latest: WANdisco Hadoop WDD, Intel Distribution of Hadoop and Pivotal HD from EMC Greenplum.

There’s also the version Facebook’s running on their cluster which includes Facebook Corona: a different approach to job scheduling and resource management.

But this list is not complete as it doesn’t include appliances featuring Hadoop. In this category we have:

  1. Oracle’s Big Data appliance featuring Cloudera’s Distribution of Hadoop
  2. Netapp’s Hadooplers
  3. EMC Greenplum DCA
  4. Teradata Aster Discovery Platform featuring Hortonworks’s Hadoop Data Platform
  5. Data Direct Networks (DDN)

I hope I didn’t miss any important ones1. As a conclusion for this list, my question is: who is actually benefiting from all these distributions?

  1. I left aside for now Hadoop-as-a-Service.  

Original title and link: How Many Hadoops? (NoSQL database©myNoSQL)

Cloudera Pissed Off

Charles Zedlewki takes position for Cloudera to the recent attacks to Hadoop and Impala:

I’m reminded of our open source strategy this week not only because of the further validation of Hadoop’s popularity but also because of the entry of a new round of proprietary imitators. At one point there were six distinct vendors all promoting proprietary filesystems as alternatives to HDFS, many of which included breathless claims of how they could make Apache Hadoop faster and “more powerful.” This year we get to see history repeat itself, this time with SQL engines. The marketing is nearly identical to that of the proprietary filesystem era: damning open source with faint praise, pointing out its limitations and extolling the virtues of some feature(s) proprietary to that particular vendor.

Proprietary SQL vendors will pull a page from the proprietary storage playbook: damn open source Impala with faint praise and point out its limitations, both real and contrived. They will be equally ineffective. We will continue to bet on an open, integrated, and highly flexible big data platform. Saying you are “all in on Hadoop” while simultaneously promoting a proprietary platform means you are missing the point.

Neither Cloudera, nor other companies that invested a lot and everything in the Hadoop ecosystem are at the size not to care about large corporations attacking their bets. Every corporation is trying to emulate the Microsoft strategy: wait for a new technology to be confirmed, then jump at the opportunity with all your forces. But I really hope open source will prevail.

Original title and link: Cloudera Pissed Off (NoSQL database©myNoSQL)


Inside Cloudera Impala: Runtime Code Generation

Nong Li about Cloudera’s Impala implementation:

Cloudera Impala, the open-source real-time query engine for Apache Hadoop, uses many tools and techniques to get the best query performance. This blog post will discuss how we use runtime code generation to significantly improve our CPU efficiency and overall query execution time. We’ll explain the types of inefficiency that code-generation eliminates and go over in more detail one of the queries in the TPCH workload where code generation improves overall query speeds by close to 3x.

This reminded me of the days I was working on Java AOP frameworks whose implementation was based on bytecode generation for the same purpose of optimization. Everything worked perfectly well as long as the underlying assumptions remained the same.

Original title and link: Inside Cloudera Impala: Runtime Code Generation (NoSQL database©myNoSQL)


Hadoop in 2013: What Hortonworks Will Focus On

Shaun Connolly summarizing a recent webinar about where Hortonwork’s work on Hadoop will focus in 2013:

[…] Interactive Query, Business Continuity (DR, Snapshots, etc.), Secure Access, as well as ongoing investments in Data Integration, Management (i.e. Ambari), and Online Data (i.e. HBase).
[…] Rather than abandon the Apache Hive community, Hortonworks is focused on working in the community to optimize Hive’s ability to serve big data exploration and interactive query in support of important BI use cases. Moreover, we are focused on enabling Hive to take advantage of YARN in Apache Hadoop 2.0, which will help ensure fast query workloads don’t compete for resources with the other jobs running in the cluster. Enabling Hadoop to predictably support enterprise workloads that span Batch, Interactive, and Online use cases is an important area of focus for us.

Basically this says that Hortonworks sees YARN and Hive as the answer to online or real-time interactive querying of Hadoop data. Cloudera’s take on this is different.

Original title and link: Hadoop in 2013: What Hortonworks Will Focus On (NoSQL database©myNoSQL)


Hortonworks Joins OpenStack Foundation

Hortonworks, a leading contributor to Apache Hadoop, today announced it has joined the OpenStack Foundation, which promotes the development, distribution and adoption of the OpenStack cloud operating system. By contributing to the OpenStack ecosystem, Hortonworks is supporting the open source community and facilitating adoption of 100-percent open source Apache Hadoop-based solutions in the cloud. Now customers will be able to access an enterprise-ready Hortonworks Data Platform built for the cloud that alleviates the time and complexities of manually deploying a big data solution.

What took this so long? Cloudera has been part of OpenStack since 2010.

Original title and link: Hortonworks Joins OpenStack Foundation (NoSQL database©myNoSQL)


Hadoop in the Cloud: Skytap and Joyent

Besides the well established Amazon Elastic MapReduce and Windows Azure HDInsight, there are two new Hadoop-in-the-cloud services:

  • Skytap which offers Cloudera CDH4 Enterprise experimentation clusters up to 50 nodes
  • Joyent Solution for Hadoop which is offered in partnership with Hortonworks. I hesitated for a bit to mention Joyent considering the page says “Sign up now to talk to a Joyent Solutions Architect” which is anything but a cloud service.

Original title and link: Hadoop in the Cloud: Skytap and Joyent (NoSQL database©myNoSQL)