NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Cloudera: All content tagged as Cloudera in NoSQL databases and polyglot persistence

Cloudera Pissed Off

Charles Zedlewki takes position for Cloudera to the recent attacks to Hadoop and Impala:

I’m reminded of our open source strategy this week not only because of the further validation of Hadoop’s popularity but also because of the entry of a new round of proprietary imitators. At one point there were six distinct vendors all promoting proprietary filesystems as alternatives to HDFS, many of which included breathless claims of how they could make Apache Hadoop faster and “more powerful.” This year we get to see history repeat itself, this time with SQL engines. The marketing is nearly identical to that of the proprietary filesystem era: damning open source with faint praise, pointing out its limitations and extolling the virtues of some feature(s) proprietary to that particular vendor.

Proprietary SQL vendors will pull a page from the proprietary storage playbook: damn open source Impala with faint praise and point out its limitations, both real and contrived. They will be equally ineffective. We will continue to bet on an open, integrated, and highly flexible big data platform. Saying you are “all in on Hadoop” while simultaneously promoting a proprietary platform means you are missing the point.

Neither Cloudera, nor other companies that invested a lot and everything in the Hadoop ecosystem are at the size not to care about large corporations attacking their bets. Every corporation is trying to emulate the Microsoft strategy: wait for a new technology to be confirmed, then jump at the opportunity with all your forces. But I really hope open source will prevail.

Original title and link: Cloudera Pissed Off (NoSQL database©myNoSQL)


Inside Cloudera Impala: Runtime Code Generation

Nong Li about Cloudera’s Impala implementation:

Cloudera Impala, the open-source real-time query engine for Apache Hadoop, uses many tools and techniques to get the best query performance. This blog post will discuss how we use runtime code generation to significantly improve our CPU efficiency and overall query execution time. We’ll explain the types of inefficiency that code-generation eliminates and go over in more detail one of the queries in the TPCH workload where code generation improves overall query speeds by close to 3x.

This reminded me of the days I was working on Java AOP frameworks whose implementation was based on bytecode generation for the same purpose of optimization. Everything worked perfectly well as long as the underlying assumptions remained the same.

Original title and link: Inside Cloudera Impala: Runtime Code Generation (NoSQL database©myNoSQL)


Hadoop in 2013: What Hortonworks Will Focus On

Shaun Connolly summarizing a recent webinar about where Hortonwork’s work on Hadoop will focus in 2013:

[…] Interactive Query, Business Continuity (DR, Snapshots, etc.), Secure Access, as well as ongoing investments in Data Integration, Management (i.e. Ambari), and Online Data (i.e. HBase).
[…] Rather than abandon the Apache Hive community, Hortonworks is focused on working in the community to optimize Hive’s ability to serve big data exploration and interactive query in support of important BI use cases. Moreover, we are focused on enabling Hive to take advantage of YARN in Apache Hadoop 2.0, which will help ensure fast query workloads don’t compete for resources with the other jobs running in the cluster. Enabling Hadoop to predictably support enterprise workloads that span Batch, Interactive, and Online use cases is an important area of focus for us.

Basically this says that Hortonworks sees YARN and Hive as the answer to online or real-time interactive querying of Hadoop data. Cloudera’s take on this is different.

Original title and link: Hadoop in 2013: What Hortonworks Will Focus On (NoSQL database©myNoSQL)


Hortonworks Joins OpenStack Foundation

Hortonworks, a leading contributor to Apache Hadoop, today announced it has joined the OpenStack Foundation, which promotes the development, distribution and adoption of the OpenStack cloud operating system. By contributing to the OpenStack ecosystem, Hortonworks is supporting the open source community and facilitating adoption of 100-percent open source Apache Hadoop-based solutions in the cloud. Now customers will be able to access an enterprise-ready Hortonworks Data Platform built for the cloud that alleviates the time and complexities of manually deploying a big data solution.

What took this so long? Cloudera has been part of OpenStack since 2010.

Original title and link: Hortonworks Joins OpenStack Foundation (NoSQL database©myNoSQL)


Hadoop in the Cloud: Skytap and Joyent

Besides the well established Amazon Elastic MapReduce and Windows Azure HDInsight, there are two new Hadoop-in-the-cloud services:

  • Skytap which offers Cloudera CDH4 Enterprise experimentation clusters up to 50 nodes
  • Joyent Solution for Hadoop which is offered in partnership with Hortonworks. I hesitated for a bit to mention Joyent considering the page says “Sign up now to talk to a Joyent Solutions Architect” which is anything but a cloud service.

Original title and link: Hadoop in the Cloud: Skytap and Joyent (NoSQL database©myNoSQL)

Hadoop Business Ecosystem as of January 2013

As I was hoping and expecting, Datameer updated the chart visualizing Hadoop’s business side ecosystem:


It shouldn’t be a surprise to anyone that the top most connected companies in the Hadoop space are Cloudera and Hortonworks. They outrank the IT industry mammoths: IBM, HP, Microsoft, Oracle, SAP, etc.

Original title and link: Hadoop Business Ecosystem as of January 2013 (NoSQL database©myNoSQL)


Video Interview With Cloudera’s Jeff Hammerbacher on Building Big Data Systems

I wasn’t expecting to see this on TechCrunch… so it took me a bit deciding to link to it. I did it for Jeff Hammerbacher.

Original title and link: Video Interview With Cloudera’s Jeff Hammerbacher on Building Big Data Systems (NoSQL database©myNoSQL)


Overview of Dremel-Like Solutions: Moving Beyond Hadoop for Big Data Needs

Until I learn more about the recently announced Cloudera Impala and Druid from Metamarkets, this article by Jaikumar Vijayan should offer—with some inherent mistakes1—a good overview of the solutions aiming to offer alternatives to the batch-processing nature of Hadoop:

  • Google Dremel (BigQuery)
  • Cloudera Impala
  • Metamarkets Druid
  • Nodeable StreamReduce
  • SAP HANA integrated with Hadoop, etc.

  1. Just an example: “If you can stand latencies of a few seconds, Hadoop is fine. But Hadoop MapReduce is never going to be useful for sub-second latencies”. Then “The technology [nb Google Dremel] can run queries over trillion-row data tables in seconds…”

    Maybe just one more: consider the title “Moving beyond Hadoop” and then the quote from Google’s Ju-kay Kwek: “Google uses Dremel in conjuction with MapReduce. […] Hadoop and Dremel are distributed computing technologies, but each was built to address very different problems.” 

Original title and link: Overview of Dremel-Like Solutions: Moving Beyond Hadoop for Big Data Needs (NoSQL database©myNoSQL)


Cloudera Disitribution of Hadoop 4.1 Released

The yearly major release of CDH is out.

Original title and link: Cloudera Disitribution of Hadoop 4.1 Released (NoSQL database©myNoSQL)


HttpFS: Another Hadoop File System Over HTTP

Just a new HTTP interface for Hadoop file system. The main differences between HttpFS and WebHDFS are that this one is created by Cloudera, not Hortonworks (on top of their previos Hoop library) and:

HttpFs is a proxy so, unlike WebHDFS, it does not require clients be able to access every machine in the cluster. This allows clients to to access a cluster that is behind a firewall via the WebHDFS REST API.

Question is: if they are API compatible and both open source, why not unifying them?

Original title and link: HttpFS: Another Hadoop File System Over HTTP (NoSQL database©myNoSQL)


Cloudera and HP Partnership to Simplify Hadoop Deployments

As I was expecting after the series of announcements coming from MapR, Cloudera is announcing its partnership with HP:

Under the terms of the joint development and licensing agreement, the two companies will deliver open standards-based reference architectures that simplify management and accelerate deployment of Hadoop Cluster environments. Clients can purchase the Cloudera Enterprise platform and future Cloudera products either directly from HP or bundled in HP AppSystem for Apache Hadoop.

The new HP reference architecture for Apache Hadoop for Cloudera and HP AppSystem for Apache Hadoop—Cloudera are based on HP Converged Infrastructure. They include the Cloudera Enterprise platform and HP Insight Cluster Manager Utility (CMU) software.

Original title and link: Cloudera and HP Partnership to Simplify Hadoop Deployments (NoSQL database©myNoSQL)

MapR Claims Title as De Facto Standard for Hadoop

Maureen O’Gara:

The champagne has been flowing over at MapR since Google announced the integration of its Distribution for Hadoop with Google Compute Engine, the start-up’s second big win in a row.

Indeed, MapR on Amazon Elastic MapReduce and Google Compute Engine are two very important events in the life of MapR and for the Hadoop ecosystem in general. But there’s still a long way from these to being a de facto standard.

Original title and link: MapR Claims Title as De Facto Standard for Hadoop (NoSQL database©myNoSQL)