ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Cloudera: All content tagged as Cloudera in NoSQL databases and polyglot persistence

Dell and Cloudera and Intel join forces for appliances

Me in Intel kills a Hadoop and feeds another:

As for Intel, what if this investment also sealed an exclusive deal for Hadoop-centric Cloudera-supported Intel-powered appliance?

I didn’t know about the existing Dell-Cloudera-Intel partnership, but this is re-inforced with the recent announcement of an in-memory appliance.

Since 2011, Cloudera, Dell and Intel have built pre-validated reference architectures for Hadoop. […]

The Dell In-Memory Appliances for Cloudera Enterprise is yet another proof point of the collaboration and synergies between the three companies. As the first of a family of appliances, it includes leading Dell hardware, Cloudera’s enterprise data hub -based on Cloudera Enterprise, Intel architecture for fast processing, and ScaleMP’s Versatile SMP (vSMP) architecture to aggregate multiple x86 servers into a single virtual machine to create large memory pools for in-memory processing.

Original title and link: Dell and Cloudera and Intel join forces for appliances (NoSQL database©myNoSQL)


Project Rhino goal: at-rest encryption for Apache Hadoop

Although network encryption has been provided in the Apache Hadoop platform for some time (since Hadoop 2.02-alpha/CDH 4.1), at-rest encryption, the encryption of data stored on persistent storage such as disk, is not. To meet that requirement in the platform, Cloudera and Intel are working with the rest of the Hadoop community under the umbrella of Project Rhino — an effort to bring a comprehensive security framework for data protection to Hadoop, which also now includes Apache Sentry (incubating) — to implement at-rest encryption for HDFS (HDFS-6134 and HADOOP-10150).

Looks like I got this wrong: Apache Sentry will become part of Project Rhino.

Original title and link: Project Rhino goal: at-rest encryption for Apache Hadoop (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2014/06/project-rhino-goal-at-rest-encryption/


Hadoop security: unifying Project Rhino and Sentry

One result of Intel’s investment in Cloudera is putting together the teams to work on the same projects:

As the goals of Project Rhino and Sentry to develop more robust authorization mechanisms in Apache Hadoop are in complete alignment, the efforts of the engineers and security experts from both companies have merged, and their work now contributes to both projects. The specific goal is “unified authorization”, which goes beyond setting up authorization policies for multiple Hadoop components in a single administrative tool; it means setting an access policy once (typically tied to a “group” defined in an external user directory) and having it enforced across all of the different tools that this group of people uses to access data in Hadoop – for example access through Hive, Impala, search, as well as access from tools that execute MapReduce, Pig, and beyond.

A great first step.

You know what would be even better? A single security framework for Hadoop instead of two.

Original title and link: Hadoop security: unifying Project Rhino and Sentry (NoSQL database©myNoSQL)

via: http://vision.cloudera.com/project-rhino-and-sentry-onward-to-unified-authorization/


Cloudera, Hadoop, Data warehouses and SLR camera

Amr Adawallah in an interview with Dan Woods for Forbes:

Our advantage is that we can encompass more data and run more workloads with less friction than any other platform. The analogy I use most often is the difference between the SLR camera and the camera on your smart phone. Almost everyone takes more pictures on their smart phone than on their SLR.

The SLR camera is like the enterprise data warehouse. The SLR camera is really, really good at taking pictures, in the same sense that an enterprise data warehouse is really, really good at running queries. But that’s the only thing it does. The data it picks is only exposed to that workload. The system we provide, the enterprise data hub, is more like the smartphone. It can take decent pictures—they won’t be as good as the SLR camera, and in this I’m referring to the Impala system. So Impala will run queries. The queries won’t run at the same interactive OLAP speeds that you get from a high-end data warehouse. However, for many use cases, that performance might be good enough, given that the cost is 10 times lower.

I’ve linked in the past to Ben Thomspon‘s visualizations of the innovator’s dillema:

ben thompson - innovator dilemma

The explanation goes like this: incumbents’ products are usually over-serving consumer needs thus leaving room to new entrants’ good-enough lower-priced products.

Original title and link: Cloudera, Hadoop, Data warehouses and SLR camera (NoSQL database©myNoSQL)

via: http://www.forbes.com/sites/danwoods/2014/05/09/clouderas-strategy-for-conquering-big-data-the-enterprise/


Choice of NoSQL databases from Cloudera

Adam Fowler1 looks at the potential confusion for Cloudera’s customers when talking about NoSQL databases:

As for Cloudera customers I’m not too sure. It may confuse people asking Cloudera about NoSQL. Below is a potential conversation that, as a sales engineer for NoSQL vendor MarkLogic, I can see easily happening:

This announcement struck me as being too publicized — it’s normal for companies with similar interests to partner, but a fair amount of care should be put into clearing all possible confusions and I don’t think this happened.

Just to summarize: Cloudera provides support for HBase and Accumulo. And it has a deal with MongoDB and Oracle. I assume in the sale process, Cloudera will go with: “we work with whatever you already have in place”. As for recommending a NoSQL solution for their customers, it will probably go as in Adam Fowler’s post. To which we could probably add Oracle too.


  1. Adam Fowler works for MarkLogic. 

Original title and link: Choice of NoSQL databases from Cloudera (NoSQL database©myNoSQL)

via: http://adamfowlerml.wordpress.com/2014/05/05/choice-of-nosql-databases-from-cloudera/


Intel kills a Hadoop and feeds another

I seriously doubt you could have missed the 2nd part of this, but here’s the shortest executive summary:

  1. Intel has killed its own distribution of Hadoop — is there anyone that would disagree this is a good idea?
  2. Intel has invested $740mil in Cloudera (for 18%) — there’s no typo. 740 millions.

The main questions:

  1. where will Cloudera put the $900mil raised in the last round(s)?
  2. why Intel invested so much?

These questions were also asked by Dan Primack for CNN Money and after looking at different angles he comes out empty.

So let’s check other sources:

  1. TechCrunch has initially speculated that much of the investment went to existing shareholders.

    The post was later updated with a comment from Cloudera’s VP of marketing stating that the majority of the money went to the company. But no word on how they’ll be used.

  2. Reuters writes that Intel made the investment to ensure their leading position in server processors:

    Intel hopes that encouraging more companies to leap into Big Data analysis will lead to higher sales of its high- end Xeon server processors. The chipmaker believes that hitching its wagon to Cloudera’s version of Hadoop, instead of pushing its own version, will make that happen faster.

    Still no word on how Cloudera will be using the money.

  3. Derrick Harris for GigaOm writes that the deal makes a lot of sense for both companies1:

    Cloudera needs capital and Intel’s huge sales force to keep up its engineering efforts and grow the company internationally.

    As part of the deal, Cloudera will be an early adopter of Intel gear and will optimize its Hadoop software to run on Intel’s latest technologies. Intel will port some of its work into the Cloudera distribution and will maintain its own Hadoop engineering team that will work alongside Cloudera’s engineers to help unite the two company’s goals.

  4. Jeff Kelly for SiliconAngle emphasizes the same channel advantages:

    Cloudera’s biggest reseller partner is Oracle. Based on my reading of the Intel announcement, the deal is not an official reseller partnership, but Intel will “market and promote CDH and Cloudera Enterprise to its customers as its preferred Hadoop platform.” Not quite as nice as having the Intel salesforce closing deals for it, but Cloudera stands to gain significant new business from the arrangement.


So how about this short list on how this round will be used by Cloudera:

  1. a part goes for international expansion
  2. a larger part goes to early shareholders
  3. the largest part goes into acquisitions

As for Intel, what if this investment also sealed an exclusive deal for Hadoop-centric Cloudera-supported Intel-powered appliance?


  1. Insert snarky comment here about a $740m deal that would not make sense to one of the parties. How about not making sense to any of them? 

Original title and link: Intel kills a Hadoop and feeds another (NoSQL database©myNoSQL)


Cloudera Search Interface: Inside Cloudera's customer support Enterprise Data Hub

Great use of their own technologies to better server the customer:

This application goes way beyond simple indexing and searching. We are using Cloudera Search, HBase, and MapReduce to process, store, and visualize stack traces that wouldn’t be possible with just a search index. How Monocle Stack Trace integrates with the larger CSI application goes way beyond that, though. It’s a great feeling when you are able to execute a search in Monocle Stack Trace that links directly to a point in time in a customer log file that an Impala query returned after churning through tens of GBs of data — done interactively from a Web UI on the order of a second or two.

I can easily see this becoming a real product used by software companies that offer direct customer support.

Original title and link: Cloudera Search Interface: Inside Cloudera’s customer support Enterprise Data Hub (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2014/02/secrets-of-cloudera-support-inside-our-own-enterprise-data-hub/


The Forrester Wave for Hadoop market

Update: I’d like to thank the people that pointed out in the comment thread that I’ve messed up quite a few aspects in my comments about the report. I don’t believe in taking down posts that have been out for a while, so please be warned that basically this article can be ignored.

Thank you and my apologies for those comments that were a misinterpretation of the report..


This is the Q1 2014 Forrester Wave for Hadoop:

Forrester wave for Hadoop

A couple of thoughts:

  1. Cloudera, Hortonworks, MapR are positioned very (very) close.

    1. Hortonworks is position closer to the top right meaning they report more customers/larger install base
    2. MapR is higher on the vertical axis meaning that MapR’s strategy is slightly better.

      For me, MapR’s strategy can be briefly summarized as:

      1. address some of the limitations in the Hadoop ecosystem
      2. provide API-compatible products for major components of the Hadoop ecosystem
      3. use these Apache product (trade marked) names to advertise their products

      I think the 1st point above explains the better positioning of MapR’s current offering.

    3. Even if Cloudera has been the first pure-play Hadoop distribution it’s positioned behind behind both Hortonworks and MapR.

  2. IBM has the largest market presence. That’s a big surprise as I’m very rarely hearing clear messages from IBM.

  3. IBM and Pivotal Software are considered to have the strongest strategy. That’s another interesting point in Forrester’s report. Except the fact that IBM has a ton of data products and that Pivotal Software is offering more than Hadoop, I don’t know what exactly explains this position.

    The Forrester report Strategy positioning is based on quantifying the following categories: Licensing and pricing, Ability to execute, Product road map, Customer support. IBM and Pivotal are ranked the first in all these categories (with maximum marks for the last 3). As a comparison Hortonworks has 3/5 for Ability to execute — this must be related only to budget; Cloudera has 3/5 for both Ability to execute and Customer support.

    Pivotal is the 3rd last in terms of current offering. I guess my hypothesis for ranking Pivotal as 1st in terms of strategy is wrong.

  4. Microsoft who through the collaboration with Hortonworks came up with HDInsight, which basically enabled Hadoop for Excel and its data warehouse offering, it positioned the 2nd last on all 3 axes.

    No one seems to love Microsoft anymore.

  5. While not a pure Hadoop player, DataStax has been offering the DataStax Enterprise platform that includes support for analytics through Hadoop and search through Solr for at least 2 years. That’s actually way before anyone else from the group of companies in the Forrester’s report had anything similar1.

    This report focuses only on “general-purpose Hadoop solutions based on a differentiated, commercial Hadoop distribution”.

You can download the report after registering on Hortonwork’s site: here.


  1. DataStax is my employer. But what I wrote is a pure fact. 

Original title and link: The Forrester Wave for Hadoop market (NoSQL database©myNoSQL)


Bloomberg says Cloudera raises at least $200m in new round

Dina Bass and Serena Saitto (Bloomberg):

Cloudera Inc. is raising at least $200 million in a new round of financing from investors including Intel Corp., according to people with knowledge of the situation.

Not confirmed yet.

Original title and link: Bloomberg says Cloudera raises at least $200m in new round (NoSQL database©myNoSQL)

via: http://www.bloomberg.com/news/2014-03-18/cloudera-said-to-raise-at-least-200-million-in-funding.html


A guide to write and run Giraph jobs on Hadoop

A good setup guide by Mirko Kämpf:

In this how-to, you will learn how to use Giraph 1.0.0 on top of CDH 4.x using a simple example dataset, and run example jobs that are already implemented in Giraph. You will also learn how to set up your own Giraph- based development environment. The end result will be a setup (not intended for production) for writing and testing Giraph jobs, or just for playing around with Giraph and small sample datasets.

giraph

Anatomy of the Giraph data flow

Original title and link: A guide to write and run Giraph jobs on Hadoop (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2014/02/how-to-write-and-run-giraph-jobs-on-hadoop/


Cloudera shipped a mountain... what can you read between the lines

Cloudera Engineering (@ClouderaEng) shipped a mountain of new product (production-grade software, not just technical previews): Cloudera Impala, Cloudera Search, Cloudera Navigator, Cloudera Development Kit (now Kite SDK), new Apache Accumulo packages for CDH, and several iterative releases of CDH and Cloudera Manager. (And, the Cloudera Enterprise 5 Beta release was made available to the world.). Furthermore, as always, a ton of bug fixes and new features went upstream, with the features notably but not exclusively HiveServer2 and Apache Sentry (incubating).

How many things can you read in this paragraph?

  1. a not that subtle stab at Hortonwork’s series of technical previews.
  2. more and more projects brought under the CDH umbrella. Does more ever become too much? (I cannot explain why, but my first thought was “this feels so Oracle-style”)
  3. Cloudera’s current big bet is Impala. SQL and low latency querying. A big win for the project, but not necessarily a direct financial win for Cloudera, was its addition as a supported service on Amazon Elastic MapReduce.

Original title and link: Cloudera shipped a mountain… what can you read between the lines (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2014/01/this-month-and-year-in-the-ecosystem-december-2013/


Integrating R with Cloudera Impala for Real-Time queries on Hadoop

A very long tutorial by Istvan Szegedi on how to integrate R with Cloudera Impala, through the ODBC driver:

Cloudera Impala is an exciting new technology to provide real-time, interactive queries in Hadoop environment. It supports ODBC connectors and this makes it possible to integrate it with many popular BI tools and statistical software such as R. Together R and Impala provide an excellent combination for data analyst to process massive data sets efficiently and they can also support graphical representation of the result sets.

Original title and link: Integrating R with Cloudera Impala for Real-Time queries on Hadoop (NoSQL database©myNoSQL)

via: http://bighadoop.wordpress.com/2013/11/25/integrating-r-with-cloudera-impala-for-real-time-queries-on-hadoop/