Cloudera: All content tagged as Cloudera in NoSQL databases and polyglot persistence
Monday, 13 May 2013
Cloudera Announces Cloudera Developer Kit, Enabling Developers to Build Hadoop Apps Faster
I didn’t know what to think of this announcement after reading the WSJ title . After checking the project GitHub page, I still don’t know what to make of it.
Original title and link: Cloudera Announces Cloudera Developer Kit, Enabling Developers to Build Hadoop Apps Faster (©myNoSQL)
Tuesday, 30 April 2013
Cloudera Impala 1.0 Release Notes and A Couple of Questions
This is what I’ve been looking for since posting about Impala 1.0: the release notes. From the new features list:
- support for
ALTER TABLE REFRESHfor a single table- Hints for specifying particular join strategies
- Dynamic resource management, allowing high concurrency for Impala queries
Question: if I remember correctly Impala uses a single process on each machine to execute queries.
- is it multi-threaded?
- does it do any memory/CPU management so one query is not completely exhausting any of these resources?
- what happens with the queries executing when this process fails?
Original title and link: Cloudera Impala 1.0 Release Notes and A Couple of Questions (©myNoSQL)
Impala 1.0 - That was fast
Cloudera announces Impala 1.0 GA release.
That was fast—I guess this is one of the (little) advantages of having Hortonworks working on Stinger, Pivotal on HAWQ, Qubole offering Hive, Pig and Sqoop as-a-Service
Original title and link: Impala 1.0 - That was fast (©myNoSQL)
Monday, 18 March 2013
How Does MapR Compare to Cloudera?
Staying in the MapR land, the question of comparing MapR to Cloudera is answered by people from all sides (MapR, Cloudera and Hortonworks). My summary: “cool proprietary technology addressing some of the current limitations of the Hadoop, but also missing some of the features the Hadoop community has come up with”.
Original title and link: How Does MapR Compare to Cloudera? (©myNoSQL)
via: http://www.quora.com/How-does-MapR-plan-to-compete-with-Cloudera
Tuesday, 12 March 2013
Parquet - Columnar Storage Format for Hadooop by Twitter and Cloudera
Announced 2 hours ago, by Twitter’s analytics infrastructure engineer Dmitriy Ryaboy, here comes Parquet:
We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.
The Parquet format page describes the details of the Apache Thrift metadata encoding, supported types, Thrift definitions, etc.
Original title and link: Parquet - Columnar Storage Format for Hadooop by Twitter and Cloudera (©myNoSQL)
Monday, 4 March 2013
How Many Hadoops?
The short answer is there is only one Apache Hadoop distribution.
The long answer is that there are many distributions that include Apache Hadoop or are claiming compatibility with Apache Hadoop.
The oldest and probably most popular: Cloudera’s Distribution of Hadoop (CDH)
The 100% open source: Hortonworks Data Platform.
The prioprietary: MapR.
The blue one: IBM InfoSphere BigInsights.
The latest: WANdisco Hadoop WDD, Intel Distribution of Hadoop and Pivotal HD from EMC Greenplum.
There’s also the version Facebook’s running on their cluster which includes Facebook Corona: a different approach to job scheduling and resource management.
But this list is not complete as it doesn’t include appliances featuring Hadoop. In this category we have:
- Oracle’s Big Data appliance featuring Cloudera’s Distribution of Hadoop
- Netapp’s Hadooplers
- EMC Greenplum DCA
- Teradata Aster Discovery Platform featuring Hortonworks’s Hadoop Data Platform
- Data Direct Networks (DDN)
I hope I didn’t miss any important ones1. As a conclusion for this list, my question is: who is actually benefiting from all these distributions?
-
I left aside for now Hadoop-as-a-Service. ↩
Original title and link: How Many Hadoops? (©myNoSQL)
Tuesday, 26 February 2013
Cloudera Pissed Off
Charles Zedlewki takes position for Cloudera to the recent attacks to Hadoop and Impala:
I’m reminded of our open source strategy this week not only because of the further validation of Hadoop’s popularity but also because of the entry of a new round of proprietary imitators. At one point there were six distinct vendors all promoting proprietary filesystems as alternatives to HDFS, many of which included breathless claims of how they could make Apache Hadoop faster and “more powerful.” This year we get to see history repeat itself, this time with SQL engines. The marketing is nearly identical to that of the proprietary filesystem era: damning open source with faint praise, pointing out its limitations and extolling the virtues of some feature(s) proprietary to that particular vendor.
Proprietary SQL vendors will pull a page from the proprietary storage playbook: damn open source Impala with faint praise and point out its limitations, both real and contrived. They will be equally ineffective. We will continue to bet on an open, integrated, and highly flexible big data platform. Saying you are “all in on Hadoop” while simultaneously promoting a proprietary platform means you are missing the point.
Neither Cloudera, nor other companies that invested a lot and everything in the Hadoop ecosystem are at the size not to care about large corporations attacking their bets. Every corporation is trying to emulate the Microsoft strategy: wait for a new technology to be confirmed, then jump at the opportunity with all your forces. But I really hope open source will prevail.
Original title and link: Cloudera Pissed Off (©myNoSQL)
via: http://blog.cloudera.com/blog/2013/02/open-source-flattery-and-the-platform-for-big-data/
Tuesday, 19 February 2013
Inside Cloudera Impala: Runtime Code Generation
Nong Li about Cloudera’s Impala implementation:
Cloudera Impala, the open-source real-time query engine for Apache Hadoop, uses many tools and techniques to get the best query performance. This blog post will discuss how we use runtime code generation to significantly improve our CPU efficiency and overall query execution time. We’ll explain the types of inefficiency that code-generation eliminates and go over in more detail one of the queries in the TPCH workload where code generation improves overall query speeds by close to 3x.
This reminded me of the days I was working on Java AOP frameworks whose implementation was based on bytecode generation for the same purpose of optimization. Everything worked perfectly well as long as the underlying assumptions remained the same.
Original title and link: Inside Cloudera Impala: Runtime Code Generation (©myNoSQL)
via: http://blog.cloudera.com/blog/2013/02/inside-cloudera-impala-runtime-code-generation/
Wednesday, 30 January 2013
Hadoop in 2013: What Hortonworks Will Focus On
Shaun Connolly summarizing a recent webinar about where Hortonwork’s work on Hadoop will focus in 2013:
[…] Interactive Query, Business Continuity (DR, Snapshots, etc.), Secure Access, as well as ongoing investments in Data Integration, Management (i.e. Ambari), and Online Data (i.e. HBase).
[…] Rather than abandon the Apache Hive community, Hortonworks is focused on working in the community to optimize Hive’s ability to serve big data exploration and interactive query in support of important BI use cases. Moreover, we are focused on enabling Hive to take advantage of YARN in Apache Hadoop 2.0, which will help ensure fast query workloads don’t compete for resources with the other jobs running in the cluster. Enabling Hadoop to predictably support enterprise workloads that span Batch, Interactive, and Online use cases is an important area of focus for us.
Basically this says that Hortonworks sees YARN and Hive as the answer to online or real-time interactive querying of Hadoop data. Cloudera’s take on this is different.
Original title and link: Hadoop in 2013: What Hortonworks Will Focus On (©myNoSQL)
via: http://hortonworks.com/blog/the-road-ahead-for-hortonworks-and-hadoop/
Tuesday, 29 January 2013
Hortonworks Joins OpenStack Foundation
Hortonworks, a leading contributor to Apache Hadoop, today announced it has joined the OpenStack Foundation, which promotes the development, distribution and adoption of the OpenStack cloud operating system. By contributing to the OpenStack ecosystem, Hortonworks is supporting the open source community and facilitating adoption of 100-percent open source Apache Hadoop-based solutions in the cloud. Now customers will be able to access an enterprise-ready Hortonworks Data Platform built for the cloud that alleviates the time and complexities of manually deploying a big data solution.
What took this so long? Cloudera has been part of OpenStack since 2010.
Original title and link: Hortonworks Joins OpenStack Foundation (©myNoSQL)
via: http://hortonworks.com/about-us/news/hortonworks-joins-openstack-foundation/
Thursday, 24 January 2013
Hadoop in the Cloud: Skytap and Joyent
Besides the well established Amazon Elastic MapReduce and Windows Azure HDInsight, there are two new Hadoop-in-the-cloud services:
- Skytap which offers Cloudera CDH4 Enterprise experimentation clusters up to 50 nodes
- Joyent Solution for Hadoop which is offered in partnership with Hortonworks. I hesitated for a bit to mention Joyent considering the page says “Sign up now to talk to a Joyent Solutions Architect” which is anything but a cloud service.
Original title and link: Hadoop in the Cloud: Skytap and Joyent (©myNoSQL)
Monday, 21 January 2013
Hadoop Business Ecosystem as of January 2013
As I was hoping and expecting, Datameer updated the chart visualizing Hadoop’s business side ecosystem:
It shouldn’t be a surprise to anyone that the top most connected companies in the Hadoop space are Cloudera and Hortonworks. They outrank the IT industry mammoths: IBM, HP, Microsoft, Oracle, SAP, etc.
Original title and link: Hadoop Business Ecosystem as of January 2013 (©myNoSQL)
via: http://www.datameer.com/blog/perspectives/hadoop-ecosystem-as-of-january-2013-now-an-app.html
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling
