cloudera: All content tagged as cloudera in NoSQL databases and polyglot persistence
Monday, 28 November 2011
Why Is Cloudera Packing Mahout With Hadoop?
Machine learning is an entire field devoted to Information Retrieval, Statistics, Linear Algebra, Analysis of Algorithms, and many other subjects. This field allows us to examine things such as recommendation engines involving new friends, love interests, and new products. We can do incredibly advanced analysis around genetic sequencing and examination, distributed search and frequency pattern matching, as well mathematical analysis with vectors, matrices, and singular value decomposition (SVD).
All these fields have deep connections in the big data space.
Original title and link: Why Is Cloudera Packing Mahout With Hadoop? (©myNoSQL)
via: http://www.cloudera.com/blog/2011/11/cdh3u2-apache-mahout-integration/
Tuesday, 1 November 2011
Hortonworks Data Platform: Hortonworks’ Hadoop Distribution
Announcement came out today[1]:
Hortonworks Data Platform, powered by Apache Hadoop — As we began to interact with enterprises and ecosystem partners, the one constant was the need for a base distribution of Apache Hadoop that is 100% open source and that contains the essential components used with every Hadoop installation. A distribution was needed to provide an easy to install, tightly integrated and well tested set of servers and tools. As we interacted with potential partners, we also heard the message loud and clear that they wanted open and secure APIs to easily integrate and extend Hadoop. We believe we have succeeded on both fronts. The Hortonworks Data Platform is such an open source distribution. It is powered by Apache Hadoop and includes the essential Hadoop components, plus some that make it more manageable, open and extensible. Our distribution is based on Hadoop 0.20.205, the first Apache Hadoop release that supports security and HBase. It also includes some new APIs, such as WebHDFS and those in Ambari and HCatalog, which will make it easy for our partners to integrate their products with Apache Hadoop. For those new to Ambari, it is an open source Apache project that will bring improved installation and management to Hadoop. HCatalog is a metadata management service for simplifying the sharing of data between Hadoop and other data systems. We are releasing Hortonworks Data Platform initially as a limited technology preview with plans to open it up to the public in early 2012.
The fight is on–even if for now the tone is still polite. And if we are adding to the mix MapR and LexisNexis’ HPCC, not to mention the armies of marketers and sales coming from Oracle, IBM, EMC, NetApp, etc. this actually smells like war.
Edward Ribeiro apty commented: “This reminds me of Linux distros war circa 2001”.
-
The emphasis in the text is mine to underline the most important aspects of the announcement. ↩
Original title and link: Hortonworks Data Platform: Hortonworks’ Hadoop Distribution (©myNoSQL)
Hadoop, Hortonworks, Cloudera: A Page of History
At a time when everyone is reading, writing, or talking about Steve Job’s biography, Wired has published a long article looking at the history of Hadoop (Yahoo-era), the Hortonworks spin-off, and Cloudera. While the article doesn’t cover the late rush into Hadoop world by giants like Oracle, IBM, EMC, and others which all want a piece, it gives an interesting overview of the Hadoop ecosystem dynamics:
The initial result is an amusingly heated rivalry between Cloudera and Hortonworks — the kind of rivalry you only see in the open source world. […] But ultimately, this Hadoop civil war shows just how vibrant the platform is.
“Additional investment in the platform and more people concentrating on the open source distro is good for community and good for Cloudera,” Olson says. It’s the sort of thing you always hear from a competitor when a new company enters a market. But in this case, there’s a truth to it. Bearden and Baldeschwieler’s efforts to expand the open source project can only help Cloudera — and the rest of the market.
Original title and link: Hadoop, Hortonworks, Cloudera: A Page of History (©myNoSQL)
via: http://www.wired.com/wiredenterprise/2011/10/how-yahoo-spawned-hadoop/all/1
Monday, 31 October 2011
Datameer Is the First BI/Analytics Platform Built Natively on Hadoop
Brian Smith (Datameer Regional Director of Sales):
DAS is an open book at every stage of the data pipeline, with plug and play support at each phase – integration, analysis and visualization. Under the covers, DAS generates Java/MapReduce code that runs natively on the Hadoop cluster. All current Hadoop distros are supported – we’re Switzerland when it comes to platform support for Apache, Cloudera, MapR, IBM and the rest, we run all of it in a browser on Windows, Mac and Linux.
As always I won’t comment on statements referring to “first” or “best”. But I find Brian Smith’s assessment of the Hadoop economics very accurate:
The economics are compelling — Hadoop is moving out costly analytic databases and warehouses, driving IT to re-look at ADBMS sales cycles, shifting IT dollars and vendor roadmaps, and generally wreaking havoc in the traditional vendor community. We’ve gone from one or two distributions to nine in the last year! And, literally every vendor in the BI/DBMS space has a Hadoop connector, the latest being the recent Oracle announcement. Everybody is on board this train — All this based upon the premise of unlimited scale and data variety at a fraction of traditional costs. Technical challenges exist, but its clear that there’s a sea change.
Original title and link: Datameer Is the First BI/Analytics Platform Built Natively on Hadoop (©myNoSQL)
via: http://datameer.com/blog/uncategorized/why-i-am-at-datameer.html
Thursday, 6 October 2011
Mine Is Bigger Than Yours: Hadoop Code Contributions
Who’s bigger? Hortonworks’ The Yahoo! Effect or Cloudera’s The Community Effect?
This is ugly and should never happen to an open source project.
Still Joe Brockmeier (RWW) describes this as a superb win-win situation:
It might seem unhealthy for companies to be clamoring for credit in open source projects, but it’s a sign of health for projects. If companies position themselves to be top contributors, and care about their standing, the projects win. Users win too. Developers in the ecosystem also win – since it’s far easier to hire existing contributors than trying to push outsiders in to a project.
But there’s just a minor thing missing. Who gets the cheese?
Original title and link: Mine Is Bigger Than Yours: Hadoop Code Contributions (©myNoSQL)
Tuesday, 27 September 2011
R and Hadoop: Revolution Analytics and Cloudera Partnership Announced
In the series of big announcements coming out this month, Cloudera and Revolution Analytics, the enterprise provider of R software, have announced their partnership to integrate Cloudera’s Hadoop distribution with Revolution R Enterprise platform thus offering R developers direct access to Hadoop data stores and the possibility to write MapReduce jobs directly in R.
The integration packages, named RevoConnectR for Apache Hadoop, are already available freely on GitHub and they will also get commercial support with Revolution R Enterprise 5.0 Server for Linux.
You can read more about this announcement on:
Original title and link: R and Hadoop: Revolution Analytics and Cloudera Partnership Announced (©myNoSQL)
Sunday, 14 August 2011
Hoop - Hadoop HDFS Over HTTP
Cloudera has created a set of tools named Hoop allowing access through HTTP/S to HDFS. My first question was why would you use HTTP to access HDFS? Here is the answer:
- Transfer data between clusters running different versions of Hadoop (thereby overcoming RPC versioning issues).
- Access data in a HDFS cluster behind a firewall. The Hoop server acts as a gateway and is the only system that is allowed to go through the firewall.
Not sure though how many will use HTTP for transfering large amounts of data. But if you want to see how it is implemented, you can find the source code on GitHub.
Original title and link: Hoop - Hadoop HDFS Over HTTP (©myNoSQL)
via: http://www.cloudera.com/blog/2011/07/hoop-hadoop-hdfs-over-http/
Tuesday, 12 July 2011
Hadoop and IBM Netezza: Compete or Co-Exist?
I assume people on both sides of data warehouses (users and providers) are asking the same question. IBM Netezza and Cloudera seem to agree on the answer:
IBM Netezza had worked with Cloudera to put together a compelling demo to highlight the value of our combined solution of CDH/Hadoop and Netezza. Through an interesting use case, the demo showed how businesses could have their “hot” data (most recent data) residing in Netezza, “warm” data (longer time range data) residing in HDFS, while leveraging the Cloudera Connector for Netezza and Oozie (workflow engine part of CDH) to provide deeper insights to business executives.
I would have liked to know more details about the use case though. Just categorizing data in “hot” and “warm” is not enough to understand the advantages of each piece.
Original title and link: Hadoop and IBM Netezza: Compete or Co-Exist? (©myNoSQL)
via: http://www.cloudera.com/blog/2011/06/reflections-from-enzee-universe-2011/
Thursday, 7 July 2011
Petabyte-Scale Hadoop Clusters
Curt Monash quoting Omer Trajman (Cloudera) in a post counting petabyte-scale Hadoop deployments:
The number of Petabyte+ Hadoop clusters expanded dramatically over the past year, with our recent count reaching 22 in production (in addition to the well-known clusters at Yahoo! and Facebook). Just as our poll back at Hadoop World 2010 showed the average cluster size at just over 60 nodes, today it tops 200. While mean is not the same as median (most clusters are under 30 nodes), there are some beefy ones pulling up that average. Outside of the well-known large clusters at Yahoo and Facebook, we count today 16 organizations running PB+ clusters running CDH across a diverse number of industries including online advertising, retail, government, financial services, online publishing, web analytics and academic research. We expect to see many more in the coming years, as Hadoop gets easier to use and more accessible to a wide variety of enterprise organizations.
First questions that bumped in my head after reading it:
- How many deployments DataStax’ Brisk has? How many close or over petabyte?
- How many clients run EMC Greenplum HD and how many are close to this scale?
- Same question about NetApp Hadoopler clients.
- Same question for MapR.
Answering these questions would give us a good overview of the Hadoop ecosystem.
Original title and link: Petabyte-Scale Hadoop Clusters (©myNoSQL)
via: http://www.dbms2.com/2011/07/06/petabyte-hadoop-clusters/
Thursday, 23 June 2011
Moving Away From Amazon’s EMR Service to an In-House Hadoop Cluster
Many of our systems use Amazon’s S3 as a backup repository for log data. Our data became too large to process by traditional techniques, so we started using Amazon’s Elastic MapReduce (EMR) to do more expensive queries on our data stored in S3. The major advantage of EMR for us was the lack of operational overhead. With a simple API call, we could have a 20 or 40 node cluster running to crunch our data, which we shutdown at the conclusion of the run. We had two systems interacting with EMR. The first consisted of shell scripts to start an EMR cluster, run a pig script, and load the output data from S3 into our data warehousing system. The second was a Java application that launched pig jobs on an EMR cluster via the Java API and consumed the data in S3 produced by EMR.
What might make you consider moving from the cloud version of MapReduce, the Amazon Elastic MapReduce, to an on-premise Hadoop cluster:
- performance and tuning
- monitoring
- API access
- lack of latest features
Original title and link: Moving Away From Amazon’s EMR Service to an In-House Hadoop Cluster (NoSQL database©myNoSQL)
Friday, 3 June 2011
Experimenting with Hadoop using Cloudera VirtualBox Demo

If you don’t count the download, you’ll get this up and running in 5 minutes tops. At the end you’ll have Hadoop, Sqoop, Pig, Hive, HBase, ZooKeeper, Oozie, Hume, Flume, and Whirr all configured and ready to experiment with.
Making it easy for users to experiment with these tools increases the chances for adoption. Adoption means business.
Original title and link: Experimenting with Hadoop using Cloudera VirtualBox Demo (NoSQL databases © myNoSQL)
Tuesday, 19 April 2011
Adopting Apache Hadoop and Hive
Moving Federal Gov analytics from MySQL to Hadoop and Hive:
HDFS offered us a distributed, resilient, and scalable filesystem while Hadoop promised to bring the work to where the data resided so we could make efficient use of local disk on multiple nodes. Hive, however, really pushed our decision in favor of a Hadoop-based system. Our data is just unstructured enough to make traditional RDBMS schemas a bit brittle and restrictive, but has enough structure to make a schema-less NoSQL system unnecessarily vague. Hive let us compromise between the two — it’s sort of a “SomeSQL” system.
Original title and link: Adopting Apache Hadoop and Hive (NoSQL databases © myNoSQL)
via: http://www.cloudera.com/blog/2011/04/adopting-apache-hadoop-in-the-federal-government/
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling