mapreduce: All content tagged as mapreduce in NoSQL databases and polyglot persistence
Thursday, 14 March 2013
Your Hadoop in Amazon's Cloud
Adam Horwich of metabroadcast shares their experience of running a Hadoop cluster on Amazon taking advantage of availability zones, spot instances and other tricks:
Oh Hadoop, how you infuriate me with your spurious failures and endless bugs, but how fantastic you can actually be when it comes down to it. I’ve been fighting with Hadoop a lot this past year, from a Region Server domino apocalypse, to the seemingly impossible job of duplicating a cluster. […] But to make the most of what you’ve got, I’ve been researching better ways of using resources available. There’s, of course, always been the option of using Amazon’s EMR service, but we originally built our cluster before that existed as a product, and have built our services around a standardised Hadoop cluster, with local DataNodes. This blog post will be about adding in some nice EMR style features to your dedicated Hadoop cluster running in AWS.
Original title and link: Your Hadoop in Amazon’s Cloud (©myNoSQL)
Wednesday, 13 March 2013
Oracle Paper: The Cost of Do-It-Yourself Hadoop vs Oracle Big Data Appliance
Based on ESG’s modeling of a medium-sized Hadoop-oriented big data project, the preconfigured Oracle Big Data Appliance is 39% less costly than a “build” equivalent do-it-yourself infrastructure. And using Oracle Big Data Appliance will cut the project length by about one-third. For most enterprises planning to take big data beyond experimentation and proof-of- concept, ESG suggests skipping the idea of in-house development, on-going management, and expansion of your own big data infrastructure, to instead look to purpose-built infrastructure solutions such as Oracle Big Data Appliance.
This is an extract from Oracle’s whitepaper “Getting Real about Big Data: Build Versus Buy“. It’s a nice reading excercise to better understand how the database leader is positioning their Oracle Big Data Appliance compared to Hadoop’s commodity-hardware cluster.
I’d love seeing the equivalent paper from Hortonworks1.
-
The only reason I’m referring directly to Hortonworks and not also Cloudera is that the Hadoop part of Oracle Big Data Appliance is offered by Cloudera. ↩
Original title and link: Oracle Paper: The Cost of Do-It-Yourself Hadoop vs Oracle Big Data Appliance (©myNoSQL)
Tuesday, 12 March 2013
Proprietary Hadoop Is a Losing Strategy
Matt Asay (10gen) for ReadWrite adding to the long discussion around EMC’s Pivotal HD announcement:
EMC has seemingly bottomless resources to throw at Hadoop, and every incentive to do so. It’s a smart, highly successful company and no doubt will prove successful with Pivotal HD. However, I can’t see it ever dominating an open-source infrastructure market with a proprietary distribution. Open source is the foundation for today’s most interesting markets, from Big Data to mobile to cloud computing. It’s unlikely that EMC will somehow stem this tide with a proprietary product, no matter its short- term performance or functionality advantages.
While I’ve linked to different perspectives about this topic, I’m not sure anyone outside our bubble actually came to a conclusion.
- Dan Woods (Forbes): Why SQL Matters, the Limits of Open Source, and Other Lessons of EMC Greenplum’s Pivotal HD
- Matthew Aslett (the451group): What It Means to Be “all In” on Hadoop
- Michael Hausenblas (MapR): Hadoop: What Matters Are Open and Standardized Interfaces
- Merv Adrian (Gartner): Open Source “Purity”, Hadoop, and Market Realities
- Steve Loughran: Hadoop Distributions: If There Is a Problem in the Hadoop JARs, How Are You Going to Fix It?
- Shaun Connolly (Hortonworks): Did EMC Just Say Fork You To The Hadoop Community?
What I know, though, is that EMC is benefiting from this. A lot. Three weeks ago, I wasn’t reading anything about EMC and Hadoop. Today all major websites have at least a couple of articles about it.
Original title and link: Proprietary Hadoop Is a Losing Strategy (©myNoSQL)
via: http://readwrite.com/2013/03/12/proprietary-hadoop-is-a-losing-strategy
Parquet - Columnar Storage Format for Hadooop by Twitter and Cloudera
Announced 2 hours ago, by Twitter’s analytics infrastructure engineer Dmitriy Ryaboy, here comes Parquet:
We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.
The Parquet format page describes the details of the Apache Thrift metadata encoding, supported types, Thrift definitions, etc.
Original title and link: Parquet - Columnar Storage Format for Hadooop by Twitter and Cloudera (©myNoSQL)
Monday, 11 March 2013
What It Means to Be “all In” on Hadoop
Another post about the Pivotal HD and the accompanying statements, this time from Matthew Aslett:
Pivotal HD is not Hadoop
Neither is Cloudera’s Distribution, including Apache Hadoop.
Nor the Hortonworks Data Platform.
Nor the MapR Distribution.
Nor IBM’s InfoSphere BigInsights.
Nor the WANdisco Distro.
Nor Intel’s Distribution for Apache Hadoop.
Original title and link: What It Means to Be “all In” on Hadoop (©myNoSQL)
via: http://blogs.the451group.com/information_management/2013/03/11/all-in-on-hadoop/
Hadoop: What Matters Are Open and Standardized Interfaces
Michael Hausenblas (MapR) about the topic of the day: “Hadoop distributions”, about which I’ve already linked to Steve Loughran’s If There Is a Problem in the Hadoop JARs, How Are You Going to Fix It?, Merv Adrian’s Open Source “Purity”, Hadoop, and Market Realities and Matthew Aslett’s What It Means to Be “all In” on Hadoop:
One aspect I’d like to highlight is the importance of ‘standard’ interfaces, defined through community consensus, and enforced by the Apaches and the likes.I think it makes perfect sense to offer a commercial implementation that is superior to the implementation you get ‘for free’ — as long as you’re 100% compatible with the community-defined standard.
Here’s something I don’t understand about the above. The “Defining Hadoop wiki page” dedicates a complete paragraph to compatibility. The most important and relevant part of it is:
Other entities may claim that other products (including derivative works) are compatible with Apache Hadoop. The Apache Hadoop development team is not a standards body, and cannot confirm or deny such assertions. All that we can say is “there is no official certification that a product is compatible with Hadoop, other than when a release of the Apache source tree is declared a new release of Apache Hadoop itself”.
Going back to MapR’s post my question is: if the Apache Hadoop project doesn’t offer a certification toolkit and the project team doesn’t validate the compatibility, what exactly does it mean to be “100% compatible” with something that can change any time and is completely out of your control?
Original title and link: Hadoop: What Matters Are Open and Standardized Interfaces (©myNoSQL)
via: http://www.mapr.com/blog/hadoop-what-matters-are-open-and-standardized-interfaces
Open Source “Purity,” Hadoop, and Market Realities
Merv Adrian (Gartner):
The question is whether it is somehow inappropriate, even “evil,” for EMC to enter the market without having “enough” committers to open source Apache projects. More broadly, it’s about whether other people can use, incorporate, add to and profit from Apache Hadoop.
After reading a lot of reactions to EMC’s announcement, the question floating in my head was: how many similar complains have I read about IBM, Amazon, and all other companies that either distribute Hadoop or offer services around it without contributing directly to the Apache Hadoop project? None.
I love open source and I would love if every business using an open source project would find a way to contribute back. But the reality today is different. There are many businesses making use of open source and contributing nothing back. There are also numerous companies making money from open source and contributing back almost nothing. There are very few companies making money directly from their open source projects. And there are very few open source projects that receive any sort of funds to support their communities. Maybe things will change. Or maybe we should take another look at how the open source market works and come up with a different, more sustainable approach.
Original title and link: Open Source “Purity,” Hadoop, and Market Realities (©myNoSQL)
via: http://blogs.gartner.com/merv-adrian/2013/03/09/open-source-purity-hadoop-and-market-realities/
Hadoop Distributions: If There Is a Problem in the Hadoop JARs, How Are You Going to Fix It?
A long post by Steve Loughran about the implications of forking Hadoop and the different evolution paths. There’s no clear conclusion, except the advise of including the following question in the discussions with the various vendors:
“if there is a problem in the Hadoop JARs — how are you going to fix it?”
Original title and link: Hadoop Distributions: If There Is a Problem in the Hadoop JARs, How Are You Going to Fix It? (©myNoSQL)
via: http://steveloughran.blogspot.co.uk/2013/03/apache-hadoop-yes-but-how-are-you-going.html
Thursday, 7 March 2013
Separating Open Source Signal From Enterprise Hadoop Noise
Shaun Connolly for Hortonworks’s blog:
We also believe that any company that thinks they are “all in” on making open source Apache Hadoop into an enterprise-viable platform needs to have key committers working on the open source technologies (Hortonworks has 50+ committers) or partner with a company like Hortonworks who is focused on working with the ecosystem on ensuring Hadoop integrates and interoperates well with existing enterprise systems and tools.
Original title and link: Separating Open Source Signal From Enterprise Hadoop Noise (©myNoSQL)
via: http://hortonworks.com/blog/separating-open-source-signal-from-enterprise-hadoop-noise/
Tuesday, 5 March 2013
The Hadoop Ecosystem Infographic
Very timely infographic from GigaOm about the Hadoop ecosystem augmenting my earlier How many Hadoops? post:

Original title and link: The Hadoop Ecosystem Infographic (©myNoSQL)
Monday, 4 March 2013
How Many Hadoops?
The short answer is there is only one Apache Hadoop distribution.
The long answer is that there are many distributions that include Apache Hadoop or are claiming compatibility with Apache Hadoop.
The oldest and probably most popular: Cloudera’s Distribution of Hadoop (CDH)
The 100% open source: Hortonworks Data Platform.
The prioprietary: MapR.
The blue one: IBM InfoSphere BigInsights.
The latest: WANdisco Hadoop WDD, Intel Distribution of Hadoop and Pivotal HD from EMC Greenplum.
There’s also the version Facebook’s running on their cluster which includes Facebook Corona: a different approach to job scheduling and resource management.
But this list is not complete as it doesn’t include appliances featuring Hadoop. In this category we have:
- Oracle’s Big Data appliance featuring Cloudera’s Distribution of Hadoop
- Netapp’s Hadooplers
- EMC Greenplum DCA
- Teradata Aster Discovery Platform featuring Hortonworks’s Hadoop Data Platform
- Data Direct Networks (DDN)
I hope I didn’t miss any important ones1. As a conclusion for this list, my question is: who is actually benefiting from all these distributions?
-
I left aside for now Hadoop-as-a-Service. ↩
Original title and link: How Many Hadoops? (©myNoSQL)
The History of Hadoop
Nice article and interviews with important people in the history of Hadoop by Derrick Harris for GigaOm.
Depending on how one defines its birth, Hadoop is now 10 years old. In that decade, Hadoop has gone from being the hopeful answer to Yahoo’s search- engine woes to a general-purpose computing platform that’s poised to be the foundation for the next generation of data-based applications.
This is the first article of a four-part series that promises to explain everything about Hadoop.
Original title and link: The History of Hadoop (©myNoSQL)
via: http://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling