bigdata: All content tagged as bigdata in NoSQL databases and polyglot persistence
Monday, 11 March 2013
What It Means to Be “all In” on Hadoop
Another post about the Pivotal HD and the accompanying statements, this time from Matthew Aslett:
Pivotal HD is not Hadoop
Neither is Cloudera’s Distribution, including Apache Hadoop.
Nor the Hortonworks Data Platform.
Nor the MapR Distribution.
Nor IBM’s InfoSphere BigInsights.
Nor the WANdisco Distro.
Nor Intel’s Distribution for Apache Hadoop.
Original title and link: What It Means to Be “all In” on Hadoop (©myNoSQL)
via: http://blogs.the451group.com/information_management/2013/03/11/all-in-on-hadoop/
Hadoop: What Matters Are Open and Standardized Interfaces
Michael Hausenblas (MapR) about the topic of the day: “Hadoop distributions”, about which I’ve already linked to Steve Loughran’s If There Is a Problem in the Hadoop JARs, How Are You Going to Fix It?, Merv Adrian’s Open Source “Purity”, Hadoop, and Market Realities and Matthew Aslett’s What It Means to Be “all In” on Hadoop:
One aspect I’d like to highlight is the importance of ‘standard’ interfaces, defined through community consensus, and enforced by the Apaches and the likes.I think it makes perfect sense to offer a commercial implementation that is superior to the implementation you get ‘for free’ — as long as you’re 100% compatible with the community-defined standard.
Here’s something I don’t understand about the above. The “Defining Hadoop wiki page” dedicates a complete paragraph to compatibility. The most important and relevant part of it is:
Other entities may claim that other products (including derivative works) are compatible with Apache Hadoop. The Apache Hadoop development team is not a standards body, and cannot confirm or deny such assertions. All that we can say is “there is no official certification that a product is compatible with Hadoop, other than when a release of the Apache source tree is declared a new release of Apache Hadoop itself”.
Going back to MapR’s post my question is: if the Apache Hadoop project doesn’t offer a certification toolkit and the project team doesn’t validate the compatibility, what exactly does it mean to be “100% compatible” with something that can change any time and is completely out of your control?
Original title and link: Hadoop: What Matters Are Open and Standardized Interfaces (©myNoSQL)
via: http://www.mapr.com/blog/hadoop-what-matters-are-open-and-standardized-interfaces
Open Source “Purity,” Hadoop, and Market Realities
Merv Adrian (Gartner):
The question is whether it is somehow inappropriate, even “evil,” for EMC to enter the market without having “enough” committers to open source Apache projects. More broadly, it’s about whether other people can use, incorporate, add to and profit from Apache Hadoop.
After reading a lot of reactions to EMC’s announcement, the question floating in my head was: how many similar complains have I read about IBM, Amazon, and all other companies that either distribute Hadoop or offer services around it without contributing directly to the Apache Hadoop project? None.
I love open source and I would love if every business using an open source project would find a way to contribute back. But the reality today is different. There are many businesses making use of open source and contributing nothing back. There are also numerous companies making money from open source and contributing back almost nothing. There are very few companies making money directly from their open source projects. And there are very few open source projects that receive any sort of funds to support their communities. Maybe things will change. Or maybe we should take another look at how the open source market works and come up with a different, more sustainable approach.
Original title and link: Open Source “Purity,” Hadoop, and Market Realities (©myNoSQL)
via: http://blogs.gartner.com/merv-adrian/2013/03/09/open-source-purity-hadoop-and-market-realities/
Hadoop Distributions: If There Is a Problem in the Hadoop JARs, How Are You Going to Fix It?
A long post by Steve Loughran about the implications of forking Hadoop and the different evolution paths. There’s no clear conclusion, except the advise of including the following question in the discussions with the various vendors:
“if there is a problem in the Hadoop JARs — how are you going to fix it?”
Original title and link: Hadoop Distributions: If There Is a Problem in the Hadoop JARs, How Are You Going to Fix It? (©myNoSQL)
via: http://steveloughran.blogspot.co.uk/2013/03/apache-hadoop-yes-but-how-are-you-going.html
Thursday, 7 March 2013
Separating Open Source Signal From Enterprise Hadoop Noise
Shaun Connolly for Hortonworks’s blog:
We also believe that any company that thinks they are “all in” on making open source Apache Hadoop into an enterprise-viable platform needs to have key committers working on the open source technologies (Hortonworks has 50+ committers) or partner with a company like Hortonworks who is focused on working with the ecosystem on ensuring Hadoop integrates and interoperates well with existing enterprise systems and tools.
Original title and link: Separating Open Source Signal From Enterprise Hadoop Noise (©myNoSQL)
via: http://hortonworks.com/blog/separating-open-source-signal-from-enterprise-hadoop-noise/
Tuesday, 5 March 2013
The Hadoop Ecosystem Infographic
Very timely infographic from GigaOm about the Hadoop ecosystem augmenting my earlier How many Hadoops? post:

Original title and link: The Hadoop Ecosystem Infographic (©myNoSQL)
Big Data Done Cheap
Quentin Hardy for NYTimes about Violin Memory data cards.
The Violin Memory data cards, produced in conjunction with Toshiba, offer 1.4 terabytes in “flash” memory, which can be accessed quickly. Cards for higher-end servers hold up to 11 terabytes.
Everything’s sounds great, right?
The low-end Violin Memory card has a list price of $4,000, and is intended for use in a server costing even less than that. A bigger card, for the kind of server that costs $5,000 or more, lists for about $60,000.
Cheap can mean so many things.
Original title and link: Big Data Done Cheap (©myNoSQL)
via: http://bits.blogs.nytimes.com/2013/03/04/big-data-done-cheap/?ref=business
Monday, 4 March 2013
How Many Hadoops?
The short answer is there is only one Apache Hadoop distribution.
The long answer is that there are many distributions that include Apache Hadoop or are claiming compatibility with Apache Hadoop.
The oldest and probably most popular: Cloudera’s Distribution of Hadoop (CDH)
The 100% open source: Hortonworks Data Platform.
The prioprietary: MapR.
The blue one: IBM InfoSphere BigInsights.
The latest: WANdisco Hadoop WDD, Intel Distribution of Hadoop and Pivotal HD from EMC Greenplum.
There’s also the version Facebook’s running on their cluster which includes Facebook Corona: a different approach to job scheduling and resource management.
But this list is not complete as it doesn’t include appliances featuring Hadoop. In this category we have:
- Oracle’s Big Data appliance featuring Cloudera’s Distribution of Hadoop
- Netapp’s Hadooplers
- EMC Greenplum DCA
- Teradata Aster Discovery Platform featuring Hortonworks’s Hadoop Data Platform
- Data Direct Networks (DDN)
I hope I didn’t miss any important ones1. As a conclusion for this list, my question is: who is actually benefiting from all these distributions?
-
I left aside for now Hadoop-as-a-Service. ↩
Original title and link: How Many Hadoops? (©myNoSQL)
The History of Hadoop
Nice article and interviews with important people in the history of Hadoop by Derrick Harris for GigaOm.
Depending on how one defines its birth, Hadoop is now 10 years old. In that decade, Hadoop has gone from being the hopeful answer to Yahoo’s search- engine woes to a general-purpose computing platform that’s poised to be the foundation for the next generation of data-based applications.
This is the first article of a four-part series that promises to explain everything about Hadoop.
Original title and link: The History of Hadoop (©myNoSQL)
via: http://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/
Use Cases for Hadoop's New Pluggable Sort
What is the big deal about Sort? Sort is fundamental to the MapReduce framework, the data is sorted between the Map and Reduce phases (see below). Syncsort’s contribution allows native Hadoop sort to be replaced by an alternative sort implementation, for both Map and Reduce sides, i.e. it makes Sort phase pluggable.
Tendu Yogurtcu describes a couple of new use cases that the pluggable sort implementation contributed by Syncsort to Apache Hadoop is opening:
- Optimized sort implementations and full joins
- Hash-based aggregations with no sort requirements
- Reducers that can start before all Mappers complete
Original title and link: Use Cases for Hadoop’s New Pluggable Sort (©myNoSQL)
via: http://blog.syncsort.com/2013/02/hadoop-mapreduce-to-sort-or-not-to-sort/
A Brief Guide to Pig Latin for the SQL Guy
Cat Miller from Mortar Data offers a quick intro to Pig Latin from a SQLish perspective:
Pig is similar enough to SQL to be familiar, but divergent enough to be disorienting to newcomers. The goal of this guide is to ease the friction in adding Pig to an existing SQL skillset.
Pig and SQL similarities are in the operations they both support. But the whole model is different. Pig is an imperative data manipulation tool, while SQL is a declarative query language.
Original title and link: A Brief Guide to Pig Latin for the SQL Guy (©myNoSQL)
BigData Top100: A Big Data Benchmark
I assume this initiative is an attempt to create a TPC-H-like benchmark how data analysis. The website is up, but the main paper is behind a paywall so I haven’t had a chance to read it:
A new benchmarking initiative called BigData Top100 was announced at the O’Reilly Strata Conference 2013 in a joint presentation by Chaitan Baru (SDSC) and Milind Bhandarkar (Greenplum). Other members of the BigData Top100 List steering group, include Dhruba Borthakur (Facebook), Eyal Gutkind (Mellanox), Jian Li (IBM), Raghunath Nambiar (Cisco), Ken Osterberg (Seagate), Scott Pearson (Brocade), Meikel Poess (Oracle), Tilmann Rabl (University of Toronto), Richard Treadway (NetApp), and Jerry Zhao (Google).
There are some notable missing names from the list, but that doesn’t mean anything for now.
Original title and link: BigData Top100: A Big Data Benchmark (©myNoSQL)
via: http://www.hadoopsphere.com/2013/03/let-benchmarking-contests-begin.html
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling