MapReduce: All content tagged as MapReduce in NoSQL databases and polyglot persistence
Monday, 11 March 2013
Hadoop Distributions: If There Is a Problem in the Hadoop JARs, How Are You Going to Fix It?
A long post by Steve Loughran about the implications of forking Hadoop and the different evolution paths. There’s no clear conclusion, except the advise of including the following question in the discussions with the various vendors:
“if there is a problem in the Hadoop JARs — how are you going to fix it?”
Original title and link: Hadoop Distributions: If There Is a Problem in the Hadoop JARs, How Are You Going to Fix It? (©myNoSQL)
via: http://steveloughran.blogspot.co.uk/2013/03/apache-hadoop-yes-but-how-are-you-going.html
Thursday, 7 March 2013
Separating Open Source Signal From Enterprise Hadoop Noise
Shaun Connolly for Hortonworks’s blog:
We also believe that any company that thinks they are “all in” on making open source Apache Hadoop into an enterprise-viable platform needs to have key committers working on the open source technologies (Hortonworks has 50+ committers) or partner with a company like Hortonworks who is focused on working with the ecosystem on ensuring Hadoop integrates and interoperates well with existing enterprise systems and tools.
Original title and link: Separating Open Source Signal From Enterprise Hadoop Noise (©myNoSQL)
via: http://hortonworks.com/blog/separating-open-source-signal-from-enterprise-hadoop-noise/
Tuesday, 5 March 2013
The Hadoop Ecosystem Infographic
Very timely infographic from GigaOm about the Hadoop ecosystem augmenting my earlier How many Hadoops? post:

Original title and link: The Hadoop Ecosystem Infographic (©myNoSQL)
Monday, 4 March 2013
How Many Hadoops?
The short answer is there is only one Apache Hadoop distribution.
The long answer is that there are many distributions that include Apache Hadoop or are claiming compatibility with Apache Hadoop.
The oldest and probably most popular: Cloudera’s Distribution of Hadoop (CDH)
The 100% open source: Hortonworks Data Platform.
The prioprietary: MapR.
The blue one: IBM InfoSphere BigInsights.
The latest: WANdisco Hadoop WDD, Intel Distribution of Hadoop and Pivotal HD from EMC Greenplum.
There’s also the version Facebook’s running on their cluster which includes Facebook Corona: a different approach to job scheduling and resource management.
But this list is not complete as it doesn’t include appliances featuring Hadoop. In this category we have:
- Oracle’s Big Data appliance featuring Cloudera’s Distribution of Hadoop
- Netapp’s Hadooplers
- EMC Greenplum DCA
- Teradata Aster Discovery Platform featuring Hortonworks’s Hadoop Data Platform
- Data Direct Networks (DDN)
I hope I didn’t miss any important ones1. As a conclusion for this list, my question is: who is actually benefiting from all these distributions?
-
I left aside for now Hadoop-as-a-Service. ↩
Original title and link: How Many Hadoops? (©myNoSQL)
The History of Hadoop
Nice article and interviews with important people in the history of Hadoop by Derrick Harris for GigaOm.
Depending on how one defines its birth, Hadoop is now 10 years old. In that decade, Hadoop has gone from being the hopeful answer to Yahoo’s search- engine woes to a general-purpose computing platform that’s poised to be the foundation for the next generation of data-based applications.
This is the first article of a four-part series that promises to explain everything about Hadoop.
Original title and link: The History of Hadoop (©myNoSQL)
via: http://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/
Use Cases for Hadoop's New Pluggable Sort
What is the big deal about Sort? Sort is fundamental to the MapReduce framework, the data is sorted between the Map and Reduce phases (see below). Syncsort’s contribution allows native Hadoop sort to be replaced by an alternative sort implementation, for both Map and Reduce sides, i.e. it makes Sort phase pluggable.
Tendu Yogurtcu describes a couple of new use cases that the pluggable sort implementation contributed by Syncsort to Apache Hadoop is opening:
- Optimized sort implementations and full joins
- Hash-based aggregations with no sort requirements
- Reducers that can start before all Mappers complete
Original title and link: Use Cases for Hadoop’s New Pluggable Sort (©myNoSQL)
via: http://blog.syncsort.com/2013/02/hadoop-mapreduce-to-sort-or-not-to-sort/
A Brief Guide to Pig Latin for the SQL Guy
Cat Miller from Mortar Data offers a quick intro to Pig Latin from a SQLish perspective:
Pig is similar enough to SQL to be familiar, but divergent enough to be disorienting to newcomers. The goal of this guide is to ease the friction in adding Pig to an existing SQL skillset.
Pig and SQL similarities are in the operations they both support. But the whole model is different. Pig is an imperative data manipulation tool, while SQL is a declarative query language.
Original title and link: A Brief Guide to Pig Latin for the SQL Guy (©myNoSQL)
Thursday, 28 February 2013
Intel Distribution of H* in 21 Links
I don’t think anyone beside the PR department at Intel had the time to read through all the media coverage Intel Distribution H* got in the last couple of days. Here’s a collection of links for your reference. Pick wisely.
Intel Announcements
Media Coverage
-
NYTimes Bits: Intel’s Big Data Push
-
Wired: Intel Leaps on Software Elephant for Trip to Hardware Heaven
-
ZDNet: Intel baking Apache Hadoop into silicon for big data, security uses
-
The Register: Intel takes on all Hadoop disties to rule big data munching
-
Forbes: Intel Drops a Big Data Shocker
-
GigaOm: Cloudera who? Intel announces its own Hadoop distribution
-
SilliconAngle: Intel Gets Inside Big Data Chips With Hadoop
-
InformationWeek: Intel Unveils New Distribution For Apache Hadoop
-
Computerworld: Intel releases Hadoop software primed for its own chips
-
PCMag: [Intel Tackles Big Data With Release of Apache Hadoop Platform](http://www.pcmag.com/article2/0,2817,2415931,00.asp “{{rel=’external nofollow’}}”
-
DataInformed: Intel Jumps into Big Data Pool with Hadoop Distribution
-
Slashdot: Intel’s New Hadoop Distribution Could Benefit Its Hardware Bottom Line
-
VentureBeat: Intel moves into ‘big data’ software with Apache Hadoop distribution
-
DatacenterKnowledge: Intel Enters the Hadoop Software Market
-
Datacenter Dynamics: Intel launches own Hadoop distribution
Intel Distribution Partners
If like me you’re interested in archiving these, I’ve put this list together in a format easier to read and archive.
Original title and link: Intel Distribution of H* in 21 Links (©myNoSQL)
Wednesday, 27 February 2013
Some Interesting Facts, Sorry FUD About Hadoop MapReduce
If you feel like reading a bit of bullet point style FUD about Hadoop, check Dr. David F. Rico’s PDF.
Original title and link: Some Interesting Facts, Sorry FUD About Hadoop MapReduce (©myNoSQL)
Tuesday, 26 February 2013
The History of Hadoop Changed the World
Over the next few years, Hadoop reinvented data analysis not only at Facebook and Yahoo but so many other web services. And then an army of commercial software vendors started selling the thing to the rest of the world. Soon, even the likes of Oracle and Greenplum were hawking Hadoop. These companies still treated Hadoop as an adjunct to the traditional database — as a tool suited only to certain types of data analysis. But now, that’s changing too.
I have found the above fragment, which fully describes the impact Hadoop had and has in the data world, in Cade Metz’s article about Greenplum’s Pivotal HD announcement for Wired: “Why Hadoop Is the Future of the Database.
Original title and link: The History of Hadoop Changed the World (©myNoSQL)
Apache Pig Goes 0.11
Almost lost in the tons of Hadoopy releases, I have found the announcement of Apache Pig 0.11, which, as a serious open source project, packages nice new features for a point release:
- DateTime data type
RANK,CUBE,ROLLUPoperators- Groovy UDFs
Plus tons of improvements.
Original title and link: Apache Pig Goes 0.11 (©myNoSQL)
via: https://blogs.apache.org/pig/entry/apache_pig_it_goes_to
Spring for Apache Hadoop 1.0 Goes GA: Wrapping Hadoop in XML
Costin Leau announcing the GA of Spring for Apache Hadoop:
What we have observed is that using the standard out of the box tools that come with Hadoop, you an easily end up with Hadoop applications that are poorly structured collection of command line utilities, scripts and pieces of code stiched together.
Leaving aside the jokes and that I don’t fully understand the purpose of this project (and here and here) , congrats for the release!
Original title and link: Spring for Apache Hadoop 1.0 Goes GA: Wrapping Hadoop in XML (©myNoSQL)
via: http://blog.springsource.org/2013/02/26/shdp-1-0-goes-ga/
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling
