Another great diagram explaining the complicated tree of Hadoop versions.
Click for full size image. Credit Konstantin I. Boudnik & Cos
When compared with the other diagram of Apache Hadoop versions, this one contains some very interesting details about the versions of Hadoop used by third party distributions like EMC, IBM, MapR, and even Azure:
The diagram above clearly shows a few important gaps of the rest of commercial offerings:
- none of them supports Kerberos security (EMC, IBM, and MapR)
- unavailability of Hbase due to the lack of HDFS append in their systems (EMC, IBM). In case of MapR you end up using a custom HBase distributed by MapR. I don’t want to make any speculation of the latter in this article.
If I’d be in position to choose which version of Hadoop to be used for a project, here is where I’d start from:
- if the project would have a budget for prototyping and experimentation, my first choice would be the latest official Apache distribution. This would give access to both the latest and greatest (and not always bug free), but more importantly it would allow the team to access the Hadoop community know-how
- if the project would require getting up to speed as fast as possible (and I’d be able to get some budget for trainings), I’d start my investigation with Cloudera Distribution of Hadoop. Even if there would be no budget for getting support for Cloudera, the advantage would be in having everything well packaged together.
Original title and link: Hadoop Versions Take 2: What You Wanted to Know About Hadoop, but Were Too Afraid to Ask: Genealogy of Elephants ( ©myNoSQL)