Hadoop and HBase Status Updates after Hadoop Summit
As you can expect after such a large summit, there are tons of updates coming in.
For now I’ve selected two, but if you find others as interesting please share them with us.
James Hamilton using a colleague’s report:
Key Takeaways
- Yahoo and Facebook operate the world largest Hadoop clusters, 4,000/2,300 nodes with 70/40 petabytes respectively. They run full cluster replicas to assure availability and data durability.
- Yahoo released Hadoop security features with Kerberos integration which is most useful for long running multitenant Hadoop clusters.
- Cloudera released paid enterprise version of Hadoop with cluster management tools and several dB connectors and announced support for Hadoop security.
- Amazon Elastic MapReduce announced expand/shrink cluster functionality and paid support.
- Many Hadoop users use the service in conjunction with NoSQL DBs like Hbase or Cassandra.
Tim Sells has an extensive report on HBase status:
The next version will be 0.90. It will be a reliability release, but also includes performance gains. The version change will break from hadoop version numbers. 0.90 was chosen as there’s a belief it is maturing towards a 1.0 release.
The main points I picked up are:
- New batch importing allows writing hfiles directly and then just telling hbase where they are.
- Taking advantage of appends in hdfs for genuine durability.
- The namenode single point of failure is being addressed, facebook is planning to release their HA namenode.
- Replication between clusters. Allows cross data center replication. Eventually consistent.
- Tighter integration with zookeeper through a master rewrite.
- Significant work to have less temperamental behaviour during compaction and splits.
- Facebook are planning to release their distribution of hadoop and their highly available namenode.