HBase and Data Locality
You know that when Lars George (@larsgeorge) is posting about HBase it will be a hardcore technical article. And he did it again this time providing a lot of details about how Hadoop and HBase are dealing with data locality (i.e. the property of a system to place data close to where it is needed).
The article looks at both HBase data access scenarios: direct random access and MapReduce scanning of tables and shows how HDFS[1] (the underlying file storage used by HBase) is smart enough to deal with data locality:
The most important factor is that HBase is not restarted frequently and that it performs house keeping on a regular basis. These so called compactions rewrite files as new data is added over time. All files in HDFS once written are immutable (for all sorts of reasons). Because of that, data is written into new files and as their number grows HBase compacts them into another set of new, consolidated files. And here is the kicker: HDFS is smart enough to put the data where it is needed!
[…]
So this means for HBase that as the region server stays up for long enough (which is the default) that after a major compaction on all tables - which can be invoked manually or is triggered by a configuration setting - it has the files local on the same host. The data node that shares the same physical host has a copy of all data the region server requires. If you are running a scan or get or any other use-case you can be sure to get the best performance.
References
- [1] The article links to a very detailed document of ☞ HDFS architecture and also an article explaining ☞ HBase storage (↩)
via: http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html