Joydeep Sen Sarma takes a look at the possibility of separating the I/O paths for processing real-time requests and analytic queries while continuing the use the same HBase storage:
- Applications requiring up-to-date versions can go through the RegionServer (Tablets in BigTable parlance) API
- However Applications that do not care about the very latest updates can directly access compacted files from HDFS
As regards the advantages of such an approach, he writes:
Many analytical and reporting applications are perfectly fine with data that’s about a day old or so. Issuing large IO against HDFS will always be more performant than streaming data out of region servers. And it allows one to size the HBase clusters based on the (more steady) real-time traffic - not based on large bursty batch jobs (alternately put - partially isolates the online traffic from resource contention by batch jobs).
As pointed by a commenter, this article doesn’t look at a possible 3rd scenario: stream analysis and I’m not sure there is any Hadoop/HBase solution for that. But for the cases covered the solution sounds interesting as long as you can avoid data collisions and make sure that this approach doesn’t lead to inconsistencies in analyzed data.