A fantastic post by Nicolas Liochon and Devaraj Das looking into possible HBase failure scenarios and configurations to reduce the Mean Time to Recover:
There are no global failures in HBase: if a region server fails, all the
other regions are still available. For a given data-subset, the MTTR was
often considered as around ten minutes. This rule of thumb was actually
coming from a common case where the recovery was taking time because it was
trying to use replicas on a dead datanode. Ten minutes would be the time
taken by HDFS to declare a node as dead. With the new stale mode in HDFS,
it’s not the case anymore, and the recovery is now bounded by HBase alone.
If you care about MTTR, with the settings mentioned here, most cases will
take less than 2 minutes between the actual failure and the data being
available again in another region server.
Stepping away for a bit, it looks like the overall complexity comes from the various components involved in HBase (ZooKeeper, HBase, HDFS) and their own failure detection mechanisms. If they are not correctly configured and ordered, things can get pretty ugly; ugly as in longer MTTR than one would expect.
Original title and link: Introduction to HBase Mean Time to Recover (MTTR) - HBase Resiliency ( ©myNoSQL)