Four Golden Rules of High Availability. Is Self-Healing a Requirement of Highly Available Systems?

Jared Wray enumerates the following 4 rules for High Availability :

  • No Single Point of failure
  • Self-healing is Required
  • It will go down so plan on it
  • It is going to cost more: […] The discussion instead should be what downtime is acceptable for the business.

I’m not sure there’s a very specific definition of high availability, but the always correct Wikipedia says:

High availability is a system design approach and associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period.

This got me thinking if self-healing is actually a requirement? Could I translated this into asking: is it possible to control the MTTF? Control in the sense of planning operations that would push MTTF into a range that is not consider to break the SLA.

Jim Gray and Daniel P. Siewiorek wrote in their “High Availability Computer Systems”:

The key concepts and techniques used to build high availability computer systems are (1) modularity, (2) fail-fast modules, (3) independent failure modes, (4) redundancy, and (5) repair. These ideas apply to hardware, to design, and to software. They also apply to tolerating operations faults and environmental faults.

Notice the lack of the “self” part. So is self-healing a requirement of highly available systems?

