distributed systems: All content tagged as distributed systems in NoSQL databases and polyglot persistence
A slide deck by Alexander Shraer providing a shorter version of the Dynamic Reconfiguration of Primary/Backup Clusters in Apache ZooKeeper paper.
There are still some scenarios where the proposal algorithm will not work, but I cannot tell how often these will occur:
- Quorum of new ensemble must be in sync
- Another reconfig in progress
- Version condition check fails
One of the most interesting slides in the deck is the one explaining the failure free flow:
A paper by Alexander Shraer, Benjamin Reed, Flavio Junqueira (Yahoo) and Dahlia Malkhi (Microsoft):
Dynamically changing (reconfiguring) the membership of a replicated distributed system while preserving data consistency and system availability is a challenging problem. In this paper, we show that reconfiguration can be simplified by taking advantage of certain properties commonly provided by Primary/Backup systems. We describe a new reconfiguration protocol, recently implemented in Apache Zookeeper. It fully automates configuration changes and minimizes any interruption in service to clients while maintaining data consistency. By leveraging the properties already provided by Zookeeper our protocol is considerably simpler than state of the art.
The corresponding ZooKeeper issue has been created in 2008 and the new protocol should be part of ZooKeeper 3.5.0
Google’s paper about their large-scale distributed systems tracing solution Dapper which inspired Twitter’s Zipkin:
Here we introduce the design of Dapper, Google’s production distributed systems tracing infrastructure, and describe how our design goals of low overhead, application-level transparency, and ubiquitous deployment on a very large scale system were met. Dapper shares conceptual similarities with other tracing systems, particularly Magpie  and X-Trace , but certain design choices were made that have been key to its success in our environment, such as the use of sampling and restricting the instrumentation to a rather small number of common libraries.
Download or read the paper after the break.
Jared Wray enumerates the following 4 rules for High Availability :
- No Single Point of failure
- Self-healing is Required
- It will go down so plan on it
- It is going to cost more: […] The discussion instead should be what downtime is acceptable for the business.
I’m not sure there’s a very specific definition of high availability, but the always correct Wikipedia says:
High availability is a system design approach and associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period.
This got me thinking if self-healing is actually a requirement? Could I translated this into asking: is it possible to control the MTTF? Control in the sense of planning operations that would push MTTF into a range that is not consider to break the SLA.
Jim Gray and Daniel P. Siewiorek wrote in their “High Availability Computer Systems”:
The key concepts and techniques used to build high availability computer systems are (1) modularity, (2) fail-fast modules, (3) independent failure modes, (4) redundancy, and (5) repair. These ideas apply to hardware, to design, and to software. They also apply to tolerating operations faults and environmental faults.
Notice the lack of the “self” part. So is self-healing a requirement of highly available systems?
Original title and link: Four Golden Rules of High Availability. Is Self-Healing a Requirement of Highly Available Systems? ( ©myNoSQL)