Distributed System Reliability: It's About Operations, Not Architecture or Design

Jay Kreps1:

I have come around to the view that the real core difficulty of these systems is operations, not architecture or design. Both are important but good operations can often work around the limitations of bad (or incomplete) software, but good software cannot run reliably with bad operations. […] I really think there is really only one thing to talk about with respect to reliability: continuous hours of successful production operations.

  1. Jay Kreps: works for LinkedIn where he is the technical lead for the SNA team that handles search, social graph, data infrastructure, and recommendation systems. 

Original title and link: Distributed System Reliability: It’s About Operations, Not Architecture or Design (NoSQL database©myNoSQL)