CAP: All content tagged as CAP in NoSQL databases and polyglot persistence
Just to expand on this, the “C” in CAP corresponds (roughly) to the “A” and “I” in ACID. Atomicity across multiple nodes requires consensus. According to FLP Impossibility Result (CAP is a very elegant and intuitive re-statement of FLP), consensus is impossible in a network that may drop or deliver packets. Serializable isolation level requires that operations are totally ordered: total ordering on multiple nodes, requires solving the “atomic multicast” problem which is a private instance of the general consensus problem.
In practice, you can achieve consensus across multiple nodes with a reasonable amount of fault tolerance if you are willing to accept high (as in, hundreds of milliseconds) latency bounds. That’s a loss of availability that’s not acceptable to many applications.
This means, that you can’t build a low-latency multi-master system that achieves the “A” and “I” guarantees. Thus, distributed systems that wish to achieve a greater form of consistency typically (Megastore from Google being a notable exception, at the cost of 140ms latency) choose master slave systems (with “floating masters” for fault tolerance). In these systems availability is lost for a short period of time in case the master fails. BigTable (or HBase) is an example of this: (grand simplification follows) when a tablet master (RegionServer in HBase) for a specific token range fails, availability is lost until other nodes take over the “master-less” token range.
These are not binary “on/off” switches: see Yahoo’s PNUTS for a great “middle of the road” system. The paper has an intuitive example explaining the various consistency models.
Note: in a partitioned system, the scope of consistency guarantees (that is, any consistency guarantees: eventual or not) is typically limited to (at best) a single partition of a “table group”/”entity group” (in Microsoft Azure Cloud SQL Server and Google Megastore, respectively), a single partition of a table (usual sharded MySQL setups) or just a single row in a table (BigTable) or document in a document oriented store. Atomic and isolated cross row transactions are impractical on commodity hardware (and are limited even in systems that mandate the use of infiband interconnect and high-performance SSDs).
The problem with relational databases is that they conflate the notions of data and views
Alex: @nathanmarz That’s reason for confusion between C in ACID and C in CAP: C in ACID means consistent view of data which can be done w/ quorums
Sergio: @strlen That’s a common misconception: ACID C just means your write operations do not break data constraints. It’s not about the view.
Alex: @sbtourist It also refers to not allowing reads of intermediate states i.e., serializability. W/o a quorum, an EC system could allow such.
Alex: @sbtourist On the other hand, an async system where node B is behind node A is still C in the ACID sense without being C in the CAP sense.
Sergio: @strlen Nope, that’s the isolation level (ACID I). Again, ACID C has a precise meaning and it’s about constraints.
Alex: @sbtourist Yeah, I think you are right: serializability would be “I”, with consensus (strongest form of CAP “C”) being about “A” (atomicity)
Sergio: @strlen That said, I strongly agree with you about ACID C being different than CAP C.
Alex: @sbtourist Yes. Both “consistent” and “atomic” mean diff things in DBs than they do elsewhere in systems (e.g., way that “ln -s” is atomic)
There have been many discussions about the loose definitions of the terms in the CAP theorem. Daniel Abadi exposed an interesting perspective on the subject proposing instead PACELC:
To me, CAP should really be PACELC – if there is a partition (P) how does the system tradeoff between availability and consistency (A and C); else (E) when the system is running as normal in the absence of partitions, how does the system tradeoff between latency (L) and consistency (C)?
In the light of publicly announcing customers, I wanted to read a bit about Clustrix Clustered Database Systems.
The company homepage is describing the product:
- scalable database appliaces for Internet-scale work loads
- Linearly Scalable: fully distributed, parallel architecture provides unlimited scale
- SQL functionality: full SQL relational and data consistency (ACID) functionality
- Fault-Tolerant: highly available providing fail-over, recovery, and self-healing
- MySQL Compatible: seamless deployment without application changes.
All these sounded pretty (
too) good. And I’ve seen a very similar presentation for Xeround: Elastic, Always-on Storage Engine for MySQL.
So, I’ve continued my reading with the Sierra Clustered Database Engine whitepaper (PDF).
Here are my notes:
- Sierra is composed of:
- database personality module: translates queries into internal representation
- distributed query planner and compiler
- distributed shared-nothing execution engine
- persistent storage
- NVRAM transactional storage for journal changes
- inter-node Infiniband
- queries are decomposed into query fragments which are the unit of work. Query fragments are sent for execution to nodes containing the data.
- query fragments are atomic operations that can:
- insert, read, update data
- execute functions and modify control flow
- perform synchronization
- send data to other nodes
- format output
- query fragments can be executed in parallel
- query fragments can be cached with parameterized constants at the node level
- determining where to sent the query fragments for execution is done using either range-based rules or hash function
- tables are partitioned into slices, each slice having redundancy replicas
- size of slices can be automatically determined or configured
- adding new nodes to the cluster results in rebalancing slices
- slices contained on a failed device are reconstructed using their replicas
- one of the slices is considered primary
- writes go to all replicas and are transactional
- all reads fo the the slice primary
The paper also exemplifies the execution of 4 different queries:
SELECT * FROM T1 WHERE uid=10 SELECT uid, name FROM T1 JOIN T2 on T1.gid = T2.gid WHERE uid=10 SELECT * FROM T1 WHERE uid<100 and gid>10 ORDER BY uid LIMIT 5 INSERT INTO T1 VALUES (10,20)
- who is coordinating transactions that may be executed on different nodes?
- who is maintains the topology of the slices? In case of a node failure, you’d need to determine:
- what slices where on the failing node
- where are the replicas for each of these slices
- where new replicas will be created
- when will new replicas become available for writes
- who elects the slice primary?