NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



CAP: All content tagged as CAP in NoSQL databases and polyglot persistence

MongoDB Journaling and Replication Interaction

How do we know our data won’t be rolled back? The answer is that a write is truly committed in a replica set when it has been written at a majority of set members. We can confirm this with the getLastError command. For example, if our write has made it to the journal on two out of three total set members, we know the data is committed even if nodes fail in a cascading sequence, and even if a minority of nodes are permanently lost.

Journaling was added in MongoDB 1.8 for crash safetiness and recovery. But the way I read this post about the way MongoDB journaling and replication works makes me think that MongoDB data is not safe without always using getLastError. And this approach is both decreasing MongoDB speed and might lead to unavailability scenarios.

Original title and link: MongoDB Journaling and Replication Interaction (NoSQL database©myNoSQL)


C in CAP != C in ACID

Alex Feinberg explains it again:

Just to expand on this, the “C” in CAP corresponds (roughly) to the “A” and “I” in ACID. Atomicity across multiple nodes requires consensus. According to FLP Impossibility Result (CAP is a very elegant and intuitive re-statement of FLP), consensus is impossible in a network that may drop or deliver packets. Serializable isolation level requires that operations are totally ordered: total ordering on multiple nodes, requires solving the “atomic multicast” problem which is a private instance of the general consensus problem.

In practice, you can achieve consensus across multiple nodes with a reasonable amount of fault tolerance if you are willing to accept high (as in, hundreds of milliseconds) latency bounds. That’s a loss of availability that’s not acceptable to many applications.

This means, that you can’t build a low-latency multi-master system that achieves the “A” and “I” guarantees. Thus, distributed systems that wish to achieve a greater form of consistency typically (Megastore from Google being a notable exception, at the cost of 140ms latency) choose master slave systems (with “floating masters” for fault tolerance). In these systems availability is lost for a short period of time in case the master fails. BigTable (or HBase) is an example of this: (grand simplification follows) when a tablet master (RegionServer in HBase) for a specific token range fails, availability is lost until other nodes take over the “master-less” token range.

These are not binary “on/off” switches: see Yahoo’s PNUTS for a great “middle of the road” system. The paper has an intuitive example explaining the various consistency models.

Note: in a partitioned system, the scope of consistency guarantees (that is, any consistency guarantees: eventual or not) is typically limited to (at best) a single partition of a “table group”/”entity group” (in Microsoft Azure Cloud SQL Server and Google Megastore, respectively), a single partition of a table (usual sharded MySQL setups) or just a single row in a table (BigTable) or document in a document oriented store. Atomic and isolated cross row transactions are impractical on commodity hardware (and are limited even in systems that mandate the use of infiband interconnect and high-performance SSDs).

Alex and Sergio Bossa have previously had an interesting conversation on the topic of consistency from the ACID and CAP perspectives.

Original title and link: C in CAP != C in ACID (NoSQL databases © myNoSQL)

Consistency in the ACID and CAP Perspectives

Following a tweet from Nathan Marz:

The problem with relational databases is that they conflate the notions of data and views

Sergio Bossa and Alex Feinberg had a very interesting exchange about the meaning of consistency in the context of ACID and consistency in CAP theorem perspective.

Alex: @nathanmarz That’s reason for confusion between C in ACID and C in CAP: C in ACID means consistent view of data which can be done w/ quorums

Sergio: @strlen That’s a common misconception: ACID C just means your write operations do not break data constraints. It’s not about the view.

Alex: @sbtourist It also refers to not allowing reads of intermediate states i.e., serializability. W/o a quorum, an EC system could allow such.

Alex: @sbtourist On the other hand, an async system where node B is behind node A is still C in the ACID sense without being C in the CAP sense.

Sergio: @strlen Nope, that’s the isolation level (ACID I). Again, ACID C has a precise meaning and it’s about constraints.

Alex: @sbtourist Yeah, I think you are right: serializability would be “I”, with consensus (strongest form of CAP “C”) being about “A” (atomicity)

Sergio: @strlen That said, I strongly agree with you about ACID C being different than CAP C.

Alex: @sbtourist Yes. Both “consistent” and “atomic” mean diff things in DBs than they do elsewhere in systems (e.g., way that “ln -s” is atomic)

There have been many discussions about the loose definitions of the terms in the CAP theorem. Daniel Abadi exposed an interesting perspective on the subject proposing instead PACELC:

To me, CAP should really be PACELC – if there is a partition (P) how does the system tradeoff between availability and consistency (A and C); else (E) when the system is running as normal in the absence of partitions, how does the system tradeoff between latency (L) and consistency (C)?

Original title and link: Consistency in the ACID and CAP Perspectives (NoSQL databases © myNoSQL)

How Scalable is VoltDB?

Percona guys[1] have run, analyzed, and concluded about VoltDB scalability:

VoltDB is very scalable; it should scale to 120 partitions, 39 servers, and 1.6 million complex transactions per second at over 300 CPU cores

Considering the definition: “A system whose performance improves after adding hardware, proportionally to the capacity added, is said to be a scalable system.”, the conclusion should be slightly updated:

VoltDB can scale up to 120 partitions on 39 servers with 300 CPU cores and 1.6 million TPS.

Bottom line:

  • if you can fit your data into 40 servers’ memory
  • you need ACID and SQL
  • you are OK precompiled Java based stored procedures
  • you don’t need multi data center deployments

now you can estimate how far you can go with VoltDB.

  1. The company specialized on MySQL services and behind the MySQL Performance Blog  

Original title and link: How Scalable is VoltDB? (NoSQL databases © myNoSQL)


Dealing With Distributed State

Jeff Darcy[1]:

The general rule to avoid these kinds of unresolvable conflicts is: don’t pass around references to values that might be inconsistent across systems. It’s like passing a pointer from one address space to a process in another; you just shouldn’t expect it to work. Either pass around the actual values or do calculations involving those values and replicate the result.

When dealing with distributed state think about the actor model.

  1. Jeff Darcy: @Obdurodon  

Original title and link: Dealing With Distributed State (NoSQL databases © myNoSQL)


9 Things to Acknowledge about NoSQL Databases

Excellent list:

  1. Understand how ACID compares with BASE (Basically Available, Soft-state, Eventually Consistent)
  2. Understand persistence vs non-persistence, i.e., some NoSQL technologies are entirely in-memory data stores
  3. Recognize there are entirely different data models from traditional normalized tabular formats: Columnar (Cassandra) vs key/value (Memcached) vs document-oriented (CouchDB) vs graph oriented (Neo4j)
  4. Be ready to deal with no standard interface like JDBC/ODBC or standarized query language like SQL; every NoSQL tool has a different interface
  5. Architects: rewire your brain to the fact that web-scale/large-scale NoSQL systems are distributed across dozens to hundreds of servers and networks as opposed to a shared database system
  6. Get used to the possibly uncomfortable realization that you won’t know where data lives (most of the time)
  7. Get used to the fact that data may not always be consistent; ‘eventually consistent’ is one of the key elements of the BASE model
  8. Get used to the fact that data may not always be available
  9. Understand that some solutions are partition-tolerant and some are not

Print it out and distribute it among your colleagues.

Original title and link: 9 Things to Acknowledge about NoSQL Databases (NoSQL databases © myNoSQL)


Amazon SimpleDB, Google Megastore & CAP

Nati Shalom (Gigaspaces) pulls out a couple of references from James Hamilton’s posts[1] on Amazon SimpleDB and Google Megastore consistency model concluding:

It is interesting to see that the reality is that even Google and Amazon - which I would consider the extreme cases for big data - realized the limitation behind eventual consistency and came up with models that can deal with scaling without forcing a compromise on consistency as I also noted in one of my recent NoCAP series

But he lefts out small details like these:

Update rates within a entity group are seriously limited by:

  • When there is log contention, one wins and the rest fail and must be retried
  • Paxos only accepts a very limited update rate (order 10^2 updates per second)


Cross entity group updates are supported by:

  • two-phase commit with the fragility that it brings
  • queueing ans asynchronously applying the changes

Original title and link: Amazon SimpleDB, Google Megastore & CAP (NoSQL databases © myNoSQL)


Clustrix Sierra Clustered Database Engine

In the light of publicly announcing customers, I wanted to read a bit about Clustrix Clustered Database Systems.

The company homepage is describing the product:

  • scalable database appliaces for Internet-scale work loads
  • Linearly Scalable: fully distributed, parallel architecture provides unlimited scale
  • SQL functionality: full SQL relational and data consistency (ACID) functionality
  • Fault-Tolerant: highly available providing fail-over, recovery, and self-healing
  • MySQL Compatible: seamless deployment without application changes.

All these sounded pretty (too) good. And I’ve seen a very similar presentation for Xeround: Elastic, Always-on Storage Engine for MySQL.

So, I’ve continued my reading with the Sierra Clustered Database Engine whitepaper (PDF).

Here are my notes:

  • Sierra is composed of:
    • database personality module: translates queries into internal representation
    • distributed query planner and compiler
    • distributed shared-nothing execution engine
    • persistent storage
    • NVRAM transactional storage for journal changes
    • inter-node Infiniband
  • queries are decomposed into query fragments which are the unit of work. Query fragments are sent for execution to nodes containing the data.
  • query fragments are atomic operations that can:
    • insert, read, update data
    • execute functions and modify control flow
    • perform synchronization
    • send data to other nodes
    • format output
  • query fragments can be executed in parallel
  • query fragments can be cached with parameterized constants at the node level
  • determining where to sent the query fragments for execution is done using either range-based rules or hash function
  • tables are partitioned into slices, each slice having redundancy replicas
    • size of slices can be automatically determined or configured
    • adding new nodes to the cluster results in rebalancing slices
    • slices contained on a failed device are reconstructed using their replicas
  • one of the slices is considered primary
  • writes go to all replicas and are transactional
  • all reads fo the the slice primary

The paper also exemplifies the execution of 4 different queries:


SELECT uid, name FROM T1 JOIN T2 on T1.gid = T2.gid WHERE uid=10

SELECT * FROM T1 WHERE uid<100 and gid>10 ORDER BY uid LIMIT 5



  • who is coordinating transactions that may be executed on different nodes?
  • who is maintains the topology of the slices? In case of a node failure, you’d need to determine:
    1. what slices where on the failing node
    2. where are the replicas for each of these slices
    3. where new replicas will be created
    4. when will new replicas become available for writes
  • who elects the slice primary?

Clustrix Sierra Clustered Database Engine

Original title and link: Clustrix Sierra Clustered Database Engine (NoSQL databases © myNoSQL)

Data Consistency and Synchronization in Scalable NoSQL Solutions

On Xeround blog:

What would be sacrificed if synchronization and transaction mechanisms were added on top of NoSQL foundations?  The answer is performance. Distributed and consistent data services will always be slower than pure raw distributed data service.

… and availability.

Original title and link: Data Consistency and Synchronization in Scalable NoSQL Solutions (NoSQL databases © myNoSQL)


Google Megastore: Scalable, Highly Available Storage for Interactive Services

A new paper from Google:

Megastore blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS in a novel way, and provides both strong consistency guarantees and high availability.
We provide fully serializable ACID semantics within fine-grained partitions of data. This partitioning allows us to synchronously replicate each write across a wide area network with reasonable latency and support seamless failover between datacenters.
This paper describes Megastore’s semantics and replication algorithm.

Megastore seems to be the solution behind the Google App Engine high replication datastore.

Emphases are mine.

Original title and link: Google Megastore: Scalable, Highly Available Storage for Interactive Services (NoSQL databases © myNoSQL)


Google App Engine High Replication Datastore

The High Replication Datastore provides the highest level of availability for your reads and writes, at the cost of increased latency for writes and changes in consistency guarantees in the API. The High Replication Datastore increases the number of data centers that maintain replicas of your data by using the Paxos algorithm to synchronize that data across datacenters in real time.

Still not completely decentralized a la Amazon Dynamo.

Original title and link: Google App Engine High Replication Datastore (NoSQL databases © myNoSQL)


Errors in Database Systems Still Must Consider Network Partitions

Dan Weinreb on Michael Stonebraker’s network partition scenarios from the overly discussed paper about ☞ Erros in Database Systems, Eventual Consistency, and the CAP Theorem :

If you mean a few computers connected together by an Ethernet, with redundant hardware all around, then the chance of a failure of the network itself is relatively low. However, real-world data centers with a relatively large number of servers rarely work this way. The problem is that a real network is very complicated. It depends on switches at both level 3 (routers) and level 2 (hubs). Situations can arise in which pieces of the network are mis-configured by accident; these can be hard to find due, ironically, the very redundancy that was added to avoid failures. In particular, there is no way to make any kind of guarantee about the latency within the network, nor the likelihood that a packet will make it from its source to its destination.

John Hugg (VoltDB) made an interesting comment:

Step 1: Figure out how much the network partitions that will cause your system to be unavailable will cost.

Step 2: Figure out how much giving up consistency will cost.

Step 3: Make your CAP decision base on these estimated values.


Original title and link: Errors in Database Systems Still Must Consider Network Partitions (NoSQL databases © myNoSQL)