NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



NoSQL theory: All content tagged as NoSQL theory in NoSQL databases and polyglot persistence

Migrating databases with zero downtime

Every article I’ve read and linked to that includes a data migration phase from one database to another tells the same story:

  1. forklift
  2. incremental replication
  3. consistency checking
  4. shadow writes
  5. shadow writes and shadow reads for validation
  6. end of life of the original data store

The same story for Netflix’s migration from SimpleDB to Cassandra and’s migration from MongoDB/Titan to Cassandra. And once again, the same appears in FullContact’s migration from MongoDB to Cassandra. This last post also provides a nice diagram of the process:

Migrating data with no down time

The key part of these stories is that the migration was performed with zero downtime.

Original title and link: Migrating databases with zero downtime (NoSQL database©myNoSQL)

Anti-patterns for developing with NoSQL databases

Basho, makers of Riak, published recently an article about the most common patterns that have to be avoided when developing with Riak. Unsurprisingly, most of these rules can must be applied to the majority of NoSQL databases.

Writing an application that can take full advantage of Riak’s robust scaling properties requires a different way of looking at data storage and retrieval. Developers who bring a relational mindset to Riak may create applications that work well with a small data set but start to show strain in production, particularly as the cluster grows.

What I’ve learned after experimenting and building apps with different NoSQL databases can be summarized in just a couple of short generic rules:

  1. if you have the “disadvantage” of being experienced with relational databases and working on an app that will use a NoSQL database, forget everything you know about the relational world. Take out that part of your brain and put it in the jar. Use the other side of your brain. Avoid any temptations of doing comparisons or asking yourself “how would I do this in a relational database?”. You’ll fail.
  2. when using relational databases, most often we start with the data model. “What’s the best way to organize and store our data?” is one of the first questions we’re addressing. Only afterwards we’re figuring out, in the application, how to retrieve data in the format needed by the app.
  3. when using a NoSQL database, focus on your application. “How do I use data in my application?” must be the driving question. Then your NoSQL database API will tell you exactly how to store the data.

    This might make it sound too simple. Indeed, it’s not that simple. Some of the complexity you’ll face comes from figuring out how to keep multiple copies of the data to fit the different ways you need to access it, updating and deleting multiple copies, dealing with the consistency requirements of your app, what availability versus consistency trade-offs your app is OK with.

  4. take the time to learn the most common usage patterns and anti-patterns for the NoSQL database you have picked. If you cannot find the ones that fit your application, talk to the community and build a prototype. Do not ignore point 3 above at any stage.

    Now go over the list of the anti-patterns when developing with Riak.

Original title and link: Anti-patterns for developing with NoSQL databases (NoSQL database©myNoSQL)

The dual sense of consistency

Michael Nygard in an article that looks at the 2, completely unrelated, definitions of consistency; the one in ACID and the one from CAP:

So it turns out that “consistency (predicate)” and “consistency (history)” are two distinct ideas that happen to share a word. It is always an error to substitute the distributed systems definition of “consistency” for the C in ACID.

Original title and link: The dual sense of consistency (NoSQL database©myNoSQL)


Watching a presentation on Byzantine fault tolerance is similar to watching a foreign film

James Mickens in “The saddest moment“:

In conclusion, I think that humanity should stop publishing papers about Byzantine fault tolerance. I do not blame my fellow researchers for trying to publish in this area, in the same limited sense that I do not blame crackheads for wanting to acquire and then consume cocaine. The desire to make systems more reliable is a powerful one; unfortunately, this addiction, if left unchecked, will inescapably lead to madness and/or tech reports that contain 167 pages of diagrams and proofs. Even if we break the will of the machines with formalism and cryptography, we will never be able to put Ted inside of an encrypted, nested log, and while the datacenter burns and we frantically call Ted’s pager, we will realize that Ted has already left for the cafeteria.

One of the shortest and delightful articles about the complexity of distributed systems.

Choosing the right database - A basic checklist

If you’ve never had to choose a framework or a database, the post from Denish Patel can be helpful in providing you with an initial checklist. If you did, you can definitely skip it.

Original title and link: Choosing the right database - A basic checklist (NoSQL database©myNoSQL)

How SQLite is tested

Speaking about the complexity of testing databases, the “How SQLite is Tested” page should give you an idea:

The reliability and robustness of SQLite is achieved in part by thorough and careful testing.

Original title and link: How SQLite is tested (NoSQL database©myNoSQL)


Paxos serialization, serializability, and proactive serialization

Professor Murat Demirbas has a (short) post looking at the Paxos serialization, comparing it with serializability and then introducing the notion of proactive serialization:

In fact Paxos serialization is overkill, it is too strong. Paxos will serialize operations in a total order, which is not necessarily needed for sequential consistency. Today in many applications where knowing the total order and replicated-logging of that order is not important, Paxos is still (ab)used.

Indeed the post doesn’t offer too many details about proactive serialization, but while thinking about it here were my first questions:

  1. what would be the behavior of the system for the cases where the prediction for locks is incorrect? Somehow the behavior of the system would need to account for both false positives and false negatives.
  2. would a system using proactive serialization still need a coordinator? A master-service? (nb: if I’m reading the post correctly, it seems that the system would rely on a lock-service master)
  3. if there isn’t a coordinator who would make sure the locks are released when failures occur?

Original title and link: Paxos serialization, serializability, and proactive serialization (NoSQL database©myNoSQL)


Conflict Resolution Using Rev Trees and a Comparison With Vector Clocks

Damien Katz has posted on GitHub a design document for the data structures, called rev trees, used to support conflict management in Couchbase. The doc also includes references to the way conflict resolution is done in CouchDB and also compares rev trees with the vector clocks.

When this happens [nb the edits are in conflict] Couchbase will store both edits, pick an interim winner (the same winner will be selected on all nodes) and “hide” the losing conflict(s) and mark the document as being in conflict so that it can found, using views and other searches, by an external agents who can potentially resolve the conflicts.

Original title and link: Conflict Resolution Using Rev Trees and a Comparison With Vector Clocks (NoSQL database©myNoSQL)


Bloom Filters by Example

Bloom filters are present in a lot of NoSQL systems. Take for example HBase and Bloom Filters. Last month I’ve linked to creating a simple Bloom filter in Python and today is time for Bloom Filters by Example.

Sid Anand

Original title and link: Bloom Filters by Example (NoSQL database©myNoSQL)


Is Eventual Consistency Useful?

As a continuation to The NoSQL Partition Tolerance Myth, Jeff Darcy:

Every once in a while, somebody comes up with the “new” idea that eventually consistent systems (or AP in CAP terminology) are useless. Of course, it’s not really new at all; the SQL RDBMS neanderthals have been making this claim-without-proof ever since NoSQL databases brought other models back into the spotlight. In the usual formulation, banks must have immediate consistency and would never rely on resolving conflicts after the fact … except that they do and have for centuries.

Original title and link: Is Eventual Consistency Useful? (NoSQL database©myNoSQL)


The NoSQL Partition Tolerance Myth

Emin Gün Sirer:

What the NoSQL industry is peddling under the guise of partition tolerance is not the ability to build applications that can survive partitions. They’re repackaging and marketing a very specific, and odd, behavior known as partition obliviousness.

The post presents a very dogmatic and radical perspective on what the requirements of both applications and distributed databases must be. I cannot agree with most of it if only for the reason it’s using the “bank example”.

Dealing with data conflicts is an added complexity for systems where write availability is more important than other requirements. Many NoSQL databases provide the knobs to tune the availability and consistency to the levels required by many applications. Applications can define more fine grained knobs on top of that.

Generalizing a scenario that might require consistent transactional data access to be the canonical example for all distributed systems and ignoring features that are present in (some of ) the NoSQL databases to help applications deal with different scenarios is never going to lead to correct conclusions.

Original title and link: The NoSQL Partition Tolerance Myth (NoSQL database©myNoSQL)


Using Apache ZooKeeper to Build Distributed Apps (And Why)

Great intro to ZooKeeper and the problems it can help solve by Sean Mackrory:

Done wrong, distributed software can be the most difficult to debug. Done right, however, it can allow you to process more data, more reliably and in less time. Using Apache ZooKeeper allows you to confidently reason about the state of your data, and coordinate your cluster the right way. You’ve seen how easy it is to get a ZooKeeper server up and running. (In fact, if you’re a CDH user, you may already have an ensemble running!) Think about how ZooKeeper could help you build more robust systems

Leaving aside for a second the main topic of the post, another important lesson here is that the NIH syndrom in distributed systems is very expensive.

Original title and link: Using Apache ZooKeeper to Build Distributed Apps (And Why) (NoSQL database©myNoSQL)