NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



NoSQL theory: All content tagged as NoSQL theory in NoSQL databases and polyglot persistence

To SQL or to NoSQL?

Bob Lambert’s thoughts about a post about migrating from a NoSQL database back to a relational database:

I really liked this post, but particularly for these two points:

  • All data is relational, but NoSQL is useful because sometimes it isn’t practical to treat it as such for volume/complexity reasons.
  • In the comments, Jonathon Fisher’s remarked that NoSQL is really old technology, not new. (Of course you have to like any commenter that uses the word “defenestrated”).

I have lost count of how many times I’ve read exactly these arguments. But let’s take a different look at this post and the original article:

  1. the first thing that strikes me is that there’s no mention of the NoSQL database that was used; adding to that there’s no explanation of what led to using that database in the first place. What if it was just an experiment? What if the initial implementation was just a fashionable decision?

  2. all data is relational

    A more accurate statement would be “all data is connected“. The way we represent these connections can take many different forms and many times depends on the ways we use the data. This is exactly the core principle when doing data modeling in the NoSQL world too.

    The relational model is the most common as a consequence of the popularity of relational databases. One quick example of connected but not relational is hierarchical data, an area where relational databases are still not excelling (even if some have built custom solutions).

  3. “data in a relational model is optimized for the set of all possible operations”.

    Actually, the relational model is optimizing for space efficiency and set operations. There’s no such thing that optimizes for everything. Take graph data and traversal operations as an obvious counter example for relations and operations that are outside the capabilities of a relational database. And there are quite a few other examples: massive sparse matrices, etc.

  4. “Todd Homa recounts one horror story that shows how NoSQL data modelers must be aware of the corners into which they paint themselves as they optimize one access path at the expense of others.”

    This is like saying that everyone using a relational database has shut all their chances to grow their application to more than one server. Both of these are quite inaccurate.

Last, I think we should change the “choose the right tool for the job” advise to something that is a bit more clear: “understand and choose the trade-offs that correspond to your requirements”. Doesn’t sound as nice, but I think it’s better.

Original title and link: To SQL or to NoSQL? (NoSQL database©myNoSQL)


The era of the No-Design DataBase

Holger Mueller:

So could be the common thread of the new database boom the absence of a design component, the disposition of schema design step that was and is key for the success of any relational database?


Original title and link: The era of the No-Design DataBase (NoSQL database©myNoSQL)


Migrating databases with zero downtime

Every article I’ve read and linked to that includes a data migration phase from one database to another tells the same story:

  1. forklift
  2. incremental replication
  3. consistency checking
  4. shadow writes
  5. shadow writes and shadow reads for validation
  6. end of life of the original data store

The same story for Netflix’s migration from SimpleDB to Cassandra and’s migration from MongoDB/Titan to Cassandra. And once again, the same appears in FullContact’s migration from MongoDB to Cassandra. This last post also provides a nice diagram of the process:

Migrating data with no down time

The key part of these stories is that the migration was performed with zero downtime.

Original title and link: Migrating databases with zero downtime (NoSQL database©myNoSQL)

Anti-patterns for developing with NoSQL databases

Basho, makers of Riak, published recently an article about the most common patterns that have to be avoided when developing with Riak. Unsurprisingly, most of these rules can must be applied to the majority of NoSQL databases.

Writing an application that can take full advantage of Riak’s robust scaling properties requires a different way of looking at data storage and retrieval. Developers who bring a relational mindset to Riak may create applications that work well with a small data set but start to show strain in production, particularly as the cluster grows.

What I’ve learned after experimenting and building apps with different NoSQL databases can be summarized in just a couple of short generic rules:

  1. if you have the “disadvantage” of being experienced with relational databases and working on an app that will use a NoSQL database, forget everything you know about the relational world. Take out that part of your brain and put it in the jar. Use the other side of your brain. Avoid any temptations of doing comparisons or asking yourself “how would I do this in a relational database?”. You’ll fail.
  2. when using relational databases, most often we start with the data model. “What’s the best way to organize and store our data?” is one of the first questions we’re addressing. Only afterwards we’re figuring out, in the application, how to retrieve data in the format needed by the app.
  3. when using a NoSQL database, focus on your application. “How do I use data in my application?” must be the driving question. Then your NoSQL database API will tell you exactly how to store the data.

    This might make it sound too simple. Indeed, it’s not that simple. Some of the complexity you’ll face comes from figuring out how to keep multiple copies of the data to fit the different ways you need to access it, updating and deleting multiple copies, dealing with the consistency requirements of your app, what availability versus consistency trade-offs your app is OK with.

  4. take the time to learn the most common usage patterns and anti-patterns for the NoSQL database you have picked. If you cannot find the ones that fit your application, talk to the community and build a prototype. Do not ignore point 3 above at any stage.

    Now go over the list of the anti-patterns when developing with Riak.

Original title and link: Anti-patterns for developing with NoSQL databases (NoSQL database©myNoSQL)

The dual sense of consistency

Michael Nygard in an article that looks at the 2, completely unrelated, definitions of consistency; the one in ACID and the one from CAP:

So it turns out that “consistency (predicate)” and “consistency (history)” are two distinct ideas that happen to share a word. It is always an error to substitute the distributed systems definition of “consistency” for the C in ACID.

Original title and link: The dual sense of consistency (NoSQL database©myNoSQL)


Watching a presentation on Byzantine fault tolerance is similar to watching a foreign film

James Mickens in “The saddest moment“:

In conclusion, I think that humanity should stop publishing papers about Byzantine fault tolerance. I do not blame my fellow researchers for trying to publish in this area, in the same limited sense that I do not blame crackheads for wanting to acquire and then consume cocaine. The desire to make systems more reliable is a powerful one; unfortunately, this addiction, if left unchecked, will inescapably lead to madness and/or tech reports that contain 167 pages of diagrams and proofs. Even if we break the will of the machines with formalism and cryptography, we will never be able to put Ted inside of an encrypted, nested log, and while the datacenter burns and we frantically call Ted’s pager, we will realize that Ted has already left for the cafeteria.

One of the shortest and delightful articles about the complexity of distributed systems.

Choosing the right database - A basic checklist

If you’ve never had to choose a framework or a database, the post from Denish Patel can be helpful in providing you with an initial checklist. If you did, you can definitely skip it.

Original title and link: Choosing the right database - A basic checklist (NoSQL database©myNoSQL)

How SQLite is tested

Speaking about the complexity of testing databases, the “How SQLite is Tested” page should give you an idea:

The reliability and robustness of SQLite is achieved in part by thorough and careful testing.

Original title and link: How SQLite is tested (NoSQL database©myNoSQL)


Paxos serialization, serializability, and proactive serialization

Professor Murat Demirbas has a (short) post looking at the Paxos serialization, comparing it with serializability and then introducing the notion of proactive serialization:

In fact Paxos serialization is overkill, it is too strong. Paxos will serialize operations in a total order, which is not necessarily needed for sequential consistency. Today in many applications where knowing the total order and replicated-logging of that order is not important, Paxos is still (ab)used.

Indeed the post doesn’t offer too many details about proactive serialization, but while thinking about it here were my first questions:

  1. what would be the behavior of the system for the cases where the prediction for locks is incorrect? Somehow the behavior of the system would need to account for both false positives and false negatives.
  2. would a system using proactive serialization still need a coordinator? A master-service? (nb: if I’m reading the post correctly, it seems that the system would rely on a lock-service master)
  3. if there isn’t a coordinator who would make sure the locks are released when failures occur?

Original title and link: Paxos serialization, serializability, and proactive serialization (NoSQL database©myNoSQL)


Conflict Resolution Using Rev Trees and a Comparison With Vector Clocks

Damien Katz has posted on GitHub a design document for the data structures, called rev trees, used to support conflict management in Couchbase. The doc also includes references to the way conflict resolution is done in CouchDB and also compares rev trees with the vector clocks.

When this happens [nb the edits are in conflict] Couchbase will store both edits, pick an interim winner (the same winner will be selected on all nodes) and “hide” the losing conflict(s) and mark the document as being in conflict so that it can found, using views and other searches, by an external agents who can potentially resolve the conflicts.

Original title and link: Conflict Resolution Using Rev Trees and a Comparison With Vector Clocks (NoSQL database©myNoSQL)


Bloom Filters by Example

Bloom filters are present in a lot of NoSQL systems. Take for example HBase and Bloom Filters. Last month I’ve linked to creating a simple Bloom filter in Python and today is time for Bloom Filters by Example.

Sid Anand

Original title and link: Bloom Filters by Example (NoSQL database©myNoSQL)


Is Eventual Consistency Useful?

As a continuation to The NoSQL Partition Tolerance Myth, Jeff Darcy:

Every once in a while, somebody comes up with the “new” idea that eventually consistent systems (or AP in CAP terminology) are useless. Of course, it’s not really new at all; the SQL RDBMS neanderthals have been making this claim-without-proof ever since NoSQL databases brought other models back into the spotlight. In the usual formulation, banks must have immediate consistency and would never rely on resolving conflicts after the fact … except that they do and have for centuries.

Original title and link: Is Eventual Consistency Useful? (NoSQL database©myNoSQL)