NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



nosql theory: All content tagged as nosql theory in NoSQL databases and polyglot persistence

Algorithm for Automatic Cache Invalidation

Jakub Łopuszański describes in much detail and with examples an algorithm for cache invalidation:

Imagine a bipartite graph which on the left hand side has one vertex per each possible subspace of a write query, and on the right side has vertices corresponding to subspaces of read queries. Actually both sets are equal, but we will focus on edges.

Edge goes from left to right, if a query on the left side affects results of a query on the right side. As said before, both sets are infinite, but that’s not the problem. There are infinitely many edges, but it’s also not bad. What’s bad is that there are nodes on the left side with the infinite degree, which means, we need to invalidate infinitely many queries. What the above tricky algorithm does, is adding a third layer to the graph, in the middle between the two, such that the transitive closure of the resulting graph is still the same (in other words: you can still get by using two edges anywhere you could by one edge in the original graph), yet each node on the left, and each node on the right, have finite (actually constant) degree. This middle layer corresponds to the artificial subspaces with “?” marks, and serves as a connecting hub for all the mess. Now, when a query on the left executes, it needs to inform only its (small number of) neighbours about the change, moving the burden of reading this information to the right. That is, a query on the right side needs to check if there is a message in the “inbox” in the middle layer. So you can think about it as a cooperation where the left query makes one step forward, and the right query does a one step back, to meet at the central place, and pass the important information about the invalidation of cache.

I’m still in front of a piece of paper understanding how it works.

Original title and link: Algorithm for Automatic Cache Invalidation (NoSQL database©myNoSQL)


Paper: Principles of Distributed Data Management in 2020?

Patrick Valduriez, co-author of the “Principles of Distributed Database Systems” book, has published a paper Principles of Distributed Data Management in 2020? (pdf) translating the main topic into the following 3 questions:

  1. What are the fundamental principles behind the emerging solutions?
  2. Is there any generic architectural model, to explain those principles?
  3. Do we need new foundations to look at data distribution?

Wrt (2), I showed that emerging solutions can still be explained along the three main dimensions of distributed data management (distribution, autonomy, heterogeneity), yet pushing the scales of the dimensions high up. However, I raised the question of how generic should distributed data management be, without hampering application-specific optimizations. Emerging NOSQL solutions tend to rely on a specific data model (e.g. Bigtable, MapReduce) with a simple set of operators easy to use from or with a programming language. It is also interesting to witness the development of algebras, with specific operators, to raise the level of abstraction in a way that enables optimization [9]. What is missing to explain the principles of emerging solutions is one or more dimensions on generic/specific data model and data processing.

What I think this paper does is actually looking at two different questions, a bit less generic but still useful in proving that the new generation of distributed database systems was clearly triggered by the new requirements and the evolution of the current applications:

  1. Is there a need for new approaches in distributed data management systems?
  2. What are some of the approaches used by the emerging solution to deal with the challenges posed by today’s data-intensive applications?

You can read or download Patrick Valduriez’s paper here:

Lock-Free Algorithms: How Intel X86_64 Processors and Their Memory Model Works

Download the slides, set aside 1 hour and 10 minutes of uncontended time, click the Maximize button, and watch this great presentation by Martin Thompson and Michael Barker diving into the Intel x86_64 processors and memory models for implementing lock-free algorithms. Once you’re done make sure to also read The Single Writer Principle by the same Martin Thompson.

Original title and link: Lock-Free Algorithms: How Intel X86_64 Processors and Their Memory Model Works (NoSQL database©myNoSQL)

The Single Writer Principle

Martin Thompson:

When trying to build a highly scalable system the single biggest limitation on scalability is having multiple writers contend for any item of data or resource.  Sure, algorithms can be bad, but let’s assume they have a reasonable Big O notation so we’ll focus on the scalability limitations of the systems design. 

I keep seeing people just accept having multiple writers as the norm.  There is a lot of research in computer science for managing this contention that boils down to 2 basic approaches.  One is to provide mutual exclusion to the contended resource while the mutation takes place; the other is to take an optimistic strategy and swap in the changes if the underlying resource has not changed while you created the new copy. 

The Single Writer Principle is that for any item of data, or resource, that item of data should be owned by a single execution context for all mutations.

Original title and link: The Single Writer Principle (NoSQL database©myNoSQL)


Should I Expose Asynchronous Wrappers for Synchronous Methods?

There are two primary benefits I see to asynchrony: scalability and offloading (e.g. responsiveness, parallelism).  Which of these benefits matters to you is typically dictated by the kind of application you’re writing.  Most client apps care about asynchrony for offloading reasons, such as maintaining responsiveness of the UI thread, though there are certainly cases where scalability matters to a client as well (often in more technical computing / agent-based simulation workloads).  Most server apps care about asynchrony for scalability reasons, though there are cases where offloading matters, such as in achieving parallelism in back-end compute servers.

There might be a 3rd scenario (or at least a sub-category of the responsiveness benefit): adding timeout capabilitites to non-critical remote invocations. What I have in mind is simulating the actor-based approach in environments with no native support for it.

Original title and link: Should I Expose Asynchronous Wrappers for Synchronous Methods? (NoSQL database©myNoSQL)


Networks Never Fail

A reminder to those thinking that networks never fail and automation can solve everything. Christina Ilvento, on behalf of the App Engine team:

The root cause of the outage was a combination of two factors during a scheduled network maintenance in one of our datacenters. As part of the scheduled maintenance, network capacity to and from this datacenter was reduced. This alone was expected, and was not a problem. However, this maintenance exposed a previously existing misconfiguration in the system that manages network bandwidth capacity.

Ordinarily, the bandwidth management system helps isolate and prioritize traffic. When capacity is reduced because of maintenance, network failure, or due to an excess of normal traffic, the bandwidth management system keeps things running smoothly by throttling back the rate of low priority traffic. However, as mentioned, the bandwidth management system had a latent misconfiguration which did not show up until capacity was reduced due to the scheduled maintenance. This misconfiguration under-reported the available network capacity to and from the datacenter, causing the network modeler to believe that there was less overall capacity than actually existed.

The configuration error in the bandwidth management system, when combined with an expected reduction in capacity due to the scheduled maintenance, led the system to conclude that there was insufficient bandwidth available for current traffic demand to and from this datacenter. (In reality, there was more than sufficient excess capacity, as otherwise the maintenance would not have been allowed to go forward.) Because of this combination of misconfiguration and scheduled maintenance, a number of services were automatically blocked from sending network traffic. […]

The outage occurred because two independent systems failed at the same time, which resulted in mistakes in our usual escalation procedures which significantly impacted the duration of the outage.

Original title and link: Networks Never Fail (NoSQL database©myNoSQL)

A Different Approach to Data Modeling: Thinking of Data Service Requirements

Pedro Visintin:

We can distinguish several objects: Session, Cart, Item, User, Order, Product and Payment. Usually we use ActiveRecord to store all of them. But this time let’s think about it differently.

For sessions, we don’t need durable data at all — Redis can be a good option, and of course will be faster than any RDBMS. For Cart and Item,we will need high availability across different locations. Riak can fit well for this use case. For User Order Product and Payment, a relational database can fit well, focusing on Transactions and Reporting about our application.

This is a very good exercise for understanding the requirements for your data service layer. As much as I write about polyglot persistence, when architecting an application never leave aside or ignore the operational requirements for your service.

Original title and link: A Different Approach to Data Modeling: Thinking of Data Service Requirements (NoSQL database©myNoSQL)


The Database Nirvana

Scroll to minute 16:55 of this video to watch Jim Webber explain the benefits of polyglot persistence and how starting (again) the winner-takes-it-all war is just sending us back at least 10 years from the database Nirvana.

We’ve just come from the place where one-size-fits-all and we don’t want to go back there. There is a huge wonderful ecosystem of stores. Pick the right one. Don’t just assume that the one you find the easiest or the one that shouts the loudest is the one you’re going to use. Pick the one that suits your data model.

It doesn’t matter what flavor of relational or NoSQL database you prefer or have experience with or if a small or large database vendor is paying your bills. You really need to get this right as otherwise we’re just going to destroy a lot of valuable options we’ve added to our toolboxes.

Original title and link: The Database Nirvana (NoSQL database©myNoSQL)

Cardinality Estimation Algorithms: Memory Efficient Solutions for Counting 1 Billion Distinct Objects

Matt Abrams from Clearspring:

Cardinality estimation algorithms trade space for accuracy. To illustrate this point we counted the number of distinct words in all of Shakespeare’s works using three different counting techniques. Note that our input dataset has extra data in it so the cardinality is higher than the standard reference answer to this question. The three techniques we used were Java HashSet, Linear Probabilistic Counter, and a Hyper LogLog Counter. Here are the results:

Cardinality estimation algorithms

Original title and link: Cardinality Estimation Algorithms: Memory Efficient Solutions for Counting 1 Billion Distinct Objects (NoSQL database©myNoSQL)


Cloud Computing Lets Us Rethink How We Use Data

But not everything we do in a database needs guaranteed transactional consistency.

Imagine you are charged with designing a system to collect data on temperature, air flow and electricity use in a building every few minutes from hundreds of locations. The system will be used to make the building more energy efficient. Now imagine you lose a few data points every day.  The cause isn’t important but it could be a glitch with a sensor, a dropped packet, or an incomplete write operation in the database.

Do you care?

It depends from what angle I’m looking at this question. If I’m the producer of the sensor, I do care if it has a glitch. If I’m a network administrator I do care there are dropped packets. And if I am a database system I do care if I’m dropping write operations. And I also have to tell whoever is using me if I am able to receive operations—am I available when I’m needed?

Original title and link: Cloud Computing Lets Us Rethink How We Use Data (NoSQL database©myNoSQL)


Design Your Database Schema

Three paterns of making a relational database behave like a document database. Useful in the times there were no document databases around.

If we were to use a relational database we might end up with a single table with an ungodly amount of columns so that each event has all its specific columns available. But we will never use all columns for one event of course. Maybe try to re-use columns, and call them names like column1, column2 etc. Hmm… sounds like fun to maintain and develop against.

The other pattern would be to start creating a normalized schema with multiple tables – probably one per game, and one per even type etc. So then we end up with a complex schema that needs to be maintained and versioned. Inserts and selects will be spread across tables and for sure we need to change the schema when new games or events are introduced.

There is also a third pattern out there which is to store a binary blob in the database… lets not even talk about that one.

Original title and link: Design Your Database Schema (NoSQL database©myNoSQL)


IBM: Behind the Buzz About NoSQL

Mature database management systems like DB2 also offer advantages like high availability and data compression that the newer NoSQL systems have not had time to develop.

Misinform your customers to save them the trouble of discovering alternative solutions.

Original title and link: IBM: Behind the Buzz About NoSQL (NoSQL database©myNoSQL)