NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



nosql theory: All content tagged as nosql theory in NoSQL databases and polyglot persistence

Lock-Free Algorithms: How Intel X86_64 Processors and Their Memory Model Works

Download the slides, set aside 1 hour and 10 minutes of uncontended time, click the Maximize button, and watch this great presentation by Martin Thompson and Michael Barker diving into the Intel x86_64 processors and memory models for implementing lock-free algorithms. Once you’re done make sure to also read The Single Writer Principle by the same Martin Thompson.

Original title and link: Lock-Free Algorithms: How Intel X86_64 Processors and Their Memory Model Works (NoSQL database©myNoSQL)

The Single Writer Principle

Martin Thompson:

When trying to build a highly scalable system the single biggest limitation on scalability is having multiple writers contend for any item of data or resource.  Sure, algorithms can be bad, but let’s assume they have a reasonable Big O notation so we’ll focus on the scalability limitations of the systems design. 

I keep seeing people just accept having multiple writers as the norm.  There is a lot of research in computer science for managing this contention that boils down to 2 basic approaches.  One is to provide mutual exclusion to the contended resource while the mutation takes place; the other is to take an optimistic strategy and swap in the changes if the underlying resource has not changed while you created the new copy. 

The Single Writer Principle is that for any item of data, or resource, that item of data should be owned by a single execution context for all mutations.

Original title and link: The Single Writer Principle (NoSQL database©myNoSQL)


Should I Expose Asynchronous Wrappers for Synchronous Methods?

There are two primary benefits I see to asynchrony: scalability and offloading (e.g. responsiveness, parallelism).  Which of these benefits matters to you is typically dictated by the kind of application you’re writing.  Most client apps care about asynchrony for offloading reasons, such as maintaining responsiveness of the UI thread, though there are certainly cases where scalability matters to a client as well (often in more technical computing / agent-based simulation workloads).  Most server apps care about asynchrony for scalability reasons, though there are cases where offloading matters, such as in achieving parallelism in back-end compute servers.

There might be a 3rd scenario (or at least a sub-category of the responsiveness benefit): adding timeout capabilitites to non-critical remote invocations. What I have in mind is simulating the actor-based approach in environments with no native support for it.

Original title and link: Should I Expose Asynchronous Wrappers for Synchronous Methods? (NoSQL database©myNoSQL)


Networks Never Fail

A reminder to those thinking that networks never fail and automation can solve everything. Christina Ilvento, on behalf of the App Engine team:

The root cause of the outage was a combination of two factors during a scheduled network maintenance in one of our datacenters. As part of the scheduled maintenance, network capacity to and from this datacenter was reduced. This alone was expected, and was not a problem. However, this maintenance exposed a previously existing misconfiguration in the system that manages network bandwidth capacity.

Ordinarily, the bandwidth management system helps isolate and prioritize traffic. When capacity is reduced because of maintenance, network failure, or due to an excess of normal traffic, the bandwidth management system keeps things running smoothly by throttling back the rate of low priority traffic. However, as mentioned, the bandwidth management system had a latent misconfiguration which did not show up until capacity was reduced due to the scheduled maintenance. This misconfiguration under-reported the available network capacity to and from the datacenter, causing the network modeler to believe that there was less overall capacity than actually existed.

The configuration error in the bandwidth management system, when combined with an expected reduction in capacity due to the scheduled maintenance, led the system to conclude that there was insufficient bandwidth available for current traffic demand to and from this datacenter. (In reality, there was more than sufficient excess capacity, as otherwise the maintenance would not have been allowed to go forward.) Because of this combination of misconfiguration and scheduled maintenance, a number of services were automatically blocked from sending network traffic. […]

The outage occurred because two independent systems failed at the same time, which resulted in mistakes in our usual escalation procedures which significantly impacted the duration of the outage.

Original title and link: Networks Never Fail (NoSQL database©myNoSQL)

A Different Approach to Data Modeling: Thinking of Data Service Requirements

Pedro Visintin:

We can distinguish several objects: Session, Cart, Item, User, Order, Product and Payment. Usually we use ActiveRecord to store all of them. But this time let’s think about it differently.

For sessions, we don’t need durable data at all — Redis can be a good option, and of course will be faster than any RDBMS. For Cart and Item,we will need high availability across different locations. Riak can fit well for this use case. For User Order Product and Payment, a relational database can fit well, focusing on Transactions and Reporting about our application.

This is a very good exercise for understanding the requirements for your data service layer. As much as I write about polyglot persistence, when architecting an application never leave aside or ignore the operational requirements for your service.

Original title and link: A Different Approach to Data Modeling: Thinking of Data Service Requirements (NoSQL database©myNoSQL)


The Database Nirvana

Scroll to minute 16:55 of this video to watch Jim Webber explain the benefits of polyglot persistence and how starting (again) the winner-takes-it-all war is just sending us back at least 10 years from the database Nirvana.

We’ve just come from the place where one-size-fits-all and we don’t want to go back there. There is a huge wonderful ecosystem of stores. Pick the right one. Don’t just assume that the one you find the easiest or the one that shouts the loudest is the one you’re going to use. Pick the one that suits your data model.

It doesn’t matter what flavor of relational or NoSQL database you prefer or have experience with or if a small or large database vendor is paying your bills. You really need to get this right as otherwise we’re just going to destroy a lot of valuable options we’ve added to our toolboxes.

Original title and link: The Database Nirvana (NoSQL database©myNoSQL)

Cardinality Estimation Algorithms: Memory Efficient Solutions for Counting 1 Billion Distinct Objects

Matt Abrams from Clearspring:

Cardinality estimation algorithms trade space for accuracy. To illustrate this point we counted the number of distinct words in all of Shakespeare’s works using three different counting techniques. Note that our input dataset has extra data in it so the cardinality is higher than the standard reference answer to this question. The three techniques we used were Java HashSet, Linear Probabilistic Counter, and a Hyper LogLog Counter. Here are the results:

Cardinality estimation algorithms

Original title and link: Cardinality Estimation Algorithms: Memory Efficient Solutions for Counting 1 Billion Distinct Objects (NoSQL database©myNoSQL)


Cloud Computing Lets Us Rethink How We Use Data

But not everything we do in a database needs guaranteed transactional consistency.

Imagine you are charged with designing a system to collect data on temperature, air flow and electricity use in a building every few minutes from hundreds of locations. The system will be used to make the building more energy efficient. Now imagine you lose a few data points every day.  The cause isn’t important but it could be a glitch with a sensor, a dropped packet, or an incomplete write operation in the database.

Do you care?

It depends from what angle I’m looking at this question. If I’m the producer of the sensor, I do care if it has a glitch. If I’m a network administrator I do care there are dropped packets. And if I am a database system I do care if I’m dropping write operations. And I also have to tell whoever is using me if I am able to receive operations—am I available when I’m needed?

Original title and link: Cloud Computing Lets Us Rethink How We Use Data (NoSQL database©myNoSQL)


Design Your Database Schema

Three paterns of making a relational database behave like a document database. Useful in the times there were no document databases around.

If we were to use a relational database we might end up with a single table with an ungodly amount of columns so that each event has all its specific columns available. But we will never use all columns for one event of course. Maybe try to re-use columns, and call them names like column1, column2 etc. Hmm… sounds like fun to maintain and develop against.

The other pattern would be to start creating a normalized schema with multiple tables – probably one per game, and one per even type etc. So then we end up with a complex schema that needs to be maintained and versioned. Inserts and selects will be spread across tables and for sure we need to change the schema when new games or events are introduced.

There is also a third pattern out there which is to store a binary blob in the database… lets not even talk about that one.

Original title and link: Design Your Database Schema (NoSQL database©myNoSQL)


IBM: Behind the Buzz About NoSQL

Mature database management systems like DB2 also offer advantages like high availability and data compression that the newer NoSQL systems have not had time to develop.

Misinform your customers to save them the trouble of discovering alternative solutions.

Original title and link: IBM: Behind the Buzz About NoSQL (NoSQL database©myNoSQL)


Visualizing System Latency

Besides the many practical lessons emphasized in Jack Clark’s interview with Adrian Cockcroft on ZDNet—luckly I’ve had the chance to see some of Cockcroft’s presentations about Netflix architecture and also talk to him directly—one thing that sticked with me was the ending paragraph:

The thing I’ve been publicly asking for has been better IO in the cloud. Obviously I want SSDs in there. We’ve been asking cloud vendors to do that for a while. With Cassandra, we’ve had to go onto horizontal scale and use the internal disks and triple replicate across availability zones, so you end up with a triple-redundant data store that is careful not to overload the disks.

That reminded me of this old ACM article authored by Brendan Gregg:

When I/O latency is presented as a visual heat map, some intriguing and beautiful patterns can emerge. These patterns provide insight into how a system is actually performing and what kinds of latency end-user applications experience. Many characteristics seen in these patterns are still not understood, but so far their analysis is revealing systemic behaviors that were previously unknown.

Visualizing system latency - Sequential disk reads, stepping disk count

I was wondering if in the NoSQL databases space (and data storage space in general) are there any of the monitoring tools that provide such advanced visualization of latency data. Do you know any?

Original title and link: Visualizing System Latency (NoSQL database©myNoSQL)

Hadoop Terms and Components Index Card

A quick description of the most important terms and components of Hadoop—HDFS, NameNode, DataNode, MapReduce, JobTracker, TaskTracker—and its high level design principles:

  1. The system must properly distribute data across a system evenly and safely.
  2. The system must support partial failure of a node in the system. This means, if a node goes down, the operations within the cluster continue without change in the final outcome.
  3. If there is a failure, the system should be able to recover the data through the existence of backup (later referred to as replicated blocks).
  4. When a node is brought back online, it should be able to rejoin the system immediately
  5. The system shall maintain linear scalability, meaning addition of resources will increase performance linearly, just as removal of resources would decrease performance linearly.

Original title and link: Hadoop Terms and Components Index Card (NoSQL database©myNoSQL)