NoSQL theory: All content tagged as NoSQL theory in NoSQL databases and polyglot persistence
Thursday, 10 May 2012
Neo4j Data Modeling: What Question Do You Want to Answer?
Mark Needham:
Over the past few weeks I’ve been modelling ThoughtWorks project data in neo4j and I realised that the way that I’ve been doing this is by considering what question I want to answer and then building a graph to answer it.
This same principle should be applied to modeling with any NoSQL database. Thinking in terms of access patterns is one of the major differences between doing data modeling in the NoSQL space and the relational world, which is driven, at least in the first phases and theoretically, by the normalization rules.
Original title and link: Neo4j Data Modeling: What Question Do You Want to Answer? (©myNoSQL)
via: http://www.markhneedham.com/blog/2012/05/05/neo4j-what-question-do-you-want-to-answer/
Wednesday, 9 May 2012
In Defense of ORMs
Martin Fowler in a post defending the role of ORMs:
As you might have gathered, I think NoSQL is technology to be taken very seriously. If you have an application problem that maps well to a NoSQL data model - such as aggregates or graphs - then you can avoid the nastiness of mapping completely. Indeed this is often a reason I’ve heard teams go with a NoSQL solution.
Truth is that some NoSQL databases are getting more mapping libraries and frameworks than what I’ve seen done for relational databases before. Even worse, there are attempts to hide all NoSQL databases behind the same libraries, specs, or APIs. I think that the old principle of decoupling the application from the underlying database is way too ingrained in the software community, so much that almost nobody is asking the real questions: will I really need to change the data storage, will it actually work as planned, or what will I lose if I add just another indirection layer?
As Martin Fowler correctly emphasizes in his post, the real benefit of ORMs comes from eliminating the plumbing :
A framework that allows me to avoid 80% of that [boiler-plate code] is worthwhile even if it is only 80%. The problem is in me for pretending it’s 100% when it isn’t.
But that doesn’t mean that everyone should create its own mapping library/framework for each and every project.
Original title and link: In Defense of ORMs (©myNoSQL)
Tuesday, 8 May 2012
The Future of NoSQL With Java EE
Markus Eisele:
We already have a lot in place for the so-called “NoSQL” DBs. And the groundwork for integrating this into new Java EE standards is promising. Control of embedded NoSQL instances should be done via JSR 322 (Java EE Connector Architecture) with this being the only allowed place spawn threads and open files directly from a filesystem. I’m not a big supporter of having a more general data abstraction JSR for the platform comparable to what Spring is doing with Spring Data. To me the concepts of the different NoSQL categories are too different than to have a one-size-fits-all approach.
Eureka!
Original title and link: The Future of NoSQL With Java EE (©myNoSQL)
via: http://blog.eisele.net/2012/05/future-of-nosql-with-java-ee.html
Algorithm for Automatic Cache Invalidation
Jakub Łopuszański describes in much detail and with examples an algorithm for cache invalidation:
Imagine a bipartite graph which on the left hand side has one vertex per each possible subspace of a write query, and on the right side has vertices corresponding to subspaces of read queries. Actually both sets are equal, but we will focus on edges.
Edge goes from left to right, if a query on the left side affects results of a query on the right side. As said before, both sets are infinite, but that’s not the problem. There are infinitely many edges, but it’s also not bad. What’s bad is that there are nodes on the left side with the infinite degree, which means, we need to invalidate infinitely many queries. What the above tricky algorithm does, is adding a third layer to the graph, in the middle between the two, such that the transitive closure of the resulting graph is still the same (in other words: you can still get by using two edges anywhere you could by one edge in the original graph), yet each node on the left, and each node on the right, have finite (actually constant) degree. This middle layer corresponds to the artificial subspaces with “?” marks, and serves as a connecting hub for all the mess. Now, when a query on the left executes, it needs to inform only its (small number of) neighbours about the change, moving the burden of reading this information to the right. That is, a query on the right side needs to check if there is a message in the “inbox” in the middle layer. So you can think about it as a cooperation where the left query makes one step forward, and the right query does a one step back, to meet at the central place, and pass the important information about the invalidation of cache.
I’m still in front of a piece of paper understanding how it works.
Original title and link: Algorithm for Automatic Cache Invalidation (©myNoSQL)
via: https://groups.google.com/d/topic/memcached/OiScvRbGaU8/discussion
Monday, 7 May 2012
Paper: Principles of Distributed Data Management in 2020?
Patrick Valduriez, co-author of the “Principles of Distributed Database Systems” book, has published a paper Principles of Distributed Data Management in 2020? (pdf) translating the main topic into the following 3 questions:
- What are the fundamental principles behind the emerging solutions?
- Is there any generic architectural model, to explain those principles?
- Do we need new foundations to look at data distribution?
Wrt (2), I showed that emerging solutions can still be explained along the three main dimensions of distributed data management (distribution, autonomy, heterogeneity), yet pushing the scales of the dimensions high up. However, I raised the question of how generic should distributed data management be, without hampering application-specific optimizations. Emerging NOSQL solutions tend to rely on a specific data model (e.g. Bigtable, MapReduce) with a simple set of operators easy to use from or with a programming language. It is also interesting to witness the development of algebras, with specific operators, to raise the level of abstraction in a way that enables optimization [9]. What is missing to explain the principles of emerging solutions is one or more dimensions on generic/specific data model and data processing.
What I think this paper does is actually looking at two different questions, a bit less generic but still useful in proving that the new generation of distributed database systems was clearly triggered by the new requirements and the evolution of the current applications:
- Is there a need for new approaches in distributed data management systems?
- What are some of the approaches used by the emerging solution to deal with the challenges posed by today’s data-intensive applications?
You can read or download Patrick Valduriez’s paper here:
Thursday, 3 May 2012
Lock-Free Algorithms: How Intel X86_64 Processors and Their Memory Model Works
Download the slides, set aside 1 hour and 10 minutes of uncontended time, click the Maximize button, and watch this great presentation by Martin Thompson and Michael Barker diving into the Intel x86_64 processors and memory models for implementing lock-free algorithms. Once you’re done make sure to also read The Single Writer Principle by the same Martin Thompson.
Original title and link: Lock-Free Algorithms: How Intel X86_64 Processors and Their Memory Model Works (©myNoSQL)
Friday, 27 April 2012
The Single Writer Principle
Martin Thompson:
When trying to build a highly scalable system the single biggest limitation on scalability is having multiple writers contend for any item of data or resource. Sure, algorithms can be bad, but let’s assume they have a reasonable Big O notation so we’ll focus on the scalability limitations of the systems design.
I keep seeing people just accept having multiple writers as the norm. There is a lot of research in computer science for managing this contention that boils down to 2 basic approaches. One is to provide mutual exclusion to the contended resource while the mutation takes place; the other is to take an optimistic strategy and swap in the changes if the underlying resource has not changed while you created the new copy.
The Single Writer Principle is that for any item of data, or resource, that item of data should be owned by a single execution context for all mutations.
Original title and link: The Single Writer Principle (©myNoSQL)
via: http://mechanical-sympathy.blogspot.com/2011/09/single-writer-principle.html
Tuesday, 24 April 2012
Should I Expose Asynchronous Wrappers for Synchronous Methods?
There are two primary benefits I see to asynchrony: scalability and offloading (e.g. responsiveness, parallelism). Which of these benefits matters to you is typically dictated by the kind of application you’re writing. Most client apps care about asynchrony for offloading reasons, such as maintaining responsiveness of the UI thread, though there are certainly cases where scalability matters to a client as well (often in more technical computing / agent-based simulation workloads). Most server apps care about asynchrony for scalability reasons, though there are cases where offloading matters, such as in achieving parallelism in back-end compute servers.
There might be a 3rd scenario (or at least a sub-category of the responsiveness benefit): adding timeout capabilitites to non-critical remote invocations. What I have in mind is simulating the actor-based approach in environments with no native support for it.
Original title and link: Should I Expose Asynchronous Wrappers for Synchronous Methods? (©myNoSQL)
via: http://blogs.msdn.com/b/pfxteam/archive/2012/03/24/10287244.aspx
Friday, 20 April 2012
Networks Never Fail
A reminder to those thinking that networks never fail and automation can solve everything. Christina Ilvento, on behalf of the App Engine team:
The root cause of the outage was a combination of two factors during a scheduled network maintenance in one of our datacenters. As part of the scheduled maintenance, network capacity to and from this datacenter was reduced. This alone was expected, and was not a problem. However, this maintenance exposed a previously existing misconfiguration in the system that manages network bandwidth capacity.
Ordinarily, the bandwidth management system helps isolate and prioritize traffic. When capacity is reduced because of maintenance, network failure, or due to an excess of normal traffic, the bandwidth management system keeps things running smoothly by throttling back the rate of low priority traffic. However, as mentioned, the bandwidth management system had a latent misconfiguration which did not show up until capacity was reduced due to the scheduled maintenance. This misconfiguration under-reported the available network capacity to and from the datacenter, causing the network modeler to believe that there was less overall capacity than actually existed.
The configuration error in the bandwidth management system, when combined with an expected reduction in capacity due to the scheduled maintenance, led the system to conclude that there was insufficient bandwidth available for current traffic demand to and from this datacenter. (In reality, there was more than sufficient excess capacity, as otherwise the maintenance would not have been allowed to go forward.) Because of this combination of misconfiguration and scheduled maintenance, a number of services were automatically blocked from sending network traffic. […]
The outage occurred because two independent systems failed at the same time, which resulted in mistakes in our usual escalation procedures which significantly impacted the duration of the outage.
Original title and link: Networks Never Fail (©myNoSQL)
Thursday, 19 April 2012
A Different Approach to Data Modeling: Thinking of Data Service Requirements
Pedro Visintin:
We can distinguish several objects: Session, Cart, Item, User, Order, Product and Payment. Usually we use ActiveRecord to store all of them. But this time let’s think about it differently.
For sessions, we don’t need durable data at all — Redis can be a good option, and of course will be faster than any RDBMS. For Cart and Item,we will need high availability across different locations. Riak can fit well for this use case. For User Order Product and Payment, a relational database can fit well, focusing on Transactions and Reporting about our application.
This is a very good exercise for understanding the requirements for your data service layer. As much as I write about polyglot persistence, when architecting an application never leave aside or ignore the operational requirements for your service.
Original title and link: A Different Approach to Data Modeling: Thinking of Data Service Requirements (©myNoSQL)
Friday, 6 April 2012
The Database Nirvana
Scroll to minute 16:55 of this video to watch Jim Webber explain the benefits of polyglot persistence and how starting (again) the winner-takes-it-all war is just sending us back at least 10 years from the database Nirvana.
We’ve just come from the place where one-size-fits-all and we don’t want to go back there. There is a huge wonderful ecosystem of stores. Pick the right one. Don’t just assume that the one you find the easiest or the one that shouts the loudest is the one you’re going to use. Pick the one that suits your data model.
It doesn’t matter what flavor of relational or NoSQL database you prefer or have experience with or if a small or large database vendor is paying your bills. You really need to get this right as otherwise we’re just going to destroy a lot of valuable options we’ve added to our toolboxes.
Original title and link: The Database Nirvana (©myNoSQL)
Cardinality Estimation Algorithms: Memory Efficient Solutions for Counting 1 Billion Distinct Objects
Matt Abrams from Clearspring:
Cardinality estimation algorithms trade space for accuracy. To illustrate this point we counted the number of distinct words in all of Shakespeare’s works using three different counting techniques. Note that our input dataset has extra data in it so the cardinality is higher than the standard reference answer to this question. The three techniques we used were Java HashSet, Linear Probabilistic Counter, and a Hyper LogLog Counter. Here are the results:

Original title and link: Cardinality Estimation Algorithms: Memory Efficient Solutions for Counting 1 Billion Distinct Objects (©myNoSQL)
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling