NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



NoSQL theory: All content tagged as NoSQL theory in NoSQL databases and polyglot persistence

Virtual Nodes Strategies

Interesting post by Acuno on 3 strategies for virtual nodes:

At Acunu we wanted to bring virtual nodes support to Cassandra to alleviate some of these operations headaches and I spent some time looking at some of the different approaches we could take. The three main variants that I found could be broadly divided into the following three categories: 1) random token assignment; 2) fixed partition assignment; 3) automatic sharding

Original title and link: Virtual Nodes Strategies (NoSQL database©myNoSQL)


NoSQL Database Virtual Panel: Architectural Patterns, Use Cases, Limitations and Constraints

In a sort of follow up to the NoSQL panel I’ve hosted at QCon, InfoQ published a virtual panel covering the following topics:

  1. What is the current state of NoSQL databases in terms of enterprise adoption?
  2. Can you discuss the core architectural patterns your NoSQL database product supports in the areas of data persistence, retrieval, and other database related concerns?
  3. What are the use cases or applications that are best suited to use your product?
  4. What are the limitations or constraints of using the product?
  5. Cross-Store Persistence concept is getting a lot of attention lately. This can be used to persist the data into different databases including NoSQL and Relational databases. Can you talk about this approach and how it can help the application developers who need to connect to multiple databases from a single application?
  6. What do you see as the role of In-memory data grids (IMDG) in the polyglot persistence space?
  7. What is the current state of security in this space and what’s coming?
  8. What is the future road map for your product in terms of new features and enhancements?

I’d suggest reading the answers to questions 2 to 6 at least.

Original title and link: NoSQL Database Virtual Panel: Architectural Patterns, Use Cases, Limitations and Constraints (NoSQL database©myNoSQL)


The Behavior of EC2/EBS Metadata Replicated Datastore

The Amazon post about the service disruption that happened late last month provides an interesting description of the behavior of the Amazon EC2 and EBS metadata datastores:

The EC2 and EBS APIs are implemented on multi-Availability Zone replicated datastores. These datastores are used to store metadata for resources such as instances, volumes, and snapshots. To protect against datastore corruption, currently when the primary copy loses power, the system automatically flips to a read-only mode in the other Availability Zones until power is restored to the affected Availability Zone or until we determine it is safe to promote another copy to primary.

Original title and link: The Behavior of EC2/EBS Metadata Replicated Datastore (NoSQL database©myNoSQL)


NoSQL Data Models and Adoption

In an interview for the DataStax blog, Philippe Modard, engineer and CTO at V2i:

The big difference over relational databases is the data model. Once we understood how things needed to be modeled and defined, everything else was a piece of cake.

Indeed NoSQL new data models are the first obstacles developers encounter when considering a NoSQL database. Some could think it’s about using new APIs, lacking a query language like SQL, or having to use a different one. But I don’t think these are the real causes.

The first time I’ve experienced the unfamiliarity of a new data model was back in 2005 when I’ve started using Jackrabbit JCR implementation (a hierarchical model). Then a couple years later I’ve had the same feeling when first using the Google App Engine data store.

It wasn’t about the new APIs though. And it wasn’t about the query languages either. For me it was about rethinking how I store and access data. It was striking to realize how used I was to think in terms of a relational model. Even if not everything I’ve implemented before was purely relational.

Looking at the various NoSQL databases around, you could see how those that started with a data model that felt closer to the relational model have seen faster adoption. And I don’t think the main reason behind it is better data models per se, but just familiarity.

Original title and link: NoSQL Data Models and Adoption (NoSQL database©myNoSQL)

The Myth of Auto Scaling as a Capacity Planning Approach

A quite old, but very educative post by James Golick dissecting the mythical extra server capacity:

There’s this idea floating around that we can scale out our data services “just in time”. Proponents of cloud computing frequently tout this as an advantage of such a platform. Got a load spike? No problem, just spin up a few new instances to handle the demand. It’s a great sounding story, but sadly, things don’t quite work that way.

This is the Mythical Man-Month of the IT department.

John Allspaw

Original title and link: The Myth of Auto Scaling as a Capacity Planning Approach (NoSQL database©myNoSQL)


NO DB - the Center of Your Application Is Not the Database

Uncle Bob:

The center of your application is not the database. Nor is it one or more of the frameworks you may be using. The center of your application are the use cases of your application. […] If you get the database involved early, then it will warp your design. It’ll fight to gain control of the center, and once there it will hold onto the center like a scruffy terrier. You have to work hard to keep the database out of the center of your systems. You have to continuously say “No” to the temptation to get the database working early.

Original title and link: NO DB - the Center of Your Application Is Not the Database (NoSQL database©myNoSQL)


The Grand Picture of Big Data and the Impact on the Architecture of Systems

In a recent interview for AllThingsD, Mike Rhodin, the senior vice president of IBM’s Software Solutions Group gave a very realistic description of what the future of data looks like:

[…] it comes out of the digitization of the physical world, the instrumentation of physical processes that’s going to generate huge amounts of new data, which is going to drive issues around storage, and what to do with all the data, how to analyze it. That pushes you toward real-time analytics and streaming technologies, because with real time, you don’t have to save the data — you want to look for anomalies as they occur.

This is indeed the grand picture of Big Data.

Now think for a second how many companies have such systems in place. Not many. Think now how many companies can offer as-complete-as-possible integrated systems to address these challenges. Very few.

These two answers are revealing an interesting perspective about the future of the Big Data market.

On one side we have vendors building top notch solutions—consider the new features in the relational databases, NoSQL databases, Hadoop, etc. By looking at this space you’ll have to agree that all these are excellent solutions for tackling a sub-space of the overall problem. They are getting closer and closer to offering local optimum solutions.

On other side there are the system integrators and platform vendors. Their systems may not be the best in solving every aspect of a problem, but their focus is in addressing and solving the complete problem. Their sales pitch is integration and/or specialization.

As someone writing about polyglot persistence and the 1001 NoSQL, NewSQL, and the development of the relational databases, I could be tempted to think that every company would have the budget, the know-how, and the time to take top-notch sub-systems and create solutions crafted to their problem. But looking back in time and also applying the lessons from other markets, I think it is safe to say that integrated solutions are preferred.

The lesson to be learned by both NoSQL and relational database vendors, actually by all (sub)system vendors that are playing in the Big Data market is to design products with openness and integration in mind. Very few, if any, sub-systems will be part of the grand solution if they are architected as silos. They can continue to provide the ultimate local optimum solutions, but as long as they are not architected to be part of a collaborative integrated platform they’ll lose important segments of the market. Many products I’m writing about are already following this principle, many are making steps towards being friendlier in terms of integration, and many are still taking the silver bullet approach.

Original title and link: The Grand Picture of Big Data and the Impact on the Architecture of Systems (NoSQL database©myNoSQL)

Neo4j Data Modeling: What Question Do You Want to Answer?

Mark Needham:

Over the past few weeks I’ve been modelling ThoughtWorks project data in neo4j and I realised that the way that I’ve been doing this is by considering what question I want to answer and then building a graph to answer it.

This same principle should be applied to modeling with any NoSQL database. Thinking in terms of access patterns is one of the major differences between doing data modeling in the NoSQL space and the relational world, which is driven, at least in the first phases and theoretically, by the normalization rules.

Original title and link: Neo4j Data Modeling: What Question Do You Want to Answer? (NoSQL database©myNoSQL)


In Defense of ORMs

Martin Fowler in a post defending the role of ORMs:

As you might have gathered, I think NoSQL is technology to be taken very seriously. If you have an application problem that maps well to a NoSQL data model - such as aggregates or graphs - then you can avoid the nastiness of mapping completely. Indeed this is often a reason I’ve heard teams go with a NoSQL solution.

Truth is that some NoSQL databases are getting more mapping libraries and frameworks than what I’ve seen done for relational databases before. Even worse, there are attempts to hide all NoSQL databases behind the same libraries, specs, or APIs. I think that the old principle of decoupling the application from the underlying database is way too ingrained in the software community, so much that almost nobody is asking the real questions: will I really need to change the data storage, will it actually work as planned, or what will I lose if I add just another indirection layer?

As Martin Fowler correctly emphasizes in his post, the real benefit of ORMs comes from eliminating the plumbing :

A framework that allows me to avoid 80% of that [boiler-plate code] is worthwhile even if it is only 80%. The problem is in me for pretending it’s 100% when it isn’t.

But that doesn’t mean that everyone should create its own mapping library/framework for each and every project.

Original title and link: In Defense of ORMs (NoSQL database©myNoSQL)


The Future of NoSQL With Java EE

Markus Eisele:

We already have a lot in place for the so-called “NoSQL” DBs. And the groundwork for integrating this into new Java EE standards is promising. Control of embedded NoSQL instances should be done via JSR 322 (Java EE Connector Architecture) with this being the only allowed place spawn threads and open files directly from a filesystem. I’m not a big supporter of having a more general data abstraction JSR for the platform comparable to what Spring is doing with Spring Data. To me the concepts of the different NoSQL categories are too different than to have a one-size-fits-all approach. 


Original title and link: The Future of NoSQL With Java EE (NoSQL database©myNoSQL)


Algorithm for Automatic Cache Invalidation

Jakub Łopuszański describes in much detail and with examples an algorithm for cache invalidation:

Imagine a bipartite graph which on the left hand side has one vertex per each possible subspace of a write query, and on the right side has vertices corresponding to subspaces of read queries. Actually both sets are equal, but we will focus on edges.

Edge goes from left to right, if a query on the left side affects results of a query on the right side. As said before, both sets are infinite, but that’s not the problem. There are infinitely many edges, but it’s also not bad. What’s bad is that there are nodes on the left side with the infinite degree, which means, we need to invalidate infinitely many queries. What the above tricky algorithm does, is adding a third layer to the graph, in the middle between the two, such that the transitive closure of the resulting graph is still the same (in other words: you can still get by using two edges anywhere you could by one edge in the original graph), yet each node on the left, and each node on the right, have finite (actually constant) degree. This middle layer corresponds to the artificial subspaces with “?” marks, and serves as a connecting hub for all the mess. Now, when a query on the left executes, it needs to inform only its (small number of) neighbours about the change, moving the burden of reading this information to the right. That is, a query on the right side needs to check if there is a message in the “inbox” in the middle layer. So you can think about it as a cooperation where the left query makes one step forward, and the right query does a one step back, to meet at the central place, and pass the important information about the invalidation of cache.

I’m still in front of a piece of paper understanding how it works.

Original title and link: Algorithm for Automatic Cache Invalidation (NoSQL database©myNoSQL)


Paper: Principles of Distributed Data Management in 2020?

Patrick Valduriez, co-author of the “Principles of Distributed Database Systems” book, has published a paper Principles of Distributed Data Management in 2020? (pdf) translating the main topic into the following 3 questions:

  1. What are the fundamental principles behind the emerging solutions?
  2. Is there any generic architectural model, to explain those principles?
  3. Do we need new foundations to look at data distribution?

Wrt (2), I showed that emerging solutions can still be explained along the three main dimensions of distributed data management (distribution, autonomy, heterogeneity), yet pushing the scales of the dimensions high up. However, I raised the question of how generic should distributed data management be, without hampering application-specific optimizations. Emerging NOSQL solutions tend to rely on a specific data model (e.g. Bigtable, MapReduce) with a simple set of operators easy to use from or with a programming language. It is also interesting to witness the development of algebras, with specific operators, to raise the level of abstraction in a way that enables optimization [9]. What is missing to explain the principles of emerging solutions is one or more dimensions on generic/specific data model and data processing.

What I think this paper does is actually looking at two different questions, a bit less generic but still useful in proving that the new generation of distributed database systems was clearly triggered by the new requirements and the evolution of the current applications:

  1. Is there a need for new approaches in distributed data management systems?
  2. What are some of the approaches used by the emerging solution to deal with the challenges posed by today’s data-intensive applications?

You can read or download Patrick Valduriez’s paper here: