NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



nosql comparison: All content tagged as nosql comparison in NoSQL databases and polyglot persistence

From CouchDB to Riak at Linkfluence

We were already aware of Riak before we started using CouchDB, but we weren’t sure about trusting a new product at this point, so we decided, after some benchmark, to go for CouchDB.

After the first couple of months, it was obvious that this was a bad choice.

Our main problems with CouchDB is scalability, versioning and stability.

I am wondering how using BigCouch would have addressed Linkfluence requirements:

and the stability/maintenance issues.

The article also gives an overview of Linkfluence polyglot persistence architecture:

  • PostgreSQL: some indexes on documents’ ID
  • MongoDB: store tweets relationships and some indexes
  • CouchDB Riak for content and metadata
  • Redis for caching
  • Solr for search indexes
  • ElasticSearch for secondary indexes

You might also enjoy some of the comments on the Hacker News thread.

Original title and link: From CouchDB to Riak at Linkfluence (NoSQL databases © myNoSQL)


Redis: Le système de cache parfait

I love how this sounds in French:

Après 3 ans d’une histoire d’amour fidèle avec Memcached; le serveur de cache notamment utilisé par Facebook, Youtube ou Twitter; je suis au bord de la rupture après avoir rencontré redis.

The author, Julien Crouzet, mentions three key features of Redis:

  • non-volatile data
  • performance
  • support for data types

On these points:

But Redis’ support for data types (lists, sets, sorted sets, and hashes) is not up for debate.

Original title and link: redis : Le système de cache parfait (NoSQL databases © myNoSQL)


Adku's Choice: Cassandra or HBase

The 8 6 reasons[1] Adku prefers Cassandra to HBase:

  1. Reliability
  2. Performance
  3. Consistency
  4. Single point of failure
  5. Hot spot problem
  6. MapReduce
  7. Simpler, Hackable
  8. Community support

Before jumping to any conclusions make sure you read the disclaimer:

While these decisions apply to Adku, they might not apply to your situation. Always do your own investigation and experimentation before choosing any large part of your system.

Update: JD Cryans2 commented on the points listed above (thanks JD):

This comparison reminds me of the pain we went through in the late 2009 when lots of similar comparisons came out from all sides — the “NoSQL war”. Unfortunately as we all found out, no one wins.

But let’s look at the points mentioned in this post.

  • Reliability: As far as I can tell that’s not a reliability test. The first thing that raises questions is the large number of crashes of the region servers. Considering the data set used (1 million rows of the full “Alice in Wonderland” text) is small compared to the ones other HBase users (StumbleUpon, Mozilla) are handling, that would point out to a configuration problem that wasn’t taken care of.

    One could say it’s because HBase is hard to configure or that the default configurations aren’t good, and to some extent I agree, but you don’t quantify reliability based on these.

  • Hot Spot Problem: This point is an interesting one, and more likely falls into the disclaimer.

    Distribution based on timestamp row keys will be better with Cassandra. But usually when using timestamps you also want range scans which is impossible with hashing. For example OpenTSDB provides a very efficient way to store time series by using a clever row key design. A design that you’ll probably also use if you need scans in Cassandra.

    Not to mention that using MapReduce will require sorted row keys anyways.

  • Community Support: Comparing communities only based on the number of IRC users is too much of a simplification. Someone looking to use an open source project should spend some time getting to know and interact with the users before stating that “one community is more helpful” than the other — a message that could also be perceived as disrespectful.

There are also a couple of points that are mentioned in the post even if HBase is the “winner” (MapReduce) or the feature is not a hard requirement (consistency).

I left performance last as the post mentions similar write performance results. But there is too little information about the benchmark to be able to comment on it. At first glance those results look weird considering they weren’t using a Hadoop version that supports append, which as shown by the original YCSB paper would make quite a difference.

After the Adku blog came out, Edward Capriolo wrote this response (rant?) to all who try to do the same as them and I think it’s worth the read:

  1. From the original list I have crossed MapReduce as the author considers HBase as the “winner”. Also commenters to the original post have clarified the confusion about HBase single point of failure.  

  2. Jean-Daniel Cryans: Apache HBase committer and DB Engineer at StumbleUpon, @jdcryans

Original title and link: Adku’s Choice: Cassandra or HBase (NoSQL databases © myNoSQL)


Benchmarking MongoDB

The code is purposely a naive implementation, to test how fast each back end is without resorting to optimizations, hacks or tricks. There are probably ways of making it much faster. And even though the production code will be very different to this early experiment, it is not an evil, synthetic micro-benchmark: on the contrary, it is a real application!

You could say that being a benchmark for a specific scenario the results are relevant in that context. But I’d also include the following two checks:

  • inserting some rogue data and try to recover
  • run a kill -9 midway through the import

Original title and link: Benchmarking MongoDB (NoSQL databases © myNoSQL)


Project Voldermort and Terrastore: Key-Value vs Document Stores

It is an apples to oranges comparison, but it underlines, from a beginner perspective, the major differences between a pure key-value store (Project Voldemort) and a document database (Terrastore):

Being a simpler KV store than Terrastore, to my understanding Project Voldemort offers no ability to leverage the server to evaluate the Values. In order to, for example, produce a list of documents whose “publish date” is in the past, it is necessary to either fetch all documents and evaluate the publish date each time this operation is needed — or — manage a lookup list of document IDs that were “published” when the lookup list was created.

In the end, the author also emphasizes how important the first impression is: clean documentation, simple installation, etc.. Or differently put, an end user judges a project by how fast he can start using it.

Original title and link: Project Voldermort and Terrastore: Key-Value vs Document Stores (NoSQL databases © myNoSQL)


NoSQL Comparison: Cassandra, CouchDB, HBase, MongoDB, Redis, Riak

Just before the end of year, a brief comparison — bullet style — of Cassandra, CouchDB, HBase, MongoDB, Redis, and Riak:

But the differences between “NoSQL” databases are much bigger than it ever was between one SQL database and another. This means that it is a bigger responsibility on software architects to choose the appropriate one for a project right at the beginning.

Original title and link: NoSQL Comparison: Cassandra, CouchDB, HBase, MongoDB, Redis, Riak (NoSQL databases © myNoSQL)


Planning for Data Migration

From the Amazon ☞ Migrating your Existing Applications to the AWS Cloud paper (PDF):

  • What are the different storage options available in the cloud today?
  • What are the different RDBMS (commercial and open source) options available in the cloud today?
  • What is my data segmentation strategy? What trade-offs do I have to make?
  • How much effort (in terms new development, one-off scripts) is required to migrate all my data to the cloud?

When choosing the appropriate storage option, one size does not fit all (nb: my emphasis). There are several dimensions that you might have to consider so that your application can scale to your needs appropriately with minimal effort. You have to make the right tradeoffs among various dimensions - cost, durability, query-ability, availability, latency, performance (response time), relational (SQL joins), size of object stored (large, small), accessibility, read heavy vs. write heavy, update frequency, cache-ability, consistency (strict, eventual) and transience (short-lived).

Just replace the words “cloud” and “AWS” with NoSQL database and you get a good base for your migration plan.

Original title and link: Planning for Data Migration (NoSQL databases © myNoSQL)

Why NoSQL … Why Not

Interesting article from Xeround Avi Kapuya ☞ NoSQL: The Sequel. Couple of comments though:


In other words, in SQL, the data model does not enforce a specific way to work with the data — it is built with an emphasis on data integrity, simplicity, data normalization and abstraction, which are all extremely important for large complex applications.

I’d say that data normalization is not a goal per se, but a solution to a problem (data duplication, frequent updates to common entities). But what if this solution is introducing another bigger problem (read JOINs)?

The NoSQL approach presents huge advantages over SQL databases because it allows one to scale an application to new levels

Plus it may give you more flexibility in your data model, plus it may be a better (as in operational, complexity, performance, etc.) storage for different formats of data.

Why not NoSQL

At the system level, data models are key*. Not having a skilled authority to design a single, well-defined data model, regardless of the technology used, has its drawbacks.

Actually I think the reality might be a bit different. Because NoSQL imposes a “narrow predefined access pattern” it will require one to spend more time understanding and organizing data. Secondly, the final model will reflect and be based on the reality of the application, on not only on pure theory (as is the case with most initial relational model designs).

At the architecture level, two major issues are interfaces and interoperability. Interfaces for the NoSQL data services are yet to be standardized.

The interface limitation is a temporary issue in terms of getting more/better/quicker tooling support and probably a longer term issue for developers needing to learn different models. But as we’ve agreed, NoSQL has a small, predefined access mode and so we are not talking about learning completely new languages.

Personally, I think the real issue is steep learning curve of understanding each of these NoSQL databases semantics and operational behavior then not having a common API.

Interoperability is an important point, especially when data needs to be accessed by multiple services.

I’m not seeing the problem here. As far as I know each relational database is coming with its per-language drivers. On the NoSQL side, there are already quite a few products using standard protocols.

Moving to the operational realm, here, from my experience, lies the toughest resistance, and rightfully so… The operational environment requires a set of tools that is not only scalable but also manageable and stable, be it on the cloud or on a fixed set of servers. […] Operation needs to be systematic and self contained.

Now, this is completely the other way around. If you read any large scale application story, you’ll notice the pattern: the operational costs where a significant factor in deciding to use NoSQL. Just check the stories of Twitter, Adobe, Adobe products, Facebook. Complexity is a fundamental dimension of scalability and right now the balance is towards NoSQL databases .

It is my opinion that a SQL database built on NoSQL foundations can provide the highest value to customers who wish to be both agile and efficient while they grow.

Unfortunately I don’t think that’s actually possible or at least not for all solutions. But If we just want some common access language, we will probably get it.

If, on the other hand, what we want is more tunable and scenario specific engines, we will probably get these too. (nb: as far as I’ve heard the PostgreSQL community is learning a lot from the various NoSQL databases and trying to bring in as many of the good ideas they can).


My conclusion is simple. As with programming languages where we are not stuck with COBOL, polyglot persistence is here to stay and it’ll only get better.

Original title and link: Why NoSQL … Why Not (NoSQL databases © myNoSQL)

Another NoSQL Comparison: Evaluation Guide

The requirements were clear:

  • Fast data insertion.
  • Extremely fast random reads on large datasets.
  • Consistent read/write speed across the whole data set.
  • Efficient data storage.
  • Scale well.
  • Easy to maintain.
  • Have a network interface.
  • Stable, of course.

The list of NoSQL databases to be compared: Tokyo Cabinet, BerkleyDB, MemcacheDB, Project Voldemort, Redis, and MongoDB, not so clear.

The methodology to evaluate and the results definitely not clear at all.

NoSQL Comparison Guide / A review of Tokyo Cabinet, Tokyo Tyrant, Berkeley DB, MemcacheDB, Voldemort, Redis, MongoDB

And the conclusion is quite wrong:

Although MongoDB is the solution for most NoSQL use cases, it’s not the only solution for all NoSQL needs.

There were a couple of people asking for more details about my comments on this NoSQL comparison, so here they are:

  1. the initial list of NoSQL databases to be evaluated looks at the first glance a bit random. It includes some not so used solutions (memcachedb), some that are not , while leaving aside others that at least at the high level would correspond to the characteristics of others in the list (Riak, Membase)
  2. another reason for considering the initial choice a bit random is that while scaling is listed as one of the requirements, the only truly scalable in the list would be Project Voldemort. The recently added auto-sharding and replica sets would make MongoDB a candidate too, but a search on the MongoDB group would show that the solution is still young
  3. even if the set of requirements is clear, there’s no indication of what kind of evaluation and how was it performed. Without knowing what and how and it is impossible to consider the results as being relevant.
  4. as Janl was writing about benchmarks, most of the time you are doing it wrong. Creating good, trustworthy, useful, relevant benchmarks is very difficult
  5. .
  6. the matrix lists characteristics that are difficult to measure. And there are no comments on how the thumbs up were given. Examples: what is manageability and how was that measured? Same questions for stability and feature set.
  7. because most of it sounds speculative here are a couple of speculations:
    1. judging by the thumbs up MongoDB received for insertion/random reads for large data set, I can assume that data hasn’t overpassed the available memory. But on the other hand, Redis was dismissed and received less votes due to its “more” in-memory character
    2. Tokyo Cabinet and Redis project activity and community were ranked the same. When was the last release of Tokyo Cabinet?
  8. I’m leaving up to you to decide why the conclusion — “Although MongoDB is the solution for most NoSQL use cases”” is wrong.

Original title and link: Another NoSQL Comparison: Evaluation Guide (NoSQL databases © myNoSQL)


RavenDB and CouchDB Compared

A fair emphasis on what differentiates RavenDB from CouchDB (nb coming from RavenDB creator). Just to mention the most interesting ones:

  • transactions: support for single document, document batch, multi request, multi node transactions […]
  • set based operations: update active = false where last_login < '2010-10-01'
  • includes and live projections (local data only)

Original title and link: RavenDB and CouchDB Compared (NoSQL databases © myNoSQL)


Railo Cache Benchmark - CouchDB, MongoDB, RAM

They’re all fast, but what amazes me is how little difference there is between RAM vs MongoDB performance!

Not sure why that’d would be amazing considering MongoDB will keep all that data in memory. In fact I’d say that the interesting part is CouchDB performance considering it goes to the disk for each read.

Original title and link: Railo Cache Benchmark - CouchDB, MongoDB, RAM (NoSQL databases © myNoSQL)