ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

benchmark: All content tagged as benchmark in NoSQL databases and polyglot persistence

NoSQL and RDBMS: Learn from Others’ Experience

I firstly thought that Innostore[1], the embedded InnoDB from Basho, is just another cool project they’ve made available to the community. It was only after a couple of days that I realized that Innostore is in fact one option for the pluggable Riak backend storage engines. That definitely made me think more about this decision.

Luckily enough, David Smith from Basho has already took the time to explain ☞ the reasons that brought Riak to use InnoDB as one of its storage engines:

1. predictability and 2. stability. […] we need something that is going to have predictable latency under significant loads. After evaluating TokyoCabinent (TC), BerkeleyDB-C (BDB) and Embedded Inno, it was quite clear that Inno won this aspect hands down.

You’ll notice pretty much the same arguments in this post about ☞ MySQL usage at Flickr:

  • it is a very well known component. When you’re scaling a complex app everything that can go wrong, will. Anything which cuts down on your debugging time is gold. All of MySQL’s flags and stats can be a bit overwhelming at times, but they’ve accumulated over time to solve real problems.
  • it’s pretty darn fast and stable. Speed is usually one of the key appeals of the new NoSQL architectures, but MySQL isn’t exactly slow (if you’re doing it right). I’ve seen two large, commercial “NoSQL” services flounder, stall and eventually get rewritten on top of MySQL. (and you’ve used services backed by both of them)

As a side note, that last sentence reminded me of the migration Hashrocket team has completed for a pharma company.

Last, but not least, you can also take a look at this ☞ Yahoo! benchmark that includes MySQL and, if I’m not misinterpreting those results, you’ll notice that for some of them MySQL performed quite well.

I guess what we can learn from all these is:

  • not all traditional storage engines are as bad as we sometimes want to think of them
  • it is probably the complete feature set of the RDBMS that are making them overkill for some projects
  • there are still a lot of scenarios in which an RDBMS makes sense

Strange post for a NoSQL centric blog, isn’t it?


MongoDB Durability: A Tradeoff to Be Aware Of

The MongoDB team post about MongoDB’s durability made some waves last week. While I’d still recommend reading the original post, I’m including below a summary of the most important points from the post:

First, there are many scenarios in which that server loses all its data no matter what.

[…]

In the real world, traditional durability often isn’t even done correctly.

[…]

Given all this, we’re not saying durability isn’t important, we just think that single server durability isn’t the best way to get true durability. We think the right path to durability is replication (local and remote) and snapshotting. […] We are currently planning many more enhancements to replication to make it better.

[…]

Now - there are definitely some cases where single server durability is the best option. It is on our road map, its just not on the short list right now.

I have no intention to judge the decisions MongoDB team made in designing their tool. But I do feel that the above arguments are inaccurate and that MondoDB durability should be seen as a tradeoff for the performance you are getting from it.

Firstly, the fact that there are scenarios in which you can loose all your data, despite every prevention, is not a good reason to remove durability features. The same applies to ignoring a feature because others might not do it right. Both these arguments are kind of childish and I suppose everyone reading the post have already ignored them.

So, I’d like to focus on the important part: “We think the right path to durability is replication”. That’s definitely a more relevant and possibly more valid argument. But there are are couple of aspects that I’d like to cover:

  • while the probability of data loss is significantly reduced by using 2 machines instead of 1, you also have to take into account the probability of a network partition. As a side note, even if MongoDB would support replica sets instead of the currently supported replica pairs, the argument would remain valid. I don’t have any statistics about hardware failure rates vs network failure rates to compute the impact on the probability of data loss, but below plot should give you an idea of what I mean:
graph
  • with an eventually consistent distributed system there is an uncontrollable window in which data loss may still occur. Remember that for the time being MongoDB’s replication is asynchronous and so, even if replicated, data loss can still occur during the sync window. Keep in mind also that the synchronization speed depends on the reliability of the network.
  • there is a cost impact that you should be aware — I think that the cost of a battery backed RAID is much lower than the cost of an additional server

I’d also like to note that as a consequence of MongoDB’s approach to durability, any benchmarks comparing MongoDB to other more durable stores will result in not so accurate results[1].

Summarizing, while the MongoDB team is working on improving the replication mechanisms:

  • psuedo real-time with optional blocking for writes until on multiple servers
  • replica sets instead of replica pairs
  • easier to create new slaves with large data sets

I still think that everyone should be aware of MongoDB’s approach to durability and accept it as a tradeoff for other MongoDB features.

References

via: http://blog.mongodb.org/post/381927266/what-about-durability


A Very Specific Benchmark: Files vs MySQL vs Memcached vs Redis vs MongoDB

This sort of very specific benchmarks are valid/interesting if and only if:

  • they simulate extremely close the real life scenario that will be supported by the final application
  • they are not generalized to compare the overall performance of the NoSQL stores
  • the NoSQL store is correctly configured to fulfill the app requirements (f.e. durability)
  • it is understood that the driver has an impact on the results

In this case the benchmark measured requests/s for a usecase of session storage for a Tornado-based web app. You can see the results below:


Reference MySQL Memcached MongoDB Redis
1626 req/s 1353 req/s 1473 req/s 1582 req/s 1418 req/s

Note: The benchmark doesn’t provide enough details about the drivers used.

via: http://milancermak.posterous.com/benchmarking-tornados-sessions-0


Redis Ecosystem Updates

The Redis 1.2.0 release (shortly followed by a small bugfix release[1]) has introduced a new persistence option: Append Only File.

On the mailing list [2], Salvatore Sanfilippo (@antirez) has detailed the process of migrating an existing Redis store to the “Append Only File”:

  1. Create the initial append only file from your dataset just issuing: redis-cli bgrewriteaof
  2. When the rewrite is done (you can see it from the INFO command output) stop the server
  3. Edit redis.conf in order to enable append only
  4. Restart the server
  5. Profit!

Just a few days after Redis 1.2.0 was released, Rediska, a PHP client for Redis that provides full integration with Zend, the popular PHP framework that is also looking to integrate with CouchDB and MongoDB, has announced the 0.3.0 release [3] featuring :

  • Full support Redis 1.2.0 API
  • Pipelining
  • Operate with keys on specified (by alias) server
  • Specify DB index in server config
  • Easy extending Rediska by adding you own or overwrite standart commands
  • Lazy loading
  • Full documentation

Last, but not least, Chris Streeter has published a server throughput benchmark [4] when using Redis with another PHP Redis driver PRedis:

Credit Chris Streeter


Redis Benchmarks Updated

The Redis benchmarks got updated to include results for SpeedyRails host. I should note that Redis includes its own benchmarking tool ☞ redis-benchmark.

The other day, after announcing the completion of the first phase of Redis Virtual Memory implementation, I was talking with Salvatore and he underlined the necessity of updating this tool:

The main problem I’ve now is: how to measure performances? Benchmarking with VM is *very* hard as what’s needed is to simulate different access patterns with biases, and the right amount of RAM and VM.

And I cannot stop wondering if there aren’t any volunteers to help with this between the smart MyNoSQL readers!

via: http://porteightyeight.com/2009/11/09/redis-benchmarking-on-amazon-ec2-flexiscale-and-slicehost/


Basic Benchmark: CouchDB vs MongoDB vs MySQL

my benchmark script ☞ http://gist.github.com/268512, couchdb is 5x(read)~10x(write) slower than mongodb :(

I think the conclusion is wrong as it is based on comparing the real-time figures (wall time elapsed between invocation and termination). I’d say comparing total times (user + sys) would be more correct.

@ihower

Update: @codemonkeyism has pointed out yet another reason for this benchmark being wrong: “As far as I know CouchDB data is durable, but MongoDB is primarily memory and then stored and corruptable - are those comparable?”.


CouchDB vs MongoDB: An attempt for a More Informed Comparison

After posting about Scott Motte’s comparison of MongoDB and CouchDB, I thought there should be some more informative sources out there, so I’ve started to dig.

The first I came upon (thanks to Debasish Ghosh @debasishg) is an article about ☞ Raindrop requirements and the issues faced while attacking them with CouchDB and the pros and cons of possibly replacing CouchDB with MongoDB:

[Pros]

  • Uses update-in-place, so the file system impact/need for compaction is less if we store our schemas in one document are likely to work better.
  • Queries are done at runtime. Some indexes are still helpful to set up ahead of time though.
  • Has a binary format for passing data around. One of the issues we have seen is the JSON encode/decode times as data passes around through couch and to our API layer. This may be improving though.
  • Uses language-specific drivers. While the simplicity of REST with CouchDB sounds nice, due to our data model, the megaview and now needing a server API layer means that querying the raw couch with REST calls is actually not that useful. The harder issue is trying to figure out the right queries to do and how to do the “joins” effectively in our API app code.

[Cons]

  • easy master-master replication. However, for me personally, this is not so important. […] So while we need backups, we probably are fine with master-slave. To support the sometimes-offline case, I think it is more likely that using HTML5 local storage is the path there. But again, that is just my opinion.
  • ad-hoc query cost may still be too high. It is nice to be able to pass back a JavaScript function to do the query work. However, it is not clear how expensive that really is. On the other hand, at least it is a formalized query language — right now we are on the path to inventing our own with the server API with a “query language” made up of other API calls.

Anyway while some of the points above are generic, you should definitely try to consider them through the Raindrop requirements perspective about which you can read more here.

Another article comparing MongoDB and CouchDB is hosted by ☞ MongoDB docs. I find it well balanced and you should read it all as it covers a lot of different aspects: horizontal scalability, query expressions, atomicity, durability, mapreduce support, javascript, performance, etc.

I’d also mention this ☞ benchmark comparing the performance of MongoDB, CouchDB, Tokyo Cabinet/Tyrant (note: the author of the benchmark is categorizing Tokyo Cabinet as a document database, while Tokyo is a key-value store) and uses MySQL results as a reference.

In case you have other resources that you think would be worth including do not hesitate to send them over.

Update: Just found a nice comparison matrix [1].

As a teaser, very soon I will introduce you to a new solution available in this space, so make sure to check MyNoSQL regularly.

Update: The main article about this new document store has been published: Terrastore: A Consistent, Partitioned and Elastic Document Database. I would strongly encourage you to check it, as Terrastore is looking quite promising.


Redis Benchmarks

Back when I was writing the ☞ Quick Reference to Alternative data storages, I have searched the internet for benchmark results probably more deeply than Google does it. And I couldn’t find much.

Things seem to be changing lately and I start gather quite a few results (see NoSQL benchmark articles).

Redis Benchmarking on Amazon EC2, Flexiscale, and Slicehost

The author of the article has managed to run the Redis benchmarks on a set of different cloud hosting providers:

  • small-remote (Amazon EC2, 32b)
  • small (Amazon EC2, 32b)
  • slicehost-256 (Slicehost, 64b)
  • quadruple-extra-large (Amazon EC2, 64b)
  • large (Amazon EC2, 64b)
  • high-cpu-medium (Amazon EC2, 64b)
  • high-cpu-extra-large-32b-os (Amazon EC2, 32b)
  • high-cpu-extra-large (Amazon EC2, 64b)
  • flexiscale-2gb-4core (Flexiscale, 64b)
  • flexiscale-2gb-2core (Flexiscale, 64b)
  • extra-large (Amazon EC2, 64b)
  • double-extra-large (Amazon EC2, 64b)

You can read the results ☞ here (there is also a ☞ spreadsheet available)

Redis Benchmarks on FusionIO

It looks like the “MySQL Performance guys” are growing their passion for NoSQL systems. Now they have published the results of benchmarking Redis on FusionIO in 5 modes:

  • In-Memory (save 900000000 900000000)
  • Semi-Persistent Mode 1 (save 1 1)
  • Fully persistent (appendonly yes, appendfsync always)
  • Semi-Persistent Mode 2 (appendonly yes, appendfsync no)
  • Semi-Persistent Mode 3 (appendonly yes, appendfsync everysec)

You might find useful reading the ☞ RAID vs SSD vs FusionIO setup to better understand the environment.

Update: there is an update of these Redis benchmarks


Memcached-in-the-Cloud by Gear6

Memcached is used as a reference in the NoSQL world for its API and also for performance comparisons. Some NoSQL KV stores are offering a Memcached compatible API and some are even supporting the same protocol.

Startup Gear6 today launched the availability of its memcached appliance on Amazon’s Web Services platform, bringing a widely used distributed memory caching system for web companies to the cloud.

What seems to be missing from the announcement is any mentions of automatic Memcached scaling. Wouldn’t that be an interesting feature?

via: http://gigaom.com/2009/12/08/gear6-brings-memcached-to-amazons-cloud/


Non-relational data stores for OpenSQL Camp

Igal Koshevoy has made available through Github his presentation on “Non-relational data stores for OpenSQL Camp: Overview, coding and assessment: MongoDB, Tokyo Tyrant & CouchDB”.

After a short presentation of the relational and non-relational worlds, Igal jumps to presented pros and cons for each of the MongoDB, Tokyo Tyrant and CouchDB, includes code snippets for all basic operations and completes with some benchmarking results. You can read the presentation embedded below (update: it looks like Google embed doesn’t work with this document or GitHub is not allowing access to it): so for the moment you can access it in PDF format ☞ here.


A Benchmark for NoSQL Solutions

Sooner or later every piece of software or programming language gets benchmarked. Some benchmarks are interesting, while others tend to be created to prove that a particular solution is better than all others (vendor benchmarks). Coming up with a fair benchmark is a hard job and trying to analyze a set of heterogenous systems is even more difficult.

K.S. Bhaskar has published a benchmark proposal called ‘3n+1 NoSQL/Key-Value/Schema-Free/Schema-Less Database Benchmark’ that, in his words,

is designed to allow for apples to apples comparisons of NoSQL databases using features that should allow many if not most NoSQL engines to be benchmarked

There have been such proposals before and most probably there will be many more to come.

While writing a quick reference to alternative storages I have tried to put together as many performance results that I could find. Unfortunately that attempt was far from being a success, so seeing this proposal and ☞ people starting to publish their results is a major step forward.

What are your thoughts about NoSQL benchmarks?