NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



nosql comparison: All content tagged as nosql comparison in NoSQL databases and polyglot persistence

Here Is Why in Cassandra vs. HBase, Riak, CouchDB, MongoDB, It's Cassandra FTW

Brian ONeill:

Now, since choosing Cassandra, I can say there are a few other really important less tangible considerations. The first, is the code base. Cassandra has an extremely clean and well maintained code base. Jonathan and team do a fantastic job managing the community and the code. As we adopted NoSQL, the ability to extend the code-base and incorporate our own features has proven invaluable. (e.g. triggers, a REST interface, and server-side wide-row indexing)

Secondly, the community is phenomenal. That results in timely support, and solid releases on a regular schedule. They do a great job prioritizing features, accepting contributions, and cranking out features. (They are now releasing ~quarterly) We’ve all probably been part of other open source projects where the leadership is lacking, and features and releases are unpredictable, which makes your own release planning difficult. Kudos to the Cassandra team.

Everything sounds reasonable except for Riak being the “new kid on the block” and not finding support for it. Basho, where were you hidding?

Original title and link: Here Is Why in Cassandra vs. HBase, Riak, CouchDB, MongoDB, It’s Cassandra FTW (NoSQL database©myNoSQL)


What HBase Learned From the Hypertable vs HBase Benchmark

Every decent benchmark can reveal not only performance or stability problems, but oftentimes more subtle issues like less known or undocumented options, common misconfigurations or misunderstandings. Sometimes it can reveal scenarios that a product hasn’t considered before or for which it has different solutions.

So even if I don’t agree with the purpose of the Hypertable vs HBase benchmark, I think the benchmark is well designed and there were no intentions to favor one product over the other.

I went back to two long time HBase committers and users, Michael Stack and Jean-Daniel Cryans, to find out what the HBase community could learn from this benchmark.

What can be learned from the Hypertable vs HBase benchmark from the HBase perspective?

Michael Stack: That we need to work on our usability; even a smart fellow like Doug Judd can get it really wrong.

We haven’t done his sustained upload in a good while. Our defaults need some tweaking.

We need to do more documentation around JVM tuning; you’d think fellas would have grok’d by now that big java apps need their JVM’s tweaked but it looks like the message still hasn’t gotten out there.

That we need a well-funded PR dept. to work on responses to the likes of Doug’s article (well-funded because Doug claims he spent four months on his comparison).

Jean-Daniel Cryans: I already opened a few jiras after using HT’s test on a cluster I have here with almost the same hardware and node count, it’s mostly about usability and performance for that type of use case:

  • Automagically tweak global memstore and block cache sizes based on workload

    Hypertable does a neat thing where it changes the size given to the CellCache (our MemStores) and Block Cache based on the workload. If you need an image, scroll down at the bottom of this link:

    Hypertable adaptive memory allocation

  • Soft limit for eager region splitting of young tables

    Coming out of HBASE-2375, we need a new functionality much like hypertable’s where we would have a lower split size for new tables and it would grow up to a certain hard limit. This helps usability in different ways:

    • With that we can set the default split size much higher and users will still have good data distribution
    • No more messing with force splits
    • Not mandatory to pre-split your table in order to get good out of the box performance

    The way Doug Judd described how it works for them, they start with a low value and then double it every time it splits. For example if we started with a soft size of 32MB and a hard size of 2GB, it wouldn’t be until you have 64 regions that you hit the ceiling.

    On the implementation side, we could add a new qualifier in .META. that has that soft limit. When that field doesn’t exist, this feature doesn’t kick in. It would be written by the region servers after a split and by the master when the table is created with 1 region.

  • Consider splitting after flushing

    Spawning this from HBASE-2375, I saw that it was much more efficient compaction-wise to check if we can split right after flushing. Much like the ideas that Jon spelled out in the description of that jira, the window is smaller because you don’t have to compact and then split right away to only compact again when the daughters open.

If someone is faced with similar scenarios are there workarounds or different solutions?

Michael Stack: There are tunings of HBase configs over in our reference guide for the sustained upload both in hbase and in jvm.

Then there is our bulk load facility which by-passes this scenario altogether which is what we’d encourage folks to use because its 10x to 100x faster getting your data in there.

Jean-Daniel Cryans: You can import 5TB in HBase with sane configs, I’ve done it a few times already since I started using his test. The second time he ran his test he just fixed mslab but still kept the crazy ass other settings like 80% of the memory dedicated to memstores. My testing also shows that you need to keep the eden space under control, 64MB seems a good value in my testing (he didn’t set any in his test, the first time I ran mine without setting it I got the concurrent mode failure too).

The answer he gave this week to Todd’s email on the hadoop mailing list is about a constant stream of updates and that’s what he’s trying to test. Considering that the test imports 5TB in ~16h (on my cluster), you run out of disk space in about 3 days. I seriously don’t know what he’s aiming for here.

Quoting him: “Bulk loading isn’t always an option when data is streaming in from a live application. Many big data use cases involve massive amounts of smaller items in the size range of 10-100 bytes, for example URLs, sensor readings, genome sequence reads, network traffic logs, etc.”

What are the most common places to look for improving the performance of a HBase cluster?

Michael Stack: This is what we point folks at when they ask the likes of the above question: HBase Performance Tunning

If that chapter doesn’t have it, its a bug and we need to fix up our documentation more.

Jean-Daniel Cryans: What Stack said. Also if you run into GC issues like he did then you’re doing it wrong.

Michael Stack also pointed me to a comment by Andrew Purtell (nb: you need to be logged in on LinkedIn and member of the group to see it):

I think HBase should find all of this challenging and flattering. Challenging because we know how we can do better along the dimensions of your testing and you are kicking us pretty hard. Flattering because by inference we seem to be worth kicking.

But this misses the point, and reduces what should be a serious discussion of the tradeoffs between Java and C++ to a cariacture. Furthermore, nobody sells HBase. (Not in the Hypertable or Datastax sense. Commercial companies bundle HBase but they do so by including a totally free and zero cost software distribution.) Instead it is voluntarily chosen for hundreds of large installations all over the world, some of them built and run by the smartest guys I have ever encountered in my life. Hypertable would have us believe we are all making foolish choices. While it is true that we all on some level have to deal with the Java heap, only Hypertable seems to not be able to make it work. I find that unsurprising. After all, until you can find some way to break it, you don’t have any kind of marketing story.

This remineded me of the quote from Jonathan Ellis’s Dealing With JVM Limitations in Apache Cassandra:

Cliff Click: Many concurrent algorithms are very easy to write with a GC and totally hard (to down right impossible) using explicit free.

As I was expecting, there are quite a few good things that will come out from this benchmark for both long time HBase users, but also for new adopters.

Original title and link: What HBase Learned From the Hypertable vs HBase Benchmark (NoSQL database©myNoSQL)

Data Grid or NoSQL? What are the common points? The main differences?

A great post by Olivier Mallassi on a topic that comes up very often: how do data grids and NoSQL databases compare?

  • Data Grids enable you controlling the way data is stored. They all have default implementation (Gigaspaces offers RDBMS by default, Gemfire offers file and disk based storage by default….) but in all cases, you can choose the one that fits your needs: do you need to store data, do you need to relieve the existing databases….
  • In order to minimize the latency, data grids enable you to store data synchronously (write-through) or asynchronously (write-behind) on disk. You can also define overflow strategies. In that case, data is store in memory up to a treshold where data is flushed on disk (following algorithms like LRU …). NoSQL solutions have not been designed to provide these features.
  • Data grids enable you developing Event Driven Architecture.
  • Querying is maybe the point on which pure NoSQL solutions and data grids are merging.
  • Data grids enable near-cache topologies.

Taking a step back you’ll notice that there are actually more similarities than differences. While Oliver Mallasi lists the above points as features that prove data grids as being more configurable and so more adaptable, some of these do exist also in the NoSQL databases taking different forms:

  1. pluggable storage backends. Not many of the NoSQL databases have this feature,but Riak and Project Voldemort are offering different solutions that are optimized for specific scenarios.
  2. replicated and durable writes. Not the same as synchronous vs asynchronous writes, but a different perspective on writes.
  3. Notification mechanisms. Once again not all of the NoSQL databases support notification mechanisms, but a couple of them have offer some interesting approaches:
    1. CouchDB: _changes feed with filters
    2. Riak: pre-commit and post-commit hooks
    3. HBase coprocessors
  4. Most of the NoSQL database have local per-node caches.

With these, I’ve probably made things even blurrier. But let me try to draw a line between data grids and NoSQL databases. Data grids are optimized for handling data in memory. Everything that spills over is secondary. On the other hand, NoSQL databases are for storing data. Thus they focus on how they organize data (on disk or in memory) and optimize access to it. Data grids are a processing/architectural model. NoSQL databases are storage solutions.

Original title and link: Data Grid or NoSQL? What are the common points? The main differences? (NoSQL database©myNoSQL)

Cassandra and Amazon DynamoDB Comparison

Maybe a couple of too strong words, but definitely a great comparison of Cassandra and Amazon DynamoDB by Jonathan Ellis (Cassandra chair and founder of DataStax):

As an engineer, it’s nice to see so many of Cassandra’s design decisions imitated by Amazon’s next-gen NoSQL product. I feel like a proud uncle! But in many important ways, Cassandra retains a firm lead in power and flexibility.

Cassandra vs Amazon DynamoDB

Update: this is the updated version of the comparison.

Original title and link: Cassandra and Amazon DynamoDB Comparison (NoSQL database©myNoSQL)


Enterprise Caches Versus Data Grids Versus NoSQL Databases

RedHat/JBoss Manik Surtani:

[…] If you want to compare distributed systems, both data grids and NoSQL have kind of come from different starting points, if you will. They solve different problems, but where they stand today they’ve kind of converged. Data grids have been primarily in-memory but now they spill off onto disk and so on and so forth and they’ve added in-query and mapreduce onto it while NoSQL have primarily been on disk, but now cache stuff in-memory anyway for performance. They are starting to look the same now, or are very similar.

One big difference though that I see between data grids and NoSQL, something that still exists today, is how you actually interact with these systems. Data grids tend to be in VM, they tend to be embedded, you tend to launch a Java or JVM program, you tend to connect to a data grid API and you work with it whereas NoSQL tends to be a little bit more client server, a bit more like old-fashion databases where you open a socket to your NoSQL database or your NoSQL grid, if you will, and start talking to it. That’s the biggest difference I see today, but even that will eventually go away.

They seem to converge, but:

  • spilling off to disk is not equivalent to optimized disk access
  • distributed, sometimes even transactional caches are not equivalent with single node caches

Original title and link: Enterprise Caches Versus Data Grids Versus NoSQL Databases (NoSQL database©myNoSQL)


Rails Caching Benchmarked: MongoDB, Redis, Memcached

A couple of Rails caching solutions—file, memcached, MongoDB, and Redis—benchmarked firstly here by Steph Skardal and then here by Thomas W. Devol. Thomas W. Devol concludes:

Though it looks like mongo-store demonstrates the best overall performance, it should be noted that a mongo server is unlikely to be used solely for caching (the same applies to redis), it is likely that non-caching related queries will be running concurrently on a mongo/redis server which could affect the suitability of these benchkmarks.

I’m not a Rails user, so please take these with a grain of salt:

  • without knowing the size of the cached objects, at 20000 iterations most probably neither MongoDB, nor Redis have had to persist to disk.

    This means that all three of memcached, MongoDB, Redis stored data in memory only[1]

  • if no custom object serialization is used by any of the memcached, MongoDB, Redis caches, then the performance difference is mostly caused by the performance of the driver

  • it should not be a surprise to anyone that the size of the cached objects can and will influence the results of such benchmarks

  • there doesn’t seem to be any concurrent access to caches. Concurrent access and concurrent updates of caches are real-life scenarios and not including them in a benchmark greatly reduces the value of the results

  • none of these benchmarks doesn’t seem to contain code that measure the performance of cache eviction

  1. Except the case where any of these forces a disk write  

Original title and link: Rails Caching Benchmarked: MongoDB, Redis, Memcached (NoSQL database©myNoSQL)

CouchDB and Redis: Strengths and Weaknesses

Knowing the strenghts and weaknesses of each of them could help making a decision. But do not fall for comparing them:

I’ve compiled a simpler item-by-item comparison of CouchDB and Redis, and it appears to be that CouchDB is strong precisely where Redis is weak (storing large amounts of rarely-changing but heavily indexed data), and Redis is strong precisely where CouchDB is weak (storing moderate amounts of fast-changing data).

Original title and link: CouchDB and Redis: Strengths and Weaknesses (NoSQL database©myNoSQL)


LevelDB and Kyoto Cabinet Benchmark

I’ve been pretty excited about Google’s LevelDB, not to mention there are some really old tanks already in the battle field like BDB, Tokyo Cabinet (Kyoto Cabinet as new one), HamsterDB etc. Fortunately I’ve already worked with Kyoto Cabinet and when I looked at the benchmarks I was totally blown away.

His benchmark results are radically different than the ones published in the LevelDB benchmark.

Original title and link: LevelDB and Kyoto Cabinet Benchmark (NoSQL database©myNoSQL)


Brief MongoDB and Riak Comparison

The advantage of Riak over Mongo is that Riak automatically replicates and rebalances.

The advantage of MongoDB over Riak is that Mongo supports secondary indexes and a more robust query language.

Both Riak and MongoDB support MapReduce via JavaScript, and both use the SpiderMonkey JavaScript engine. However, Riak’s MapReduce framework is more powerful than MongoDB’s framework because Riak allows you to run MapReduce jobs on a filtered set of keys. By contrast, in Mongo, you have to run MapReduce jobs across an entire database.

All true.

Original title and link: Brief MongoDB and Riak Comparison (NoSQL database©myNoSQL)


Redis vs H2 Performance in Grails 1.4

I wondered just how much faster read/write operations could be with Redis (if at all) over the H2 database so I set out to write a little test app to see for myself.

It is an apple-to-apple comparison—both Redis and H2 are in-memory databases. But it is not a comparison of Redis vs H2 performance, but rather a comparison of Grails integration for Redis and H2, Grails object to Redis vs object to relational mapping, Redis and H2 drivers, and only at last of Redis and H2 performance.

You could argue that for real applications that’s what matters and that would be correct. But then the title should be the one I used.

Redis vs H2 in Grails 1.4

Original title and link: Redis vs H2 Performance in Grails 1.4 (NoSQL database©myNoSQL)


The 11 Commandments of Benchmarking

Mark Nottingham has a great post about benchmarking HTTP servers. All the 11 rules exposed in the post apply as they are to NoSQL benchmarks and generally to storage benchmarks:

  1. Consistency
  2. One machine, one job
  3. Check the network
  4. Remove OS limitations
  5. Don’t test the client
  6. Overload is not capacity
  7. Thirty Seconds isn’t a test
  8. Do more than Hello world
  9. Not just averages
  10. Publish it all
  11. Try different tools

Go read the post now before creating yet another irrelevant benchmark.

Original title and link: The 11 Commandments of Benchmarking (NoSQL databases © myNoSQL)


InterSystems Globals and GT.M Compared

InterSystems, producers of the Caché database, have launched Globals, a fast, proven, simple, flexible and free databases, 2 months ago. But after the initial announcement, I couldn’t find and didn’t hear much about it. This until Rob Tweed[1] and K.S.Bhaskar[2] took the time to explained some of the differences between InterSystems Globals and GT.M, both systems being implemented on top of the MUMPS Global Persistent Variables .

Rob Tweed: I’m not an InterSystems person — simply a long-term user and advocate of Global-storage based technologies of which GT.M, Cache and now InterSystem Globals are members, and someone who has long believed that it’s a significantly under-valued database technology, and unfortunately and sadly not known about or understood sufficiently in the wider database/IT world. However, the rise of NoSQL has provided some renewed chance of rediscovery by a wider community of developers, which I’m keen to encourage.

With respect to a comparison with BigTable etc, I guess all of us in the Global-storage technology user communities have looked at many of the new NoSQL technologies and thought it’s deja vu all over again :-) Perhaps this paper[3] that I co-authored might help to at least provide a comparative positioning against the “mainstream” NoSQL databases.

As we note in the paper, full-blown Cache and GT.M provide many of the mechanisms needed for high-end scalability, though, as you point out, many of these appear to be lacking in InterSystem Globals, at least in its current (and relatively early) incarnation.

Regarding a comparison of InterSystem Globals and GT.M, at the core data storage level, there’s little difference: they both use Globals for data storage, so the use cases will be similar. Both GT.M and Globals are implemented in C (instead of M/Mumps), with some small bits of GT.M glue code in assembler. In terms of licensing, InterSystem Globals is free but proprietary, GT.M is free open source.

I guess the biggest differences are:

  • InterSystem Globals is essentially the core database engine from Cache, but with many of the features of Cache, in particular its native language (M) turned off. The concept in InterSystem Globals is that it will be accessed via APIs from other mainstream languages, instead of being primarily accessed via the M language as is the norm in, say, GT.M

    K.S.Bhaskar: Although the majority of GT.M users do indeed program in M, the fact is that the GT.M database is just as accessible from a C main() program.

  • the InterSystem Globals APIs are in-process rather than via, say, a TCP interface. As such it should be significantly faster to access from a non-M scripting language such as Java or Javascript/Node.js than would be GT.M. I’ve not tried it out yet myself, as I’m very much a Javascript person these days and keen to try out their planned Node.js APIs. I should be able to report back some performance comparisons when I get my hands on the Node.js-compatible version.

    (KSB) As discussed above, GT.M does not restrict a user to TCP access. The primary restriction (which results from the fact that the database engine is daemonless and processes cooperate to manage the database - so there is a real time database engine linked into each processes’ address space) is that a GT.M process can have only one thread. If you can’t live with this, then you have to use TCP through a client such as Rob’s.

    Another option is to use the GT.CM “database service” that GT.M includes (GNP - the GT.M Network Protocol is layered on TCP). A client is coded within GT.M itself, or you can use/adapt other clients for GNP such as Dave H’s PHP gtcmclient.

(RT): I suspect one way things will pan out over coming months will be:

  • if you want a fully open source Global-based database technology with all the bells and whistles, then GT.M is probably the answer, but its interfacing to other languages will be bottlenecked by TCP networking limits and indirection (the M equivalent of, eg Javascript’s eval() function)
  • If you want the ultimate in performance and willing to sacrifice open source and the high-end scalability options, but remain free, then InterSystem Globals will be a good choice
  • If you want the former but are willing to pay for the extra high-end scalability technologies, then full-blown Cache will be your choice.

    (KSB) This is a false choice. GT.M gives you high end scalability with a free / open source license (and support with assured service levels on commercial terms for those who want it).

(RT) The nice thing is that it will be straightforward to engineer applications that can be easily migrated between these three options with a minimum of change being needed at the application level.

Personally, I think InterSystem Globals is a great thing and nice to see InterSystems venturing into a new direction: I think that’s only to be encouraged and can only help the NoSQL community.

The text of this post has been adapted and edited based on this conversation .

  1. Rob Tweed: Web/Ajax app/Cloud consultant and product developer. M/DB, M/DB:X and EWD Areas of expertise: Node, Sencha Touch, Mobile Web Apps, NoSQL databases  

  2. K.S.Bhaskar: : Development Director at FIS  

  3. A Universal NoSQL Engine, Using a Tried and Tested Technology  

Original title and link: InterSystems Globals and GT.M Compared (NoSQL databases © myNoSQL)