NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



NoSQL benchmarks: All content tagged as NoSQL benchmarks in NoSQL databases and polyglot persistence

5 Steps to Benchmarking Managed NoSQL - DynamoDB Vs Cassandra

Ben Bromhead (instaclustr) for High Scalability:

To determine the suitability of a provider, your first port of call is to benchmark. Choosing a service provider is often done in a number of stages. First is to shortlist providers based on capabilities and claimed performance, ruling out those that do not meet your application requirements. Second is to look for benchmarks conducted by third parties, if any. The final stage is to benchmark the service yourself.

Peter Bailis asks a very valid question: if it’s the default YCSB and it’s a benchmark, where are the results?”

✚ instaclustr offers a totally managed hosting solution for Cassandra. (Disclaimer: they’ve sponsored myNoSQL in the past)

Original title and link: 5 Steps to Benchmarking Managed NoSQL - DynamoDB Vs Cassandra (NoSQL database©myNoSQL)


PUMA: A MapReduce Benchmarks Suite From Purdue

Purdue MapReduce benchmarks and data sets are available here:

During our work on MapReduce, we developed a benchmark suite which represents a broad range of MapReduce applications exhibiting application characteristics with high/low computation and high/low shuffle volumes. There are a total of 13 benchmarks, out of which Tera-Sort, Word-Count, and Grep are from Hadoop distribution. The rest of the benchmarks were developed in-house and are currently not part of the Hadoop distribution. The three benchmarks from Hadoop distribution are also slightly modified to take number of reduce tasks as input from the user and generate final time completion statistics of jobs.

I couldn’t find any references to this set of benchmarks being used anywhere though.

Original title and link: PUMA: A MapReduce Benchmarks Suite From Purdue (NoSQL database©myNoSQL)


YCSB Benchmark Results for Cassandra, HBase, MongoDB, MySQL Cluster, and Riak

Put together by the team at Altoros Systems Inc., this time run in the Amazon EC2 and including Cassandra, HBase, MongoDB, MySQL Cluster, sharded MySQL and Riak:

After some of the results had been presented to the public, some observers said MongoDB should not be compared to other NoSQL databases because it is more targeted at working with memory directly. We certainly understand this, but the aim of this investigation is to determine the best use cases for different NoSQL products. Therefore, the databases were tested under the same conditions, regardless of their specifics.

Teaser: HBase got the best results in most of the benchmarks (with flush turned off though). And I’m not sure the setup included the latest HBase read improvements from Facebook.

Original title and link: YCSB Benchmark Results for Cassandra, HBase, MongoDB, MySQL Cluster, and Riak (NoSQL database©myNoSQL)


Redis and Memcached Benchmark on Amazon Cloud

Garantia Data, providers of Redis and Memcached as-a-Service-in-the-Amazon-Cloud, published the results of a throughput and latency benchmark for different AWS deployment models:

The first thing we looked at when putting together our benchmark was the various architectural alternatives we wanted to compare. Users typically choose the most economical AWS instance based on the initial size estimate of their dataset, however, it’s crucial to also keep in mind that other AWS users might share the same physical server that runs your data (as nicely explained by Adrian Cockcroft here). This is especially true if you have a small-to-medium dataset, because instances between m1.small and m1.large are much more likely to be shared on a physical server than large instances like m2.2xlarge and m2.4.xlarge, which typically run on a dedicated physical server. Your “neighbours” may become “noisy” once they start consuming excess I/O and CPU resources from your physical server. In addition, small-to-medium instances are by nature weaker in processing power than large instances.

Only two comments:

  1. it’s not clear if there were multiple instances of Redis used per machine when the chosen instances had multi-cores
  2. I would have really liked to also have a pricing comparison in the conclusion section

Original title and link: Redis and Memcached Benchmark on Amazon Cloud (NoSQL database©myNoSQL)


Benchmarking High Performance I/O With SSD for Cassandra on AWS

Adrian Cockcroft:

The SSD based system running the same workload had plenty of IOPS left over and could also run compaction operations under full load without affecting response times. The overall throughput of the 12-instance SSD based system was CPU limited to about 20% less than the existing system, but with much lower mean and 99th percentile latency. This sizing exercise indicated that we could replace the 48 m2.4xlarge and 36 m2.xlarge with 15 hi1.4xlarge to get the same throughput, but with much lower latency.

Tons of details and data about the benchmarks Netflix ran against the new high I/O SSD-backed EC2 instances. Results are even more impressive than the IOPS numbers in Werner Vogel’s High performance I/O instances for EC2.

Original title and link: Benchmarking High Performance I/O With SSD for Cassandra on AWS (NoSQL database©myNoSQL)


Performance Evaluation of HBase and How Hardware Changes Results

Two posts by Oliver Meyn on measuring the performance of two HBase clusters—first results on the original cluster and results on the upgraded cluster— using org.apache.hadoop.hbase.PerformanceEvaluation, the resulting performance charts, Ganglia charts, and some thoughts and feedback from the HBase community.

Original title and link: Performance Evaluation of HBase and How Hardware Changes Results (NoSQL database©myNoSQL)

Hypertable Revival. Still the wrong strategy

After a very long silence (my last post about Hypertable dates back in Oct. 2010: NoSQL database architectures and Hypertable), there seems to be a bit of revival in the Hypertable space:

  1. there are new packages of (commercial) services (PR announcement):
    1. Uptime support subscription
    2. Training and certification
    3. Commercial license
  2. it seems like Hypertable has a customer in (India)
  3. it is taking yet another stab at HBase performance

While I’m somehow glad that Hypertable didn’t hit the deadpool, it’s quite disappointing that they are still trying to use this old and completely useless strategy of attacking another product in the market.

There are probably many marketers out there encouraging companies to use this old trick of getting attention by attacking the market leader1. And one of the simplest ways of doing that is by saying “mine is bigger than yours“.

But these days this strategy isn’t working anymore for quite a few reasons:

  1. benchmarks are most of the time incorrect, thus the attention will be pointed in the wrong direction.

    In the case of the Hypertable vs HBase benchmark, JD Cryans (HBase veteran) is demoting the results.

  2. For existing users, performance issues are already known. Performance issues are also known by core developers that are always working to address them. So nothing new, just some angry users of the attacked product.

  3. For new users, performance is just one aspect of the decision. Most of the time, it’s one of the last considered. Community, support, adoption, and well know case studies are much more important.

Attacking competitors based on feature checklists might be slightly effective in attracting a bit of attention, but it’s not the strategy to get users and customers and grow a community.

  1. HBase might not be a market leader, but it is definitely one of the NoSQL databases that have seen and a few very large deployments. 

Original title and link: Hypertable Revival. Still the wrong strategy (NoSQL database©myNoSQL)

Hadoop, HPCC, MapR and the TeraSort Benchmark

Just in, from LexisNexis:

HPCC Systems 4 nodes cluster sorts 100 gigabytes in 98 seconds and is 25% faster than a 20 nodes Hadoop cluster.

Results achieved in December 2011 show that an HPCC Systems four node Thor cluster took only 98 seconds to complete a Terasort with a job size of 100 gigabytes (GB) on a cluster five times smaller than Hadoop. The HPCC Systems four node cluster was comprised of one (1) Dell PowerEdge C6100 2U server with Intel® Xeon® processors E5675 series, 48GB of memory, and 6 x 146GB SAS HDD’s. The Dell C6100 houses four nodes inside the 2U enclosure. The previous leader ran the same Terasort benchmark in 130 seconds on a 20-node Hadoop cluster using equivalent node hardware. HPCC Systems is an Open Source, enterprise-proven Big Data analytics-processing platform.

Thus Armando Escalante (SVP and CTO of LexisNexis Risk Solutions and head of HPCC Systems) concludes:

These results demonstrate that HPCC Systems is a leader in Big Data processing

Now switching to a post on MapR’s blog:

Recently a world record was claimed for a Hadoop benchmark. […] We were surprised to see that this world record was for a TeraSort benchmark on a 100GB of data. TeraSort is a standard benchmark and the name is derived from “sorting a terabyte”.  Any record claims for sorting a 100GB dataset across a 20 node cluster with 10 times as much memory is comical. The test is named TeraSort not GigaSort.

Original title and link: Hadoop, HPCC, MapR and the TeraSort Benchmark (NoSQL database©myNoSQL)

Rails Caching Benchmarked: MongoDB, Redis, Memcached

A couple of Rails caching solutions—file, memcached, MongoDB, and Redis—benchmarked firstly here by Steph Skardal and then here by Thomas W. Devol. Thomas W. Devol concludes:

Though it looks like mongo-store demonstrates the best overall performance, it should be noted that a mongo server is unlikely to be used solely for caching (the same applies to redis), it is likely that non-caching related queries will be running concurrently on a mongo/redis server which could affect the suitability of these benchkmarks.

I’m not a Rails user, so please take these with a grain of salt:

  • without knowing the size of the cached objects, at 20000 iterations most probably neither MongoDB, nor Redis have had to persist to disk.

    This means that all three of memcached, MongoDB, Redis stored data in memory only[1]

  • if no custom object serialization is used by any of the memcached, MongoDB, Redis caches, then the performance difference is mostly caused by the performance of the driver

  • it should not be a surprise to anyone that the size of the cached objects can and will influence the results of such benchmarks

  • there doesn’t seem to be any concurrent access to caches. Concurrent access and concurrent updates of caches are real-life scenarios and not including them in a benchmark greatly reduces the value of the results

  • none of these benchmarks doesn’t seem to contain code that measure the performance of cache eviction

  1. Except the case where any of these forces a disk write  

Original title and link: Rails Caching Benchmarked: MongoDB, Redis, Memcached (NoSQL database©myNoSQL)