NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



performance: All content tagged as performance in NoSQL databases and polyglot persistence

MySQL slow query collection sources

Morgan Tocker:

The other day it struck me that MySQL applications have no fewer than four sources to be able to collect potentially slow queries for analysis, and that I actually find myself using 3/4 methods available.

The source listed:

  1. application logging/monitoring
  2. performance schema
  3. slow query log file
  4. slow query log table — I didn’t know about this one.

For the details of each of these you must read the post.

Original title and link: MySQL slow query collection sources (NoSQL database©myNoSQL)


Beyond averages

A short (6‘25”) talk by Dan Kuebrich about the importance of using the right abstractions and visualizations when analyzing performance:

Speaking about performance analysis and visualization, Brendan Gregg’s “Systems Performance” book is available now.

Original title and link: Beyond averages (NoSQL database©myNoSQL)

When More Machines Equals Worse Results

Galileo observed how things broke if they were naively scaled up.

Google found the larger the scale the greater the impact of latency variability. When a request is implemented by work done in parallel, as is common with today’s service oriented systems, the overall response time is dominated by the long tail distribution of the parallel operations. Every response must have a consistent and low latency or the overall operation response time will be tragically slow.

Fantastic post from Todd Hoff on the (hopefully) well known truth: “the reponse time in a distributed parallel systems is the time of the slowest component“.

Original title and link: When More Machines Equals Worse Results (NoSQL database©myNoSQL)


Possible 100-fold increase in data storage speed

European researchers may have found a way to speed up data storage 100-fold, breaking one barrier holding back how fast data can be transferred. […] The researchers at York University in the U.K. and Nijmegen University in the Netherlands accomplished the feat by heating a magnetic material with laser bursts that alter what is called the magnetic spin of the material at the atomic level, according to an explanation by York University. There are two possible spins, parallel and anti-parallel, and in storage, these binary states would represent the ones and zeros that designate bit types.

I still find the salmon storage more mouthwatering.

Original title and link: Possible 100-fold increase in data storage speed (NoSQL database©myNoSQL)


MapReduce With Hadoop: What Happens During Mapping

An interesting look at what happens during the map phase in Hadoop and the impact of emitting key-value pairs:

  • a direct negative impact on the map time and CPU usage, due to more serialization
  • an indirect negative impact on CPU due to more spilling and additional deserialization in the combine step
  • a direct impact on the map task, due to more intermediate files, which makes the final merge more expensive

Map Reduce Combine

The main point of the dynaTrace blog post is that even if Hadoop makes it easy to throw more hardware at a problem, wasting resources with bad code in MapReduce tasks comes with a noticeable and measurable cost.

Original title and link: MapReduce With Hadoop: What Happens During Mapping (NoSQL database©myNoSQL)


Asking for Performance and Scalability Advice on StackOverflow

How many times have you got an answer that applies to your specific scenario when providing a short list of performance and scalability requirements? MySQL/InnoDB can do 750k qps, Cassandra is scaling linearly, MongoDB can do 8 mil ops/s. Is any of these the answer for your application?


  • How many times did you get all the requirements right at the spec time?

  • How many times did requirements remain the same during the development cycle?

  • How many times did production reality confirmed your bullet list requirements?

Original title and link: Asking for Performance and Scalability Advice on StackOverflow (NoSQL database©myNoSQL)

Optimizing MongoDB: Lessons Learned at Localytics

These slides have generated quite a reaction on Twitter. I’ll let you decide for yourself the reasons:

While there have been lots of retweets, here’s just a glimpse of what type of reactions I’m referring to:

VoltDB: 3 Concepts that Makes it Fast

John Hugg lists the 3 concepts that make VoltDB fast:

  1. Exploit repeatable workloads: VoltDB exclusively uses a stored procedure interface.
  2. Partition data to horizontally scale: VoltDB devides data among a set of machines (or nodes) in a cluster to achieve parallelization of work and near linear scale-out.
  3. Build a SQL executor that’s specialized for the problem you’re trying to solve.: If stored procedures take microseconds, why interleave their execution with a complex system of row and table locks and thread synchronization? It’s much faster and simpler just to execute work serially.

Let’s take a quick look at these.

Using stored procedures — instead of allowing free form queries — would allow the system:

  1. to completely skip query parsing, creating and optimizing execution plans at runtime
  2. by analyzing (at deploy time) the set of stored procedures, it might also be possible to generate the appropriate indexes

The benefits of horizontally partitioned data are well understood: parallelization and also easier and cost effective hardware usage.

Single threaded execution can also help by removing the need for locking and reducing data access contention.

While these 3 solutions are making a lot of sense and can definitely make a system faster, there’s one major aspect of VoltDB that’s missing from the above list and which I think is critical to explaining its speed: VoltDB is an in-memory storage solution.

Here are a couple of examples of other NoSQL databases that benefit from being in memory (or as close as possible to it). MongoDB, while being a lot more liberal with the queries it accepts, can deliver very fast results by keeping as much data in memory as possible — remember what happened when it had to hit the disk more often? — and using appropriate indexes where needed. Redis and Memcached can deliver amazingly fast results because they keep all data in-memory. And Redis is single threaded while Memcached is not.

Original title and link: VoltDB: 3 Concepts that Makes it Fast (NoSQL databases © myNoSQL)


High Rate insertion with MySQL and Innodb

On 8 core Opteron Box we were able to achieve 275K inserts/sec at which time we started to see load to get IO bound because of log writes and flushing dirty buffers. I’m confident you can get to 400K+ inserts/sec on faster hardware and disks (say better RAID or Flash) which is a very cool number. Of course, mind you this is in memory insertion in the simple table and table with long rows and bunch of indexes will see lower numbers.

There are more caveats in the article. Not sure though how to compare this number with the 750k qps on the NoSQLish MySQL.

Original title and link: High Rate insertion with MySQL and Innodb (NoSQL databases © myNoSQL)


Deferring Processing Updates to Increase HBase Write Performance

Alex Baranau:

The idea behind deferred updates processing is to postpone updating of the existing record and store incoming deltas as a new record. Thus, record update operations become a simple write operations with corresponding performance. Deferred updates technique elaborated here fits well when system handles a lot of updates of stored data and write performance is the main concern, while reading speed requirements are not that strict. The following cases (each of them separately or any combination of them) may indicate  that one can benefit from using the technique (the list is not complete):

  • updates are well spread over the whole and large dataset
  • lower rate (among write operations) of “true updates” (i.e. low percentage of writes are for completely new data, not really updates of existing data)
  • good portion of data stored/updated may never be accessed
  • system should be able to handle high write peaks without major performance degradation

Sounds like CQRS event stores.

Project available on ☞ GitHub.

Original title and link: Deferring Processing Updates to Increase HBase Write Performance (NoSQL databases © myNoSQL)


Project Voldemort Performance Tool

Project Voldemort gets a performance tool from Roshan Sumbaly :

  • Run using bin/
  • Has a warmup phase to insert records (—record-count)
  • Various record selection distributions
  • Can fix client throughput to measure latency under certain load