NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



voltdb: All content tagged as voltdb in NoSQL databases and polyglot persistence

How Scalable is VoltDB?

Percona guys[1] have run, analyzed, and concluded about VoltDB scalability:

VoltDB is very scalable; it should scale to 120 partitions, 39 servers, and 1.6 million complex transactions per second at over 300 CPU cores

Considering the definition: “A system whose performance improves after adding hardware, proportionally to the capacity added, is said to be a scalable system.”, the conclusion should be slightly updated:

VoltDB can scale up to 120 partitions on 39 servers with 300 CPU cores and 1.6 million TPS.

Bottom line:

  • if you can fit your data into 40 servers’ memory
  • you need ACID and SQL
  • you are OK precompiled Java based stored procedures
  • you don’t need multi data center deployments

now you can estimate how far you can go with VoltDB.

  1. The company specialized on MySQL services and behind the MySQL Performance Blog  

Original title and link: How Scalable is VoltDB? (NoSQL databases © myNoSQL)


MySQL Fork Drizzle Released

Drizzle aims to be different from MySQL, stripping out “unnecessary” features loved by enterprise and OEMs in the name of greater speed and simplicity and for reduced management overhead.

Drizzle has no stored procedures, triggers, or views […]

Aiming to provide a database for the cloud with support for massive concurrency optimized for increased performance, Drizzle team started by removing “non-essential” code and features. Michael Stonebraker’s VoltDB is focusing on a different set of optimizations for achieving performance — removing logging, locking, latching, buffer management[1].

Anyway, it is not about who’s approach is better, but which scenarios are covered by using a simplified MySQL compatible database or by an in-memory with predefined queries database.

  1. The “NoSQL” Discussion has Nothing to Do With SQL:

    If one eliminates any one of the above overhead components, one speeds up a DBMS by 25%. Eliminate three and your speedup is limited by a factor of two. You must get rid of all four to run a lot faster.

Original title and link: MySQL Fork Drizzle Released (NoSQL databases © myNoSQL)


VoltDB: 3 Concepts that Makes it Fast

John Hugg lists the 3 concepts that make VoltDB fast:

  1. Exploit repeatable workloads: VoltDB exclusively uses a stored procedure interface.
  2. Partition data to horizontally scale: VoltDB devides data among a set of machines (or nodes) in a cluster to achieve parallelization of work and near linear scale-out.
  3. Build a SQL executor that’s specialized for the problem you’re trying to solve.: If stored procedures take microseconds, why interleave their execution with a complex system of row and table locks and thread synchronization? It’s much faster and simpler just to execute work serially.

Let’s take a quick look at these.

Using stored procedures — instead of allowing free form queries — would allow the system:

  1. to completely skip query parsing, creating and optimizing execution plans at runtime
  2. by analyzing (at deploy time) the set of stored procedures, it might also be possible to generate the appropriate indexes

The benefits of horizontally partitioned data are well understood: parallelization and also easier and cost effective hardware usage.

Single threaded execution can also help by removing the need for locking and reducing data access contention.

While these 3 solutions are making a lot of sense and can definitely make a system faster, there’s one major aspect of VoltDB that’s missing from the above list and which I think is critical to explaining its speed: VoltDB is an in-memory storage solution.

Here are a couple of examples of other NoSQL databases that benefit from being in memory (or as close as possible to it). MongoDB, while being a lot more liberal with the queries it accepts, can deliver very fast results by keeping as much data in memory as possible — remember what happened when it had to hit the disk more often? — and using appropriate indexes where needed. Redis and Memcached can deliver amazingly fast results because they keep all data in-memory. And Redis is single threaded while Memcached is not.

Original title and link: VoltDB: 3 Concepts that Makes it Fast (NoSQL databases © myNoSQL)


Integrating VoltDB and Hadoop

A paper on integrating VoltDB and Hadoop. From what I read, for now it works on a single direction (exporting data from VoltDB to Hadoop):

It is possible to design and develop a complete business solution utilizing both VoltDB and Hadoop from scratch. But you do not need to. VoltDB simplifies the process by providing an export facility that lets you automatically archive selected data from the VoltDB database. And you can use this export functionality with Hadoop.


See the paper below:

VoltDB Release: Version 1.2 Featuring Data Availability Enhancements

VoltDB 1.2 released earlier this month:

New data availability features. Version 1.2 introduces two important data availability enhancements. The first is network partition tolerance, which allows VoltDB to automatically detect, isolate and manage network failures. This is a critical feature for distributed database infrastructures including those deployed into public clouds such as Amazon’s EC2. The second availability feature, node rejoin, allows VoltDB database nodes that have been taken offline (e.g., for maintenance or repair) to “rejoin” the cluster while the database is live. Node rejoin dynamically resynchronizes all node data.

I’d love to read more about about the mechanisms used for automatically detecting, isolating and managing network failures. (If I remember correctly) The topic of reliably determining partitions in a distributed system is a central part of Seth Gilbert and Nancy Lynch paper on CAP theorem. It would also be interesting to understand how VoltDB deals with its strong consistency promise in these situations.

And some management tools (nb: by the announcement text I cannot tell if they are available only in the Enterprise version):

New consoles for provisioning, management and monitoring. New in the Enterprise Edition of version 1.2, the VoltDB Enterprise Manager (VEM) provides database and systems administrators with browser-based tools for managing production VoltDB databases. VEM offers a flexible suite of consoles for performing many common administrative and diagnostic activities.

Original title and link: VoltDB Release: Version 1.2 Featuring Data Availability Enhancements (NoSQL databases © myNoSQL)


Using MySQL as NoSQL: A Story for exceeding 750k qps

How many times do you need to run PK lookups per second? […] These are “SQL” overhead. It’s obvious that performance drops were caused by mostly SQL layer, not by “InnoDB(storage)” layer. MySQL has to do a lot of things like below while memcached/NoSQL do not neeed to do.

  • Parsing SQL statements
  • Opening, locking tables
  • Making SQL execution plans
  • Unlocking, closing tables

MySQL also has to do lots of concurrency controls.

The story has been out for a couple of weeks already, so I’ll not get into the details. But I felt like adding a couple of comments to the subject:

  • existing RDBMS storage engines are most of the time very well thought and long time tested
  • some NoSQL databases have realized that and allow plugging in such storage engines in their systems:
    • Project Voldemort supports Berkley DB (and MySQL, but not sure it goes around the SQL engine)
    • [Riak comes with Innostore], an InnoDB-based storage
  • many of the findings in this article sound very close to the rationale behind VoltDB, including the pre-compiled, cluster deployed stored procedures

Original title and link: Using MySQL as NoSQL: A Story for exceeding 750k qps (NoSQL databases © myNoSQL)


VoltDB: An SQL Developer’s Perspective

Two hours of VoltDB. Planning to watch it over the weekend:

Original title and link: VoltDB: An SQL Developer’s Perspective (NoSQL databases © myNoSQL)

NoSQL benchmarks and performance evaluations

Some say it is the right time to start having these around. Others are saying it’s way to early to start the “battle”. Users do want to see them and in case they’re lacking they create their own, most of the time using incomplete or wrong approaches.

But what am I talking about? As some of you might have guessed already:

NoSQL benchmarks and performance evaluations!

With their recent release of Riak 0.11.0, Basho guys have also published their internal ☞ benchmarking code. Similar internal benchmark code is ☞ available for MongoDB.

But users are more interested in seeing cross product benchmarks, even if most of the time constructing these is extremely complicated and they end up comparing apples with oranges.

All these being said and accepting that most of the time someone will figure out a way to invalidate the results, lets see what cross product benchmarks do we have in the NoSQL space.

Yahoo! Cloud Serving Benchmark

The Yahoo! Cloud Serving Benchmark’s goal is to facilitate performance comparisons of the new generation of cloud data serving systems. The source code is available on ☞ GitHub and Yahoo! has also published ☞ the results of running this benchmark against Cassandra, HBase, Yahoo!’s PNUTS, and a simple sharded MySQL implementation.

VoltDB Benchmark

VoltDB a new storage solution that calls itself the next-generation SQL RDBMS with ACID for fast-scaling OLTP applications has recently ☞ published the results of their benchmark comparing VoltDB and Cassandra.

It is worth noting that while being one of those apples to oranges comparisons (nb and the authors are well aware of it), there are still a couple of interesting and useful things to be learned from it (i.e. benchmarking procedure, tested scenarios, etc.)

Unfortunately at this time the source code is not yet available, but hopefully we will see it soon:

Going forward, we’re planning to release the code we used to do these benchmarks. We’d also like to try a few other storage layers

Hypertable and HBase Performance Evaluation

The guys behind Hypertable ☞ have published their results of comparing Hypertable with HBase using a benchmark based on the Google BigTable paper[1] from which both HBase and Hypertable are inheriting their architecture. Unfortunately, the benchmark code is not available at this moment.

Thanks to Stu Hood, now I know the code for this benchmark is available in the Hypertable distribution available ☞ here (tar.gz) and the configuration files are also available ☞ here (tar.gz)

So, as far as I could gather we have:

Did I miss any?

  1. The BigTable paper is available ☞ here  ()

VoltDB Don’ts Validating NoSQL Assumptions

Interesting to note that some VoltDB don’ts from the paper ☞ Do’s and Don’ts (pdf) are validating some major assumptions in the NoSQL space:

Don’t create tables with very large rows (that is, lots of columns or large VARCHAR columns). Several smaller tables with a common partitioning key are better.

Basically both wide-column stores (i.e. Cassandra, HBase, Hypertable) with their column-families and document databases (i.e. CouchDB, MongoDB, RavenDB, Terrastore) with their schema-less approach are addressing this issue.

  1. Don’t use ad hoc SQL queries as part of a production application.

Firstly this points to the mindset change required by the NoSQL space when doing data modeling: think about data access patterns.

Secondly, it pretty much validates CouchDB and RavenDB approaches of having queries defined upfront making their reads extremely fast.