NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



PNUTS: All content tagged as PNUTS in NoSQL databases and polyglot persistence

Yahoo! Sherpa: Status and Advances

Information about Yahoo! PNUTS/Sherpa is so rare. Except the original PNUTS architecture paper (PDF) and the Sherpa: Cloud Computing of the Third Kind slides (PDF), it’s difficult to find something else. But in September, the Yahoo! Developer blog posted an update about Sherpa.

Sherpa Status

  • 500+ Sherpa tables
  • 50,000+ Sherpa tablets(shards) in operation
  • tablets can be copied, moved, or split dynamically
  • multi-data center
  • multi-tenant: it must support different ranges of read/write ratios
  • 75000 requests per second
  • supports heterogeneous servers
    • some are SSD exclusively
  • Sherpa users have access to monitoring tools to examine their app latency and throughput SLA. This implies each application could negotiate different SLAs with the infrastructure team.
  • Sherpa default storage engine is MySQL/InnoDB
    • storage access have been abstracted
    • Yahoo! tested BDB, BDB-Java, and Log-Structured Merge (LSM) Tree developed by Yahoo Labs backends
  • Sherpa relies on a reliable messaging system
    • can guarantee reliable in-colo and cross-colo transactions

Sherpa Advances

Selective Record Replication

  • previous versions supported Table-level replication
  • new version support Record-level replication
  • designed for efficiency (minimize costs of transfer and storage) and legality (ensure copyright regulations, other legal concerns)
  • replication locations are declarative/static
  • Sherpa maintains a ‘stub’ of the record in locations that do not have a full copy
    • the stub is updated only when a record is created, deleted, or when replication rules change
    • the stub is used for routing requests


  • Support for full table backups has been added
  • Point-in-time recovery planned
  • Also planned full, cross-colo, and automatic table restoration

Task Manager using Sherpa Tables for Task State

  • Sherpa has added a general workflow manager to execute long-running tasks
  • It is used for backup and restore operations

The complete post can be read here.

Considering Yahoo! has always been a big proponent of open source projects, it is a pitty that we don’t have the chance to hear more often and more details about PNUTS/Sherpa.

Original title and link: Yahoo! Sherpa: Status and Advances (NoSQL database©myNoSQL)

Memcached and Sherpa for Yahoo! News Activity Data Service

Mixer, the recently announced Yahoo’s new data service for news activities, uses Memcached and Sherpa for its data backend. Plus a combination of asynchronous libraries and task execution tools:

Mixer - Memcached Sherpa Yahoo News Activity

The data processing model and the clear separation between read and write data solutions is not only compelling, but essential for maintaining the SLA (max. 250ms/response):

Memcache maintains two types of materialized views: 1) Consumer-pivoted, and 2) Producer-pivoted. Consumer-pivoted views (e.g. user’s friends’ latest read activity) are refreshed at query time by refresh tasks. Producer-pivoted views (e.g. user’s latest read activity) are refreshed at update time (i.e. when “read” event is posted). And producer-pivoted views are used to refresh consumer-pivoted views.

Sherpa is Yahoo!’s cloud-based NoSql data store that provides low-latency reads and writes of key-value records and short range scans. Efficient range scans are particular important for the Mixer use cases. The “read” event is stored in the Updates table. The Updates table is a Sherpa Distributed Ordered Table that is ordered by “user,timestamp desc”. This provides efficient scans through a user’s latest read activity. A reference to the “read” record is stored in the UpdatesIndex table to support efficient point lookups. UpdatesIndex is a Sherpa Distributed Hash Table

Original title and link: Memcached and Sherpa for Yahoo! News Activity Data Service (NoSQL database©myNoSQL)


An Introduction to MongoDB

Good slidedeck from Chris Westin (10gen engineer)—I particularly liked the slides summarizing some of the limitations in the relational databases:

Schema Evolution

  • Applications are evolving all the time
    • Applications need new fields
    • Applications need new indexes
    • Data is growing — sometimes very fast
  • Users need to be abelt o alter their schemas without making their data unavailable

Write Rates

  • Replication is a solution for high read loads
  • Sooner or lager, writing becomes a bottleneck
  • Sharding
    • Joins and aggregation become a problem
    • Distributed transactions are too slow for the web
    • Manual management of shards
      • Choosing shard partitions
      • Rebalancing shards

Reading through these reminded me of the PNUTS paper, the datastorage solution used by Yahoo!, which seems to have good answers to many of these limitations. The BigTable paper led to creation of HBase and Hypertable and its data model is used in Cassandra too. The Dynamo paper led to the creation of Riak, Project Voldemort, and is used as the distribution model for Cassandra. But I don’t think there’s anything out there taking inspiration from the PNUTS paper.

Original title and link: An Introduction to MongoDB (NoSQL database©myNoSQL)

Hadoop and NoSQL Databases at Twitter

Three presentations covering the various NoSQL usages at Twitter:

  1. Kevin Weil talking about data analysis using Scribe for logging, base analysis with Pig/Hadoop, and specialized data analysis with HBase, Cassandra, and FlockDB on InfoQ

  2. Ryan King’s presentation from last year’s QCon SF NoSQL track on Gizzard, Cassandra, Hadoop, and Redis on InfoQ

  3. Dmitriy Ryaboy on Hadoop from Devoxx 2010:

By looking at the powered by NoSQL page and my records, Twitter seems to be the largest adopter of NoSQL solutions. Here is an updated version of who is using Cassandra and HBase

  • Twitter: Cassandra, HBase, Hadoop, Scribe, FlockDB, Redis
  • Facebook: Cassandra, HBase, Hadoop, Scribe, Hive
  • Netflix: Amazon SimpleDB, Cassandra
  • Digg: Cassandra
  • SimpleGeo: Cassandra
  • StumbleUpon: HBase, OpenTSDB
  • Yahoo!: Hadoop, HBase, PNUTS
  • Rackspace: Cassandra

And probably many more missing from the list. But that could change if you leave a comment.

Original title and link: Hadoop and NoSQL Databases at Twitter (NoSQL databases © myNoSQL)

NoSQL benchmarks and performance evaluations

Some say it is the right time to start having these around. Others are saying it’s way to early to start the “battle”. Users do want to see them and in case they’re lacking they create their own, most of the time using incomplete or wrong approaches.

But what am I talking about? As some of you might have guessed already:

NoSQL benchmarks and performance evaluations!

With their recent release of Riak 0.11.0, Basho guys have also published their internal ☞ benchmarking code. Similar internal benchmark code is ☞ available for MongoDB.

But users are more interested in seeing cross product benchmarks, even if most of the time constructing these is extremely complicated and they end up comparing apples with oranges.

All these being said and accepting that most of the time someone will figure out a way to invalidate the results, lets see what cross product benchmarks do we have in the NoSQL space.

Yahoo! Cloud Serving Benchmark

The Yahoo! Cloud Serving Benchmark’s goal is to facilitate performance comparisons of the new generation of cloud data serving systems. The source code is available on ☞ GitHub and Yahoo! has also published ☞ the results of running this benchmark against Cassandra, HBase, Yahoo!’s PNUTS, and a simple sharded MySQL implementation.

VoltDB Benchmark

VoltDB a new storage solution that calls itself the next-generation SQL RDBMS with ACID for fast-scaling OLTP applications has recently ☞ published the results of their benchmark comparing VoltDB and Cassandra.

It is worth noting that while being one of those apples to oranges comparisons (nb and the authors are well aware of it), there are still a couple of interesting and useful things to be learned from it (i.e. benchmarking procedure, tested scenarios, etc.)

Unfortunately at this time the source code is not yet available, but hopefully we will see it soon:

Going forward, we’re planning to release the code we used to do these benchmarks. We’d also like to try a few other storage layers

Hypertable and HBase Performance Evaluation

The guys behind Hypertable ☞ have published their results of comparing Hypertable with HBase using a benchmark based on the Google BigTable paper[1] from which both HBase and Hypertable are inheriting their architecture. Unfortunately, the benchmark code is not available at this moment.

Thanks to Stu Hood, now I know the code for this benchmark is available in the Hypertable distribution available ☞ here (tar.gz) and the configuration files are also available ☞ here (tar.gz)

So, as far as I could gather we have:

Did I miss any?

  1. The BigTable paper is available ☞ here  ()

Cassandra, HBase, and PNUTS Compared

From ☞ Amandeep Khurana a nice matrix comparing characteristics of Cassandra, HBase and PNUTS:

Recently, in a course that I’m taking at UC Santa Cruz, I got a chance to present the PNUTS paper and compare the system with Bigtable and Dynamo.

The Cassandra, HBase, and PNUTS matrix includes details about:

  • data model
  • consistency model
  • ACID semantics
  • storage
  • replication
  • fault tolerance
  • scalability
  • aplicability

These systems are also part of the ☞ Yahoo! Cloud Serving Benchmark, a benchmark that has been open sourced recently and is available on ☞ GitHub

You should also check Cassandra and HBase Compared