NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Cassandra: All content tagged as Cassandra in NoSQL databases and polyglot persistence

11 Interesting Releases From the First Weeks of January

The list of releases I wanted to post about has been growing fast these last couple of weeks, so instead of waiting leaving it to Here it is (in no particular order1):

  1. (Jan.2nd) Cassandra 1.2 — announcement on DataStax’s blog. I’m currently learning and working on a post looking at what’s new in Cassandra 1.2.
  2. (Jan.10th) Apache Pig 0.10.1 — Hortonworks wrote about it
  3. (Jan.10th) DataStax Community Edition 1.2 and OpsCenter 2.1.3 — DataStax announcement
  4. (Jan.10th) CouchDB 1.0.4, 1.1.2, and 1.2.1 — releases fixing some security vulnerabilities
  5. (Jan.11th) MongoDB 2.3.2 unstable — announcement. This dev release includes support for full text indexing. For more details you can check:

    […] an open source project extending Hadoop and Hive with a collection of useful user-defined-functions. Its aim is to make the Hive Big Data developer more productive, and to enable scalable and robust dataflows.

  1. I’ve tried to order it chronologically, but most probably I’ve failed. 

Original title and link: 11 Interesting Releases From the First Weeks of January (NoSQL database©myNoSQL)

CCM: A Tool for Creating Local Cassandra Clusters

This little useful gem for creating local Cassandra test clusters was mentioned in Peter Bailis’s post Using Probabilistically Bounded Staleness in Cassandra 1.2.0, but I didn’t catch it until today when the DataStax guys blogged about it:

CCM (Cassandra Cluster Manager) is a tool written by Sylvain Lebresne that creates multi-node cassandra clusters on the local machine. It is great for quickly setting up clusters for development and testing, and is the foundation that the cassandra distributed tests (dtests) are built on. In this post I will give an introduction to installing and using ccm.

Original title and link: CCM: A Tool for Creating Local Cassandra Clusters (NoSQL database©myNoSQL)


Using Probabilistically Bounded Staleness in Cassandra 1.2.0

Peter Bailis:

With the help of the Cassandra community, we recently released PBS consistency predictions as a feature in the official Cassandra 1.2.0 stable release. In case you aren’t familiar, PBS (Probabilistically Bounded Staleness) predictions help answer questions like: how eventual is eventual consistency? how consistent is eventual consistency? These predictions help you profile your existing Cassandra cluster and determine which configuration of N,R, and W are the best fit for your application, expressed quantitatively in terms of latency, consistency, and durability (see output below).

If I get this right, this tool should become a must-run-before-going-into-production and then also a good start for investigating WTFs like what am I suppose to do to avoid getting stale data.

Original title and link: Using Probabilistically Bounded Staleness in Cassandra 1.2.0 (NoSQL database©myNoSQL)


Cassandra at MetricsHub for Cloud Monitoring

Charles Lamanna (CEO MetricsHub):

We use Cassandra for recording time series information (e.g. metrics) as well as special events (e.g. server failure) for our customers. We have a multi-tenant Cassandra cluster for this. We record over 16 data points per server per second, 24 hours a day, 7 days a week. We use Cassandra to store and crunch this data.

Many of the NoSQL databases can be used for monitoring. For example for small scale self-monitoring you could use Redis.

Original title and link: Cassandra at MetricsHub for Cloud Monitoring (NoSQL database©myNoSQL)


Cassandra Application Performance Management With Request Tracing

Jonathan Ellis introduces in two posts—here and here—a new feature in Cassandra 1.2: request tracing. Basically such a feature is an improved approach over more generic APM tools like AppDynamics or NewRelic.

Be judicious with this: tracing a request will usually requre at least 10 rows to be inserted, so it is far from free. Unless you are under very light load tracing all requests (probability 1.0) will probably overwhelm your system. I recommend starting with a small fraction, e.g. 0.001 and increasing that only if necessary.

Years ago I had to implement myself a tracing layer1, after trying to get information from that system using some commercial tools—I’m sure these got better since then though. There were a few goals I’ve planned for and there were many things I’ve learned after deploying it live:

  1. granularity of the probes is critical to understanding how the system behaves. Use too coarse grained probes and you’ll miss important details, use too fine grained probes and you’ll be flooded with unusable data
  2. deciding if traces are persistent or volatile and the impact on the system performance. Should you be able to retrieve older traces? If persistent, do they contain enough information to help explain a specific behavior? Can they be used to replay a scenario?
  3. deciding what requests should be traced and when? Tracing comes with a cost and you must try to minimize the impact it has on the system. The most important data is needed when the system misbehaves or is under high load, but that’s the same time additional work could bring it down
  4. probabilistic vs pattern vs behavioral tracing. Generic solutions have no knowledge of the system, but a custom one could be created
  5. trace ordering. Can historical tracing information be ordered?

And there are probably many other things that I don’t remember right anymore.

  1. My implementation was specific to the system (in the sense that it had different tracing capabilities based on request types), but it was generic enough to allow us to change the granularity of collected probes, introduce new trace points, and also change the ratio of the requests to be traced.  

Original title and link: Cassandra Application Performance Management With Request Tracing (NoSQL database©myNoSQL)

Cassandra Query Language CQL3 Explained

CQL3 (the Cassandra Query Language) provides a new API to work with Cassandra. Where the legacy thrift API exposes the internal storage structure of Cassandra pretty much directly, CQL3 provides a thin abstraction layer over this internal structure. This is A Good Thing as it allows hiding from the API a number of distracting and useless implementation details (such as range ghosts) and allows to provide native syntaxes for common encodings/idioms (like the CQL3 collections as we’ll discuss below), instead of letting each client or client library reimplement them in their own, different and thus incompatible, way.

CQL seems to be the solution Cassandra is using to address the sometimes confusing or complex data model. I also think that CQL is an attempt of bringing Cassandra closer to SQL-enabled tools, a feature that might allow more integrations in the future.

Original title and link: Cassandra Query Language CQL3 Explained (NoSQL database©myNoSQL)


YCSB Benchmark Results for Cassandra, HBase, MongoDB, MySQL Cluster, and Riak

Put together by the team at Altoros Systems Inc., this time run in the Amazon EC2 and including Cassandra, HBase, MongoDB, MySQL Cluster, sharded MySQL and Riak:

After some of the results had been presented to the public, some observers said MongoDB should not be compared to other NoSQL databases because it is more targeted at working with memory directly. We certainly understand this, but the aim of this investigation is to determine the best use cases for different NoSQL products. Therefore, the databases were tested under the same conditions, regardless of their specifics.

Teaser: HBase got the best results in most of the benchmarks (with flush turned off though). And I’m not sure the setup included the latest HBase read improvements from Facebook.

Original title and link: YCSB Benchmark Results for Cassandra, HBase, MongoDB, MySQL Cluster, and Riak (NoSQL database©myNoSQL)


Pig the Big Data Duct Tape: Examples for MongoDB, HBase, and Cassandra

A three part article from Hortonworks showing how Pig can be used with MongoDB, HBase, and Cassandra:

Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems, to enable you to process data from wherever and to wherever you like.

Original title and link: Pig the Big Data Duct Tape: Examples for MongoDB, HBase, and Cassandra (NoSQL database©myNoSQL)

$25 Million in C Round for DataStax

I’d say that raising another $25 million from Meritech Capital Partners and with the participation of existing investors Lightspeed Venture Partners and Crosslink Capital is a good enough reason for DataStax to party.

DataStax will use the funds to further enhance its Big Data platform and increase the value for current customers while driving global customer acquisition.

Congrats to DataStax and Cassandra community!

Original title and link: $25 Million in C Round for DataStax (NoSQL database©myNoSQL)

Doing Redundant Work to Speed Up Distributed Queries

Great post by Peter Bailis looking at how some systems are reducing tail latency by distributing reads across nodes:

Open-source Dynamo-style stores have different answers. Apache Cassandra originally sent reads to all replicas, but CASSANDRA-930 and CASSANDRA-982 changed this: one commenter argued that “in IO overloaded situations” it was better to send read requests only to the minimum number of replicas. By default, Cassandra now sends reads to the minimum number of replicas 90% of the time and to all replicas 10% of the time, primarily for consistency purposes. (Surprisingly, the relevant JIRA issues don’t even mention the latency impact.) LinkedIn’s Voldemort also uses a send-to-minimum strategy (and has evidently done so since it was open-sourced). In contrast, Basho Riak chooses the “true” Dynamo-style send-to-all read policy.

Original title and link: Doing Redundant Work to Speed Up Distributed Queries (NoSQL database©myNoSQL)


Reddit’s Database Has Two Tables

Considering the fast evolution of NoSQL databases, the topic is now very old (from 2010). But read the comments on the original post, Hacker News, and Reddit to see what people think today about extreme denormalization, schemas, relational and NoSQL databases.

Original title and link: Reddit’s Database Has Two Tables (NoSQL database©myNoSQL)


Latency-Consistency Analysis

A very interesting proposal and patch for enhancing nodetool to provide cluster latency-consistency analysis. From JIRA:

We’ve implemented Probabilistically Bounded Staleness, a new technique for predicting consistency-latency trade-offs within Cassandra. Our paper will appear in VLDB 2012, and, in it, we’ve used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than profiling and manually testing each possible replication configuration (especially in production!).

This analysis is important for the many users we’ve talked to and heard about who use “partial quorum” operation (e.g., non-QUORUM ConsistencyLevel). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there’s no existing way to answer these questions.

Original title and link: Latency-Consistency Analysis (NoSQL database©myNoSQL)