NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Cassandra Reads Performance Explained

After explaining Cassandra writes performance, Mike Perham ☞ continues his series now explaining: “reads and […] why they are slow”.

So what happens with a Cassandra read?

  • a client makes a read request to a random node
  • the node acts as a proxy determining the nodes having copies of data
  • the node request the corresponding data from each node
  • the client can select the strength of the read consistency:

    • single read => the request returns once it gets the first response, but data can be stale
    • quorum read => the request returns only after the majority responded with the same value

      Mark mentions a couple of corner cases related to this behavior that is more complicated.

  • the node also performs read repair of any inconsistent response
  • each node reading data uses either Memtable (in-memory) or SSTables (disk)

    Mike and Jonathan provide a very detailed explanation of the read performance:

    Mike: To scan the SSTable, Cassandra uses a row-level column index and bloom filter to find the necessary blocks on disk, deserializes them and determines the actual data to return. There’s a lot of disk IO here which ultimately makes the read latency higher than a similar DBMS.

    Jonathan: The reason uncached reads are slower in Cassandra is not because the SSTable is inherently io-intensive (it’s actually better than b-tree based storage on a 1:1 basis) but because in the average case you’ll have to merge row fragments from 2-4 SSTables to complete the request, since SSTables are not update-in-place.

    It is also important to note that Cassandra employs row caching that addresses reads latency.

Mike’s post also covers Cassandra range scans and explains the role of Cassandra partitioning strategies. ☞ Great read!