NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Presentation: Gary Dusbabek (Rackspace) on Cassandra

A presentation about Cassandra given by Rackspace’ Gary Dusbabek (@gdusbabek):

My notes:

What problems does it solve?

  • Reliability at scale
    • No Single point of failure (all nodes are identifical)
  • Simple scaling
    • linear
  • High write thoughput
  • Large data sets

What problems can’t it solve?

  • No flexible indices
  • No querying on non PK values
  • Not good for binary data (>64mb) unless you chunck
  • Row contents must fit in available memory

Concepts: CAP

  • Cassandra chooses A and P but allows them to be tunable to have more C

Data Model

  • Keyspace contains column families
  • ColumnFamily:
    • Standard or Super
    • Two levels of indexes (key and column names)

Data Model

  • Column and subcolumn sorting
  • Specify your own comparator:
    • TimeUUID
    • Lexical UUID
    • UTF8
    • Bytes
    • CreateYourOwn

Inserting: Writes

  • Commit log for durability
  • Memtable - no disk access (no reads or seeks)
  • Sstables are final (become read only)
    • Index
    • Bloom filter
    • Raw data
  • Atomic within a ColumnFamily
  • Bottom line: FAST!!

Note: make sure to check the slide for a nice visual description of Cassandra write operation. You should check also the Cassandra Write operation performance explained for more details.

Querying: Overview

Querying: Reads

  • Not as fast as writes
  • Read repair when out of sync
  • New in 0.6:
    • Row cache (avoid sstable lookup)
    • Key cache (avoid index scan)

Note: make sure you check the slide for a visual description of the Cassandra read operation. And you can also read the Cassandra Reads performance explained for more details.

Future Direction

  • Range delete (delete these cols from those keys)
  • Vector clocks (including server-side conflict resolution)
  • Altering keyspace/column family definitions on a live cluster
  • Byte[] keys
  • Compression
  • Multi-tenant support
  • Less memory restrictions