Presentation: Gary Dusbabek (Rackspace) on Cassandra
A presentation about Cassandra given by Rackspace’ Gary Dusbabek (@gdusbabek):
My notes:
What problems does it solve?
- Reliability at scale
- No Single point of failure (all nodes are identifical)
- Simple scaling
- linear
- High write thoughput
- Large data sets
What problems can’t it solve?
- No flexible indices
- No querying on non PK values
- Not good for binary data (>64mb) unless you chunck
- Row contents must fit in available memory
Concepts: CAP
- Cassandra chooses A and P but allows them to be tunable to have more C
Data Model
- Keyspace contains column families
- ColumnFamily:
- Standard or Super
- Two levels of indexes (key and column names)


Data Model
- Column and subcolumn sorting
- Specify your own comparator:
- TimeUUID
- Lexical UUID
- UTF8
- Bytes
- CreateYourOwn
Inserting: Writes
- Commit log for durability
- Memtable - no disk access (no reads or seeks)
- Sstables are final (become read only)
- Index
- Bloom filter
- Raw data
- Atomic within a ColumnFamily
- Bottom line: FAST!!
Note: make sure to check the slide for a nice visual description of Cassandra write operation. You should check also the Cassandra Write operation performance explained for more details.
Querying: Overview
- But secondary indices are being worked on (see ☞ CASSANDRA-749)
Querying: Reads
- Not as fast as writes
- Read repair when out of sync
- New in 0.6:
- Row cache (avoid
sstablelookup) - Key cache (avoid index scan)
- Row cache (avoid
Note: make sure you check the slide for a visual description of the Cassandra read operation. And you can also read the Cassandra Reads performance explained for more details.
Future Direction
- Range delete (delete these cols from those keys)
- Vector clocks (including server-side conflict resolution)
- Altering keyspace/column family definitions on a live cluster
- Byte[] keys
- Compression
- Multi-tenant support
- Less memory restrictions