tutorial: All content tagged as tutorial in NoSQL databases and polyglot persistence
Nice addition to the getting started with Cassandra tutorial:
A very informative presentation by Benjamin Black on Cassandra indexing:
There are so many interesting things to learn from these slides. Benjamin is briefly introducing the main Cassandra terms — if you are not familiar with them you can read more in this Cassandra tutorial — and moves to explain how column sorting and partitioning strategies should be used. Also to mention, some really quotable fragments from the deck:
Relational stores are schema oriented. Start from your schema & work forwards
Column stores are query oriented. Start from your queries & work backwards
Cassandra is an index construction kit
Based on Ronald Mathies’ intro articles to Cassandra and a few other resources I’ve been gathering, I thought I should put together a detailed guide to getting started with Cassandra. As one would expect the ☞ first post is briefly introducing Cassandra and covers the distribution details and installation steps. It should be noted that Windows may not be the best environment to install Cassandra. Also if after the brief intro you’d like to see more details about it, you should check Gary Dusbabek’s presentation on Cassandra or watch Eric Evan’s Cassandra presentation at FOSDEM.
The ☞ second article is focusing on Cassandra data model. If you are not familiar with it, this is the part you’ll want to focus on.
A column is also referred to as a tuple (triplet) that contains a name, value and a timestamp. This is the smallest data container there is.
A SuperColumn is a tuple with a name and a value, it doesn’t have a timestamp like the Column tuple. Notice that the value is in this case not a binary value but more of a Map style container. The map contains key / column combinations. What is important here is that the key has the same value as the name of the Columnit refers to. So to put it simple, a SuperColumn is a container for one or more Columns. You will see that it will also make a big difference later on when we discuss the ColumnFamily and SuperColumnFamily.
ColumnFamily is a structure that can keep an infinite number of rows, for most people with an RDBMS background, this is the structure that resembles a Table the most. When you look at the diagram you can see that a ColumnFamily has a name (comparable to the name of a Table), A map with a key (comparable to a row identifier) and a value (which is a Map containing Columns). The map with the columns have the same rules as the SuperColumn, the key has the same value as the name of the Column it refers to.
Finally we have the largest container, the SuperColumnFamily, if you understand the ColumnFamily then this construction isn’t much harder, instead of having Columns in the inner most Map we have SuperColumns. So it just adds an extra dimension. As displayed in the image, the Key of the Map which contain the SuperColumns must be the same as the name of the SuperColumn (just like with the ColumnFamily).
Keyspaces are quite simple again, from an RDBMS point of view you can compare this to your schema, normally you have one per application. A keyspace contains the ColumnFamilies. Note however there is no relationship between the ColumnFamiliies, they are just separate containers.
Probably the best explanation of the Cassandra data model can be found in Arin Sarkissian’s ☞ WTF is a SuperColumn?. There are other recommended resources about Cassandra and Jonathan Ellis, Cassandra project chair, has a suggested Cassandra reading list.
☞ Third article in the series is focusing on Cassandra sorting capabilities:
By default Cassandra sorts the data as soon as you store it in the database and it remains sorted. This gives you an enormous performance boost, however you need to think before you start storing data.
Sorting can be specified on the ColumnFamily CompareWith attribute, these are the options you can choose from (it is possible to create custom sorting behavior but we will cover that later):
And there is also a way to define your own custom Cassandra sorting types described in ☞ post.
By now you should be ready to start using Cassandra and this is exactly the subject of the ☞ part 4 and ☞ part 5 of the series which cover the Thrift Cassandra client. Understanding how writes and reads are performed might be useful, so you should check Cassandra write operation and Cassandra read operation which also talk about the performance of these operation.
While initially you might not have enough data to have to decide how to partition a Cassandra cluster, once you’ll get to that point I’m pretty sure you’ll appreciate some more details on Cassandra partitioning strategies.
Last, but not least, here is a list of known Cassandra usecases that might give you a good idea of where Cassandra will fit in your next app and then you should be absolutely ready to experiment with Cassandra.
This is the longest NoSQL presentation I’ve ever posted here: 209 slides! If you’re planning to beat Kevin Smith’s (@kevsmith) record please do let me know in advance so I can reserve enough time to go through it.
My notes below:
What is Riak?
- A flexible storage engine…
- … with a REST API …
- … and map/reduce capability …
- … designed to be fault-tolerant …
- … distributed …
- … and ops friendly
The Riak Way for CAP
- Pick Two
- For each operation
Riak Improvements on Amazon Dynamo N, R, W
- N can vary per bucket
- R and W can vary per operation
- *Choose your own fault tolerance/performance tradeoff
Conflict resolution: Client Resolution
- Can be set per-bucket or server-wide
- Conflicting data is “bubbled up” to the client
- Client picks the winner
Conflict resolution: Server Resolution
- “Last write wins”
- Enabled by default
- What most apps need 80% of the time
The presentation covers also:
- Linking objects (slide 78)
- Map/Reduce (slide 99)