cassandra: All content on NoSQL databases and projects about cassandra, featuring the best daily NoSQL articles, news, and links on cassandra
Friday, 3 September 2010
Cassandra: Tuning Garbage Collection ☞
Mikio L. Braun shares a set of experiments he ran configuring the garbage collection for Cassandra:
In summary, a bit of garbage collection tuning can help to make Cassandra run in a stable manner. In particular, you should set the CMS thresholds a bit lower, and probably also experiment with incremental CMS if you have enough cores. Setting the CMS threshold to 75%, I got Cassandra to run well in 8GB without any GC induced glitches, which is a big progress from the previous post.
Jonathan Ellis has recently mentioned a valuable resource for Garbage Collection tuning, a presentation by Tony Printezis, Charlie Hunt and Ludovic Poitou: “Garbage Collection Tuning in the Java HotSpot Virtual Machine” (nb unfortunately the link is not available anymore, but if you can find it somewhere make sure you get a copy). Also to note that the last Cassandra release went outside the VM dealing directly with the OS for addressing a combination of GC behavior and swapping.
Original title and link for this post: Cassandra: Tuning Garbage Collection (published on the NoSQL blog: myNoSQL)
Wednesday, 1 September 2010
Cassandra 0.6.5: What is New? ☞
Jonathan Ellis details what’s new in Cassandra 0.6.5:
- Dynamic Snitch
- Use mlockall via JNA, if present, to prevent Linux from swapping out parts of the JVM
- Page within a single row during hinted handoff
- Faster UUIDType, LongType comparisons
- Log summary of dropped messages instead of spamming log
It’s interesting to hear that Cassandra had to go out of the VM in order to optimize its behavior by using mlockall on OSes supporting it. Also learning about the dynamic snitch:
Cassandra has always been good at dealing with cluster members who are all the way dead, thanks to its failure detector. The dynamic snitch lets us also handle members who are only mostly dead, that is, are still responding but with impaired performance.
[…]
The dynamic Snitch incorporates real-time request latency into its closeness metric, and routes requests to nodes that respond the fastest, no matter where they are actually located.
Original title and link for this post: Cassandra 0.6.5: What is New? (published on the NoSQL blog: myNoSQL)
Friday, 20 August 2010
Cassandra Summit Through The Eyes of an HBase Emeriti Committer ☞
Bryan Duxbury summarizing Cassandra summit — videos and slides available here — :
Jonathan Ellis’s “state of the union” talk was interesting for a variety of reasons. The struggles they’re having with hinted handoff seems to be one of the classic symptoms of trying to build a project around someone else’s whitepaper – it’s good to hear that they’re starting to overcome the difficulties and actually get a good feature into play. I was also really pleased to see that we’d managed to take care of two out of three of Cassandra’s chief complaints about Thrift. (Cassandra’s near-switch to Avro has me nervous.)
Original title and link for this post: Cassandra Summit Through The Eyes of a HBase Emeriti Committer (published on the NoSQL blog: myNoSQL)
Thursday, 19 August 2010
Riptano Publishes Videos and Slides from Cassandra Summit ☞
Riptano, the company offering services for Cassandra, has posted links to videos and slide decks from Cassandra summit. 8 videos and 9 slide decks from speakers like Jonathan Ellis (Riptano, Cassandra), Stu Hood (Rackspace), Gary Dusbabek (Rackspace), Kelvin Kakugawa (Digg), Noah Silas and John Watson (Mahalo). They represent probably the most well known Cassandra users.
I haven’t had the time to watch them myself, so please do let us know which ones are the must see.
Riptano Published Videos and Slides from Cassandra Summit originally posted on the NoSQL blog: myNoSQL
Thursday, 12 August 2010
Hector, Main Java Client for Cassandra Improves API ☞
Hector is probably the most known and used client for Cassandra. Now it is getting a new API focused on getting rid of Thrift details:
When writing the first version of hector the premise was that users are comfortable with the current level of the thrift API so hector should maintain an API similar in spirit. […] I was wrong. As it turns out, users don’t learn the thrift API and then go use hector. Most users tend to just skip the thrift API and start with hector. Fait enough. But then I’m asked why did I make such a funny API… They are right, users of hector should not suffer from the limitations of the thrift API. Add to that the complexity of dealing with failover, which clients need not care about at the API level (and in the v1 API they did) and some complex anonymous classes and the Command pattern users need to understand (if only we could have closures in java…) then we get a less than ideal API.
That sounds like a very sane process: launch a first version and see what the real users are saying.
Update: Riptano, the company offering support for Cassandra, has made available a PDF detailing Hector API:
You can download it from ☞ here.
Hector, Main Java Client for Cassandra Improves API originally posted on the NoSQL blog: myNoSQL
Friday, 6 August 2010
From Cassandra to Riak at inagist.com ☞
A couple of confusing things in this post:
The nice thing about Cassandra was the data model. Super columns allowed us to store metadata for a resource as needed. […] Concurrency issues were also not a bother. We could do simultaneous updates to columns and super columns and not worry about data consistency issues. […] When looking for alternatives Riak was our first choice primarily because of it being in Erlang and since it had a map-reduce option which looked seriously promising.
I don’t see any connection between these. Going from a granular data model supporting column level operations to an key-value store with opaque values doesn’t really add up.
Of the back-ends available this has worked best for us giving a consistent performance along with being reasonable on the resource usage.
This seems a bit contradictory with what was said about the new default Riak storage Bitcask in the Innostore and Bitcask comparison.
Anyone able to clarify these? (nb I’m not saying something is wrong, but I’d like to better understand the details). For now, Mozilla story Cassandra, HBase, Riak: Choosing the Right Solution seems to be much better documented.
Update: Thanks to Jebu Ittiachen things are a clearer now:
My issues with Cassandra and with Bitcask under Riak were with how they behaved in terms of their memory consumption. In the presence of ever increasing number of keys like the tweets which keep coming in both of them would eat up all the memory available on my servers. Cassandra I guess because of its per SSTable cache of keys and Bitcask because it maintains all keys in memory. This initially being the reason for me looking out for a different store than Cassandra. I should mention that in addition to tweets other data is also managed in Cassandra / Riak.
What I was trying to convey is how something that was easily modeled in Cassandra could still be mapped into Riak and possibly be to an advantage given the map-reduce infrastructure.
My preference of innostore over bitcask has purely been seeing how they behave in real use. Bitcask is definitely faster but high in memory usage on the servers. Innostore on the other hand is steady on the memory usage over time.
From Cassandra to Riak at inagist.com originally posted on the NoSQL blog: myNoSQL
Thursday, 5 August 2010
Cassandra has (Async) Triggers ☞
Similar to Riak post-commit hooks:
Like traditional database triggers, Cassandra Async trigger is a procedure that is automatically executed by the database in response to certain events on a particular database object (e.g. table or view). The distinguishing feature of Async trigger is that the database responds to the client on successful update execution without waiting for triggers to be executed, thus reducing response latency.
In case you wonder how to use them this might give you an idea.
Cassandra has (Async) Triggers originally posted on the NoSQL blog: myNoSQL
Wednesday, 28 July 2010
Canonical, Ubuntu and NoSQL ☞
Separately, sources close to Canonical have told The Reg that the company is in talks with Cassandra and CouchDB on NoSQL, and start-up PuppetLabs for data-center automation and provisioning.
[…]
Canonical is targeting Hadoop and NoSQL – used by hyperscale providers like Yahoo! and Facebook – believing ordinary businesses are now ready to start use them for data processing and analytics.
Having in mind that both Hadoop and Cassandra are meant to be used in distributed systems, I’m wondering what exactly will Canonical offer by including these in Ubuntu? (note the secret sauce may be Puppet).
Monday, 26 July 2010
Migrating from Cassandra to MongoDB or What Can be Learned Here? ☞
The story here is simple: hit a crazy ☞ bug in Cassandra (remember all these tools are really young), they needed their data before a fix was available (keep in mind that some are using these NoSQL solutions in production), migration to MongoDB (while in production, you’ll do whatever it takes to minimize downtime):
At some point we started to have some stability issues with Cassandra. All nodes would go into an infinite loop, running GC and trying to compact the data files – occasionally falling off the cluster. We were unable to solve the problem, except that restarting and then compacting a node usually settled it down for a while. Other people had reported similar problems. Last couple of weeks our Cassandra nodes always ate all the resources they were given, slowing down Flowdock.
Anyways, I think there’s a lesson here: all NoSQL databases should start providing a data export tool.
Thursday, 22 July 2010
Presentation: Introduction to Cassandra
Nice addition to the getting started with Cassandra tutorial:
Wednesday, 21 July 2010
BigData: A Common Problem for Web 2.0 and Enterprise Worlds
Matt Pfeil, co-founder of Riptano, the company offering services for Cassandra:
The thing that enterprises have in common with the web 2.0 companies like Digg, Twitter, Reddit, etc is that It might be different types of data, but both have a large amount of it.
The rest of the video focused mostly on business related topics:
Tuesday, 20 July 2010
Heroku Encourages Polyglot Persistence ☞
Heroku published an article preaching polyglot persistence through a Database-as-a-Service approach:
Database-as-as-service is one of the coming decade’s most promising business models. […] DaaS also goes hand-in-glove with polyglot persistence. Thanks to database services, you won’t need to learn how to sysadmin/DBA for every datastore you use – you can instead outsource that job to a service provider specializing in each database.
While it definitely sounds exciting to be able to use all these NoSQL databases , we should always keep in mind the cost of complexity even if DaaS will help alleviate some of the complexity of heterogeneous systems.
The article includes also some interesting use cases for a couple of NoSQL databases:
- Frequently-written, rarely read statistical data (for example, a web hit counter) should use an in-memory key/value store like Redis, or an update-in-place document store like MongoDB.
- Big Data (like weather stats or business analytics) will work best in a freeform, distributed db system like Hadoop.
- Binary assets (such as MP3s and PDFs) find a good home in a datastore that can serve directly to the user’s browser, like Amazon S3.
- Transient data (like web sessions, locks, or short-term stats) should be kept in a transient datastore like Memcache. (Traditionally we haven’t grouped memcached into the database family, but NoSQL has broadened our thinking on this subject.)
- If you need to be able to replicate your data set to multiple locations (such as syncing a music database between a web app and a mobile device), you’ll want the replication features of CouchDB.
- High availability apps, where minimizing downtime is critical, will find great utility in the automatically clustered, redundant setup of datastores like Casandra and Riak.
These are good examples, but you can find many more in our coverage of NoSQL uses cases and the per-product case studies: CouchDB case studies or MongoDB case studies, etc.
Heroku Encourages Polyglot Persistence originally posted on the NoSQL blog: myNoSQL

