Friday, 3 September 2010
Cassandra: Tuning Garbage Collection ☞
Mikio L. Braun shares a set of experiments he ran configuring the garbage collection for Cassandra:
In summary, a bit of garbage collection tuning can help to make Cassandra run in a stable manner. In particular, you should set the CMS thresholds a bit lower, and probably also experiment with incremental CMS if you have enough cores. Setting the CMS threshold to 75%, I got Cassandra to run well in 8GB without any GC induced glitches, which is a big progress from the previous post.
Jonathan Ellis has recently mentioned a valuable resource for Garbage Collection tuning, a presentation by Tony Printezis, Charlie Hunt and Ludovic Poitou: “Garbage Collection Tuning in the Java HotSpot Virtual Machine” (nb unfortunately the link is not available anymore, but if you can find it somewhere make sure you get a copy). Also to note that the last Cassandra release went outside the VM dealing directly with the OS for addressing a combination of GC behavior and swapping.
Original title and link for this post: Cassandra: Tuning Garbage Collection (published on the NoSQL blog: myNoSQL)
Scaling Out or Scaling Up? ☞
Nati Shalom (Gigaspaces):
Today, with the availability of large multi-core and large memory systems, there are more cases where you might have a single machine that can cover your scalability and performance goals. And yet, there are several other factors to consider when choosing between the two options:
- Continuous Availability/Redundancy
- Cost/Performance Flexibility
- Continuous Upgrades
- Geographical Distribution
Very informative.
Original title and link for this post: Scaling Out or Scaling Up? (published on the NoSQL blog: myNoSQL)
Hadoop/HBase Capacity Planning ☞
After some Hadoop hardware recommendations and using Amdhal’s law for Hadoop provisioning, Cloudera shares its know-how on Hadoop/HBase capacity planning covering aspects like network, memory, disk, and CPU:
Since we are talking about data, the first crucial parameter is how much disk space we need on all of the Hadoop nodes to store all of your data and what compression algorithm you are going to use to store the data. For the MapReduce components an important consideration is how much computational power you need to process the data and whether the jobs you are going to run on the cluster is CPU or I/O intensive. […] Finally, HBase is mainly memory driven and we need to consider the data access pattern in your application and how much memory you need so that the HBase nodes do not swap the data too often to the disk. Most of the written data end up in memstores before they finally end up on disk, so you should plan for more memory in write-intensive workloads like web crawling.
Original title and link for this post: Hadoop/HBase Capacity Planning (published on the NoSQL blog: myNoSQL)
Thursday, 2 September 2010
CouchDB: A CouchApp Project Structure ☞
If you learned about CouchApp, now it is time to see the most detailed description of a CouchApp project structure:
his is my current (3 day) understanding of the way it works. I could be totally/probably wrong but I am sure someone will help and point that out. […] The order of the document is based on the layout of the files in TextMate while examining the Pages project.
A good reference point for those building CouchApp based apps.
Original title and link for this post: CouchDB: A CouchApp Project Structure (published on the NoSQL blog: myNoSQL)
MongoDB safe and fsync Explained ☞
Kristina Chodorow explains the MongoDB safe and fsync options:
safe and fsync are not the same, here’s a rundown of the options:
safe=> false: do not wait for a db responsesafe=> true: wait for a db responsesafe=> num: wait for that many servers to have the write before returning
fsync=> true: fsync the write to disk before returning.fsync=> true impliessafe=>true, but not visa versa.- If
fsync=>false andsafe=>true and the write could be in successfully applied to a mmapped file but not yet written to disk.
Related to MongoDB durability.
Original title and link for this post: MongoDB safe and fsync Explained (published on the NoSQL blog: myNoSQL)
Riak: Sort by with MapReduce ☞
Alexander Sicular:
The focus of this post is to show you how to do the equivalent of the sql “SORT BY date DESC” using Riak’s map/reduce interface. Due to Riak’s schemaless, document focused nature Riak lacks internal indexing and by extension, native sorting capabilities.
Complete code included (and embedded below):
A couple of links you’ll probably find useful before/after reading the article:
- Riak has improved the fetching of keys in a bucket, that making MapReduce on buckets directly not so expensive
- Even if there are some saying MapReduce is complicated, take a look at how to translate SQL to MapReduce or this MapReduce explanation in simple terms
- A complete guide to MapReduce with Riak
Original title and link for this post: Riak: Sort by with MapReduce (published on the NoSQL blog: myNoSQL)
CouchDB and MongoDB: Querying ☞
Andrew Glover:
Both MongoDB and CouchDB are document-oriented datastores. They both work with JSON documents. They both are usually thrown into the NoSQL bucket. They’re both hip. But that’s where the similarities, for the most part, stop.
When it comes to queries, both couldn’t be any more different.
They differ even in the implementation and behavior of MapReduce.
Original title and link for this post: CouchDB and MongoDB: Quering (published on the NoSQL blog: myNoSQL)
Fixing ACID without going NoSQL ☞
Daniel Abadi and Alexander Thomson:
In our opinion, the NoSQL decision to give up on ACID is the lazy solution to these scalability and replication issues. Responsibility for atomicity, consistency and isolation is simply being pushed onto the developer. What is really needed is a way for ACID systems to scale on shared-nothing architectures, and that is what we address in the research paper that we will present at VLDB this month. Our view (and yes, this may seem counterintuitive at first), is that the problem with ACID is not that its guarantees are too strong (and that therefore scaling these guarantees in a shared-nothing cluster of machines is too hard), but rather that its guarantees are too weak, and that this weakness is hindering scalability.
No comments until I think this through.
Paper available ☞ here.
Original title and link for this post: Fixing ACID without going NoSQL (published on the NoSQL blog: myNoSQL)
CouchDB Tips & Tricks: Loading DNS data into CouchDB
Not sure how many need it, but still an interesting trick:
Probably this one is a nicer trick though.
Original title and link for this post: CouchDB Tips & Tricks: Loading DNS data into CouchDB (published on the NoSQL blog: myNoSQL)
Too Much Redis?
Ben Curtis ☞ thinks that using Redis for managing friends list as described in the ☞ EngineYard post is overly complicated:
Yesterday I read a post over at the EngineYard blog about a use case for Redis (in the name of being a polyglot, trying new things, etc.), and I just had to scratch my head. I love Redis — it rocks my world — but that example was too much for me. If you just want to store a set of ids somewhere to avoid normalization headaches, introducing Redis is overkill… just do it in MySQL!
He goes on and proposes a MySQL solution in which friends IDs are serialized as a comma separated list. Frankly speaking, I do see quite a few advantages Redis has compared to this one:
- Redis knows how to handle sets
- you don’t have to deal with de-duplication
- (most probably) the storage is optimized
- with manual serialization you’ll have to deal with all concurrency issues occurring when updating these lists
So what is the advantage of Ben’s suggested solution?
Original title and link for this post: Too Much Redis? (published on the NoSQL blog: myNoSQL)
Wednesday, 1 September 2010
MongoDB: Stable MongoDB 1.6.2 Released, Recommended Upgrade
New stable version from MongoDB fixing a couple of bugs:
- database isolation issue with concurrency and deletion of objects http://jira.mongodb.org/browse/SERVER-1710
- concurrency issue when doing in-memory sort
- current operation tracking could cause segfault while being accessed
- replica set initialization segfault fixed
- map reduce sort option working again
- administrative enhancements for using replica sets as shards
Full announcement ☞ here.
Original title and link for this post: MongoDB: Stable 1.6.2 Released, Recommended Upgrade (published on the NoSQL blog: myNoSQL)
MongoDB or Hadoop? ☞
Posted on the MongoDB mailing list:
I have about 500M log file entries each representing an “ad impression” (we are an advertising company). Each “hit” has about 50 attributes to it (example: Country, State, City, Adsize, Browser, OS, etc) .. I want to load all 500M into some form of database and then run queries against this set.
As you could expect MongoDB is considered as a possibility. But I’d call that a biased vendor advise. I’ll be blunt: invest in your future by using Hadoop and Pig. Hive may fit too.
Original title and link for this post: MongoDB or Hadoop? (published on the NoSQL blog: myNoSQL)

