dynamo: All content tagged as dynamo in NoSQL databases and polyglot persistence
Thursday, 12 January 2012
Eventual and Strong Consistency, Sloppy and Strict Quorums, and Other Lessons and Thoughts on Distributed Systems
Anything I’d write would just steal from your time to read and think about the email Joseph Blomstedt posted to the Riak list.
Original title and link: Eventual and Strong Consistency, Sloppy and Strict Quorums, and Other Lessons and Thoughts on Distributed Systems (©myNoSQL)
Friday, 15 April 2011
riak_zab: ZooKeeper’s Zab Protocol for Riak
Last night, Mark Phillips[1] sent me the following message:
Yesterday a Riak community member, Joseph Blumstedt released a pair of repos: riak_zab and riak_zab_example. In short, riak_zab is an extension for riak_core that provides totally ordered atomic broadcast capabilities.
If your first reaction was something along the line “so what?”, then you are not alone. Anyway I hope I’ve done my homework and now I understand why this is both interesting and important.
ZooKeeper is a coordination service created by Yahoo! to allow large scale applications to perform coordination tasks such as leader election, status propagation, and rendezvous. Embedded into ZooKeeper is a totally ordered broadcast protocol: Zab[2].
ZooKeeper makes the following requirements on the Zab broadcast protocol:
- Reliable delivery
- If a message, m, is delivered by one server, then it will be eventually delivered by all correct servers.
- Total order
- If message a is delivered before message b by one server, then every server that delivers a and b delivers a before b.
- Causal order
- If message a causally precedes message b and both messages are delivered, then a must be ordered before b.
- Prefix property:
- If m is the last message delivered for a leader L, any message proposed before m by L must also be delivered
Putting it together:
riak_zab is an extension for riak_core that provides totally ordered atomic broadcast capabilities. This is accomplished through a pure Erlang implementation of Zab, the Zookeeper Atomic Broadcast protocol invented by Yahoo! Research.
Basically this means that many operations that were typically hard to do well using the Dynamo eventual consistency model which Riak implements with riak_core are now theoretically possible. In case you’d like to read more about riak_core I’d suggest these posts Building blocks of Dynamo-like distributed systems, riak_core: building distributed applications without shared state, and Where to start with riak_core .
Indeed there are other Dynamo aspects like synchronization, scaling, ring rebalancing, handoff that riak_zab had to address and its approach is documented on the project page
Technically, riak_zab isn’t focused on message delivery. Rather, riak_zab provides consistent replicated state.
So why is this important? While parts of an application can deal with eventual consistency, there are typically a few aspects of an app where stronger consistency is needed for various reasons. A few such examples:
- Distributed Counters — just ask Digg people how they implemented distributed counters in Cassandra or check how Twitter built a ZooKeeper and Cassandra based solution for realtime analytics .
- Ordered Messaging
- Compare-and-Swap Operations
- Set Operations
- Sub-document updates
- Multi-key transactions
riak_zab will theoretically allow users to reap the availability of the standard eventually consistent model Riak provides and opt in to stronger consistency for the parts of the app that require it. In other words, it expands the number of CAP trade offs that a developer can make beyond the R, W and N model and makes possible functionality that relies on consistency guarantees.
After connecting all these dots, I’ve finally got it (nb: thanks Mark for sending this out!). And it definitely sounds like an exciting addition to riak_core. The caveat here is that even if this code has been tested in an academic environment the project is still an alpha version. Hopefully other members of the Riak community will pick it up and together with Basho guys will make it production ready.
-
Mark Phillips: Community Manager at Basho Technologies, @pharkmillups ↩
Original title and link: riak_zab: ZooKeeper’s Zab Protocol for Riak (NoSQL databases © myNoSQL)
Wednesday, 9 March 2011
Redis and HBase for Mozilla Grouperfish Storage
About Mozilla Grouperfish architecture and choosing a scalable storage solution:
Given our access patterns (insert documents, update clusters, re-process entire collections, fetch lists of clusters), efficient sequential access to selected parts of the data is very important. Sorted, column oriented storage seems to be the way to go. There are other pros and cons (single point of failure, write throughput, hardware requirements), but if we don’t cater to our use case, those won’t ever matter.
And this is what the planned solution is going to look:
- service layer: node.js
- data layer: Redis + HBase
- processing layer: RabbitMQ, Mahout, Jetty
- batching layer: Hadoop

Original title and link: Redis and HBase for Mozilla Grouperfish Storage (NoSQL databases © myNoSQL)
via: http://blog.mozilla.com/data/2011/03/08/scalable-text-clustering-for-the-web/
Thursday, 6 January 2011
Google App Engine High Replication Datastore
The High Replication Datastore provides the highest level of availability for your reads and writes, at the cost of increased latency for writes and changes in consistency guarantees in the API. The High Replication Datastore increases the number of data centers that maintain replicas of your data by using the Paxos algorithm to synchronize that data across datacenters in real time.
Still not completely decentralized a la Amazon Dynamo.
Original title and link: Google App Engine High Replication Datastore (NoSQL databases © myNoSQL)
via: http://googleappengine.blogspot.com/2011/01/announcing-high-replication-datastore.html
Friday, 12 November 2010
Why Every Node in a Cassandra Cluster is the Same
An additional benefit — besides elasticity and fault tolerance — of having a single type of nodes in your cluster:
Having all nodes share the same role also streamlines operations and systems administrations tasks as well. Because Cassandra has a single node type, it has only a single set of requirements for hardware, for monitoring, and deployment.
As far as I can tell, Cassandra, Riak, Project Voldemort, Membase , and Terrastore[1]
are the ones following this philosophy.
- Not 100% sure, so please correct me if I’m wrong. Update: Sergio Bossa clarified this in a comment. (↩)
Original title and link: Why Every Node in a Cassandra Cluster is the Same (NoSQL databases © myNoSQL)
via: http://www.riptano.com/blog/spof-0-why-every-node-cassandra-cluster-same
Thursday, 21 October 2010
Riak Search: A Proof of the Riak's Core Building Blocks
Justin Sheehy[1] offering a different perspective on how Riak search can also be seen as the proof of correctness of Riak’s core Dynamo-like building blocks:
Riak Search is the first public demonstration that Riak Core is a meaningful base on which to build distributed systems beyond just a key/value store. By using the same central code base for distribution, dispatch, ownership, failure management, and node administration, we are able to confidently make many of the same guarantees for Search that we have made all along for Riak’s key/value storage. Even if Search itself wasn’t such a compelling product, it is exciting as a proof of the value of Riak Core.
- Justin Sheehy, Basho CTO, @justinsheehy (↩)
Original title and link: Riak Search: A Proof of the Riak’s Core Building Blocks (NoSQL databases © myNoSQL)
via: http://blog.basho.com/2010/10/20/why-i-am-excited-about-riak-search/
Thursday, 19 August 2010
CouchDB: Horizontal Scalability from Cloudant
Even if CouchDB benefits of probably one of the most sophisticated and cool replication mechanisms that doesn’t make it horizontally scalable. I’ve already covered the different solutions for scaling CouchDB, but what Cloudant promises seems to be the missing part:
All of these features — distributed, horizontally scalable, durable, consistent — happen with little or no change required in applications that have been written for CouchDB. A cluster looks just like a stand-alone CouchDB, and API compliance has been our goal from the beginning. Granted, there are a few extra options like overriding quorum constant defaults and there are a few vagaries, like views always performing rereduce due to the views being distributed. But on the whole, the extras in Cloudant are transparent to the application.
Now I’m wondering how Cloudant CouchDB scaling compares with running CouchDB with a Riak backend, Riak offering also a Dynamo-like distributed system.
CouchDB: Horizontal Scalability from Cloudant originally posted on the NoSQL blog: myNoSQL
Thursday, 5 August 2010
Building Blocks of Dynamo-like Distributed Systems
Basho guys have started to talk about their experience on building Riak, the Dynamo-like distributed key-value store and the common building blocks of distributed systems.
Justin Sheehy interviewed by Sadek Drobi over ☞ InfoQ.com:
Even just the Dynamo specific parts are very dramatic in differences. There have been a number of Dynamo-like systems developed over the past few years, each of which has had to design and implement large portions of even just the Dynamo-like sections on their own. Because Dynamo tells you what some very good design decisions are but it doesn’t show you how to implement the system. Even just the Dynamo portion you have to do a lot of design work, just to implement that.
Justin on choosing Erlang for implementing Riak:
There was a really natural choice because especially when you look at the Dynamo model, where they talk about all these operations where to get a value you’ll send messages to multiple other parties, then you’ll wait through various phases for responses of different classes to come back and the basic building blocks to do that kind of messaging and to do that kind of more complex state machine are there for you out of the box for you in Erlang.
Kevin Smith promises a series of posts covering the details of ☞ riak_core, the refactored core of the Riak system that can be used for building Dynamo-like distributed systems:
Distributed systems are complex and some of that complexity shows in the amount of features available in riak_core. Rather than dive deeply into code, I’m going to separate the features into broad categories and give an overview of each.
The ☞ first part covers aspects like:
- node liveness & membership (note it is interesting to note that improvements to the failure recovery mechanism is part of the latest Riak release
- partitioning & distributing work
- cluster state
Definitely a series I’ll keep an eye on as I’m pretty sure there are many things to be learned from their experience. (shameless plug) If you happen to be in the Bay area in November, come check the NoSQL track at QCon where, even if not yet published yet, among others, Andy Gross, VP of engineering at Basho, will be speaking about how to build Dynamo style systems using Riak’s core.
Building Blocks of Dynamo-like Distributed Systems originally posted on the NoSQL blog: myNoSQL
Monday, 2 August 2010
Overview of the Amazon Dynamo Paper
This is becoming a habit of the NOSQL summer meeting in ☞ Tokyo:
They did it before for the Google BigTable paper. If you speak Japanese and happen to be around in Tokyo I’d say you shouldn’t miss such an event.
Note: In case you missed it, you can get all NoSQL papers using this little “hack”
Thursday, 29 July 2010
Single Server vs Sharding vs Dynamo Style Systems
Funny video from Basho guys (physically) demonstrating what’s happening with your data when a server crashes
Next time, send the mugs to people and use real servers for such demos!
Thursday, 28 January 2010
Characterizing Enterprise Systems using the CAP theorem
When building your next distributed system, you will have to make sure that all subsystems are able to deliver the combination of consistency-availability-partition tolerance that you are looking for.
Taylor’s article is a great start for categorizing according to the CAP theorem some of the (enterprise) systems out there: Terracota, Oracle Coherence, GigaSpaces, but also RDBMS and a couple of NoSQL solutions like Amazon Dynamo, BigTable, Cassandra, CouchDB and Project Voldemort.
Another interesting aspect of the article is that it tries to identify how these systems are coping with the missing CAP dimension. Unfortunately, there are a couple of things in the RDBMS analysis that I do not agree with.
An RDBMS provides availability, but only when there is connectivity between the client accessing the RDBMS and the RDBMS itself.
[…] there are several well-known approaches that can be employed to compensate for the lack of Partition tolerance. One of these approaches is commonly referred to as master/slave replication.
RDBMS are not available by themselves. Leaving aside the connectivity issue, RDBMS can become busy performing complex operations or run out of resources and so they can be unavailable.
What the article identifies as a solution for dealing with partition tolerance, master/slave setups are meant in fact to provide some level of availability. But with master/slave consistency becomes only “eventual consistency”.
The other approach mentioned — sharding — is indeed a solution meant to provide some level of partition tolerance. But without replication it gives up to availability.
As side notes:
- it was interesting to learn that GigaSpaces can behave as either an CA or AP system, depending on the configurable replication scheme (sync vs async).
- I am wondering if there are any CP solutions out there. I’d speculate that financial services would probably be required to be CP (if distributed).
via: http://javathink.blogspot.com/2010/01/characterizing-enterprise-systems-using.html
Tuesday, 8 December 2009
Understanding Amazon Dynamo by Building it in Erlang
This is what I expect to become a great article series on the Amazon Dynamo paper. The author, Will Larson, suggests a different path of understanding the inner workings of this system:
I decided that a good way to record the ideas (as well as solidify them in my mind) was to go through the process of writing a distributed key-value store, and then incrementally add the enhancements discussed in the Dynamo paper. By the end of this series we’ll have re-implemented most of the interesting ideas from Dynamo in a distributed Erlang system.
He sounds really excited to have to deal with the concepts introduced in the Dynamo paper: consistent hashing, merkle trees, vector clocks, gossip protocols, sloppy quorums) and so far has published the first part: Hands On Review of the Dynamo Paper and the 2nd Durable Writes & Consistent Reads.
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling