NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Riak: All content tagged as Riak in NoSQL databases and polyglot persistence

Merge and serialization functions for Riak

Tom Crayford (Yeller) describes how to test the merge and serialization functions used to resolve potential conflicts in Riak:

The way I prefer using riak, is with allow_mult=true. This means that whenever you have causally conflicting writes to a key, riak will store all of them, and upon your next read of that key you have to resolve them yourself. Designing your datatypes such that you can merge them is a huge topic, and an area of active research. However, even once you have a merge strategy worked out, how can you be sure that your reasoning is good? The merge functions you use have to obey a few properties: they have to be commutative, idempotent and associative, or you’ll mess things up when you have conflicts

Original title and link: Merge and serialization functions for Riak (NoSQL database©myNoSQL)


Riak: Entropy detection, correction, and conflict resolution

John Daily covers Riak’s mechanisms for bringing data in sync across the nodes:

Riak’s overarching design goal is simple: be maximally available. […] In order to make sure your data can survive server failures, Riak retains multiple copies (replicas) and allows lock-free, uncoordinated updates. […] This then open ups the possibility that data will be out of sync across a cluster. Riak manages this issue in three distinct stages: entropy detection, correction, and conflict resolution.

You’ll read pitches from products promising both maximal availability and no out-of-date data. Those are just that promises.

Original title and link: Riak: Entropy detection, correction, and conflict resolution (NoSQL database©myNoSQL)


NoSQL meets Bitcoin and brings down two exchanges

Most of Emin Gün Sirer’s posts end up linked here, as I usually enjoy the way he combines a real-life story with something technical, all that ending with a pitch for HyperDex.

The problem here stemmed from the broken-by-design interface and semantics offered by MongoDB. And the situation would not have been any different if we had used Cassandra or Riak. All of these first-generation NoSQL datastores were early because they are easy to build. When the datastore does not provide any tangible guarantees besides “best effort,” building it is simple. Any masters student in a top school can build an eventually consistent datastore over a weekend, and students in our courses at Cornell routinely do. What they don’t do is go from door to door in the valley, peddling the resulting code as if it could or should be deployed.

Unfortunately in this case, the jump from the real problem, which was caused only by the pure incompetence, to declaring “first-generation NoSQL databases” as being bad and pitching HyperDex’s features is both too quick and incorrect1.

  1. 1) ACID guarantees wouldn’t have solved the issue; 2) All 3 NoSQL databases mentioned, actually offer a solution for this particular scenario. 

Original title and link: NoSQL meets Bitcoin and brings down two exchanges (NoSQL database©myNoSQL)


Quick guide to CRDTs in Riak 2.0

Joel Jacobson provides a quick intro to using the new CRDT counters, sets, and maps in the Riak 2.0 preview:

Riak Data Types (also referred to as CRDTs) adds counters, sets, and maps to Riak – allowing for better conflict resolution. They enable developers to spend less time thinking about the complexities of vector clocks and sibling resolution and, instead, focusing on using familiar, distributed data types to support their applications’ data access patterns.

✚ An extra point for everyone recognizing the data sample used in the post.

Original title and link: Quick guide to CRDTs in Riak 2.0 (NoSQL database©myNoSQL)


Riak vs. Cassandra – How does Riak compare to Cassandra?

Basho’s side on Riak vs. Cassandra:

Cassandra looks the most like Riak out of any other widely-deployed data storage technology in existence. Cassandra and Riak have architectural roots in Amazon’s Dynamo, the system Amazon engineered to handle their highly available shopping cart service. Both Riak and Cassandra are masterless, highly available stores that persist replicas and handle failure scenarios through concepts such as hinted handoff and read-repair. However, there are certain key differences between the two that should be considered when evaluating them.

Publishing such comparisons is always an extremely difficult task as long as you want to stay objective; I know this first hand:

  1. you must stay with technical facts — no rumours, no speculations. Technical facts rarely come in many shades of grey. Everything needs to be accurate. For an extra point, each aspect presented should allow the reader to dig deeper into it;
  2. you must be clear what aspects you’ll cover in the comparison. And for each category you must make sure you are not leaving things out;
  3. you must remove all corporate messaging. If you want to express opinion, be clear about it. Or do it separately. Corporate messaging and opinion don’t mix well (or at all) with good technical comparisons.
  4. be open to answer any questions. Be ready to accept you’ve made mistakes.

Then work hard to get facts right.

Original title and link: Riak vs. Cassandra – How does Riak compare to Cassandra? (NoSQL database©myNoSQL)


Quick links for how to backup different NoSQL databases

After re-reading HyperDex’s comparison of Cassandra, MongoDB, and Riak backups, I’ve realized there are no links to the corresponding docs. So here they are:

Cassandra backups

Cassandra backs up data by taking a snapshot of all on- disk data files (SSTable files) stored in the data directory.

You can take a snapshot of all keyspaces, a single keyspace, or a single table while the system is online. Using a parallel ssh tool (such as pssh), you can snapshot an entire cluster. This provides an eventually consistent backup. Although no one node is guaranteed to be consistent with its replica nodes at the time a snapshot is taken, a restored snapshot resumes consistency using Cassandra’s built-in consistency mechanisms.

After a system-wide snapshot is performed, you can enable incremental backups on each node to backup data that has changed since the last snapshot: each time an SSTable is flushed, a hard link is copied into a /backups subdirectory of the data directory (provided JNA is enabled).

MongoDB backups

Basically three are three ways to backup MongoDB:

  1. Using MMS
  2. Copying underlying files
  3. Using mongodump

Riak backups

Riak’s backup operations are pretty different for the two main storage backends, Bitcask and LevelDB, used by Riak:

Choosing your Riak backup strategy will largely depend on the backend configuration of your nodes. In many cases, Riak will conform to your already established backup methodologies. When backing up a node, it is important to backup both the ring and data directories that pertain to your configured backend.

Note: I’d be happy to update this entry with links to docs on what tools and solutions other NoSQL databases (HBase, Redis, Neo4j, CouchDB, Couchbase, RethinkDB) are providing.

✚ Considering that creating backups is as useful as making sure that these will actually work when trying to restore, I’m wondering why there are no tools that can validate a backup without forcing a complete restore. The two mechanisms are not equivalent, but for large size databases this might simplify a bit the process and increase the confidence of the users.

Original title and link: Quick links for how to backup different NoSQL databases (NoSQL database©myNoSQL)

Comparing NoSQL backup solutions

In a post introducing HyperDex backups, Robert Escriva compares the different backup solutions available in Cassandra, MongoDB, and Riak:

Cassandra: Cassandra’s backups are inconsistent, as they are taken at each server independently without coordination. Further, “Restoring from snapshots and incremental backups temporarily causes intensive CPU and I/O activity on the node being restored.”

MongoDB: MongoDB provides two backup strategies. The first strategy copies the data on backup, and re-inserts it on restore. This approach introduces high overhead because it copies the entire data set without opportunity for incremental backup.

The second approach is to use filesystem-provided snapshots to quickly backup the data of a mongod instance. This approach requires operating system support and will produce larger backup sizes.

Riak: Riak backups are inconsistent, as they are taken at each server independently without coordination, and require care when migrating between IP addresses. Further, Riak requires that each server be shut down before backing up LevelDB-powered backends.

How is HyperDex’s new backup described:

The HyperDex backup/restore process is strongly consistent, doesn’t require shutting down servers, and enables incremental backup support. Further, the process is quite efficient; it completes quickly, and does not consume CPU or I/O for extended periods of time.

The caveat is that HyperDex puts the cluster in read-only mode for backing up. That’s loss of availability. Considering both Cassandra and Riak promise is high availability, their choice was clear.

Update: This comment from Emin Gün Sirer makes me wonder if I missed something:

HyperDex quiesces the network, takes a snapshot, resumes. Whole operation takes sub-second latency.

The key point is that the system is online, available while the data copying is taking place.

Original title and link: Comparing NoSQL backup solutions (NoSQL database©myNoSQL)


Anti-patterns for developing with NoSQL databases

Basho, makers of Riak, published recently an article about the most common patterns that have to be avoided when developing with Riak. Unsurprisingly, most of these rules can must be applied to the majority of NoSQL databases.

Writing an application that can take full advantage of Riak’s robust scaling properties requires a different way of looking at data storage and retrieval. Developers who bring a relational mindset to Riak may create applications that work well with a small data set but start to show strain in production, particularly as the cluster grows.

What I’ve learned after experimenting and building apps with different NoSQL databases can be summarized in just a couple of short generic rules:

  1. if you have the “disadvantage” of being experienced with relational databases and working on an app that will use a NoSQL database, forget everything you know about the relational world. Take out that part of your brain and put it in the jar. Use the other side of your brain. Avoid any temptations of doing comparisons or asking yourself “how would I do this in a relational database?”. You’ll fail.
  2. when using relational databases, most often we start with the data model. “What’s the best way to organize and store our data?” is one of the first questions we’re addressing. Only afterwards we’re figuring out, in the application, how to retrieve data in the format needed by the app.
  3. when using a NoSQL database, focus on your application. “How do I use data in my application?” must be the driving question. Then your NoSQL database API will tell you exactly how to store the data.

    This might make it sound too simple. Indeed, it’s not that simple. Some of the complexity you’ll face comes from figuring out how to keep multiple copies of the data to fit the different ways you need to access it, updating and deleting multiple copies, dealing with the consistency requirements of your app, what availability versus consistency trade-offs your app is OK with.

  4. take the time to learn the most common usage patterns and anti-patterns for the NoSQL database you have picked. If you cannot find the ones that fit your application, talk to the community and build a prototype. Do not ignore point 3 above at any stage.

    Now go over the list of the anti-patterns when developing with Riak.

Original title and link: Anti-patterns for developing with NoSQL databases (NoSQL database©myNoSQL)

MySQL is a great Open Source project. How about open source NoSQL databases?

In a post titled Some myths on Open Source, the way I see it, Anders Karlsson writes about MySQL:

As far as code, adoption and reaching out to create an SQL-based RDBMS that anyone can afford, MySQL / MariaDB has been immensely successful. But as an Open Source project, something being developed together with the community where everyone work on their end with their skills to create a great combined piece of work, MySQL has failed. This is sad, but on the other hand I’m not so sure that it would have as much influence and as wide adoption if the project would have been a “clean” Open Source project.

The article offers a very black-and-white perspective on open source versus commercial code. But that’s not why I’m linking to it.

The above paragraph made me think about how many of the most popular open source NoSQL databases would die without the companies (or people) that created them.

Here’s my list: MongoDB, Riak, Neo4j, Redis, Couchbase, etc. And I could continue for quite a while considering how many there are out there: RavenDB, RethinkDB, Voldemort, Tokyo, Titan.

Actually if you reverse the question, the list would get extremely short: Cassandra, CouchDB (still struggling though), HBase. All these were at some point driven by community. Probably the only special case could be LevelDB.

✚ As a follow up to Anders Karlsson post, Robert Hodges posted The Scale-Out Blog: Why I Love Open Source.

Original title and link: MySQL is a great Open Source project. How about open source NoSQL databases? (NoSQL database©myNoSQL)


Relational to Riak

A 3-part, a bit too high level for me, article about what is to be gained (and lost) when using Riak instead of a relational database:

  1. High Availability
  2. Cost of Scale
  3. Tradeoffs

What I always like about Basho’s posts is that they don’t shy away from covering the tradeoffs.

Original title and link: Relational to Riak (NoSQL database©myNoSQL)

Concurrent updates, distributed systems and clocks, vector clocks, last-write-win, and CRDT

Great post by John Daily from Basho about concurrent updates in the world of distributed systems and the implications of using clocks, vector clocks, last-write-wins, distributed data types (Commutative Replicated Data Type):

The problem is simple: there is no reliable definition of “last write”; because system clocks across multiple servers are going to drift.

Original title and link: Concurrent updates, distributed systems and clocks, vector clocks, last-write-win, and CRDT (NoSQL database©myNoSQL)


Riak: Secondary Indexes or G-Set Term-based Inverted Indexes

Comparing the pros and cons of 2 different approaches for indexing data in Riak: secondary indexes and G-Set based inverted indexes:

A G-Set Term-Based Inverted Index has the following benefits over a Secondary Index:

  • Better read performance at the sacrifice of some write performance
  • Less resource intensive for the Riak cluster
  • Excellent resistance to cluster partition since CRDTs have defined sibling merge behavior
  • Can be implemented on any Riak backend including Bitcask, Memory, and of course LevelDB
  • Tunable via read and write parameters to improve performance
  • Ideal when the exact index term is known

✚ The Grow-Only Set (G-Set) is one of convergent and commutative replicated data types defined in the paper A comprehensive study of Convergent and Commutative Replicated Data Types (pdf).

Original title and link: Riak: Secondary Indexes or G-Set Term-based Inverted Indexes (NoSQL database©myNoSQL)