ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Dynamo: All content tagged as Dynamo in NoSQL databases and polyglot persistence

The NoSQL Family Tree

NoSQL-Family-Tree

Even if it includes just a handful of NoSQL databases, it’s still a nice visualization.

Original title and link: The NoSQL Family Tree (NoSQL database©myNoSQL)

via: https://cloudant.com/blog/the-nosql-family-tree/


Cassandra hits 1 million writes per second on Google Compute Engine

Google using Cassandra to show the performance and cost efficiency of the Google Compute Engine:

  • sustain one million writes per second to Cassandra with a median latency of 10.3 ms and 95% completing under 23 ms
  • sustain a loss of 1/3 of the instances and volumes and still maintain the 1 million writes per second (though with higher latency)
  • scale up and down linearly so that the configuration described can be used to create a cost effective solution
  • go from nothing in existence to a fully configured and deployed instances hitting 1 million writes per second took just 70 minutes. A configured environment can achieve the same throughput in 20 minutes.

Make sure you check the charts and get to the conclusion part. The other conclusion I’d suggest is: based on the real benchmarks I’ve seen over the years, Cassandra is the only system that was proven to scale lineary and provide top performance1.


  1. Before saying that I’m biased, make sure you are reading at least this story and Netflix’s post

Original title and link: Cassandra hits 1 million writes per second on Google Compute Engine (NoSQL database©myNoSQL)

via: http://googlecloudplatform.blogspot.co.uk/2014/03/cassandra-hits-one-million-writes-per-second-on-google-compute-engine.html


The Hadoop as ETL part in migrating from MongoDB to Cassandra at FullContact

While I’ve found the whole post very educative — and very balanced considering the topic — the part that I’m linking to is about integrating MongoDB with Hadoop. After reading the story of integrating MongoDB and Hadoop at Foursquare, there were quite a few questions bugging me. This post doesn’t answer any of them, but it brings in some more details about existing tools, a completely different solution, and what seems to be an overarching theme when using Hadoop and MongoDB in the same phrase:

We’re big users of Hadoop MapReduce and tend to lean on it whenever we need to make large scale migrations, especially ones with lots of transformation. That fact along with our existing conversion project from before, we used 10gen’s mongo-hadoop project which has input and output formats for Hadoop. We immediately realized that the InputFormat which connected to a MongoDB cluster was ill-suited to our usage. We had 3TB of partially-overlapping data across 2 clusters. After calculating input splits for a few hours, it began pulling documents at an uncomfortably slow pace. It was slow enough, in fact, that we developed an alternative plan.

You’ll have to read the post to learn how they’ve accomplished their goal, but as a spoiler, it was once again more of an ETL process rather than an integration.

✚ The corresponding HN thread; it’s focused mostly on the from MongoDB to Cassandra parts.

Original title and link: The Hadoop as ETL part in migrating from MongoDB to Cassandra at FullContact (NoSQL database©myNoSQL)

via: http://www.fullcontact.com/blog/mongo-to-cassandra-migration/


Riak vs. Cassandra – How does Riak compare to Cassandra?

Basho’s side on Riak vs. Cassandra:

Cassandra looks the most like Riak out of any other widely-deployed data storage technology in existence. Cassandra and Riak have architectural roots in Amazon’s Dynamo, the system Amazon engineered to handle their highly available shopping cart service. Both Riak and Cassandra are masterless, highly available stores that persist replicas and handle failure scenarios through concepts such as hinted handoff and read-repair. However, there are certain key differences between the two that should be considered when evaluating them.

Publishing such comparisons is always an extremely difficult task as long as you want to stay objective; I know this first hand:

  1. you must stay with technical facts — no rumours, no speculations. Technical facts rarely come in many shades of grey. Everything needs to be accurate. For an extra point, each aspect presented should allow the reader to dig deeper into it;
  2. you must be clear what aspects you’ll cover in the comparison. And for each category you must make sure you are not leaving things out;
  3. you must remove all corporate messaging. If you want to express opinion, be clear about it. Or do it separately. Corporate messaging and opinion don’t mix well (or at all) with good technical comparisons.
  4. be open to answer any questions. Be ready to accept you’ve made mistakes.

Then work hard to get facts right.

Original title and link: Riak vs. Cassandra – How does Riak compare to Cassandra? (NoSQL database©myNoSQL)

via: http://basho.com/riak-vs-cassandra/


How not to benchmark Cassandra

The emphasis is on the not:

As Cassandra continues to increase in popularity, it’s natural that more people will benchmark it against systems they’re familiar with as part of the evaluation process. Unfortunately, many of these results are less valuable than one would hope, due to preventable errors.

While I bet every core database developer has seen a lot of irrelevant1 benchmarks — do not miss the last paragraph of the post — I still find microbencharks the most useless (i.e. 100 data points, no concurrency, no tuning => mine is bigger than yours).


  1. That’s the most polite term I could come up with. 

Original title and link: How not to benchmark Cassandra (NoSQL database©myNoSQL)

via: http://www.datastax.com/dev/blog/how-not-to-benchmark-cassandra


Quick links for how to backup different NoSQL databases

After re-reading HyperDex’s comparison of Cassandra, MongoDB, and Riak backups, I’ve realized there are no links to the corresponding docs. So here they are:

Cassandra backups

Cassandra backs up data by taking a snapshot of all on- disk data files (SSTable files) stored in the data directory.

You can take a snapshot of all keyspaces, a single keyspace, or a single table while the system is online. Using a parallel ssh tool (such as pssh), you can snapshot an entire cluster. This provides an eventually consistent backup. Although no one node is guaranteed to be consistent with its replica nodes at the time a snapshot is taken, a restored snapshot resumes consistency using Cassandra’s built-in consistency mechanisms.

After a system-wide snapshot is performed, you can enable incremental backups on each node to backup data that has changed since the last snapshot: each time an SSTable is flushed, a hard link is copied into a /backups subdirectory of the data directory (provided JNA is enabled).

MongoDB backups

Basically three are three ways to backup MongoDB:

  1. Using MMS
  2. Copying underlying files
  3. Using mongodump

Riak backups

Riak’s backup operations are pretty different for the two main storage backends, Bitcask and LevelDB, used by Riak:

Choosing your Riak backup strategy will largely depend on the backend configuration of your nodes. In many cases, Riak will conform to your already established backup methodologies. When backing up a node, it is important to backup both the ring and data directories that pertain to your configured backend.

Note: I’d be happy to update this entry with links to docs on what tools and solutions other NoSQL databases (HBase, Redis, Neo4j, CouchDB, Couchbase, RethinkDB) are providing.

✚ Considering that creating backups is as useful as making sure that these will actually work when trying to restore, I’m wondering why there are no tools that can validate a backup without forcing a complete restore. The two mechanisms are not equivalent, but for large size databases this might simplify a bit the process and increase the confidence of the users.

Original title and link: Quick links for how to backup different NoSQL databases (NoSQL database©myNoSQL)


Comparing NoSQL backup solutions

In a post introducing HyperDex backups, Robert Escriva compares the different backup solutions available in Cassandra, MongoDB, and Riak:

Cassandra: Cassandra’s backups are inconsistent, as they are taken at each server independently without coordination. Further, “Restoring from snapshots and incremental backups temporarily causes intensive CPU and I/O activity on the node being restored.”

MongoDB: MongoDB provides two backup strategies. The first strategy copies the data on backup, and re-inserts it on restore. This approach introduces high overhead because it copies the entire data set without opportunity for incremental backup.

The second approach is to use filesystem-provided snapshots to quickly backup the data of a mongod instance. This approach requires operating system support and will produce larger backup sizes.

Riak: Riak backups are inconsistent, as they are taken at each server independently without coordination, and require care when migrating between IP addresses. Further, Riak requires that each server be shut down before backing up LevelDB-powered backends.

How is HyperDex’s new backup described:

The HyperDex backup/restore process is strongly consistent, doesn’t require shutting down servers, and enables incremental backup support. Further, the process is quite efficient; it completes quickly, and does not consume CPU or I/O for extended periods of time.

The caveat is that HyperDex puts the cluster in read-only mode for backing up. That’s loss of availability. Considering both Cassandra and Riak promise is high availability, their choice was clear.

Update: This comment from Emin Gün Sirer makes me wonder if I missed something:

HyperDex quiesces the network, takes a snapshot, resumes. Whole operation takes sub-second latency.

The key point is that the system is online, available while the data copying is taking place.

Original title and link: Comparing NoSQL backup solutions (NoSQL database©myNoSQL)

via: http://hackingdistributed.com/2014/01/14/back-that-nosql-up/


Cassandra CQL and the IN operator

The most succint description of how to use IN in CQL:

  1. The last column in the partition key, assuming the = operator is used on the first N-1 columns of the partition key
  2. The last clustering column, assuming the = operator is used on the first N-1 clustering columns and all partition keys are restricted
  3. The last clustering column, assuming the = operator is used on the first N-1 clustering columns and ALLOW FILTERING is specified

I like clear rules.

Original title and link: Cassandra CQL and the IN operator (NoSQL database©myNoSQL)

via: http://planetcassandra.org/blog/post/the-in-operator-in-cassandra-cql


MySQL is a great Open Source project. How about open source NoSQL databases?

In a post titled Some myths on Open Source, the way I see it, Anders Karlsson writes about MySQL:

As far as code, adoption and reaching out to create an SQL-based RDBMS that anyone can afford, MySQL / MariaDB has been immensely successful. But as an Open Source project, something being developed together with the community where everyone work on their end with their skills to create a great combined piece of work, MySQL has failed. This is sad, but on the other hand I’m not so sure that it would have as much influence and as wide adoption if the project would have been a “clean” Open Source project.

The article offers a very black-and-white perspective on open source versus commercial code. But that’s not why I’m linking to it.

The above paragraph made me think about how many of the most popular open source NoSQL databases would die without the companies (or people) that created them.

Here’s my list: MongoDB, Riak, Neo4j, Redis, Couchbase, etc. And I could continue for quite a while considering how many there are out there: RavenDB, RethinkDB, Voldemort, Tokyo, Titan.

Actually if you reverse the question, the list would get extremely short: Cassandra, CouchDB (still struggling though), HBase. All these were at some point driven by community. Probably the only special case could be LevelDB.

✚ As a follow up to Anders Karlsson post, Robert Hodges posted The Scale-Out Blog: Why I Love Open Source.

Original title and link: MySQL is a great Open Source project. How about open source NoSQL databases? (NoSQL database©myNoSQL)

via: http://karlssonondatabases.blogspot.com/2014/01/some-myths-on-open-source-way-i-see-it.html


Google Compute Engine and Data

Since announcing the GA couple of weeks ago, I’ve been noticing quite a few data related posts on the Google Compute Engine blog:

If you look at these, you’ll notice a theme: covering data from every angle; Cassandra/DSE from DataStax for OLTP, DataTorrent for stream processing, Qubole for Hadoop, MapR for their Hadoop-like solution. I can see this continuing for a while and making Google Compute Engine a strong competitor for Amazon Web Services.

One question remains though: will they be able to come up with a good integration strategy for all these 3rd party tools?

Original title and link: Google Compute Engine and Data (NoSQL database©myNoSQL)


Why NoSQL Can Be Safer than an RDBMS

Robin Schumacher1:

That said, I disagree with many of the article’s statements, the most important being that companies should not consider NoSQL databases as a first choice for critical data. In this article, I’ll show first how a NoSQL database like Cassandra is indeed being used today as a primary datastore for key data and, second, that Cassandra can actually end up being safer than an RDBMS for important information.

You already know how this goes: “First they ignore you, then they laugh at you, then they fight you, then you win”. I’ll let you decide where major NoSQL databases are today.


  1. Robin Schumacher is VP of Products at DataStax. He’s also my boss

Original title and link: Why NoSQL Can Be Safer than an RDBMS (NoSQL database©myNoSQL)

via: http://www.datastax.com/2013/10/why-nosql-can-be-safer-than-an-rdbms


Quick intro to Apache Cassandra… comic style

You can find it here. Nice job by Alberto Diego Prieto Löfkrantz.

Original title and link: Quick intro to Apache Cassandra… comic style (NoSQL database©myNoSQL)