ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

column store: All content tagged as column store in NoSQL databases and polyglot persistence

The evolution of Cassandra

Robbie Strickland (Weather Channel’s software development manager):

“It used to be that Cassandra was something of a beast,” recalled Robbie Strickland, the Weather Channel’s software development manager. “It was like an old space shuttle cockpit with a million knobs.” In fact, prior to joining the Weather Channel, Strickland says using Cassandra required spending “most of the day in the Cassandra IRC channel talking to developers.”

Some databases evolve into more powerful and friendlier tools, while others are just “familiar giant hairballs growing bigger“.

Original title and link: The evolution of Cassandra (NoSQL database©myNoSQL)

via: http://data-informed.com/weather-channel-manages-data-deluge-combination-platforms/


Consensus-based replication in HBase

Konstantin Boudnik (WANdisco):

The idea behind consensus-based replication is pretty simple: instead of trying to guarantee that all replicas of a node in the system are synced post-factum to an operation, such a system will coordinate the intent of an operation. If a consensus on the feasibility of an operation is reached, it will be applied by each node independently. If consensus is not reached, the operation simply won’t happen. That’s pretty much the whole philosophy.

Not enough details, but doesn’t this sound like Paxos applied earlier?

Original title and link: Consensus-based replication in HBase (NoSQL database©myNoSQL)

via: http://blogs.wandisco.com/2014/06/16/consunsus-based-replication-hbase/


Choice of NoSQL databases from Cloudera

Adam Fowler1 looks at the potential confusion for Cloudera’s customers when talking about NoSQL databases:

As for Cloudera customers I’m not too sure. It may confuse people asking Cloudera about NoSQL. Below is a potential conversation that, as a sales engineer for NoSQL vendor MarkLogic, I can see easily happening:

This announcement struck me as being too publicized — it’s normal for companies with similar interests to partner, but a fair amount of care should be put into clearing all possible confusions and I don’t think this happened.

Just to summarize: Cloudera provides support for HBase and Accumulo. And it has a deal with MongoDB and Oracle. I assume in the sale process, Cloudera will go with: “we work with whatever you already have in place”. As for recommending a NoSQL solution for their customers, it will probably go as in Adam Fowler’s post. To which we could probably add Oracle too.


  1. Adam Fowler works for MarkLogic. 

Original title and link: Choice of NoSQL databases from Cloudera (NoSQL database©myNoSQL)

via: http://adamfowlerml.wordpress.com/2014/05/05/choice-of-nosql-databases-from-cloudera/


OhmData C5: an improved HBase

You’ll probably recognize the names behind OhmData and their improved HBase product C5. In their own HN words:

  • We say we can do failover in a couple of seconds. We want to make it subsecond, but we can’t do that reliably yet. In HBase this story is much more mixed.
  • We wanted to really reduce complexity, as a result, you can just apt-get install c5 on each node and you are done. It’s one daemon, one log file, and that’s it. No xmx nonsense, and almost no tuning or config files. I don’t know if you have dealt with hadoop before, but the complexity is high.
  • Finally we have a much more advanced wireformat. In fact it’s advanced by being simple (protobufs + http). As a result clients in languages other than java become very easy, without a thrift client.

Are we in a new stage of NoSQL databases: “X that doesn’t suck”?

Original title and link: OhmData C5: an improved HBase (NoSQL database©myNoSQL)


Hadoop and big data: Where Apache Slider slots in and why it matters

Arun Murthy for ZDNet about Apache Slider:

Slider is a framework that allows you to bridge existing always-on services and makes sure they work really well on top of YARN without having to modify the application itself. That’s really important.

Right now it’s HBase and Accumulo but it could be Cassandra, it could be MongoDB, it could be anything in the world. That’s the key part.

I couldn’t find the project on the Incubator page.

Original title and link: Hadoop and big data: Where Apache Slider slots in and why it matters (NoSQL database©myNoSQL)

via: http://www.zdnet.com/hadoop-and-big-data-where-apache-slider-slots-in-and-why-it-matters-7000028073/


NoSQL meets Bitcoin and brings down two exchanges

Most of Emin Gün Sirer’s posts end up linked here, as I usually enjoy the way he combines a real-life story with something technical, all that ending with a pitch for HyperDex.

The problem here stemmed from the broken-by-design interface and semantics offered by MongoDB. And the situation would not have been any different if we had used Cassandra or Riak. All of these first-generation NoSQL datastores were early because they are easy to build. When the datastore does not provide any tangible guarantees besides “best effort,” building it is simple. Any masters student in a top school can build an eventually consistent datastore over a weekend, and students in our courses at Cornell routinely do. What they don’t do is go from door to door in the valley, peddling the resulting code as if it could or should be deployed.

Unfortunately in this case, the jump from the real problem, which was caused only by the pure incompetence, to declaring “first-generation NoSQL databases” as being bad and pitching HyperDex’s features is both too quick and incorrect1.


  1. 1) ACID guarantees wouldn’t have solved the issue; 2) All 3 NoSQL databases mentioned, actually offer a solution for this particular scenario. 

Original title and link: NoSQL meets Bitcoin and brings down two exchanges (NoSQL database©myNoSQL)

via: http://hackingdistributed.com/2014/04/06/another-one-bites-the-dust-flexcoin/


Cassandra hits 1 million writes per second on Google Compute Engine

Google using Cassandra to show the performance and cost efficiency of the Google Compute Engine:

  • sustain one million writes per second to Cassandra with a median latency of 10.3 ms and 95% completing under 23 ms
  • sustain a loss of 1/3 of the instances and volumes and still maintain the 1 million writes per second (though with higher latency)
  • scale up and down linearly so that the configuration described can be used to create a cost effective solution
  • go from nothing in existence to a fully configured and deployed instances hitting 1 million writes per second took just 70 minutes. A configured environment can achieve the same throughput in 20 minutes.

Make sure you check the charts and get to the conclusion part. The other conclusion I’d suggest is: based on the real benchmarks I’ve seen over the years, Cassandra is the only system that was proven to scale lineary and provide top performance1.


  1. Before saying that I’m biased, make sure you are reading at least this story and Netflix’s post

Original title and link: Cassandra hits 1 million writes per second on Google Compute Engine (NoSQL database©myNoSQL)

via: http://googlecloudplatform.blogspot.co.uk/2014/03/cassandra-hits-one-million-writes-per-second-on-google-compute-engine.html


The Hadoop as ETL part in migrating from MongoDB to Cassandra at FullContact

While I’ve found the whole post very educative — and very balanced considering the topic — the part that I’m linking to is about integrating MongoDB with Hadoop. After reading the story of integrating MongoDB and Hadoop at Foursquare, there were quite a few questions bugging me. This post doesn’t answer any of them, but it brings in some more details about existing tools, a completely different solution, and what seems to be an overarching theme when using Hadoop and MongoDB in the same phrase:

We’re big users of Hadoop MapReduce and tend to lean on it whenever we need to make large scale migrations, especially ones with lots of transformation. That fact along with our existing conversion project from before, we used 10gen’s mongo-hadoop project which has input and output formats for Hadoop. We immediately realized that the InputFormat which connected to a MongoDB cluster was ill-suited to our usage. We had 3TB of partially-overlapping data across 2 clusters. After calculating input splits for a few hours, it began pulling documents at an uncomfortably slow pace. It was slow enough, in fact, that we developed an alternative plan.

You’ll have to read the post to learn how they’ve accomplished their goal, but as a spoiler, it was once again more of an ETL process rather than an integration.

✚ The corresponding HN thread; it’s focused mostly on the from MongoDB to Cassandra parts.

Original title and link: The Hadoop as ETL part in migrating from MongoDB to Cassandra at FullContact (NoSQL database©myNoSQL)

via: http://www.fullcontact.com/blog/mongo-to-cassandra-migration/


HBase block caches - Optimizing for random reads

Great post by Nick Dimiduk1 covering the whats, whys, and hows of caching data blocks in HBase, the mechanism through which HBase is optimizing random reads2:

There is a single BlockCache instance in a region server, which means all data from all regions hosted by that server share the same cache pool. The BlockCache is instantiated at region server startup and is retained for the entire lifetime of the process. Traditionally, HBase provided only a single BlockCache implementation: the LruBlockCache. The 0.92 release introduced the first alternative in HBASE-4027: the SlabCache. HBase 0.96 introduced another option via HBASE-7404, called the BucketCache.


  1. Nick Dimiduk works at Hortonworks and is the co-author of HBase in Action

  2. For optimizing recent edits, HBase has another mechanism, the MemStore

Original title and link: HBase block caches - Optimizing for random reads (NoSQL database©myNoSQL)

via: http://www.n10k.com/blog/blockcache-101/


Riak vs. Cassandra – How does Riak compare to Cassandra?

Basho’s side on Riak vs. Cassandra:

Cassandra looks the most like Riak out of any other widely-deployed data storage technology in existence. Cassandra and Riak have architectural roots in Amazon’s Dynamo, the system Amazon engineered to handle their highly available shopping cart service. Both Riak and Cassandra are masterless, highly available stores that persist replicas and handle failure scenarios through concepts such as hinted handoff and read-repair. However, there are certain key differences between the two that should be considered when evaluating them.

Publishing such comparisons is always an extremely difficult task as long as you want to stay objective; I know this first hand:

  1. you must stay with technical facts — no rumours, no speculations. Technical facts rarely come in many shades of grey. Everything needs to be accurate. For an extra point, each aspect presented should allow the reader to dig deeper into it;
  2. you must be clear what aspects you’ll cover in the comparison. And for each category you must make sure you are not leaving things out;
  3. you must remove all corporate messaging. If you want to express opinion, be clear about it. Or do it separately. Corporate messaging and opinion don’t mix well (or at all) with good technical comparisons.
  4. be open to answer any questions. Be ready to accept you’ve made mistakes.

Then work hard to get facts right.

Original title and link: Riak vs. Cassandra – How does Riak compare to Cassandra? (NoSQL database©myNoSQL)

via: http://basho.com/riak-vs-cassandra/


How not to benchmark Cassandra

The emphasis is on the not:

As Cassandra continues to increase in popularity, it’s natural that more people will benchmark it against systems they’re familiar with as part of the evaluation process. Unfortunately, many of these results are less valuable than one would hope, due to preventable errors.

While I bet every core database developer has seen a lot of irrelevant1 benchmarks — do not miss the last paragraph of the post — I still find microbencharks the most useless (i.e. 100 data points, no concurrency, no tuning => mine is bigger than yours).


  1. That’s the most polite term I could come up with. 

Original title and link: How not to benchmark Cassandra (NoSQL database©myNoSQL)

via: http://www.datastax.com/dev/blog/how-not-to-benchmark-cassandra


Quick links for how to backup different NoSQL databases

After re-reading HyperDex’s comparison of Cassandra, MongoDB, and Riak backups, I’ve realized there are no links to the corresponding docs. So here they are:

Cassandra backups

Cassandra backs up data by taking a snapshot of all on- disk data files (SSTable files) stored in the data directory.

You can take a snapshot of all keyspaces, a single keyspace, or a single table while the system is online. Using a parallel ssh tool (such as pssh), you can snapshot an entire cluster. This provides an eventually consistent backup. Although no one node is guaranteed to be consistent with its replica nodes at the time a snapshot is taken, a restored snapshot resumes consistency using Cassandra’s built-in consistency mechanisms.

After a system-wide snapshot is performed, you can enable incremental backups on each node to backup data that has changed since the last snapshot: each time an SSTable is flushed, a hard link is copied into a /backups subdirectory of the data directory (provided JNA is enabled).

MongoDB backups

Basically three are three ways to backup MongoDB:

  1. Using MMS
  2. Copying underlying files
  3. Using mongodump

Riak backups

Riak’s backup operations are pretty different for the two main storage backends, Bitcask and LevelDB, used by Riak:

Choosing your Riak backup strategy will largely depend on the backend configuration of your nodes. In many cases, Riak will conform to your already established backup methodologies. When backing up a node, it is important to backup both the ring and data directories that pertain to your configured backend.

Note: I’d be happy to update this entry with links to docs on what tools and solutions other NoSQL databases (HBase, Redis, Neo4j, CouchDB, Couchbase, RethinkDB) are providing.

✚ Considering that creating backups is as useful as making sure that these will actually work when trying to restore, I’m wondering why there are no tools that can validate a backup without forcing a complete restore. The two mechanisms are not equivalent, but for large size databases this might simplify a bit the process and increase the confidence of the users.

Original title and link: Quick links for how to backup different NoSQL databases (NoSQL database©myNoSQL)