NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



column store: All content tagged as column store in NoSQL databases and polyglot persistence

An intro to HBase’s Thrift interface

If you’ve never used Thrift (with or without HBase), the two articles authored by Jesse Anderson and posted on Cloudera’s blog will give you both a quick intro and

  1. How-to: Use the HBase Thrift Interface, Part 1: setting up, getting the language bindings, and connecting;
  2. How-to: Use the HBase Thrift Interface, Part 2: Inserting/Getting Rows: using HBase’s Thrift API from Python

Original title and link: An intro to HBase’s Thrift interface (NoSQL database©myNoSQL)

Approaches to Backup and Disaster Recovery in HBase

This shouldmust be part of your HBase operational manual:

Let’s start with the least disruptive, smallest data footprint, least performance-impactful mechanism and work our way up to the most disruptive, forklift-style tool:

  • Snapshots
  • Replication
  • Export
  • CopyTable
  • HTable API
  • Offline backup of HDFS data

HBase backup strategies

When you return to the office after the winter holiday make sure you take a copy of this with you and pass it around.

Original title and link: Approaches to Backup and Disaster Recovery in HBase (NoSQL database©myNoSQL)


Dropbox: Challenges in mirroring large MySQL systems to HBase

A presentation by Todd Eisenberger about the archival system used by Dropbox based on MySQL and HBase:

MySQL benefits:

  • fast queries for known keys over a (relatively) small dataset
  • high read throughput

HBase benetits:

  • high write throughput
  • large suite of pre-existing tools for distributed computation
  • easier to perform large processing tasks

✚ Both are consistent

✚ Most of the benefits in HBase’s section point in the direction of data processing benefits (and not data storage benefits)

Apache HBase 0.96.0 released after more than 2000 issues resolved

This is a an important release for HBase. Both Hortonworks and Cloudera have posts covering it:

HBase 0.94 has been released over a year and a half ago.

Original title and link: Apache HBase 0.96.0 released after more than 2000 issues resolved (NoSQL database©myNoSQL)

Results of collaboration on improving the Mean Time to Recovery in HBase

Hortonworks, eBay and Scaled Risk have been collaborating in improving the mean time to recovery in HBase and after long testing performed at eBay, some results are now available for 2 scenarios:

  • Node/RegionServer failures while writing
  • Node/RegionServer failures while reading

Original title and link: Results of collaboration on improving the Mean Time to Recovery in HBase (NoSQL database©myNoSQL)

A prolific season for Hadoop and its ecosystem

In 4 years of writing this blog I haven’t seen such a prolific month:

  • Apache Hadoop 2.2.0 (more links here)
  • Apache HBase 0.96 (here and here)
  • Apache Hive 0.12 (more links here)
  • Apache Ambari 1.4.1
  • Apache Pig 0.12
  • Apache Oozie 4.0.0
  • Plus Presto.

Actually I don’t think I’ve ever seen such an ecosystem like the one created around Hadoop.

Original title and link: A prolific season for Hadoop and its ecosystem (NoSQL database©myNoSQL)

Why NoSQL Can Be Safer than an RDBMS

Robin Schumacher1:

That said, I disagree with many of the article’s statements, the most important being that companies should not consider NoSQL databases as a first choice for critical data. In this article, I’ll show first how a NoSQL database like Cassandra is indeed being used today as a primary datastore for key data and, second, that Cassandra can actually end up being safer than an RDBMS for important information.

You already know how this goes: “First they ignore you, then they laugh at you, then they fight you, then you win”. I’ll let you decide where major NoSQL databases are today.

  1. Robin Schumacher is VP of Products at DataStax. He’s also my boss

Original title and link: Why NoSQL Can Be Safer than an RDBMS (NoSQL database©myNoSQL)


Quick intro to Apache Cassandra… comic style

You can find it here. Nice job by Alberto Diego Prieto Löfkrantz.

Original title and link: Quick intro to Apache Cassandra… comic style (NoSQL database©myNoSQL)

Cloudera Announces Support for Apache Accumulo - what, how, why

Cloudera, the leader in enterprise analytic data management powered byApache Hadoop™, today announced its formal support for, and integration with, Apache Accumulo, a highly distributed, massively parallel processing database that is capable of analyzing structured and unstructured data and delivers fine-grained user access control and authentication. Accumulo uniquely enables system administrators to assign data access at the cell- level, ensuring that only authorized users can view and manipulate individual data points. This increased control allows a database to be accessed by a maximum number of users, while remaining compliant with data privacy and security regulations.

What about HBase?

Mike Olson:

It offers a strong complement to HBase, which has been part of our CDH offering since 2010, and remains the dominant high-performance delivery engine for NoSQL workloads running on Hadoop. However, Accumulo was expressly built to augment sensitive data workloads with fine-grained user access and authentication controls that are of mission-critical importance for federal and highly regulated industries.

The way I read this is: if you don’t need security go with HBase. If you need advanced security features you go with Accumulo.


While there aren’t any details about what formal support means, I assume Cloudera will start offering Accumulo as an alternative to HBase.


I might be wrong though about Accumulo being a replacement for HBase. I’d love to learn how and why the 2 would co-exist.


The obvious reason is that Cloudera wants to get into government and super-regulated markets contracts where security is a top requirement.

Another reason might be that Cloudera is continuing to expand its portfolio to catch as many customers as possible. Something à la Oracle or IBM. The alternative would be to stay focused. Like Teradata.

Original title and link: Cloudera Announces Support for Apache Accumulo (NoSQL database©myNoSQL)


Facebook’s Cassandra paper, annotated and compared to Apache Cassandra 2.0

The evolution from the original paper to Cassandra 2.0 in an interesting format:

The release of Apache Cassandra 2.0 is a good point to look back at the past five years of progress after Cassandra’s release as open source. Here, we annotate the Cassandra paper from LADIS 2009 with the new features and improvements that have been added since.

Original title and link: Facebook’s Cassandra paper, annotated and compared to Apache Cassandra 2.0 (NoSQL database©myNoSQL)


Hoya, HBase on YARN, Architecture

The architecture of HBase on top of YARN, a project named Hoya:


The main question I had about what YARN would bring to HBase is answered in the post. But I’m still not sure I get the whole picture of how YARN improves HBase’s availability (if it does it):

YARN keeps an eye on the health of the containers, telling the AM when there is a problem. It also monitors the Hoya AM itself. When the AM fails, YARN allocates a new container for it, and restarts it. This provides an availability solution to Hoya without it having to code it in itself.

Original title and link: Hoya, HBase on YARN, Architecture (NoSQL database©myNoSQL)


Considering TokuDB as an engine for timeseries data... or Cassandra or OpenTSDB

Vadim Tkachenko:

  • Provide high insertion rate
  • Provide a good compression rate to store more data on expensive SSDs
  • Engine should be SSD friendly (less writes per timeperiod to help with SSD wear)
  • Provide a reasonable response time (within ~50 ms) on SELECT queries on hot recently inserted data

Looking on these requirements I actually think that TokuDB might be a good fit for this task.

There are solutions in the NoSQL space that are optimized for this scenario: Cassandra or OpenTSDB. Indeed using one of these will have an impact on the application side.

Most of the time when the requirements dictate looking into different solutions, the easiest to estimate is the initial costs: development (nb: this doesn’t include only pure development, but also learning costs, etc.) and hardware costs.

Unfortunately many times we ignore taking into consideration long term costs:

  • maintenance costs (hardware, operations, enhancements)
  • opportunity costs (features that the current architecture won’t be able to support as being either impossible or too expensive)
  • accounting for the risks of failed initial designs (the technical debt costs)

Way too many times we optimize for the initial costs (the general excuse is that familiarity delivers faster—with the more scientific forms: time to market is essential and premature optimization is the root of all evil), while ignoring almost completely the ongoing costs.

Original title and link: Considering TokuDB as an engine for timeseries data… or Cassandra or OpenTSDB (NoSQL database©myNoSQL)