NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



hbase: All content about hbase in NoSQL databases and polyglot persistence

Cloudera Announces Support for Apache Accumulo - what, how, why

Cloudera, the leader in enterprise analytic data management powered byApache Hadoop™, today announced its formal support for, and integration with, Apache Accumulo, a highly distributed, massively parallel processing database that is capable of analyzing structured and unstructured data and delivers fine-grained user access control and authentication. Accumulo uniquely enables system administrators to assign data access at the cell- level, ensuring that only authorized users can view and manipulate individual data points. This increased control allows a database to be accessed by a maximum number of users, while remaining compliant with data privacy and security regulations.

What about HBase?

Mike Olson:

It offers a strong complement to HBase, which has been part of our CDH offering since 2010, and remains the dominant high-performance delivery engine for NoSQL workloads running on Hadoop. However, Accumulo was expressly built to augment sensitive data workloads with fine-grained user access and authentication controls that are of mission-critical importance for federal and highly regulated industries.

The way I read this is: if you don’t need security go with HBase. If you need advanced security features you go with Accumulo.


While there aren’t any details about what formal support means, I assume Cloudera will start offering Accumulo as an alternative to HBase.


I might be wrong though about Accumulo being a replacement for HBase. I’d love to learn how and why the 2 would co-exist.


The obvious reason is that Cloudera wants to get into government and super-regulated markets contracts where security is a top requirement.

Another reason might be that Cloudera is continuing to expand its portfolio to catch as many customers as possible. Something à la Oracle or IBM. The alternative would be to stay focused. Like Teradata.

Original title and link: Cloudera Announces Support for Apache Accumulo (NoSQL database©myNoSQL)


Hoya, HBase on YARN, Architecture

The architecture of HBase on top of YARN, a project named Hoya:


The main question I had about what YARN would bring to HBase is answered in the post. But I’m still not sure I get the whole picture of how YARN improves HBase’s availability (if it does it):

YARN keeps an eye on the health of the containers, telling the AM when there is a problem. It also monitors the Hoya AM itself. When the AM fails, YARN allocates a new container for it, and restarts it. This provides an availability solution to Hoya without it having to code it in itself.

Original title and link: Hoya, HBase on YARN, Architecture (NoSQL database©myNoSQL)


Big Data Debate: HBase or Cassandra

This debate about the pros and cons of HBase and Cassandra set up by Doug Henschen for InformationWeek and featuring Jonathan Ellis (Cassandra, DataStax) and Michael Hausenbias (MapR) will stir some strong feelings:

Michael Hausenbias: An interesting proof point for the superiority of HBase is the fact that Facebook, the creator of Cassandra, replaced Cassandra with HBase for their internal use.

Jonathan Ellis: The technical shortcomings driving HBase’s lackluster adoption fall into two major categories: engineering problems that can be addressed given enough time and manpower, and architectural flaws that are inherent to the design and cannot be fixed.

✚ One question I couldn’t answer about this dialog is why HBase-side wasn’t covered by either a HBase community member or a user. Indeed MapR has interest in HBase, but their product is not HBase.

Original title and link: Big Data Debate: HBase or Cassandra (NoSQL database©myNoSQL)


HBase migration to the new Hadoop Metrics2 system

Elliott Clarke explains a bit the work that his doing in migrating the HBase metrics to Hadoop’s Metrics2 system:

As HBase’s metrics system grew organically, Hadoop developers were making a new version of the Metrics system called Metrics2. In HADOOP-6728 and subsequent JIRAs, a new version of the metrics system was created. This new subsystem has a new name space, different sinks, different sources, more features, and is more complete than the old metrics. When the Metrics2 system was completed, the old system (aka Metrics1) was deprecated. With all of these things in mind, it was time to update HBase’s metrics system so HBASE-4050 was started. I also wanted to clean up the implementation cruft that had accumulated.

The post is more about the specific implementation details than the wide range of metrics HBase already supports and how this new system would unify and allow extending it.

Original title and link: HBase migration to the new Hadoop Metrics2 system (NoSQL database©myNoSQL)


Introduction to HBase Mean Time to Recover (MTTR) - HBase Resiliency

A fantastic post by Nicolas Liochon and Devaraj Das looking into possible HBase failure scenarios and configurations to reduce the Mean Time to Recover:

There are no global failures in HBase: if a region server fails, all the other regions are still available. For a given data-subset, the MTTR was often considered as around ten minutes. This rule of thumb was actually coming from a common case where the recovery was taking time because it was trying to use replicas on a dead datanode. Ten minutes would be the time taken by HDFS to declare a node as dead. With the new stale mode in HDFS, it’s not the case anymore, and the recovery is now bounded by HBase alone. If you care about MTTR, with the settings mentioned here, most cases will take less than 2 minutes between the actual failure and the data being available again in another region server.

Stepping away for a bit, it looks like the overall complexity comes from the various components involved in HBase (ZooKeeper, HBase, HDFS) and their own failure detection mechanisms. If they are not correctly configured and ordered, things can get pretty ugly; ugly as in longer MTTR than one would expect.

Original title and link: Introduction to HBase Mean Time to Recover (MTTR) - HBase Resiliency (NoSQL database©myNoSQL)


The Master-Slave Architecture of HBase

Fantastic post by Matteo Bertozzi looking at HBase’s master-slave architecture:

At first glance, the Apache HBase architecture appears to follow a master/slave model where the master receives all the requests but the real work is done by the slaves. This is not actually the case, and in this article I will describe what tasks are in fact handled by the master and the slaves.

Original title and link: The Master-Slave Architecture of HBase (NoSQL database©myNoSQL)


HBase Data Modeling Tips & Tricks - Timeshifting

Jeff Kolesky describing the data model they are using with HBase and one (strange) trick to reduce the roundtrips to the database:

The idea is to put all of the data about a single entity into a single row in HBase. When you need to run a computation that involves that entity’s data, you have quick access to it by the row key, and all of the data is stored close together on disk.

Additionally, against many suggestions from the HBase community, and general confusion about how timestamps work, we are using timestamps with logical values. Instead of just letting the region server assign a timestamp version to each cell, we are explicitly setting those values so that we can use timestamp as a true queryable dimension in our gets and scans.

In addition to the real timeseries data that is indexed using the cell timestamp, we also have other columns that store metadata about the entity.

It’s amazing how many smart and weird tricks engineers put in their production systems when having to deal with real requirements and SLAs.

Original title and link: HBase Data Modeling Tips & Tricks - Timeshifting (NoSQL database©myNoSQL)


Kairosdb - Fast Scalable Time Series Database

kairosdb is introduced as a rewrite of the OpenTSDB written primarily for Cassandra (nb: OpenTSDB was based on HBase). In terms of what it brings new, this page lists:

  • Uses Guice to load modules.
  • Incorporates Jetty for Rest API and serving up UI.
  • Pure Java build tool (Tablesaw)
  • UI uses Flot and is client side rendered.
  • Ability to customize UI.
  • Relative time now includes month and supports leap years.
  • Modular data store interface supports:
    • HBase
    • Cassandra
    • H2 (For development)
  • Milliseconds data support when using Cassandra.
  • Rest API for querying and submitting data.
  • Build produces deployable tar, rpm and deb packages.
  • Linux start/stop service scripts.
  • Faster.
  • Made aggregations optional (easier to get raw data).
  • Added abilities to import and export data.
  • Aggregators can aggregate data for a specified period.
  • Aggregators can be stacked or “piped” together.

Source code lives on GitHub. Let’s see where it goes.

Original title and link: Kairosdb - Fast Scalable Time Series Database (NoSQL database©myNoSQL)

HBase Compactions Q&A

Ted Yu summarizes some of the most frequent questions related to compactions in HBase:

On user mailing list, questions about compaction are probably the most frequently asked.

Original title and link: HBase Compactions Q&A (NoSQL database©myNoSQL)


Introduction to Apache HBase Snapshots

Matteo Bertozzi introduces HBase snapshots:

Prior to CDH 4.2, the only way to back-up or clone a table was to use Copy/Export Table, or after disabling the table, copy all the hfiles in HDFS. Copy/Export Table is a set of tools that uses MapReduce to scan and copy the table but with a direct impact on Region Server performance. Disabling the table stops all reads and writes, which will almost always be unacceptable.

In contrast, HBase snapshots allow an admin to clone a table without data copies and with minimal impact on Region Servers. Exporting the snapshot to another cluster does not directly affect any of the Region Servers; export is just a distcp with an extra bit of logic.

The part that made me really curious and that didn’t make too much sense when first reading the post is “clone a table without data copies”. But the post clarifies what the snapshot is:

A snapshot is a set of metadata information that allows an admin to get back to a previous state of the table. A snapshot is not a copy of the table; it’s just a list of file names and doesn’t copy the data. A full snapshot restore means that you get back to the previous “table schema” and you get back your previous data losing any changes made since the snapshot was taken.

What I still don’t understand is how snapshots are working after a major compaction (which drops deletes and expired cells).

Original title and link: Introduction to Apache HBase Snapshots (NoSQL database©myNoSQL)


Adding Value Through Graph Analysis Using Titan and Faunus

Interesting slidedeck by Matthias Broecheler introducing 3 graph-related tools developed by Vadas Gintautas, Marko Rodriguez, Stephen Mallette and Daniel LaRocque:

  1. Titan: a massive scale property graph allowing real-time traversals and updates
  2. Faunus: for batch processing of large graphs using Hadoop
  3. Fulgora: for global running graph algorithms on large, compressed, in-memory graphs

The first couple of slides are also showing some possible use cases where these tools would prove their usefulness:

Original title and link: Adding Value Through Graph Analysis Using Titan and Faunus (NoSQL database©myNoSQL)

Simplifying HBase Schema Development With KijiSchema

Jon Natkins from WibiData:

When building an HBase application, you need to be aware of the intricacies and quirks of HBase. For example, your choice of names for column families, or columns themselves can have a drastic effect on the amount of disk space necessary to store your data. In this article, we’ll see how building HBase applications with KijiSchema can help you avoid inefficient disk utilization.

The recommendations related to the length of column names is a one of those subtle signs of how young the NoSQL space is1.

  1. This is not specific only to HBase, but also MongoDB, RethinkDB, etc. 

Original title and link: Simplifying HBase Schema Development With KijiSchema (NoSQL database©myNoSQL)