ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

HBase: All content tagged as HBase in NoSQL databases and polyglot persistence

Introduction to HBase Mean Time to Recover (MTTR) - HBase Resiliency

A fantastic post by Nicolas Liochon and Devaraj Das looking into possible HBase failure scenarios and configurations to reduce the Mean Time to Recover:

There are no global failures in HBase: if a region server fails, all the other regions are still available. For a given data-subset, the MTTR was often considered as around ten minutes. This rule of thumb was actually coming from a common case where the recovery was taking time because it was trying to use replicas on a dead datanode. Ten minutes would be the time taken by HDFS to declare a node as dead. With the new stale mode in HDFS, it’s not the case anymore, and the recovery is now bounded by HBase alone. If you care about MTTR, with the settings mentioned here, most cases will take less than 2 minutes between the actual failure and the data being available again in another region server.

Stepping away for a bit, it looks like the overall complexity comes from the various components involved in HBase (ZooKeeper, HBase, HDFS) and their own failure detection mechanisms. If they are not correctly configured and ordered, things can get pretty ugly; ugly as in longer MTTR than one would expect.

Original title and link: Introduction to HBase Mean Time to Recover (MTTR) - HBase Resiliency (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/introduction-to-hbase-mean-time-to-recover-mttr/


The Master-Slave Architecture of HBase

Fantastic post by Matteo Bertozzi looking at HBase’s master-slave architecture:

At first glance, the Apache HBase architecture appears to follow a master/slave model where the master receives all the requests but the real work is done by the slaves. This is not actually the case, and in this article I will describe what tasks are in fact handled by the master and the slaves.

Original title and link: The Master-Slave Architecture of HBase (NoSQL database©myNoSQL)

via: https://blogs.apache.org/hbase/entry/hbase_who_needs_a_master


HBase Data Modeling Tips & Tricks - Timeshifting

Jeff Kolesky describing the data model they are using with HBase and one (strange) trick to reduce the roundtrips to the database:

The idea is to put all of the data about a single entity into a single row in HBase. When you need to run a computation that involves that entity’s data, you have quick access to it by the row key, and all of the data is stored close together on disk.

Additionally, against many suggestions from the HBase community, and general confusion about how timestamps work, we are using timestamps with logical values. Instead of just letting the region server assign a timestamp version to each cell, we are explicitly setting those values so that we can use timestamp as a true queryable dimension in our gets and scans.

In addition to the real timeseries data that is indexed using the cell timestamp, we also have other columns that store metadata about the entity.

It’s amazing how many smart and weird tricks engineers put in their production systems when having to deal with real requirements and SLAs.

Original title and link: HBase Data Modeling Tips & Tricks - Timeshifting (NoSQL database©myNoSQL)

via: http://www.heyitsopower.com/code/timeshifting-in-hbase/


Kairosdb - Fast Scalable Time Series Database

kairosdb is introduced as a rewrite of the OpenTSDB written primarily for Cassandra (nb: OpenTSDB was based on HBase). In terms of what it brings new, this page lists:

  • Uses Guice to load modules.
  • Incorporates Jetty for Rest API and serving up UI.
  • Pure Java build tool (Tablesaw)
  • UI uses Flot and is client side rendered.
  • Ability to customize UI.
  • Relative time now includes month and supports leap years.
  • Modular data store interface supports:
    • HBase
    • Cassandra
    • H2 (For development)
  • Milliseconds data support when using Cassandra.
  • Rest API for querying and submitting data.
  • Build produces deployable tar, rpm and deb packages.
  • Linux start/stop service scripts.
  • Faster.
  • Made aggregations optional (easier to get raw data).
  • Added abilities to import and export data.
  • Aggregators can aggregate data for a specified period.
  • Aggregators can be stacked or “piped” together.

Source code lives on GitHub. Let’s see where it goes.

Original title and link: Kairosdb - Fast Scalable Time Series Database (NoSQL database©myNoSQL)


HBase Compactions Q&A

Ted Yu summarizes some of the most frequent questions related to compactions in HBase:

On user mailing list, questions about compaction are probably the most frequently asked.

Original title and link: HBase Compactions Q&A (NoSQL database©myNoSQL)

via: http://zhihongyu.blogspot.com/2013/03/compactions-q.html


Introduction to Apache HBase Snapshots

Matteo Bertozzi introduces HBase snapshots:

Prior to CDH 4.2, the only way to back-up or clone a table was to use Copy/Export Table, or after disabling the table, copy all the hfiles in HDFS. Copy/Export Table is a set of tools that uses MapReduce to scan and copy the table but with a direct impact on Region Server performance. Disabling the table stops all reads and writes, which will almost always be unacceptable.

In contrast, HBase snapshots allow an admin to clone a table without data copies and with minimal impact on Region Servers. Exporting the snapshot to another cluster does not directly affect any of the Region Servers; export is just a distcp with an extra bit of logic.

The part that made me really curious and that didn’t make too much sense when first reading the post is “clone a table without data copies”. But the post clarifies what the snapshot is:

A snapshot is a set of metadata information that allows an admin to get back to a previous state of the table. A snapshot is not a copy of the table; it’s just a list of file names and doesn’t copy the data. A full snapshot restore means that you get back to the previous “table schema” and you get back your previous data losing any changes made since the snapshot was taken.

What I still don’t understand is how snapshots are working after a major compaction (which drops deletes and expired cells).

Original title and link: Introduction to Apache HBase Snapshots (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2013/03/introduction-to-apache-hbase-snapshots/


Adding Value Through Graph Analysis Using Titan and Faunus

Interesting slidedeck by Matthias Broecheler introducing 3 graph-related tools developed by Vadas Gintautas, Marko Rodriguez, Stephen Mallette and Daniel LaRocque:

  1. Titan: a massive scale property graph allowing real-time traversals and updates
  2. Faunus: for batch processing of large graphs using Hadoop
  3. Fulgora: for global running graph algorithms on large, compressed, in-memory graphs

The first couple of slides are also showing some possible use cases where these tools would prove their usefulness:

Original title and link: Adding Value Through Graph Analysis Using Titan and Faunus (NoSQL database©myNoSQL)


Simplifying HBase Schema Development With KijiSchema

Jon Natkins from WibiData:

When building an HBase application, you need to be aware of the intricacies and quirks of HBase. For example, your choice of names for column families, or columns themselves can have a drastic effect on the amount of disk space necessary to store your data. In this article, we’ll see how building HBase applications with KijiSchema can help you avoid inefficient disk utilization.

The recommendations related to the length of column names is a one of those subtle signs of how young the NoSQL space is1.


  1. This is not specific only to HBase, but also MongoDB, RethinkDB, etc. 

Original title and link: Simplifying HBase Schema Development With KijiSchema (NoSQL database©myNoSQL)

via: http://www.kiji.org/2012/03/01/using-disk-space-efficiently-with-kiji-schema


Project Rhino: Enhanced Data Protection for the Apache Hadoop Ecosystem

Avik Dey (Intel) sent the announcement of the new open source project from Intel to the Hadoop mailing list:

As the Apache Hadoop ecosystem extends into new markets and sees new use cases with security and compliance challenges, the benefits of processing sensitive and legally protected data with Hadoop must be coupled with protection for private information that limits performance impact. Project Rhino is our open source effort to enhance the existing data protection capabilities of the Hadoop ecosystem to address these challenges, and contribute the code back to Apache.

Project Rhino targets security at all levels: from encryption and key management, cell level ACLs to audit logging.

Original title and link: Project Rhino: Enhanced Data Protection for the Apache Hadoop Ecosystem (NoSQL database©myNoSQL)

via: http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201302.mbox/%3cCD5137E5.15610%25avik.dey@intel.com%3e


Flatten Entire HBase Column Families With Pig and Python UDFs

Chase Seibert:

Most Pig tutorials you will find assume that you are working with data where you know all the column names ahead of time, and that the column names themselves are just labels, versus being composites of labels and data. For example, when working with HBase, it’s actually not uncommon for both of those assumptions to be false. Being a columnar database, it’s very common to be working to rows that have thousands of columns. Under that circumstance, it’s also common for the column names themselves to encode to dimensions, such as date and counter type.

Original title and link: Flatten Entire HBase Column Families With Pig and Python UDFs (NoSQL database©myNoSQL)

via: http://chase-seibert.github.com/blog/2013/02/10/pig-hbase-flatten-column-family.html


Apache HBase Internals: Locking and Multiversion Concurrency Control

Gregory Chanan explains the ACID per-row semantics of HBase and the usage of row-level locks and MVCC to ensure them:

For writes:

  1. (w1) After acquiring the RowLock, each write operation is immediately assigned a write number
  2. (w2) Each data cell in the write stores its write number.
  3. (w3) A write operation completes by declaring it is finished with the write number.

For reads:

  1. (r1) Each read operation is first assigned a read timestamp, called a read point.
  2. (r2) The read point is assigned to be the highest integer such that all writes with write number <= x have been completed.
  3. (r3) A read r for a certain (row, column) combination returns the data cell with the matching (row, column) whose write number is the largest value that is less than or equal to the read point of r.

Probably self understood that you should read and save this article if HBase is already in your datacenter or at least at horizon.

Original title and link: Apache HBase Internals: Locking and Multiversion Concurrency Control (NoSQL database©myNoSQL)

via: https://blogs.apache.org/hbase/entry/apache_hbase_internals_locking_and


SQL Over HBase With Phoenix

Released by the Salesforce team, Phoenix adds a SQL layer on top of HBase and an almost complete JDBC driver.

Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.

The project already has a page about the performance and the results are looking great. For a bullet list summary, check out James Taylor’s post.

Original title and link: SQL Over HBase With Phoenix (NoSQL database©myNoSQL)