ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

HBase: All content tagged as HBase in NoSQL databases and polyglot persistence

Last NoSQL Releases in 2011: MongoDB, Hive, ZooKeeper, Whirr, HBase, Redis, and Hadoop 1.0.0

Let’s start the year with a quick review of the latest releases that happened in December. Make sure that you scroll to the end as there are quite a few important ones.

MongoDB 2.0.2

Announced on Dec.15th, MongoDB 2.0.2 is a bug fix release:

  • Hit config server only once per mongos on meta data change to not overwhelm
  • Removed unnecessary connection close and open between mongos and mongod after getLastError
  • Replica set primaries close all sockets on stepDown()
  • Do not require authentication for the buildInfo command
  • scons option for using system libraries

Apache Hive 0.8.0

Apache Hive 0.8.0 came out on Dec.19th. The list of new features, improvements, and bug fixes is extremely long.

Just as a side note, who came out with the idea of having a Hive fans’ page on Facebook?

Apache ZooKeeper 3.4.2

ZooKeeper 3.4.0 has been followed up shortly by two new minor version updates fixing some critical bugs. The list of issues fixed in ZooKeeper 3.4.1 can be found here and for ZooKeeper 3.4.2 the 2 fixed bugs are listed here.

As with ZooKeeper 3.4.0, these versions are not yet production ready.

Apache Whirr 0.7.0

Apache Whirr 0.7.0 has been released on Dec.21st featuring 56 improvements and bug fixes including support for Puppet & Chef, and Mahout and Ganglia as a service. The complete list can be found here.

Some more details about Whirr 0.7.0 can be found here.

Apache HBase 0.90.5

Released Dec.23rd, HBase 0.90.5 packs 81 bug fixes. The complete list can be found here.

Redis 2.4.5

Redis 2.4.5 was released on Dec.23rd and provides 4 bug fixes:

  • [BUGFIX] Fixed a ZUNIONSTORE/ZINTERSTORE bug that can cause a NaN to be inserted as a sorted set element score. This happens when one of the elements has +inf/-inf score and the weight used is 0.
  • [BUGFIX] Fixed memory leak in CLIENT INFO.
  • [BUGFIX] Fixed a non critical SORT bug (Issue 224).
  • [BUGFIX] Fixed a replication bug: now the timeout configuration is respected during the connection with the master.
  • --quiet option implemented in the Redis test.

Last but definitely one of the most important announcements that came in December:

Hadoop 1.0.0

Based on the 0.20-security code line, Hadoop 1.0.0 was announced on Dec.29. This release includes support for:

  • HBase (append/hsynch/hflush) and Security
  • Webhdfs (with full support for security)
  • Performance enhanced access to local files for HBase
  • Other performance enhancements, bug fixes, and features
  • All version 0.20.205 and prior 0.20.2xx features

Complete release notes are available here.

Stéphane Fréchette, Ryan Slobojan, Duane Moore, Arun C. Murthy

And with this we are ready for 2012.

Original title and link: Last NoSQL Releases in 2011: MongoDB, Hive, ZooKeeper, Whirr, HBase, Redis, and Hadoop 1.0.0 (NoSQL database©myNoSQL)


Why We Chose HBase for AppFirst APM

Its performance had a significant impact on our decision making as well. It sustains an enormous number of writes and the read cycle times were much better than we had anticipated. Further, it gives us the option to interact with the Hadoop Ecosystem, including HDFS, Mapreduce, and Zookeeper frameworks. Our enthusiasm for HBase skyrocketed when we discovered how to create map-reduce apps to do a number of management tasks. While Cassandra also has these capabilities, its data model was fundamentally more complex.

What if the whole post would have said: we chose HBase because of

  1. its seamless integration in the Hadoop ecosystem
  2. the scalable time series OpenTSDB is built on top of HBase?

Original title and link: Why We Chose HBase for AppFirst APM (NoSQL database©myNoSQL)

via: http://blog.appfirst.com/2011/12/22/why-we-chose-hbase/


NoSQL Screencast: HBase Schema Design

In this O’Reilly webcast, long time HBase developer and Cloudera HBase/Hadoop architect Lars George discusses the underlying concepts of the storage layer in HBase and how to do model data in HBase for best possible performance.


Facebook: There Are No Published Cases of NoSQL Databases Operating at the Scale of Facebook’s MySQL Database

Joe Maguire referring to the Facebook talk embedded below MySQL and HBase:

if Facebook doesn’t need NoSQL, who does?

My answer: many of those that cannot employ a specialized team to hack the hell out of MySQL to make it work at that scale.

On the flipside, many other companies don’t have the time or engineering power to grow their product together with a NoSQL database.

via: http://josephmaguire.blogspot.com/2011/12/facebook-there-are-no-published-cases.html


Backing Up HBase to Amazon S3

This is a guest post by Bizosys Team creators of HSearch, an opensource, NoSQL, distributed, real-time search engine built on Hadoop and HBase.

We have evaluated various options to backup data inside HBase and built a solution. This post will explain the options and also provide the solution for anyone to download and implement it for their own HBase installations.

Option Pros Cons
Backup the Hadoop DFS Block data files are backed up quickly.

Even if there is no visible external load on HBase, HBase internal processes such as region balancing, compaction goes on updating the HDFS blocks. So a raw copy may result in an inconsistence state.

Secondly, Hadoop, HBase as well as Hadoop HDFS keeps data in memory and flush at periodic intervals. So raw copy may result in an inconsistent state.

HBase Import and Export tool The Map-Reduce Job downloads data to the given output path. Providing a path like s3://backupbucket/ the program fails with exceptions like: Jets3tFileSystemStore failed with AWSCredentials.
HBase Table Copy tools Another parallel replicated setup to switch. Huge investment to keep running another parallel environment to replicate production data.

After considering these options we developed a simple tool, which backs up  data to Amazon S3 and restores it when needed. Another requirement is to take a full backup over weekend and a daily incremental backup.

In a recovery scenario, it should firstly initiate a clean environment with all tables created and populated with latest full backup data. Then it should apply all incremental backups sequentially. However, with this method, deletes are not captured and this may lead to some unnecessary data in tables. This is a known disadvantage for this method of backup and restore.

This backup program uses internally the HBase Import and Export tools to execute the programs in a Map-Reduce way.

Top 10 Features of the backup tool

  1. Export complete data for the given set of tables to S3 bucket.
  2. Export incrementally data for the given set of tables to S3 bucket.
  3. List all complete as well as incremental backup repositories.
  4. Restore a table from backup based on the given backup repository.
  5. Runs in Map-Reduce
  6. In case of connection failure, retries with increasing delays
  7. Handles special characters like _ which creates the export and import activities.
  8. Enhancement of existing Export and Import tool with detail logging to report a failure than just exiting with a program status of 1.
  9. Works in human readable time format for taking, listing and restoring of backup than using system tick time or unix EPOCH time (Time represented as a Number than readabale format as YYYY.MM.DD 24HH:MINUTE:SECOND:MILLSECOND TIMEZONE)
  10. All parameters are taken from command line which allows the cron job to run this at regular interval.

Setting up the tool

  1. Download the package from hbackup.install.tar
    This package includes the necessary jar files and the source code.
  2. Setup a configuration file. Download the hbase-site.xml file. Add to this fs.s3.awsAccessKeyId, fs.s3.awsSecretAccessKey, fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey properties
  3. Setup the class path with all jars existing inside the hbase/lib directory, hbase.jar file, java-xmlbuilder-0.4.jar, jets3t-0.8.1a.jar and hbackup-1.0-core.jar file bundled inside the downloaded hbackup.install.tar. Make sure hbackup-1.0-core.jar at the beginning of the classpath. In addition to this add the configuration directory to CLASSPATH which has kept hbase-site.xml file.

Running the tool

Usage: It runs in 4 modes as [backup.full], [backup.incremental], [backup.history] and [restore].

[backup.full]

mode=backup.full tables="comma separated tables" backup.folder=S3-Path  date="YYYY.MM.DD 24HH:MINUTE:SECOND:MILLSECOND TIMEZONE"

Example:

mode=backup.full tables=tab1,tab2,tab3 backup.folder=s3://S3BucketABC/ date="2011.12.01 17:03:38:546 IST"
mode=backup.full tables=tab1,tab2,tab3 backup.folder=s3://S3BucketABC/

[backup.incremental]

mode=backup.incremental tables="comma separated tables" backup.folder=S3-Path duration.mins=Minutes

Example of backup of changes occurred in the last 30 minutes:

mode=backup.incremental backup.folder=s3://S3BucketABC/ duration.mins=30 tables=tab1,tab2,tab3

backup.history

mode=backup.history backup.folder=S3-Path

Example of listing past archives. Incremental ones end with .incr

mode=backup.history backup.folder=s3://S3BucketABC/

[restore]

mode=restore  backup.folder=S3-Path/ArchieveDate tables="comma separated tables"

Example of adding the rows archived during that date. First apply a full backup and then apply incremental backups.

mode=backup.history backup.folder=s3://S3-Path/DAY_MON_HH_MI_SS_SSS_ZZZ_YYYY tables=tab1,tab2,tab3

Sample scripts to run the backup tool

Setup:

$ cat setenv.sh
 for file in `ls /mnt/hbase/lib`
 do
 export CLASSPATH=$CLASSPATH:/mnt/hbase/lib/$file;
 done

 export CLASSPATH=/mnt/hbase/hbase-0.90.4.jar:$CLASSPATH

 export CLASSPATH=/mnt/hbackup/hbackup-1.0-core.jar:/mnt/hbackup/java-xmlbuilder-0.4.jar:/mnt/hbackup/jets3t-0.8.1a.jar:/mnt/hbackup/conf:$CLASSPATH

Full backup:

 $ cat backup_full.sh
 . /mnt/hbackup/bin/setenv.sh

 dd=`date "+%Y.%m.%d %H:%M:%S:000 %Z"`
 echo Backing up for date $dd
 for table in `echo table1 table2 table3`
 do
 /usr/lib/jdk/bin/java com.bizosys.oneline.maintenance.HBaseBackup mode=backup.full backup.folder=s3://mybucket/ tables=$table "date=$dd"
 sleep 10
 done

List of backups:

 $ cat list.sh
 . /mnt/hbackup/bin/setenv.sh
 /usr/lib/jdk/bin/java com.bizosys.oneline.maintenance.HBaseBackup mode=backup.history backup.folder=s3://mybucket

Original title and link: Backin Up HBase to Amazon S3 (NoSQL database©myNoSQL)


DataSift Using MySQL, HBase, Memcached to Deal With Twitter Firehose

A new great article from Todd Hoff dissecting the DataSift architecture:

DataSift architecture

Click for a larger image

In terms of data store, DataSift architecture includes:

  • MySQL (Percona server) on SSD drives
  • HBase cluster (currently, ~30 hadoop nodes, 400TB of storage)
  • Memcached (cache)
  • Redis (still used for some internal queues, but probably going to be dismissed soon)

Leave whatever you were doing and go read it now.

Original title and link: DataSift Using MySQL, HBase, Memcached to Deal With Twitter Firehose (NoSQL database©myNoSQL)


How to Implement an IMAP Server on Top of a CouchDB/NoSQL Data Store?

Interesting question on SO:

To summarize my objective here, I am really just looking for a simple, opensource method which allows me to create and maintain a (preferably noSQL db) backup/archieve of one/more remote IMAP email accounts on a per user basis and sync each individual users email accounts using a simple, low cost solution which easily scales out, consumes server resources in an efficient maner with the ADDED ABILITY that each user needs to be able to connect to his central email archive by simply addingba new imap account to his existing email client using an imap server, username and password provided through this archive server/setup.

This reminded me of a GSOC project to design and implement a distributed mailbox on top of Hadoop HDFS as part of the Apache James project. The project description can be found on this JIRA ticket and more details here:

We need to implement mailbox storage as a distributed system on top of Hadoop HDFS. The James mailbox API will be used. A first step is to design how to interact with Hadoop (native api, gora incubator at apache,…) and deal with specific performance questions related to mail loading/parsing in a distributed system (use map/reduce or not, use existing local lucene indexes for search,…). The second step is to implement the HDFS mailbox (maildir mailbox is similar because is stores mails as a file and can be an inspiration). A single James server will still be deployed because we don’t have any distributed UID generation.

According to the last comments on the ticket, this project was completed Ioan Eugen Stan under Eric Charles’ mentorship.

Original title and link: How to Implement an IMAP Server on Top of a CouchDB/NoSQL Data Store? (NoSQL database©myNoSQL)

via: http://stackoverflow.com/questions/8276110/how-to-implement-an-imap-server-on-top-of-a-couchdb-nosql-data-store


Odiago WibiData: Analytics Startup Powered by HBase and Hadoop

A new startup powered by HBase and Hadoop, founded by one of Cloudera’s founders Christophe Bisciglia and Hadoop developer and ex-Cloudera Aaron Kimball, focusing on investigative and operational analytics on consumer Internet data:

  • ALL data pertaining to a single user (or mobile device) is kept in a single, possibly very long, HBase row.
  • There are two primary operators in WibiData, Produce and Gather.
  • Produce operates on single rows. It can operate on one row at HBase speed (milliseconds) if you need to inform an interactive user response. Or it can operate on the whole database in batch via Hadoop MapReduce.
  • It is reasonable to think of Produce as mainly doing two things. One is the aforementioned serving of data out of WibiData into interactive applications. The other is scoring, classifying, recommending, etc. on individual users (i.e. rows), in line with an analytic model.
  • Gather typically operates on all your rows at once, and emits suitable input for a MapReduce Reduce step. It is reasonable to think of Gather as being a key cog in the training of analytic models.
  • HBase schema management is done at the WibiData system level, not directly in applications. There’s a WibiData HBase data dictionary, powered by a set of system tables, that specifies cell data types/record types and, in effect, primitive schemas.

One aspect that I’m not familiar with is how HBase can handle multitenancy, a requirement for services like WibiData.

As a side note, I assume this is the type of startups Accel’s $100m fund for Big Data, Hadoop, and NoSQL Databases is targetting.

Original title and link: Odiago WibiData: Analytics Startup Powered by HBase and Hadoop (NoSQL database©myNoSQL)

via: http://www.dbms2.com/2011/11/02/5576/


NoSQL: A Three-Horse Race

James Philips (Couchbase) quoted by Curt Monash:

NoSQL is simply a three-horse race between Couchbase, MongoDB, and Cassandra.

Off the top of my head I could name at least two other projects that are either having numerous deployments or are already managing huge amounts of data. And I’d bet every regular reader would figure out that I’m referring to Redis and HBase.

Original title and link: NoSQL: A Three-Horse Race (NoSQL database©myNoSQL)

via: http://www.dbms2.com/2011/10/23/nosql-notes/


How Does Google MegaStore Compare Against HDFS/HBase?

Alex Feinberg answering the question in the title:

This is like saying “how does a General Motors bus compare against a Ford engine”. MegaStore is built on of Google’s BigTable/GFS. HBase/HDFS are BigTable/HDFS work-alikes.

BigTable and HBase give up availability (in the CAP Theorem sense) in favour of consistency: when a tablet master node (HRegionServer in HBase) goes down, the portion of the keyspace the failed node is responsible for becomes (briefly) unavailable until another node takes over the portion of the key space. This is efficient, as the data/write-ahead-log is stored GFS (or HDFS): in a way serializing writes to GFS/HDFS (a file system with relaxed consistency semantics) through a single node ensures serializable consistency.

Make sure you read it all.

Original title and link: How Does Google MegaStore Compare Against HDFS/HBase? (NoSQL database©myNoSQL)

via: http://www.quora.com/How-does-Google-MegaStore-compare-against-HDFS-HBase


State of HBase With Michael Stack

Michael Stack (StumbleUpon & Hadoop PMC) presents on some of the more interesting HBase deployments, HBase scenario usages, HBase and HDFS, and near-future of HBase: