HBase: All content tagged as HBase in NoSQL databases and polyglot persistence
Tuesday, 3 January 2012
Last NoSQL Releases in 2011: MongoDB, Hive, ZooKeeper, Whirr, HBase, Redis, and Hadoop 1.0.0
Let’s start the year with a quick review of the latest releases that happened in December. Make sure that you scroll to the end as there are quite a few important ones.
MongoDB 2.0.2
Announced on Dec.15th, MongoDB 2.0.2 is a bug fix release:
- Hit config server only once per mongos on meta data change to not overwhelm
- Removed unnecessary connection close and open between mongos and mongod after getLastError
- Replica set primaries close all sockets on stepDown()
- Do not require authentication for the buildInfo command
- scons option for using system libraries
Apache Hive 0.8.0
Apache Hive 0.8.0 came out on Dec.19th. The list of new features, improvements, and bug fixes is extremely long.
Just as a side note, who came out with the idea of having a Hive fans’ page on Facebook?
Apache ZooKeeper 3.4.2
ZooKeeper 3.4.0 has been followed up shortly by two new minor version updates fixing some critical bugs. The list of issues fixed in ZooKeeper 3.4.1 can be found here and for ZooKeeper 3.4.2 the 2 fixed bugs are listed here.
As with ZooKeeper 3.4.0, these versions are not yet production ready.
Apache Whirr 0.7.0
Apache Whirr 0.7.0 has been released on Dec.21st featuring 56 improvements and bug fixes including support for Puppet & Chef, and Mahout and Ganglia as a service. The complete list can be found here.
Some more details about Whirr 0.7.0 can be found here.
Apache HBase 0.90.5
Released Dec.23rd, HBase 0.90.5 packs 81 bug fixes. The complete list can be found here.
Redis 2.4.5
Redis 2.4.5 was released on Dec.23rd and provides 4 bug fixes:
- [BUGFIX] Fixed a ZUNIONSTORE/ZINTERSTORE bug that can cause a NaN to be inserted as a sorted set element score. This happens when one of the elements has
+inf/-infscore and the weight used is 0. - [BUGFIX] Fixed memory leak in
CLIENT INFO. - [BUGFIX] Fixed a non critical
SORTbug (Issue 224). - [BUGFIX] Fixed a replication bug: now the timeout configuration is respected during the connection with the master.
--quietoption implemented in the Redis test.
Last but definitely one of the most important announcements that came in December:
Hadoop 1.0.0
Based on the 0.20-security code line, Hadoop 1.0.0 was announced on Dec.29. This release includes support for:
- HBase (append/hsynch/hflush) and Security
- Webhdfs (with full support for security)
- Performance enhanced access to local files for HBase
- Other performance enhancements, bug fixes, and features
- All version 0.20.205 and prior 0.20.2xx features
Complete release notes are available here.
Stéphane Fréchette, Ryan Slobojan, Duane Moore, Arun C. Murthy
And with this we are ready for 2012.
Original title and link: Last NoSQL Releases in 2011: MongoDB, Hive, ZooKeeper, Whirr, HBase, Redis, and Hadoop 1.0.0 (©myNoSQL)
Thursday, 22 December 2011
Why We Chose HBase for AppFirst APM
Its performance had a significant impact on our decision making as well. It sustains an enormous number of writes and the read cycle times were much better than we had anticipated. Further, it gives us the option to interact with the Hadoop Ecosystem, including HDFS, Mapreduce, and Zookeeper frameworks. Our enthusiasm for HBase skyrocketed when we discovered how to create map-reduce apps to do a number of management tasks. While Cassandra also has these capabilities, its data model was fundamentally more complex.
What if the whole post would have said: we chose HBase because of
- its seamless integration in the Hadoop ecosystem
- the scalable time series OpenTSDB is built on top of HBase?
Original title and link: Why We Chose HBase for AppFirst APM (©myNoSQL)
via: http://blog.appfirst.com/2011/12/22/why-we-chose-hbase/
Saturday, 17 December 2011
NoSQL Screencast: HBase Schema Design
In this O’Reilly webcast, long time HBase developer and Cloudera HBase/Hadoop architect Lars George discusses the underlying concepts of the storage layer in HBase and how to do model data in HBase for best possible performance.
Sunday, 11 December 2011
Facebook: There Are No Published Cases of NoSQL Databases Operating at the Scale of Facebook’s MySQL Database
Joe Maguire referring to the Facebook talk embedded below MySQL and HBase:
if Facebook doesn’t need NoSQL, who does?
My answer: many of those that cannot employ a specialized team to hack the hell out of MySQL to make it work at that scale.
On the flipside, many other companies don’t have the time or engineering power to grow their product together with a NoSQL database.
via: http://josephmaguire.blogspot.com/2011/12/facebook-there-are-no-published-cases.html
Wednesday, 7 December 2011
Backing Up HBase to Amazon S3
This is a guest post by Bizosys Team creators of HSearch, an opensource, NoSQL, distributed, real-time search engine built on Hadoop and HBase.
We have evaluated various options to backup data inside HBase and built a solution. This post will explain the options and also provide the solution for anyone to download and implement it for their own HBase installations.
After considering these options we developed a simple tool, which backs up data to Amazon S3 and restores it when needed. Another requirement is to take a full backup over weekend and a daily incremental backup.
In a recovery scenario, it should firstly initiate a clean environment with all tables created and populated with latest full backup data. Then it should apply all incremental backups sequentially. However, with this method, deletes are not captured and this may lead to some unnecessary data in tables. This is a known disadvantage for this method of backup and restore.
This backup program uses internally the HBase Import and Export tools to execute the programs in a Map-Reduce way.
Top 10 Features of the backup tool
- Export complete data for the given set of tables to S3 bucket.
- Export incrementally data for the given set of tables to S3 bucket.
- List all complete as well as incremental backup repositories.
- Restore a table from backup based on the given backup repository.
- Runs in Map-Reduce
- In case of connection failure, retries with increasing delays
- Handles special characters like _ which creates the export and import activities.
- Enhancement of existing Export and Import tool with detail logging to report a failure than just exiting with a program status of 1.
- Works in human readable time format for taking, listing and restoring of backup than using system tick time or unix
EPOCHtime (Time represented as a Number than readabale format asYYYY.MM.DD 24HH:MINUTE:SECOND:MILLSECOND TIMEZONE) - All parameters are taken from command line which allows the cron job to run this at regular interval.
Setting up the tool
- Download the package from hbackup.install.tar
This package includes the necessary jar files and the source code. - Setup a configuration file. Download the
hbase-site.xmlfile. Add to thisfs.s3.awsAccessKeyId,fs.s3.awsSecretAccessKey,fs.s3n.awsAccessKeyIdandfs.s3n.awsSecretAccessKeyproperties - Setup the class path with all jars existing inside the
hbase/libdirectory,hbase.jarfile,java-xmlbuilder-0.4.jar,jets3t-0.8.1a.jarandhbackup-1.0-core.jarfile bundled inside the downloaded hbackup.install.tar. Make surehbackup-1.0-core.jarat the beginning of the classpath. In addition to this add the configuration directory to CLASSPATH which has kept hbase-site.xml file.
Running the tool
Usage: It runs in 4 modes as [backup.full], [backup.incremental], [backup.history] and [restore].
[backup.full]
mode=backup.full tables="comma separated tables" backup.folder=S3-Path date="YYYY.MM.DD 24HH:MINUTE:SECOND:MILLSECOND TIMEZONE"
Example:
mode=backup.full tables=tab1,tab2,tab3 backup.folder=s3://S3BucketABC/ date="2011.12.01 17:03:38:546 IST"mode=backup.full tables=tab1,tab2,tab3 backup.folder=s3://S3BucketABC/
[backup.incremental]
mode=backup.incremental tables="comma separated tables" backup.folder=S3-Path duration.mins=Minutes
Example of backup of changes occurred in the last 30 minutes:
mode=backup.incremental backup.folder=s3://S3BucketABC/ duration.mins=30 tables=tab1,tab2,tab3
backup.history
mode=backup.history backup.folder=S3-Path
Example of listing past archives. Incremental ones end with .incr
mode=backup.history backup.folder=s3://S3BucketABC/
[restore]
mode=restore backup.folder=S3-Path/ArchieveDate tables="comma separated tables"
Example of adding the rows archived during that date. First apply a full backup and then apply incremental backups.
mode=backup.history backup.folder=s3://S3-Path/DAY_MON_HH_MI_SS_SSS_ZZZ_YYYY tables=tab1,tab2,tab3
Sample scripts to run the backup tool
Setup:
$ cat setenv.sh
for file in `ls /mnt/hbase/lib`
do
export CLASSPATH=$CLASSPATH:/mnt/hbase/lib/$file;
done
export CLASSPATH=/mnt/hbase/hbase-0.90.4.jar:$CLASSPATH
export CLASSPATH=/mnt/hbackup/hbackup-1.0-core.jar:/mnt/hbackup/java-xmlbuilder-0.4.jar:/mnt/hbackup/jets3t-0.8.1a.jar:/mnt/hbackup/conf:$CLASSPATH
Full backup:
$ cat backup_full.sh
. /mnt/hbackup/bin/setenv.sh
dd=`date "+%Y.%m.%d %H:%M:%S:000 %Z"`
echo Backing up for date $dd
for table in `echo table1 table2 table3`
do
/usr/lib/jdk/bin/java com.bizosys.oneline.maintenance.HBaseBackup mode=backup.full backup.folder=s3://mybucket/ tables=$table "date=$dd"
sleep 10
done
List of backups:
$ cat list.sh
. /mnt/hbackup/bin/setenv.sh
/usr/lib/jdk/bin/java com.bizosys.oneline.maintenance.HBaseBackup mode=backup.history backup.folder=s3://mybucket
Original title and link: Backin Up HBase to Amazon S3 (©myNoSQL)
Wednesday, 30 November 2011
DataSift Using MySQL, HBase, Memcached to Deal With Twitter Firehose
A new great article from Todd Hoff dissecting the DataSift architecture:

In terms of data store, DataSift architecture includes:
- MySQL (Percona server) on SSD drives
- HBase cluster (currently, ~30 hadoop nodes, 400TB of storage)
- Memcached (cache)
- Redis (still used for some internal queues, but probably going to be dismissed soon)
Leave whatever you were doing and go read it now.
Original title and link: DataSift Using MySQL, HBase, Memcached to Deal With Twitter Firehose (©myNoSQL)
Monday, 28 November 2011
How to Implement an IMAP Server on Top of a CouchDB/NoSQL Data Store?
Interesting question on SO:
To summarize my objective here, I am really just looking for a simple, opensource method which allows me to create and maintain a (preferably noSQL db) backup/archieve of one/more remote IMAP email accounts on a per user basis and sync each individual users email accounts using a simple, low cost solution which easily scales out, consumes server resources in an efficient maner with the ADDED ABILITY that each user needs to be able to connect to his central email archive by simply addingba new imap account to his existing email client using an imap server, username and password provided through this archive server/setup.
This reminded me of a GSOC project to design and implement a distributed mailbox on top of Hadoop HDFS as part of the Apache James project. The project description can be found on this JIRA ticket and more details here:
We need to implement mailbox storage as a distributed system on top of Hadoop HDFS. The James mailbox API will be used. A first step is to design how to interact with Hadoop (native api, gora incubator at apache,…) and deal with specific performance questions related to mail loading/parsing in a distributed system (use map/reduce or not, use existing local lucene indexes for search,…). The second step is to implement the HDFS mailbox (maildir mailbox is similar because is stores mails as a file and can be an inspiration). A single James server will still be deployed because we don’t have any distributed UID generation.
According to the last comments on the ticket, this project was completed Ioan Eugen Stan under Eric Charles’ mentorship.
Original title and link: How to Implement an IMAP Server on Top of a CouchDB/NoSQL Data Store? (©myNoSQL)
Odiago WibiData: Analytics Startup Powered by HBase and Hadoop
A new startup powered by HBase and Hadoop, founded by one of Cloudera’s founders Christophe Bisciglia and Hadoop developer and ex-Cloudera Aaron Kimball, focusing on investigative and operational analytics on consumer Internet data:
- ALL data pertaining to a single user (or mobile device) is kept in a single, possibly very long, HBase row.
- There are two primary operators in WibiData, Produce and Gather.
- Produce operates on single rows. It can operate on one row at HBase speed (milliseconds) if you need to inform an interactive user response. Or it can operate on the whole database in batch via Hadoop MapReduce.
- It is reasonable to think of Produce as mainly doing two things. One is the aforementioned serving of data out of WibiData into interactive applications. The other is scoring, classifying, recommending, etc. on individual users (i.e. rows), in line with an analytic model.
- Gather typically operates on all your rows at once, and emits suitable input for a MapReduce Reduce step. It is reasonable to think of Gather as being a key cog in the training of analytic models.
- HBase schema management is done at the WibiData system level, not directly in applications. There’s a WibiData HBase data dictionary, powered by a set of system tables, that specifies cell data types/record types and, in effect, primitive schemas.
One aspect that I’m not familiar with is how HBase can handle multitenancy, a requirement for services like WibiData.
As a side note, I assume this is the type of startups Accel’s $100m fund for Big Data, Hadoop, and NoSQL Databases is targetting.
Original title and link: Odiago WibiData: Analytics Startup Powered by HBase and Hadoop (©myNoSQL)
Monday, 31 October 2011
NoSQL: A Three-Horse Race
James Philips (Couchbase) quoted by Curt Monash:
NoSQL is simply a three-horse race between Couchbase, MongoDB, and Cassandra.
Off the top of my head I could name at least two other projects that are either having numerous deployments or are already managing huge amounts of data. And I’d bet every regular reader would figure out that I’m referring to Redis and HBase.
Original title and link: NoSQL: A Three-Horse Race (©myNoSQL)
Thursday, 6 October 2011
How Does Google MegaStore Compare Against HDFS/HBase?
Alex Feinberg answering the question in the title:
This is like saying “how does a General Motors bus compare against a Ford engine”. MegaStore is built on of Google’s BigTable/GFS. HBase/HDFS are BigTable/HDFS work-alikes.
BigTable and HBase give up availability (in the CAP Theorem sense) in favour of consistency: when a tablet master node (HRegionServer in HBase) goes down, the portion of the keyspace the failed node is responsible for becomes (briefly) unavailable until another node takes over the portion of the key space. This is efficient, as the data/write-ahead-log is stored GFS (or HDFS): in a way serializing writes to GFS/HDFS (a file system with relaxed consistency semantics) through a single node ensures serializable consistency.
Make sure you read it all.
Original title and link: How Does Google MegaStore Compare Against HDFS/HBase? (©myNoSQL)
via: http://www.quora.com/How-does-Google-MegaStore-compare-against-HDFS-HBase
Saturday, 10 September 2011
State of HBase With Michael Stack
Michael Stack (StumbleUpon & Hadoop PMC) presents on some of the more interesting HBase deployments, HBase scenario usages, HBase and HDFS, and near-future of HBase:
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling