hbase: All content tagged as hbase in NoSQL databases and polyglot persistence
In this O’Reilly webcast, long time HBase developer and Cloudera HBase/Hadoop architect Lars George discusses the underlying concepts of the storage layer in HBase and how to do model data in HBase for best possible performance.
We have evaluated various options to backup data inside HBase and built a solution. This post will explain the options and also provide the solution for anyone to download and implement it for their own HBase installations.
After considering these options we developed a simple tool, which backs up data to Amazon S3 and restores it when needed. Another requirement is to take a full backup over weekend and a daily incremental backup.
In a recovery scenario, it should firstly initiate a clean environment with all tables created and populated with latest full backup data. Then it should apply all incremental backups sequentially. However, with this method, deletes are not captured and this may lead to some unnecessary data in tables. This is a known disadvantage for this method of backup and restore.
This backup program uses internally the HBase Import and Export tools to execute the programs in a Map-Reduce way.
Top 10 Features of the backup tool
- Export complete data for the given set of tables to S3 bucket.
- Export incrementally data for the given set of tables to S3 bucket.
- List all complete as well as incremental backup repositories.
- Restore a table from backup based on the given backup repository.
- Runs in Map-Reduce
- In case of connection failure, retries with increasing delays
- Handles special characters like _ which creates the export and import activities.
- Enhancement of existing Export and Import tool with detail logging to report a failure than just exiting with a program status of 1.
- Works in human readable time format for taking, listing and restoring of backup than using system tick time or unix
EPOCHtime (Time represented as a Number than readabale format as
YYYY.MM.DD 24HH:MINUTE:SECOND:MILLSECOND TIMEZONE)
- All parameters are taken from command line which allows the cron job to run this at regular interval.
Setting up the tool
- Download the package from hbackup.install.tar
This package includes the necessary jar files and the source code.
- Setup a configuration file. Download the
hbase-site.xmlfile. Add to this
- Setup the class path with all jars existing inside the
hbackup-1.0-core.jarfile bundled inside the downloaded hbackup.install.tar. Make sure
hbackup-1.0-core.jarat the beginning of the classpath. In addition to this add the configuration directory to CLASSPATH which has kept hbase-site.xml file.
Running the tool
Usage: It runs in 4 modes as [backup.full], [backup.incremental], [backup.history] and [restore].
mode=backup.full tables="comma separated tables" backup.folder=S3-Path date="YYYY.MM.DD 24HH:MINUTE:SECOND:MILLSECOND TIMEZONE"
mode=backup.full tables=tab1,tab2,tab3 backup.folder=s3://S3BucketABC/ date="2011.12.01 17:03:38:546 IST"
mode=backup.full tables=tab1,tab2,tab3 backup.folder=s3://S3BucketABC/
mode=backup.incremental tables="comma separated tables" backup.folder=S3-Path duration.mins=Minutes
Example of backup of changes occurred in the last 30 minutes:
mode=backup.incremental backup.folder=s3://S3BucketABC/ duration.mins=30 tables=tab1,tab2,tab3
Example of listing past archives. Incremental ones end with
mode=restore backup.folder=S3-Path/ArchieveDate tables="comma separated tables"
Example of adding the rows archived during that date. First apply a full backup and then apply incremental backups.
mode=backup.history backup.folder=s3://S3-Path/DAY_MON_HH_MI_SS_SSS_ZZZ_YYYY tables=tab1,tab2,tab3
Sample scripts to run the backup tool
$ cat setenv.sh for file in `ls /mnt/hbase/lib` do export CLASSPATH=$CLASSPATH:/mnt/hbase/lib/$file; done export CLASSPATH=/mnt/hbase/hbase-0.90.4.jar:$CLASSPATH export CLASSPATH=/mnt/hbackup/hbackup-1.0-core.jar:/mnt/hbackup/java-xmlbuilder-0.4.jar:/mnt/hbackup/jets3t-0.8.1a.jar:/mnt/hbackup/conf:$CLASSPATH
$ cat backup_full.sh . /mnt/hbackup/bin/setenv.sh dd=`date "+%Y.%m.%d %H:%M:%S:000 %Z"` echo Backing up for date $dd for table in `echo table1 table2 table3` do /usr/lib/jdk/bin/java com.bizosys.oneline.maintenance.HBaseBackup mode=backup.full backup.folder=s3://mybucket/ tables=$table "date=$dd" sleep 10 done
List of backups:
$ cat list.sh . /mnt/hbackup/bin/setenv.sh /usr/lib/jdk/bin/java com.bizosys.oneline.maintenance.HBaseBackup mode=backup.history backup.folder=s3://mybucket
Original title and link: Backin Up HBase to Amazon S3 ( ©myNoSQL)
A new great article from Todd Hoff dissecting the DataSift architecture:
In terms of data store, DataSift architecture includes:
- MySQL (Percona server) on SSD drives
- HBase cluster (currently, ~30 hadoop nodes, 400TB of storage)
- Memcached (cache)
- Redis (still used for some internal queues, but probably going to be dismissed soon)
Leave whatever you were doing and go read it now.
Original title and link: DataSift Using MySQL, HBase, Memcached to Deal With Twitter Firehose ( ©myNoSQL)
Michael Stack (StumbleUpon & Hadoop PMC) presents on some of the more interesting HBase deployments, HBase scenario usages, HBase and HDFS, and near-future of HBase:
The National Security Agency has submitted to Apache Incubator a proposal to open source Accumulo, a BigTable inspired key-value store that they were building since 2008. The project proposal page provides more details about Accumulo history, building blocks, and how it compares to the other BigTable open source implementation HBase:
Access Labels: Accumulo has an additional portion of its key that sorts after the column qualifier and before the timestamp. It is called column visibility and enables expressive cell-level access control. Authorizations are passed with each query to control what data is returned to the user.
Iterators: Accumulo has a novel server-side programming mechanism that can modify the data written to disk or returned to the user. This mechanism can be configured for any of the scopes where data is read from or written to disk. It can be used to perform joins on data within a single tablet.
Flexibility: Accumulo places no restrictions on the column families. Also, each column family in HBase is stored separately on disk. Accumulo allows column families to be grouped together on disk, as does BigTable.
Logging: HBase uses a write-ahead log on the Hadoop Distributed File System. Accumulo has its own logging service that does not depend on communication with the HDFS NameNode.
Storage: Accumulo has a relative key file format that improves compression.
Michael Stack has commented on the HBase mailing list:
The cell based ‘access labels’ seem like a matter of adding an extra field to KV and their Iterators seem like a specialization on Coprocessors. The ability to add column families on the fly seems too minor a difference to call out especially if online schema edits are now (soon) supported. They talk of locality group like functionality too — that could be a significant difference. We would have to see the code but at first blush, differences look small.
Original title and link: Accumulo: A New BigTable Inspired Distributed Key/Value by NSA ( ©myNoSQL)
A while ago I’ve read in a PR announcement about Infolinks’, an in-text advertising company, usage of Hadoop and HBase. Lior Schachter, Infolinks Software Architect, has been kind to answer my questions.
What was Infolinks using before going the NoSQL route with HBase and Hadoop?
Our architecture uses a proprietary solution for logs collection and aggregation (similar to Scribe/Flume). This architecture is OSGI based and currently handles billions of lines of logs (per day) from our distributed ad network. On top of this we have 2 kinds of statistics:
- Well-defined real-time statistics: MySQL based.
- Reports: data-mining on the raw logs.
In the past we used a single machine with OSGI server to process these kinds of statistics.
The bottlenecks we identified were in the report generation:
- The amount of data we needed to process constantly grew (longer time to process).
- The number of reports requested by our marketing/support/R&D teams also increased.
We reached the point where generating a single report generation took hours. We needed to find a solution for the scalability problem which would allow us to run hundreds of reports per day on fine grain data (URL, sentence). The Hadoop/HBase framework fits these needs perfectly.
Could you describe how Infolinks is using HBase and Hadoop?
We use Hadoop as our report generation engine - running M/R jobs on raw logs. We have a dedicated GUI for defining reports which are then translated to M/R configuration. We also use M/R to insert data into HBase.
We use HBase to hold aggregation data per day on various aspects of our system (e.g. URLs, Sentences). This data is used as feedback to our front-end servers (which analyze the text and serves the advertisements). This feedback has proven to be very effective. We also allow business experts to query HBase in order to get deeper insights on the system.
In contrast to the Hadoop reports which are asynchronous (you order a report and get it by mail), the HBase dashboard is very fast and supports synchronous results. Our business team loves it. No waiting!
Have you evaluated other solutions? Why picking up Hadoop and HBase?
We started exploring the NoSQL solutions more than a year ago. We did some research on the available solutions and chose Hadoop/HBase for few reasons:
- Java based
- Open source
- Hadoop - quite mature compared to other Java based solutions. Hadoop is also used by many web companies.
- HBase - using Hadoop (so you get for free Hadoop stability, APIs etc.), like BigTable
We tested this solution for 6 months (as a small cluster) and were very happy with it.
Thanks a lot Lior!
Lior Schachter is a Software Architect with extensive hands-on experience in building and evolving large-scale software systems in the Telco and Internet industries. Lior, joined infolinks in 2009 and serves as a R&D team manager, with the following responsibilities: 1) Real-Time advertising engine; 2) Hadoop and HBase infrastructures. Lior holds a BSC degree in Computer Science and Electrical Engineering with honors, from Tel-Aviv University. ↩
Original title and link: HBase and Hadoop at Infolinks ( ©myNoSQL)