hbase: All content tagged as hbase in NoSQL databases and polyglot persistence
Saturday, 17 December 2011
NoSQL Screencast: HBase Schema Design
In this O’Reilly webcast, long time HBase developer and Cloudera HBase/Hadoop architect Lars George discusses the underlying concepts of the storage layer in HBase and how to do model data in HBase for best possible performance.
Sunday, 11 December 2011
Facebook: There Are No Published Cases of NoSQL Databases Operating at the Scale of Facebook’s MySQL Database
Joe Maguire referring to the Facebook talk embedded below MySQL and HBase:
if Facebook doesn’t need NoSQL, who does?
My answer: many of those that cannot employ a specialized team to hack the hell out of MySQL to make it work at that scale.
On the flipside, many other companies don’t have the time or engineering power to grow their product together with a NoSQL database.
via: http://josephmaguire.blogspot.com/2011/12/facebook-there-are-no-published-cases.html
Wednesday, 7 December 2011
Backing Up HBase to Amazon S3
This is a guest post by Bizosys Team creators of HSearch, an opensource, NoSQL, distributed, real-time search engine built on Hadoop and HBase.
We have evaluated various options to backup data inside HBase and built a solution. This post will explain the options and also provide the solution for anyone to download and implement it for their own HBase installations.
After considering these options we developed a simple tool, which backs up data to Amazon S3 and restores it when needed. Another requirement is to take a full backup over weekend and a daily incremental backup.
In a recovery scenario, it should firstly initiate a clean environment with all tables created and populated with latest full backup data. Then it should apply all incremental backups sequentially. However, with this method, deletes are not captured and this may lead to some unnecessary data in tables. This is a known disadvantage for this method of backup and restore.
This backup program uses internally the HBase Import and Export tools to execute the programs in a Map-Reduce way.
Top 10 Features of the backup tool
- Export complete data for the given set of tables to S3 bucket.
- Export incrementally data for the given set of tables to S3 bucket.
- List all complete as well as incremental backup repositories.
- Restore a table from backup based on the given backup repository.
- Runs in Map-Reduce
- In case of connection failure, retries with increasing delays
- Handles special characters like _ which creates the export and import activities.
- Enhancement of existing Export and Import tool with detail logging to report a failure than just exiting with a program status of 1.
- Works in human readable time format for taking, listing and restoring of backup than using system tick time or unix
EPOCHtime (Time represented as a Number than readabale format asYYYY.MM.DD 24HH:MINUTE:SECOND:MILLSECOND TIMEZONE) - All parameters are taken from command line which allows the cron job to run this at regular interval.
Setting up the tool
- Download the package from hbackup.install.tar
This package includes the necessary jar files and the source code. - Setup a configuration file. Download the
hbase-site.xmlfile. Add to thisfs.s3.awsAccessKeyId,fs.s3.awsSecretAccessKey,fs.s3n.awsAccessKeyIdandfs.s3n.awsSecretAccessKeyproperties - Setup the class path with all jars existing inside the
hbase/libdirectory,hbase.jarfile,java-xmlbuilder-0.4.jar,jets3t-0.8.1a.jarandhbackup-1.0-core.jarfile bundled inside the downloaded hbackup.install.tar. Make surehbackup-1.0-core.jarat the beginning of the classpath. In addition to this add the configuration directory to CLASSPATH which has kept hbase-site.xml file.
Running the tool
Usage: It runs in 4 modes as [backup.full], [backup.incremental], [backup.history] and [restore].
[backup.full]
mode=backup.full tables="comma separated tables" backup.folder=S3-Path date="YYYY.MM.DD 24HH:MINUTE:SECOND:MILLSECOND TIMEZONE"
Example:
mode=backup.full tables=tab1,tab2,tab3 backup.folder=s3://S3BucketABC/ date="2011.12.01 17:03:38:546 IST"mode=backup.full tables=tab1,tab2,tab3 backup.folder=s3://S3BucketABC/
[backup.incremental]
mode=backup.incremental tables="comma separated tables" backup.folder=S3-Path duration.mins=Minutes
Example of backup of changes occurred in the last 30 minutes:
mode=backup.incremental backup.folder=s3://S3BucketABC/ duration.mins=30 tables=tab1,tab2,tab3
backup.history
mode=backup.history backup.folder=S3-Path
Example of listing past archives. Incremental ones end with .incr
mode=backup.history backup.folder=s3://S3BucketABC/
[restore]
mode=restore backup.folder=S3-Path/ArchieveDate tables="comma separated tables"
Example of adding the rows archived during that date. First apply a full backup and then apply incremental backups.
mode=backup.history backup.folder=s3://S3-Path/DAY_MON_HH_MI_SS_SSS_ZZZ_YYYY tables=tab1,tab2,tab3
Sample scripts to run the backup tool
Setup:
$ cat setenv.sh
for file in `ls /mnt/hbase/lib`
do
export CLASSPATH=$CLASSPATH:/mnt/hbase/lib/$file;
done
export CLASSPATH=/mnt/hbase/hbase-0.90.4.jar:$CLASSPATH
export CLASSPATH=/mnt/hbackup/hbackup-1.0-core.jar:/mnt/hbackup/java-xmlbuilder-0.4.jar:/mnt/hbackup/jets3t-0.8.1a.jar:/mnt/hbackup/conf:$CLASSPATH
Full backup:
$ cat backup_full.sh
. /mnt/hbackup/bin/setenv.sh
dd=`date "+%Y.%m.%d %H:%M:%S:000 %Z"`
echo Backing up for date $dd
for table in `echo table1 table2 table3`
do
/usr/lib/jdk/bin/java com.bizosys.oneline.maintenance.HBaseBackup mode=backup.full backup.folder=s3://mybucket/ tables=$table "date=$dd"
sleep 10
done
List of backups:
$ cat list.sh
. /mnt/hbackup/bin/setenv.sh
/usr/lib/jdk/bin/java com.bizosys.oneline.maintenance.HBaseBackup mode=backup.history backup.folder=s3://mybucket
Original title and link: Backin Up HBase to Amazon S3 (©myNoSQL)
Wednesday, 30 November 2011
DataSift Using MySQL, HBase, Memcached to Deal With Twitter Firehose
A new great article from Todd Hoff dissecting the DataSift architecture:

In terms of data store, DataSift architecture includes:
- MySQL (Percona server) on SSD drives
- HBase cluster (currently, ~30 hadoop nodes, 400TB of storage)
- Memcached (cache)
- Redis (still used for some internal queues, but probably going to be dismissed soon)
Leave whatever you were doing and go read it now.
Original title and link: DataSift Using MySQL, HBase, Memcached to Deal With Twitter Firehose (©myNoSQL)
Monday, 28 November 2011
How to Implement an IMAP Server on Top of a CouchDB/NoSQL Data Store?
Interesting question on SO:
To summarize my objective here, I am really just looking for a simple, opensource method which allows me to create and maintain a (preferably noSQL db) backup/archieve of one/more remote IMAP email accounts on a per user basis and sync each individual users email accounts using a simple, low cost solution which easily scales out, consumes server resources in an efficient maner with the ADDED ABILITY that each user needs to be able to connect to his central email archive by simply addingba new imap account to his existing email client using an imap server, username and password provided through this archive server/setup.
This reminded me of a GSOC project to design and implement a distributed mailbox on top of Hadoop HDFS as part of the Apache James project. The project description can be found on this JIRA ticket and more details here:
We need to implement mailbox storage as a distributed system on top of Hadoop HDFS. The James mailbox API will be used. A first step is to design how to interact with Hadoop (native api, gora incubator at apache,…) and deal with specific performance questions related to mail loading/parsing in a distributed system (use map/reduce or not, use existing local lucene indexes for search,…). The second step is to implement the HDFS mailbox (maildir mailbox is similar because is stores mails as a file and can be an inspiration). A single James server will still be deployed because we don’t have any distributed UID generation.
According to the last comments on the ticket, this project was completed Ioan Eugen Stan under Eric Charles’ mentorship.
Original title and link: How to Implement an IMAP Server on Top of a CouchDB/NoSQL Data Store? (©myNoSQL)
Odiago WibiData: Analytics Startup Powered by HBase and Hadoop
A new startup powered by HBase and Hadoop, founded by one of Cloudera’s founders Christophe Bisciglia and Hadoop developer and ex-Cloudera Aaron Kimball, focusing on investigative and operational analytics on consumer Internet data:
- ALL data pertaining to a single user (or mobile device) is kept in a single, possibly very long, HBase row.
- There are two primary operators in WibiData, Produce and Gather.
- Produce operates on single rows. It can operate on one row at HBase speed (milliseconds) if you need to inform an interactive user response. Or it can operate on the whole database in batch via Hadoop MapReduce.
- It is reasonable to think of Produce as mainly doing two things. One is the aforementioned serving of data out of WibiData into interactive applications. The other is scoring, classifying, recommending, etc. on individual users (i.e. rows), in line with an analytic model.
- Gather typically operates on all your rows at once, and emits suitable input for a MapReduce Reduce step. It is reasonable to think of Gather as being a key cog in the training of analytic models.
- HBase schema management is done at the WibiData system level, not directly in applications. There’s a WibiData HBase data dictionary, powered by a set of system tables, that specifies cell data types/record types and, in effect, primitive schemas.
One aspect that I’m not familiar with is how HBase can handle multitenancy, a requirement for services like WibiData.
As a side note, I assume this is the type of startups Accel’s $100m fund for Big Data, Hadoop, and NoSQL Databases is targetting.
Original title and link: Odiago WibiData: Analytics Startup Powered by HBase and Hadoop (©myNoSQL)
Monday, 31 October 2011
NoSQL: A Three-Horse Race
James Philips (Couchbase) quoted by Curt Monash:
NoSQL is simply a three-horse race between Couchbase, MongoDB, and Cassandra.
Off the top of my head I could name at least two other projects that are either having numerous deployments or are already managing huge amounts of data. And I’d bet every regular reader would figure out that I’m referring to Redis and HBase.
Original title and link: NoSQL: A Three-Horse Race (©myNoSQL)
Thursday, 6 October 2011
How Does Google MegaStore Compare Against HDFS/HBase?
Alex Feinberg answering the question in the title:
This is like saying “how does a General Motors bus compare against a Ford engine”. MegaStore is built on of Google’s BigTable/GFS. HBase/HDFS are BigTable/HDFS work-alikes.
BigTable and HBase give up availability (in the CAP Theorem sense) in favour of consistency: when a tablet master node (HRegionServer in HBase) goes down, the portion of the keyspace the failed node is responsible for becomes (briefly) unavailable until another node takes over the portion of the key space. This is efficient, as the data/write-ahead-log is stored GFS (or HDFS): in a way serializing writes to GFS/HDFS (a file system with relaxed consistency semantics) through a single node ensures serializable consistency.
Make sure you read it all.
Original title and link: How Does Google MegaStore Compare Against HDFS/HBase? (©myNoSQL)
via: http://www.quora.com/How-does-Google-MegaStore-compare-against-HDFS-HBase
Saturday, 10 September 2011
State of HBase With Michael Stack
Michael Stack (StumbleUpon & Hadoop PMC) presents on some of the more interesting HBase deployments, HBase scenario usages, HBase and HDFS, and near-future of HBase:
Monday, 5 September 2011
Accumulo: A New BigTable Inspired Distributed Key/Value by NSA
The National Security Agency has submitted to Apache Incubator a proposal to open source Accumulo, a BigTable inspired key-value store that they were building since 2008. The project proposal page provides more details about Accumulo history, building blocks, and how it compares to the other BigTable open source implementation HBase:
-
Access Labels: Accumulo has an additional portion of its key that sorts after the column qualifier and before the timestamp. It is called column visibility and enables expressive cell-level access control. Authorizations are passed with each query to control what data is returned to the user.
-
Iterators: Accumulo has a novel server-side programming mechanism that can modify the data written to disk or returned to the user. This mechanism can be configured for any of the scopes where data is read from or written to disk. It can be used to perform joins on data within a single tablet.
-
Flexibility: Accumulo places no restrictions on the column families. Also, each column family in HBase is stored separately on disk. Accumulo allows column families to be grouped together on disk, as does BigTable.
-
Logging: HBase uses a write-ahead log on the Hadoop Distributed File System. Accumulo has its own logging service that does not depend on communication with the HDFS NameNode.
-
Storage: Accumulo has a relative key file format that improves compression.
You can read more about Accumulo here and check the Hacker News and Reddit discussions.
Michael Stack has commented on the HBase mailing list:
The cell based ‘access labels’ seem like a matter of adding an extra field to KV and their Iterators seem like a specialization on Coprocessors. The ability to add column families on the fly seems too minor a difference to call out especially if online schema edits are now (soon) supported. They talk of locality group like functionality too — that could be a significant difference. We would have to see the code but at first blush, differences look small.
Original title and link: Accumulo: A New BigTable Inspired Distributed Key/Value by NSA (©myNoSQL)
Wednesday, 24 August 2011
HBase and Hadoop at Infolinks
A while ago I’ve read in a PR announcement about Infolinks’, an in-text advertising company, usage of Hadoop and HBase. Lior Schachter[1], Infolinks Software Architect, has been kind to answer my questions.
What was Infolinks using before going the NoSQL route with HBase and Hadoop?
Our architecture uses a proprietary solution for logs collection and aggregation (similar to Scribe/Flume). This architecture is OSGI based and currently handles billions of lines of logs (per day) from our distributed ad network. On top of this we have 2 kinds of statistics:
- Well-defined real-time statistics: MySQL based.
- Reports: data-mining on the raw logs.
In the past we used a single machine with OSGI server to process these kinds of statistics.
The bottlenecks we identified were in the report generation:
- The amount of data we needed to process constantly grew (longer time to process).
- The number of reports requested by our marketing/support/R&D teams also increased.
We reached the point where generating a single report generation took hours. We needed to find a solution for the scalability problem which would allow us to run hundreds of reports per day on fine grain data (URL, sentence). The Hadoop/HBase framework fits these needs perfectly.
Could you describe how Infolinks is using HBase and Hadoop?
We use Hadoop as our report generation engine - running M/R jobs on raw logs. We have a dedicated GUI for defining reports which are then translated to M/R configuration. We also use M/R to insert data into HBase.
We use HBase to hold aggregation data per day on various aspects of our system (e.g. URLs, Sentences). This data is used as feedback to our front-end servers (which analyze the text and serves the advertisements). This feedback has proven to be very effective. We also allow business experts to query HBase in order to get deeper insights on the system.
In contrast to the Hadoop reports which are asynchronous (you order a report and get it by mail), the HBase dashboard is very fast and supports synchronous results. Our business team loves it. No waiting!
Have you evaluated other solutions? Why picking up Hadoop and HBase?
We started exploring the NoSQL solutions more than a year ago. We did some research on the available solutions and chose Hadoop/HBase for few reasons:
- Java based
- Open source
- Hadoop - quite mature compared to other Java based solutions. Hadoop is also used by many web companies.
- HBase - using Hadoop (so you get for free Hadoop stability, APIs etc.), like BigTable
We tested this solution for 6 months (as a small cluster) and were very happy with it.
Thanks a lot Lior!
-
Lior Schachter is a Software Architect with extensive hands-on experience in building and evolving large-scale software systems in the Telco and Internet industries. Lior, joined infolinks in 2009 and serves as a R&D team manager, with the following responsibilities: 1) Real-Time advertising engine; 2) Hadoop and HBase infrastructures. Lior holds a BSC degree in Computer Science and Electrical Engineering with honors, from Tel-Aviv University. ↩
Original title and link: HBase and Hadoop at Infolinks (©myNoSQL)
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling