hbase: All content tagged as hbase in NoSQL databases and polyglot persistence
Monday, 13 February 2012
The components and their functions in the Hadoop ecosystem
Edd Dumbill enumerates the various components of the Hadoop ecosystem:

My quick reference of the Hadoop ecosystem is including a couple of other tools that are not in this list, with the exception of Ambari and HCatalog which were released later.
Original title and link: The components and their functions in the Hadoop ecosystem (©myNoSQL)
Wednesday, 8 February 2012
Hadoop, HBase and R: Will Open Source Software Challenge BI & Analytics Software Vendors?
Harish Kotadia:
Predictive Analytics has been billed as the next big thing for almost fifteen years, but hasn’t gained mass acceptance so far the way ERP and CRM solutions have. One of the main reason for this is the high upfront investment required in Software, Hardware and Talent for implementing a Predictive Analytics solution.
Well, this is about to change – […] Using R, HBase and Hadoop, it is possible to build cost-effective and scalable Big Data Analytics solutions that match or even exceed the functionality offered by costly proprietary solutions from leading BI/Analytics software vendors at a fraction of the cost.
Vendors will argue that software licensing represents just a small fraction of the costs of implementing BI or data analytics. What they’ll leave out is the costs of acquiring know-how and more important, the costs of maintenance and modernization of their solutions.
Original title and link: Hadoop, HBase and R: Will Open Source Software Challenge BI & Analytics Software Vendors? (©myNoSQL)
Hypertable Revival. Still the wrong strategy
After a very long silence (my last post about Hypertable dates back in Oct. 2010: NoSQL database architectures and Hypertable), there seems to be a bit of revival in the Hypertable space:
- there are new packages of (commercial) services (PR announcement):
- Uptime support subscription
- Training and certification
- Commercial license
- it seems like Hypertable has a customer in Rediff.com (India)
- it is taking yet another stab at HBase performance
While I’m somehow glad that Hypertable didn’t hit the deadpool, it’s quite disappointing that they are still trying to use this old and completely useless strategy of attacking another product in the market.
There are probably many marketers out there encouraging companies to use this old trick of getting attention by attacking the market leader1. And one of the simplest ways of doing that is by saying “mine is bigger than yours“.
But these days this strategy isn’t working anymore for quite a few reasons:
-
benchmarks are most of the time incorrect, thus the attention will be pointed in the wrong direction.
In the case of the Hypertable vs HBase benchmark, JD Cryans (HBase veteran) is demoting the results.
-
For existing users, performance issues are already known. Performance issues are also known by core developers that are always working to address them. So nothing new, just some angry users of the attacked product.
- For new users, performance is just one aspect of the decision. Most of the time, it’s one of the last considered. Community, support, adoption, and well know case studies are much more important.
Attacking competitors based on feature checklists might be slightly effective in attracting a bit of attention, but it’s not the strategy to get users and customers and grow a community.
-
HBase might not be a market leader, but it is definitely one of the NoSQL databases that have seen and a few very large deployments. ↩
Original title and link: Hypertable Revival. Still the wrong strategy (©myNoSQL)
Friday, 3 February 2012
Designing HBase Schema to Best Support Specific Queries
Real scenario, very good analysis of different data access requirements, and three possible solutions. What’s your pick?
The problem is fairly simple - I am storing “notifications” in hbase, each of which has a status (“new”, “seen”, and “read”). Here are the API’s I need to provide:
- Get all notifications for a user
- Get all “new” notifications for a user
- Get the count of all “new” notifications for a user
- Update status for a notification
- Update status for all of a user’s notifications
- Get all “new” notifications accross the database
- Notifications should be scannable in reverse chronological order and allow pagination.
Original title and link: Designing HBase Schema to Best Support Specific Queries (©myNoSQL)
Wednesday, 1 February 2012
HBase Coprocessors Explained
A thorough post from Trend Micro Hadoop Group (Mingjie Lai, Eugene Koontz, Andrew Purtell) explaining all details of HBase coprocessors included in the latest HBase release 0.92.0:
Why HBase Coprocessors?
HBase has very effective MapReduce integration for distributed computation over data stored within its tables, but in many cases – for example simple additive or aggregating operations like summing, counting, and the like – pushing the computation up to the server where it can operate on the data directly without communication overheads can give a dramatic performance improvement over HBase’s already good scanning performance.
Also, before 0.92, it was not possible to extend HBase with custom functionality except by extending the base classes.
What are HBase Coprocessors?
In order to support sufficient flexibility for potential coprocessor behaviors, two different aspects of extension are provided by the framework. One is the observer, which are like triggers in conventional databases, and the other is the endpoint, dynamic RPC endpoints that resemble stored procedures.
What can HBase Coprocessors be used for?
exciting new features can be built on top of it, for example secondary indexing, complex filtering (push down predicates), and access control.
These are just a couple of interesting points from this excellent article. I strongly suggest reading it.
Original title and link: HBase Coprocessors Explained (©myNoSQL)
via: https://blogs.apache.org/hbase/entry/coprocessor_introduction
Tuesday, 24 January 2012
More Details About Apache HBase 0.92.0
Jonathan Hsieh provides a summary of the new features in HBase 0.92.0 by splitting them into user features:
- HFile v2, a new more efficient storage format
- Faster recovery via distributed log splitting
- Lower latency region-server operations via new multi-threaded and asynchronous implementations.
operator features:
- An enhanced web UI that exposes more internal state
- Improved logging for identifying slow queries
- Improved corruption detection and repair tools
and developer features:
- Coprocessors
- Build support for Hadoop 0.20.20x, 0.22, 0.23.
- Experimental: offheap slab cache and online table schema change
Earlier today when covering the HBase 0.92.0 release, I wrote that coprocessors are the hightlight of this release. I’ll take that back. Way too many interesting features in HBase 0.92.0 to highlight just one of them.
Original title and link: More Details About Apache HBase 0.92.0 (©myNoSQL)
via: http://www.cloudera.com/blog/2012/01/apache-hbase-0-92-0-has-been-released/
Monday, 23 January 2012
Latest NoSQL Releases: HBase 0.92, DataStax Community Server, Hortonworks Data Platform, SolrCloud
Just a quick roundup of the latest releases and announcements.
Hortonworks Data Platform (HDP) version 2
HDP v2 will include:
- NextGen MapReduce architecture
- HDFS NameNode HA
- HDFS Federation
- up-to-date HCatalog, HBase, Hive, Pig
According to the announcement:
In order to avoid confusion, let me explain the two versions of HDP:
- HDP v1 is based upon Apache Hadoop 1.0 (which comes from the 0.20.205 branch). It the most stable, production-ready version of Hadoop that is currently found in many large enterprise deployments. HDP v1 is currently available as a private technology preview. A public technology preview will be made available later this quarter.
- HDP v2 is based upon Apache Hadoop 0.23, which includes the next generation advancements mentioned above. It’s an important step forward in terms of scalability, performance, high availability and data integrity. A technology preview will also be made publicly available later in Q1.
SolrCloud Completes Phase 2
Mark Miller about the completion of phase 2:
The second phase of SolrCloud has been in full swing for a couple of months now and it looks like we are going to be able to commit this work to trunk very soon! In Phase1 we built on top of Solr’s distributed search capabilities and added cluster state, central config, and built-in read side fault tolerance. Phase 2 is even more ambitious and focuses on the write side. We are talking full-blown fault tolerance for reads and writes, near real-time support, real-time GET, true single node durability, optimistic locking, cluster elasticity, improvements to the Phase 1 features, and more.
Not there yet, but it’s coming.
DataStax Community Server 1.0.7
A new release of DataStax’s distribution of Cassandra incorporating Cassandra 1.0.7
HBase 0.92
Don’t let the version number trick you. This is an important release for HBase featuring:
- coprocessors
- security
- new (self-migrating) file format
- AWS improvements: EBS support, building a HA cluster
The list of new features, improvements, and bug fixes in HBase 0.92 is impressive. But the highlight of this release is in my opinion HBase coprocessors (Jira entry HBASE-200).
I’m leaving you with Andrew Purtell’s slides about HBase Coprocessors:
Thursday, 19 January 2012
Pros and Cons of Using MapReduce With Distributed Key-Value Stores: HBase, Cassandra, Riak
Old Quora question with very good answers.
- (pro) can (potentially) query live data
- (pro) can (conceptually) be highly efficient at joining data sets that are identically sharded on the join key (the joins can be pushed down into the key-value store itself)
- (con) full scans (the most common pattern for map-reduce) is most likely to be much faster with raw file system access
- (con) because of the better decoupling of computation and storage in the GFS+Map-Reduce model - tolerating hot spots (resulting from MR jobs) is much easier
- (con) key-value stores are rarely arranged to have schemas optimized for analytics
Original title and link: Pros and Cons of Using MapReduce With Distributed Key-Value Stores: HBase, Cassandra, Riak (©myNoSQL)
Tuesday, 17 January 2012
Setting Up, Modeling and Loading Data in HBase With Hadoop and Clojure: NoSQL Tutorials
Even if you are not familiar with Clojure, you’ll still enjoy this fantastic HBase tutorial:
And that’s the thing: if you are loading literally gajigabytes of data into HBase you need to be pretty sure that it’s going to be able to answer your questions in a reasonable amount of time. Simply cramming it in there probably won’t work (indeed, that approach probably won’t work great for anything). I loaded and re-loaded a test set of twenty thousand rows until I had something that worked.
Original title and link: Setting Up, Modeling and Loading Data in HBase With Hadoop and Clojure: NoSQL Tutorials (©myNoSQL)
via: http://twitch.nervestaple.com/2012/01/12/clojure-hbase/
Wednesday, 11 January 2012
HBase and Hadoop in OpenLogic Top 5 Trending Open Source Projects
It’s nice to see HBase and Hadoop in the Top 5 gainers[1] of the OpenLogic’s Open source adoption trending report, but the list of contenders in the database and big data category is way too short: HBase, Hadoop, Mongodb, MySQL, PostgreSQL, CouchDB.
-
The top 5: HBase, Node.js, nginx, Hadoop, Rails ↩
Original title and link: HBase and Hadoop in OpenLogic Top 5 Trending Open Source Projects (©myNoSQL)
Tuesday, 3 January 2012
Last NoSQL Releases in 2011: MongoDB, Hive, ZooKeeper, Whirr, HBase, Redis, and Hadoop 1.0.0
Let’s start the year with a quick review of the latest releases that happened in December. Make sure that you scroll to the end as there are quite a few important ones.
MongoDB 2.0.2
Announced on Dec.15th, MongoDB 2.0.2 is a bug fix release:
- Hit config server only once per mongos on meta data change to not overwhelm
- Removed unnecessary connection close and open between mongos and mongod after getLastError
- Replica set primaries close all sockets on stepDown()
- Do not require authentication for the buildInfo command
- scons option for using system libraries
Apache Hive 0.8.0
Apache Hive 0.8.0 came out on Dec.19th. The list of new features, improvements, and bug fixes is extremely long.
Just as a side note, who came out with the idea of having a Hive fans’ page on Facebook?
Apache ZooKeeper 3.4.2
ZooKeeper 3.4.0 has been followed up shortly by two new minor version updates fixing some critical bugs. The list of issues fixed in ZooKeeper 3.4.1 can be found here and for ZooKeeper 3.4.2 the 2 fixed bugs are listed here.
As with ZooKeeper 3.4.0, these versions are not yet production ready.
Apache Whirr 0.7.0
Apache Whirr 0.7.0 has been released on Dec.21st featuring 56 improvements and bug fixes including support for Puppet & Chef, and Mahout and Ganglia as a service. The complete list can be found here.
Some more details about Whirr 0.7.0 can be found here.
Apache HBase 0.90.5
Released Dec.23rd, HBase 0.90.5 packs 81 bug fixes. The complete list can be found here.
Redis 2.4.5
Redis 2.4.5 was released on Dec.23rd and provides 4 bug fixes:
- [BUGFIX] Fixed a ZUNIONSTORE/ZINTERSTORE bug that can cause a NaN to be inserted as a sorted set element score. This happens when one of the elements has
+inf/-infscore and the weight used is 0. - [BUGFIX] Fixed memory leak in
CLIENT INFO. - [BUGFIX] Fixed a non critical
SORTbug (Issue 224). - [BUGFIX] Fixed a replication bug: now the timeout configuration is respected during the connection with the master.
--quietoption implemented in the Redis test.
Last but definitely one of the most important announcements that came in December:
Hadoop 1.0.0
Based on the 0.20-security code line, Hadoop 1.0.0 was announced on Dec.29. This release includes support for:
- HBase (append/hsynch/hflush) and Security
- Webhdfs (with full support for security)
- Performance enhanced access to local files for HBase
- Other performance enhancements, bug fixes, and features
- All version 0.20.205 and prior 0.20.2xx features
Complete release notes are available here.
Stéphane Fréchette, Ryan Slobojan, Duane Moore, Arun C. Murthy
And with this we are ready for 2012.
Original title and link: Last NoSQL Releases in 2011: MongoDB, Hive, ZooKeeper, Whirr, HBase, Redis, and Hadoop 1.0.0 (©myNoSQL)
Thursday, 22 December 2011
Why We Chose HBase for AppFirst APM
Its performance had a significant impact on our decision making as well. It sustains an enormous number of writes and the read cycle times were much better than we had anticipated. Further, it gives us the option to interact with the Hadoop Ecosystem, including HDFS, Mapreduce, and Zookeeper frameworks. Our enthusiasm for HBase skyrocketed when we discovered how to create map-reduce apps to do a number of management tasks. While Cassandra also has these capabilities, its data model was fundamentally more complex.
What if the whole post would have said: we chose HBase because of
- its seamless integration in the Hadoop ecosystem
- the scalable time series OpenTSDB is built on top of HBase?
Original title and link: Why We Chose HBase for AppFirst APM (©myNoSQL)
via: http://blog.appfirst.com/2011/12/22/why-we-chose-hbase/
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling