NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



HBase: All content tagged as HBase in NoSQL databases and polyglot persistence

Hortonworks Data Platform 1.0

Hortonworks has announced the 1.0 release of the Hortonworks Data Platform prior to the Hadoop Summit 2012 together with a lot of supporting quotes from companies like Attunity, Dataguise, Datameer, Karmasphere, Kognitio, MarkLogic, Microsoft, NetApp, StackIQ, Syncsort, Talend, 10gen, Teradata, and VMware.

Some info points:

  1. Hortonworks Data Platform is a platform meant to simplify the installation, integration, management, and use of Apache Hadoop


    1. HDP 1.0 is based on Apache Hadoop 1.0
    2. Apache Ambari is used for installation and provisioning
    3. The same Apache Amabari is behind the Hortonworks Management Console
    4. For Data integration, HDP offers WebHDFS, HCatalog APIs, and Talend Open Studio
    5. Apache HCatalog is the solution offering metadata and table management
  2. Hortonworks Data Platform is 100% open source—I really appreciate Hortonworks’s dedication to the Apache Hadoop project and open source community

  3. HDP comes with 3 levels of support subscriptions, pricing starting at $12500/year for a 10 nodes cluster

One of the most interesting aspects of the Hortonworks Data Platform release is that the high-availability (HA) option for HDP is based on using VMWare-powered virtual machines for the NameNode and JobTracker. My first thought about this approach is that it was chosen to strengthen a partnership with VMWare. On the other hand, Hadoop 2.0 contains already a new highly-available version of the NameNode (Cloudera Hadoop Distribution uses this solution) and VMWare has bigger plans for a virtualization-friendly Hadoop environment with project Serengeti.

You can read a lot of posts about this announcement, but you’ll find all the details in Hortonworks’s John Kreisa’s post here and the PR announcement.

Original title and link: Hortonworks Data Platform 1.0 (NoSQL database©myNoSQL)

Performance Evaluation of HBase and How Hardware Changes Results

Two posts by Oliver Meyn on measuring the performance of two HBase clusters—first results on the original cluster and results on the upgraded cluster— using org.apache.hadoop.hbase.PerformanceEvaluation, the resulting performance charts, Ganglia charts, and some thoughts and feedback from the HBase community.

Original title and link: Performance Evaluation of HBase and How Hardware Changes Results (NoSQL database©myNoSQL)

HBase 0.94 Released: What’s New

With over 350 enhancements and bug fixes, 0.94 is the new major release of HBase. This Cloudera blog post does a good summary of the most interesting improvements:

  • Read caching improvements
  • Seek optimizations
  • WAL writes optimizations
  • added functionality to HBck: fixing orphaned regions, region holes, overlapping regions
  • simplified region sizing
  • atomic Put & Delete in a single transaction

Original title and link: HBase 0.94 Released: What’s New (NoSQL database©myNoSQL)

Notes on the Hadoop and HBase Markets

Curt Monash shares what he heard from his customers:

  • Over half of Cloudera’s customers (nb 100 subscription customers) use HBase
  • Hortonworks thinks a typical enterprise Hadoop cluster has 20-50 nodes, with 50-100 already being on the large side.
  • There are huge amounts of Elastic MapReduce/Hadoop processing in the Amazon cloud. Some estimates say it’s the majority of all Amazon Web Services processing.

Original title and link: Notes on the Hadoop and HBase Markets (NoSQL database©myNoSQL)


Apache Bigtop: Apache Big Data Management Distribution Based on Apache Hadoop

Apache Bigtop:

The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.

Currently packaging:

  • Apache Hadoop 1.0.x
  • Apache Zookeeper 3.4.3
  • Apache HBase 0.92.0
  • Apache Hive 0.8.1
  • Apache Pig 0.9.2
  • Apache Mahout 0.6.1
  • Apache Oozie 3.1.3
  • Apache Sqoop 1.4.1
  • Apache Flume 1.0.0
  • Apache Whirr 0.7.0

Apache Bigtop looks like the first step towards the Big Data LAMP-like platform analysts are calling for. Practically though it’s goal is to ensure that all the components of the wide Hadoop ecosystem remain interoperable.

Original title and link: Apache Bigtop: Apache Big Data Management Distribution Based on Apache Hadoop (NoSQL database©myNoSQL)

Here Is Why in Cassandra vs. HBase, Riak, CouchDB, MongoDB, It's Cassandra FTW

Brian ONeill:

Now, since choosing Cassandra, I can say there are a few other really important less tangible considerations. The first, is the code base. Cassandra has an extremely clean and well maintained code base. Jonathan and team do a fantastic job managing the community and the code. As we adopted NoSQL, the ability to extend the code-base and incorporate our own features has proven invaluable. (e.g. triggers, a REST interface, and server-side wide-row indexing)

Secondly, the community is phenomenal. That results in timely support, and solid releases on a regular schedule. They do a great job prioritizing features, accepting contributions, and cranking out features. (They are now releasing ~quarterly) We’ve all probably been part of other open source projects where the leadership is lacking, and features and releases are unpredictable, which makes your own release planning difficult. Kudos to the Cassandra team.

Everything sounds reasonable except for Riak being the “new kid on the block” and not finding support for it. Basho, where were you hidding?

Original title and link: Here Is Why in Cassandra vs. HBase, Riak, CouchDB, MongoDB, It’s Cassandra FTW (NoSQL database©myNoSQL)


The HBase Roadmap: Where Do We Want HBase to Be in Two Years?

The HBase project management committee:

After further banter, we arrived at list: reliability, operability (insight into the running application, dynamic config. changes, usability improvements that make it easier on a clueful ops), and performance (in this order). It was offered that we are not too bad on performance — especially in 0.94 — and that use cases will drive the performance improvements so focus should be on the first two items in the list. […] To improve reliability, testing has to be better. This has been said repeatedly in the past.

EMC has announced a 1000+ nodes cluster for Apache Hadoop testing, so maybe a similar initiative is needed for HBase too. Considering how many large organizations are using HBase it shouldn’t be difficult to get these resources as long as someone will assume ownership and leadership for it.

Original title and link: The HBase Roadmap: Where Do We Want HBase to Be in Two Years? (NoSQL database©myNoSQL)


Architecture of HBase-based Lucene Implementation

Boris Lublinsky and Mike Segel:

The implementation tries to balance two conflicting requirements - performance: in memory cache can drastically improve performance by minimizing the amount of HBase reads for search and documents retrieval; and scalability: ability to run as many Lucene instances as required to support growing search clients population. The latter requires minimizing of the cache life time to synchronize content with the HBase instance (a single copy of thruth). A compromise is achieved through implementing configurable cache time to live parameter, limiting cache presence in each Lucene instance.

Architecture of HBase-based Lucene implementation

Besides existing Solr scaling approaches and the work to make Solr scalable, there’s also the recently released DataStax Enterprise which integrates Solr on top of Cassandra.

Original title and link: Architecture of HBase-based Lucene Implementation (NoSQL database©myNoSQL)


ACID in HBase: Row Level Operations Explained. Plus Something New

Lars Hofhansl:

HBase employs a kind of MVCC. And HBase has no mixed read/write transactions. […] When a write transaction (a set of puts or deletes) starts it retrieves the next highest transaction number. In HBase this is called a WriteNumber. When a read transaction (a Scan or Get) starts it retrieves the transaction number of the last committed transaction. HBase calls this the ReadPoint.

Understanding the behavior of read and write operations in HBase is definitely useful. Learning that an upcoming HBase version will support atomic multi operations (HBASE-3584) and even multi-row local transactions (HBASE-5229) is priceless.

For HBase atomic multi-operations:

 Delete d = new Delete(ROW);
 Put p = new Put(ROW);
 AtomicRowMutation arm = new AtomicRowMutation(ROW);

and HBase multi-row local transactions is implemented as mutateRowsWithLocks method in HRegion and can be used by coprocessors only (no client API).

Original title and link: ACID in HBase: Row Level Operations Explained. Plus Something New (NoSQL database©myNoSQL)


Hadoop and HBase: Configuring the Number of Server Side Threads (Xceivers)

Prepare yourself for a very long and detailed article by Lars George explaining how to correctly configure the number of server side threads (and sockets used for data connections)— the HDFS dfs.datanode.max.xcievers configuration option. Even if I’ll be giving you the final secret formula, I strongly encourage you to read the details:

Hadoop HBase Xcievers

Original title and link: Hadoop and HBase: Configuring the Number of Server Side Threads (Xceivers) (NoSQL database©myNoSQL)


What HBase Learned From the Hypertable vs HBase Benchmark

Every decent benchmark can reveal not only performance or stability problems, but oftentimes more subtle issues like less known or undocumented options, common misconfigurations or misunderstandings. Sometimes it can reveal scenarios that a product hasn’t considered before or for which it has different solutions.

So even if I don’t agree with the purpose of the Hypertable vs HBase benchmark, I think the benchmark is well designed and there were no intentions to favor one product over the other.

I went back to two long time HBase committers and users, Michael Stack and Jean-Daniel Cryans, to find out what the HBase community could learn from this benchmark.

What can be learned from the Hypertable vs HBase benchmark from the HBase perspective?

Michael Stack: That we need to work on our usability; even a smart fellow like Doug Judd can get it really wrong.

We haven’t done his sustained upload in a good while. Our defaults need some tweaking.

We need to do more documentation around JVM tuning; you’d think fellas would have grok’d by now that big java apps need their JVM’s tweaked but it looks like the message still hasn’t gotten out there.

That we need a well-funded PR dept. to work on responses to the likes of Doug’s article (well-funded because Doug claims he spent four months on his comparison).

Jean-Daniel Cryans: I already opened a few jiras after using HT’s test on a cluster I have here with almost the same hardware and node count, it’s mostly about usability and performance for that type of use case:

  • Automagically tweak global memstore and block cache sizes based on workload

    Hypertable does a neat thing where it changes the size given to the CellCache (our MemStores) and Block Cache based on the workload. If you need an image, scroll down at the bottom of this link:

    Hypertable adaptive memory allocation

  • Soft limit for eager region splitting of young tables

    Coming out of HBASE-2375, we need a new functionality much like hypertable’s where we would have a lower split size for new tables and it would grow up to a certain hard limit. This helps usability in different ways:

    • With that we can set the default split size much higher and users will still have good data distribution
    • No more messing with force splits
    • Not mandatory to pre-split your table in order to get good out of the box performance

    The way Doug Judd described how it works for them, they start with a low value and then double it every time it splits. For example if we started with a soft size of 32MB and a hard size of 2GB, it wouldn’t be until you have 64 regions that you hit the ceiling.

    On the implementation side, we could add a new qualifier in .META. that has that soft limit. When that field doesn’t exist, this feature doesn’t kick in. It would be written by the region servers after a split and by the master when the table is created with 1 region.

  • Consider splitting after flushing

    Spawning this from HBASE-2375, I saw that it was much more efficient compaction-wise to check if we can split right after flushing. Much like the ideas that Jon spelled out in the description of that jira, the window is smaller because you don’t have to compact and then split right away to only compact again when the daughters open.

If someone is faced with similar scenarios are there workarounds or different solutions?

Michael Stack: There are tunings of HBase configs over in our reference guide for the sustained upload both in hbase and in jvm.

Then there is our bulk load facility which by-passes this scenario altogether which is what we’d encourage folks to use because its 10x to 100x faster getting your data in there.

Jean-Daniel Cryans: You can import 5TB in HBase with sane configs, I’ve done it a few times already since I started using his test. The second time he ran his test he just fixed mslab but still kept the crazy ass other settings like 80% of the memory dedicated to memstores. My testing also shows that you need to keep the eden space under control, 64MB seems a good value in my testing (he didn’t set any in his test, the first time I ran mine without setting it I got the concurrent mode failure too).

The answer he gave this week to Todd’s email on the hadoop mailing list is about a constant stream of updates and that’s what he’s trying to test. Considering that the test imports 5TB in ~16h (on my cluster), you run out of disk space in about 3 days. I seriously don’t know what he’s aiming for here.

Quoting him: “Bulk loading isn’t always an option when data is streaming in from a live application. Many big data use cases involve massive amounts of smaller items in the size range of 10-100 bytes, for example URLs, sensor readings, genome sequence reads, network traffic logs, etc.”

What are the most common places to look for improving the performance of a HBase cluster?

Michael Stack: This is what we point folks at when they ask the likes of the above question: HBase Performance Tunning

If that chapter doesn’t have it, its a bug and we need to fix up our documentation more.

Jean-Daniel Cryans: What Stack said. Also if you run into GC issues like he did then you’re doing it wrong.

Michael Stack also pointed me to a comment by Andrew Purtell (nb: you need to be logged in on LinkedIn and member of the group to see it):

I think HBase should find all of this challenging and flattering. Challenging because we know how we can do better along the dimensions of your testing and you are kicking us pretty hard. Flattering because by inference we seem to be worth kicking.

But this misses the point, and reduces what should be a serious discussion of the tradeoffs between Java and C++ to a cariacture. Furthermore, nobody sells HBase. (Not in the Hypertable or Datastax sense. Commercial companies bundle HBase but they do so by including a totally free and zero cost software distribution.) Instead it is voluntarily chosen for hundreds of large installations all over the world, some of them built and run by the smartest guys I have ever encountered in my life. Hypertable would have us believe we are all making foolish choices. While it is true that we all on some level have to deal with the Java heap, only Hypertable seems to not be able to make it work. I find that unsurprising. After all, until you can find some way to break it, you don’t have any kind of marketing story.

This remineded me of the quote from Jonathan Ellis’s Dealing With JVM Limitations in Apache Cassandra:

Cliff Click: Many concurrent algorithms are very easy to write with a GC and totally hard (to down right impossible) using explicit free.

As I was expecting, there are quite a few good things that will come out from this benchmark for both long time HBase users, but also for new adopters.

Original title and link: What HBase Learned From the Hypertable vs HBase Benchmark (NoSQL database©myNoSQL)

HBase Filters Explained: Let HBase Do the Data Selection Job for You

The naive approach that consists in getting all the required data at the Client in order to apply locally some processing should be limited in a distributed setting to trivial tasks operating on a tiny subset. There are two fundamentals reasons for that. First, this generates a lot of network exchanges, consuming without necessity a lot of resources and sometimes leading to unacceptable response time. Second, centralizing all the information then processing it, simply misses all the advantages brought by a powerful cluster of hundreds or even thousands machines. The lesson is simply: When you deal with BigData, the data center is your computer.

Great and concise explanation of the pre-packaged HBase filters and their advantages by Philippe Rigaux:

Compare with the well-known SQL world. When you express a SELECT-FROM-WHERE query, you restrict the number or rows (with the “WHERE” clause) and the number of columns for each row (with the “SELECT” clause). Filters in HBase let you do both: fully ignore some rows, and for those rows that pass, restrict the family, columns, or timestamps. This must be related to the underlying motivation: limit as much as possible the network bandwidth used to communicate withe the client application.

Original title and link: HBase Filters Explained: Let HBase Do the Data Selection Job for You (NoSQL database©myNoSQL)