hypertable: All content tagged as hypertable in NoSQL databases and polyglot persistence
Every decent benchmark can reveal not only performance or stability problems, but oftentimes more subtle issues like less known or undocumented options, common misconfigurations or misunderstandings. Sometimes it can reveal scenarios that a product hasn’t considered before or for which it has different solutions.
What can be learned from the Hypertable vs HBase benchmark from the HBase perspective?
Michael Stack: That we need to work on our usability; even a smart fellow like Doug Judd can get it really wrong.
We haven’t done his sustained upload in a good while. Our defaults need some tweaking.
We need to do more documentation around JVM tuning; you’d think fellas would have grok’d by now that big java apps need their JVM’s tweaked but it looks like the message still hasn’t gotten out there.
That we need a well-funded PR dept. to work on responses to the likes of Doug’s article (well-funded because Doug claims he spent four months on his comparison).
Jean-Daniel Cryans: I already opened a few jiras after using HT’s test on a cluster I have here with almost the same hardware and node count, it’s mostly about usability and performance for that type of use case:
Hypertable does a neat thing where it changes the size given to the CellCache (our MemStores) and Block Cache based on the workload. If you need an image, scroll down at the bottom of this link:
Coming out of HBASE-2375, we need a new functionality much like hypertable’s where we would have a lower split size for new tables and it would grow up to a certain hard limit. This helps usability in different ways:
- With that we can set the default split size much higher and users will still have good data distribution
- No more messing with force splits
- Not mandatory to pre-split your table in order to get good out of the box performance
The way Doug Judd described how it works for them, they start with a low value and then double it every time it splits. For example if we started with a soft size of 32MB and a hard size of 2GB, it wouldn’t be until you have 64 regions that you hit the ceiling.
On the implementation side, we could add a new qualifier in .META. that has that soft limit. When that field doesn’t exist, this feature doesn’t kick in. It would be written by the region servers after a split and by the master when the table is created with 1 region.
Spawning this from HBASE-2375, I saw that it was much more efficient compaction-wise to check if we can split right after flushing. Much like the ideas that Jon spelled out in the description of that jira, the window is smaller because you don’t have to compact and then split right away to only compact again when the daughters open.
If someone is faced with similar scenarios are there workarounds or different solutions?
Michael Stack: There are tunings of HBase configs over in our reference guide for the sustained upload both in hbase and in jvm.
Then there is our bulk load facility which by-passes this scenario altogether which is what we’d encourage folks to use because its 10x to 100x faster getting your data in there.
Jean-Daniel Cryans: You can import 5TB in HBase with sane configs, I’ve done it a few times already since I started using his test. The second time he ran his test he just fixed mslab but still kept the crazy ass other settings like 80% of the memory dedicated to memstores. My testing also shows that you need to keep the eden space under control, 64MB seems a good value in my testing (he didn’t set any in his test, the first time I ran mine without setting it I got the concurrent mode failure too).
The answer he gave this week to Todd’s email on the hadoop mailing list is about a constant stream of updates and that’s what he’s trying to test. Considering that the test imports 5TB in ~16h (on my cluster), you run out of disk space in about 3 days. I seriously don’t know what he’s aiming for here.
Quoting him: “Bulk loading isn’t always an option when data is streaming in from a live application. Many big data use cases involve massive amounts of smaller items in the size range of 10-100 bytes, for example URLs, sensor readings, genome sequence reads, network traffic logs, etc.”
What are the most common places to look for improving the performance of a HBase cluster?
Michael Stack: This is what we point folks at when they ask the likes of the above question: HBase Performance Tunning
If that chapter doesn’t have it, its a bug and we need to fix up our documentation more.
Jean-Daniel Cryans: What Stack said. Also if you run into GC issues like he did then you’re doing it wrong.
I think HBase should find all of this challenging and flattering. Challenging because we know how we can do better along the dimensions of your testing and you are kicking us pretty hard. Flattering because by inference we seem to be worth kicking.
But this misses the point, and reduces what should be a serious discussion of the tradeoffs between Java and C++ to a cariacture. Furthermore, nobody sells HBase. (Not in the Hypertable or Datastax sense. Commercial companies bundle HBase but they do so by including a totally free and zero cost software distribution.) Instead it is voluntarily chosen for hundreds of large installations all over the world, some of them built and run by the smartest guys I have ever encountered in my life. Hypertable would have us believe we are all making foolish choices. While it is true that we all on some level have to deal with the Java heap, only Hypertable seems to not be able to make it work. I find that unsurprising. After all, until you can find some way to break it, you don’t have any kind of marketing story.
This remineded me of the quote from Jonathan Ellis’s Dealing With JVM Limitations in Apache Cassandra:
Cliff Click: Many concurrent algorithms are very easy to write with a GC and totally hard (to down right impossible) using explicit free.
As I was expecting, there are quite a few good things that will come out from this benchmark for both long time HBase users, but also for new adopters.
Original title and link: What HBase Learned From the Hypertable vs HBase Benchmark ( ©myNoSQL)
After a very long silence (my last post about Hypertable dates back in Oct. 2010: NoSQL database architectures and Hypertable), there seems to be a bit of revival in the Hypertable space:
- there are new packages of (commercial) services (PR announcement):
- Uptime support subscription
- Training and certification
- Commercial license
- it seems like Hypertable has a customer in Rediff.com (India)
- it is taking yet another stab at HBase performance
While I’m somehow glad that Hypertable didn’t hit the deadpool, it’s quite disappointing that they are still trying to use this old and completely useless strategy of attacking another product in the market.
There are probably many marketers out there encouraging companies to use this old trick of getting attention by attacking the market leader1. And one of the simplest ways of doing that is by saying “mine is bigger than yours“.
But these days this strategy isn’t working anymore for quite a few reasons:
benchmarks are most of the time incorrect, thus the attention will be pointed in the wrong direction.
For existing users, performance issues are already known. Performance issues are also known by core developers that are always working to address them. So nothing new, just some angry users of the attacked product.
- For new users, performance is just one aspect of the decision. Most of the time, it’s one of the last considered. Community, support, adoption, and well know case studies are much more important.
Attacking competitors based on feature checklists might be slightly effective in attracting a bit of attention, but it’s not the strategy to get users and customers and grow a community.
HBase might not be a market leader, but it is definitely one of the NoSQL databases that have seen and a few very large deployments. ↩
Original title and link: Hypertable Revival. Still the wrong strategy ( ©myNoSQL)
Lorenzo Alberton with an overview of the NoSQL landscape:
NoSQL databases get a lot of press coverage, but there seems to be a lot of confusion surrounding them, as in which situations they work better than a Relational Database, and how to choose one over another. This talk will give an overview of the NoSQL landscape and a classification for the different architectural categories, clarifying the base concepts and the terminology, and will provide a comparison of the features, the strengths and the drawbacks of the most popular projects (CouchDB, MongoDB, Riak, Redis, Membase, Neo4j, Cassandra, HBase, Hypertable).
Martin Schneider (Basho) trying to answer the question in the title:
Riak can be a data store to a purpose-built enterprise app; a caching layer for an Internet app, or part of the distributed fabric and DNA of a Global app. Those are of course highly arbitrary and vague examples, but it shows how flexible Riak is as a platform.
“Can be” is not quite equivalent with being the right solution and less so with being the best solution. And Martin’s answer to this is:
For super scalable enterprise and global apps — those where the data inside is inherently valuable and dependability of the system to capture, process and store data/writes is imperative — well I see Riak outperforming any perceived competitor in the space in providing value here.
But even for these scenarios, there’s competition from solutions like Cassandra, HBase, and Hypertable — the whole spectrum of scalable storage solutions based on Google BigTable and Amazon Dynamo being covered: HBase (a BigTable implementation), Cassandra (a solution using the BigTable data model and the Dynamo distributed model), and Riak (a solution based mainly on the Amazon Dynamo paper).
While Riak presents itself as the cleanest Dynamo based solution, I would venture to say that both Cassandra and HBase come to table with some interesting characteristics that cannot be ignored:
- Strong communities and community driven development processes — both HBase and Cassandra are top Apache Foundation projects
- Excellent integration with Hadoop, the leading batch processing solution. DataStax, the company offering services for Cassandra, went the extra-mile of creating a custom Hadoop solution, Brisk, making this integration even better.
Bottom line, I don’t think we can declare a winner in this space and I believe all three solutions will stay around for a while competing for every scenario requiring dependability of the system to capture, process and store data.
Cloudata is the third open source implementation of Google’s BigTable paper, after HBase and Hypertable. There’s already an 1.0 version even if the Github project page is listing just a couple of commits.
From the home page, Cloudata’s current features:
- Basic data service
- Single row operation(get, put)
- Multi row operation(like, between, scanner)
- Data uploader(DirectUploader)
- Simple cloudata query and supports JDBC driver
- Table Management
- Web based Monitor
- CLI Shell
- Master failover
- TabletServer failover
- Change log Server
- Reliable fast appendable change log server
- Support language
- Java, RESTful API, Thrift
I couldn’t figure out if this is just an experiment or if it actually plans to be a real project.
Update: Cloudata’s author, Jsjangg, mentions in the comment thread that Cloudata is used at www.searcus.com for 2 years already running on a 20 machine cluster.
See why I haven’t included Cassandra in this list in the comment thread. ↩
Michael Stonebraker has published on Vertica blog an article presenting 6 criteria for characterizing the completeness of a column store implementation:
- IO-1 (basic column store): Every storage block contains data from only ONE column.
- IO-2: Aggressive compression
- IO-3: No record-ids
- CPU-4: A column executor
- CPU-5: Executor runs on compressed data
- CPU-6: Executor can process columns that are key sequence or entry sequence
Michael’s post is going after big fishes in the ocean (SybaseIQ, EMC Greenplum, Aster Data, Oracle) and in case this is the area that interests you, you should also check Curt Monash’s follow up.
But getting back to these 6 criteria for column stores, I confess that this time these seem to make a lot of sense. So, I’m wondering how NoSQL column-stores — Cassandra, HBase, and Hypertable — are doing from this perspective. I’d really appreciate some expert comments so we have a follow up with the status of NoSQL column-stores according to these criteria.
While not remembering exactly this article, I’ve continued to maintain this separation and my post’s intention is to make sure the separation is kept, but also to get experts feedback on the following questions:
- do any of these criteria apply to NoSQL column stores?
- if a criterion applies than how NoSQL column stores score at it?
- if a criterion doesn’t apply, why doesn’t it apply?
Yesterday was the NoSQL Frankfurt conference and today we have the chance to review some of the slide decks presented.
Beyond NoSQL with MarkLogic and The Universal Index
The GraphDB Landscape and sones
Achim Friedland (@ahzf) has provided a very interesting overview of the graph databases products, the goals and some scenarios for graph databases, a brief comparison of property graphs with other models (relational databases, object-oriented, semantic web/RDF, and many other interesting aspects.
Data Modeling with Cassandra Column Families
Neo4j Spatial - GIS for the rest of us
Cassandra vs Redis
Tim Lossen (@tlossen) slides compare Cassandra and Redis from the perspective of a Facebook game requirements. All I can say is that the conclusion is definitely interesting, but you’ll have to check the slides by yourselves.
Mastering Massive Data Volumes with Hypertable
Doug Judd — who impressed me with his fantastic Hypertable: The Ultimate Scaling Machine at the Berlin Buzzwords NoSQL conference — gave a talk on Hypertable, its architecture and performance. The presentation also mentioned two Hypertable case studies: Zvents (an analytics platform) and Reddiff.com (spam classification)
More presentations will be added as I’m receiving them.
Original title and link: Hypertable 0.9.4.0 Released, Over 40 Improvements and Bug Fixes (NoSQL databases © myNoSQL)
Fantastic presentation by Doug Judd covering not only Hypertable but also other really scalable NoSQL databases: