HBase: All content tagged as HBase in NoSQL databases and polyglot persistence
A couple of most notable NoSQL databases targeting large scalable systems are written in Java: Cassandra, HBase, BigCouch. Then there’s also Hadoop. Plus a series of caching and data grid solutions like Terracotta, Gigaspaces. They are all facing the same challenge: tuning the JVM garbage collector for predictable latency and throughput.
Jonathan Ellis’s slides presented at Fosdem 2012 are covering some of the problems with GC and the way Cassandra tackles them. While this is one of those presentations where the slides are not enough to understand the full picture, going through them will still give you a couple of good hints.
For those saying that Java and the JVM are not the platform for writing large concurrent systems, here’s the quote Ellis is finishing his slides with:
Cliff Click: Many concurrent algorithms are very easy to write with a GC and totally hard (to down right impossible) using explicit free.
Enjoy the slides after the break.
Edd Dumbill enumerates the various components of the Hadoop ecosystem:
Original title and link: The components and their functions in the Hadoop ecosystem ( ©myNoSQL)
After a very long silence (my last post about Hypertable dates back in Oct. 2010: NoSQL database architectures and Hypertable), there seems to be a bit of revival in the Hypertable space:
- there are new packages of (commercial) services (PR announcement):
- Uptime support subscription
- Training and certification
- Commercial license
- it seems like Hypertable has a customer in Rediff.com (India)
- it is taking yet another stab at HBase performance
While I’m somehow glad that Hypertable didn’t hit the deadpool, it’s quite disappointing that they are still trying to use this old and completely useless strategy of attacking another product in the market.
There are probably many marketers out there encouraging companies to use this old trick of getting attention by attacking the market leader1. And one of the simplest ways of doing that is by saying “mine is bigger than yours“.
But these days this strategy isn’t working anymore for quite a few reasons:
benchmarks are most of the time incorrect, thus the attention will be pointed in the wrong direction.
For existing users, performance issues are already known. Performance issues are also known by core developers that are always working to address them. So nothing new, just some angry users of the attacked product.
- For new users, performance is just one aspect of the decision. Most of the time, it’s one of the last considered. Community, support, adoption, and well know case studies are much more important.
Attacking competitors based on feature checklists might be slightly effective in attracting a bit of attention, but it’s not the strategy to get users and customers and grow a community.
HBase might not be a market leader, but it is definitely one of the NoSQL databases that have seen and a few very large deployments. ↩
Original title and link: Hypertable Revival. Still the wrong strategy ( ©myNoSQL)
Just a quick roundup of the latest releases and announcements.
Hortonworks Data Platform (HDP) version 2
HDP v2 will include:
- NextGen MapReduce architecture
- HDFS NameNode HA
- HDFS Federation
- up-to-date HCatalog, HBase, Hive, Pig
According to the announcement:
In order to avoid confusion, let me explain the two versions of HDP:
- HDP v1 is based upon Apache Hadoop 1.0 (which comes from the 0.20.205 branch). It the most stable, production-ready version of Hadoop that is currently found in many large enterprise deployments. HDP v1 is currently available as a private technology preview. A public technology preview will be made available later this quarter.
- HDP v2 is based upon Apache Hadoop 0.23, which includes the next generation advancements mentioned above. It’s an important step forward in terms of scalability, performance, high availability and data integrity. A technology preview will also be made publicly available later in Q1.
SolrCloud Completes Phase 2
Mark Miller about the completion of phase 2:
The second phase of SolrCloud has been in full swing for a couple of months now and it looks like we are going to be able to commit this work to trunk very soon! In Phase1 we built on top of Solr’s distributed search capabilities and added cluster state, central config, and built-in read side fault tolerance. Phase 2 is even more ambitious and focuses on the write side. We are talking full-blown fault tolerance for reads and writes, near real-time support, real-time GET, true single node durability, optimistic locking, cluster elasticity, improvements to the Phase 1 features, and more.
Not there yet, but it’s coming.
DataStax Community Server 1.0.7
A new release of DataStax’s distribution of Cassandra incorporating Cassandra 1.0.7
Don’t let the version number trick you. This is an important release for HBase featuring:
- new (self-migrating) file format
- AWS improvements: EBS support, building a HA cluster
I’m leaving you with Andrew Purtell’s slides about HBase Coprocessors:
Old Quora question with very good answers.
- (pro) can (potentially) query live data
- (pro) can (conceptually) be highly efficient at joining data sets that are identically sharded on the join key (the joins can be pushed down into the key-value store itself)
- (con) full scans (the most common pattern for map-reduce) is most likely to be much faster with raw file system access
- (con) because of the better decoupling of computation and storage in the GFS+Map-Reduce model - tolerating hot spots (resulting from MR jobs) is much easier
- (con) key-value stores are rarely arranged to have schemas optimized for analytics
Original title and link: Pros and Cons of Using MapReduce With Distributed Key-Value Stores: HBase, Cassandra, Riak ( ©myNoSQL)
It’s nice to see HBase and Hadoop in the Top 5 gainers of the OpenLogic’s Open source adoption trending report, but the list of contenders in the database and big data category is way too short: HBase, Hadoop, Mongodb, MySQL, PostgreSQL, CouchDB.
The top 5: HBase, Node.js, nginx, Hadoop, Rails ↩
Original title and link: HBase and Hadoop in OpenLogic Top 5 Trending Open Source Projects ( ©myNoSQL)
Let’s start the year with a quick review of the latest releases that happened in December. Make sure that you scroll to the end as there are quite a few important ones.
Announced on Dec.15th, MongoDB 2.0.2 is a bug fix release:
- Hit config server only once per mongos on meta data change to not overwhelm
- Removed unnecessary connection close and open between mongos and mongod after getLastError
- Replica set primaries close all sockets on stepDown()
- Do not require authentication for the buildInfo command
- scons option for using system libraries
Apache Hive 0.8.0
Just as a side note, who came out with the idea of having a Hive fans’ page on Facebook?
Apache ZooKeeper 3.4.2
ZooKeeper 3.4.0 has been followed up shortly by two new minor version updates fixing some critical bugs. The list of issues fixed in ZooKeeper 3.4.1 can be found here and for ZooKeeper 3.4.2 the 2 fixed bugs are listed here.
As with ZooKeeper 3.4.0, these versions are not yet production ready.
Apache Whirr 0.7.0
Apache Whirr 0.7.0 has been released on Dec.21st featuring 56 improvements and bug fixes including support for Puppet & Chef, and Mahout and Ganglia as a service. The complete list can be found here.
Some more details about Whirr 0.7.0 can be found here.
Apache HBase 0.90.5
Redis 2.4.5 was released on Dec.23rd and provides 4 bug fixes:
- [BUGFIX] Fixed a ZUNIONSTORE/ZINTERSTORE bug that can cause a NaN to be inserted as a sorted set element score. This happens when one of the elements has
-infscore and the weight used is 0.
- [BUGFIX] Fixed memory leak in
- [BUGFIX] Fixed a non critical
SORTbug (Issue 224).
- [BUGFIX] Fixed a replication bug: now the timeout configuration is respected during the connection with the master.
--quietoption implemented in the Redis test.
Last but definitely one of the most important announcements that came in December:
Based on the 0.20-security code line, Hadoop 1.0.0 was announced on Dec.29. This release includes support for:
- HBase (append/hsynch/hflush) and Security
- Webhdfs (with full support for security)
- Performance enhanced access to local files for HBase
- Other performance enhancements, bug fixes, and features
- All version 0.20.205 and prior 0.20.2xx features
Complete release notes are available here.
And with this we are ready for 2012.
Original title and link: Last NoSQL Releases in 2011: MongoDB, Hive, ZooKeeper, Whirr, HBase, Redis, and Hadoop 1.0.0 ( ©myNoSQL)