NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



HBase: All content tagged as HBase in NoSQL databases and polyglot persistence

A Tour of Amazon DynamoDB Features and API

Mathias Meyer’s walk through the DynamoDB features and API with commentary:

Sorted range keys, conditional updates, atomic counters, structured data and multi-valued data types, fetching and updating single attributes, strong consistency, and no explicit way to handle and resolve conflicts other than conditions. A lot of features DynamoDB has to offer remind me of everything that’s great about wide column stores like Cassandra, but even more so of HBase. This is great in my opinion, as Dynamo would probably not be well-suited for a customer-facing system. And indeed, Werner Vogel’s post on DynamoDB seems to suggest DynamoDB is a bastard child of Dynamo and SimpleDB, though with lots of sugar sprinkled on top.

Think of it as an extended, better articulated and closer to the API version of my notes about Amazon DynamoDB.

Original title and link: A Tour of Amazon DynamoDB Features and API (NoSQL database©myNoSQL)


Dealing With JVM Limitations in Apache Cassandra

A couple of most notable NoSQL databases targeting large scalable systems are written in Java: Cassandra, HBase, BigCouch. Then there’s also Hadoop. Plus a series of caching and data grid solutions like Terracotta, Gigaspaces. They are all facing the same challenge: tuning the JVM garbage collector for predictable latency and throughput.

Jonathan Ellis’s slides presented at Fosdem 2012 are covering some of the problems with GC and the way Cassandra tackles them. While this is one of those presentations where the slides are not enough to understand the full picture, going through them will still give you a couple of good hints.

For those saying that Java and the JVM are not the platform for writing large concurrent systems, here’s the quote Ellis is finishing his slides with:

Cliff Click: Many concurrent algorithms are very easy to write with a GC and totally hard (to down right impossible) using explicit free.

Enjoy the slides after the break.

The components and their functions in the Hadoop ecosystem

Edd Dumbill enumerates the various components of the Hadoop ecosystem:

Hadoop ecosystem

My quick reference of the Hadoop ecosystem is including a couple of other tools that are not in this list, with the exception of Ambari and HCatalog which were released later.

Original title and link: The components and their functions in the Hadoop ecosystem (NoSQL database©myNoSQL)

Hadoop, HBase and R: Will Open Source Software Challenge BI & Analytics Software Vendors?

Harish Kotadia:

Predictive Analytics has been billed as the next big thing for almost fifteen years, but hasn’t gained mass acceptance so far the way ERP and CRM solutions have. One of the main reason for this is the high upfront investment required in Software, Hardware and Talent for implementing a Predictive Analytics solution.

Well, this is about to change – […] Using R, HBase and Hadoop, it is possible to build cost-effective and scalable Big Data Analytics solutions that match or even exceed the functionality offered by costly proprietary solutions from leading BI/Analytics software vendors at a fraction of the cost.

Vendors will argue that software licensing represents just a small fraction of the costs of implementing BI or data analytics. What they’ll leave out is the costs of acquiring know-how and more important, the costs of maintenance and modernization of their solutions.

Original title and link: Hadoop, HBase and R: Will Open Source Software Challenge BI & Analytics Software Vendors? (NoSQL database©myNoSQL)


Hypertable Revival. Still the wrong strategy

After a very long silence (my last post about Hypertable dates back in Oct. 2010: NoSQL database architectures and Hypertable), there seems to be a bit of revival in the Hypertable space:

  1. there are new packages of (commercial) services (PR announcement):
    1. Uptime support subscription
    2. Training and certification
    3. Commercial license
  2. it seems like Hypertable has a customer in (India)
  3. it is taking yet another stab at HBase performance

While I’m somehow glad that Hypertable didn’t hit the deadpool, it’s quite disappointing that they are still trying to use this old and completely useless strategy of attacking another product in the market.

There are probably many marketers out there encouraging companies to use this old trick of getting attention by attacking the market leader1. And one of the simplest ways of doing that is by saying “mine is bigger than yours“.

But these days this strategy isn’t working anymore for quite a few reasons:

  1. benchmarks are most of the time incorrect, thus the attention will be pointed in the wrong direction.

    In the case of the Hypertable vs HBase benchmark, JD Cryans (HBase veteran) is demoting the results.

  2. For existing users, performance issues are already known. Performance issues are also known by core developers that are always working to address them. So nothing new, just some angry users of the attacked product.

  3. For new users, performance is just one aspect of the decision. Most of the time, it’s one of the last considered. Community, support, adoption, and well know case studies are much more important.

Attacking competitors based on feature checklists might be slightly effective in attracting a bit of attention, but it’s not the strategy to get users and customers and grow a community.

  1. HBase might not be a market leader, but it is definitely one of the NoSQL databases that have seen and a few very large deployments. 

Original title and link: Hypertable Revival. Still the wrong strategy (NoSQL database©myNoSQL)

Designing HBase Schema to Best Support Specific Queries

Real scenario, very good analysis of different data access requirements, and three possible solutions. What’s your pick?

The problem is fairly simple - I am storing “notifications” in hbase, each of which has a status (“new”, “seen”, and “read”). Here are the API’s I need to provide:

  • Get all notifications for a user
  • Get all “new” notifications for a user
  • Get the count of all “new” notifications for a user
  • Update status for a notification
  • Update status for all of a user’s notifications
  • Get all “new” notifications accross the database
  • Notifications should be scannable in reverse chronological order and allow pagination.

Original title and link: Designing HBase Schema to Best Support Specific Queries (NoSQL database©myNoSQL)


HBase Coprocessors Explained

A thorough post from Trend Micro Hadoop Group (Mingjie Lai, Eugene Koontz, Andrew Purtell) explaining all details of HBase coprocessors included in the latest HBase release 0.92.0:

Why HBase Coprocessors?

HBase has very effective MapReduce integration for distributed computation over data stored within its tables, but in many cases – for example simple additive or aggregating operations like summing, counting, and the like – pushing the computation up to the server where it can operate on the data directly without communication overheads can give a dramatic performance improvement over HBase’s already good scanning performance.

Also, before 0.92, it was not possible to extend HBase with custom functionality except by extending the base classes.

What are HBase Coprocessors?

In order to support sufficient flexibility for potential coprocessor behaviors, two different aspects of extension are provided by the framework. One is the observer, which are like triggers in conventional databases, and the other is the endpoint, dynamic RPC endpoints that resemble stored procedures.

What can HBase Coprocessors be used for?

exciting new features can be built on top of it, for example secondary indexing, complex filtering (push down predicates), and access control.

These are just a couple of interesting points from this excellent article. I strongly suggest reading it.

Original title and link: HBase Coprocessors Explained (NoSQL database©myNoSQL)


More Details About Apache HBase 0.92.0

Jonathan Hsieh provides a summary of the new features in HBase 0.92.0 by splitting them into user features:

  • HFile v2, a new more efficient storage format
  • Faster recovery via distributed log splitting
  • Lower latency region-server operations via new multi-threaded and asynchronous implementations.

operator features:

  • An enhanced web UI that exposes more internal state
  • Improved logging for identifying slow queries
  • Improved corruption detection and repair tools

and developer features:

  • Coprocessors
  • Build support for Hadoop 0.20.20x, 0.22, 0.23.
  • Experimental: offheap slab cache and online table schema change

Earlier today when covering the HBase 0.92.0 release, I wrote that coprocessors are the hightlight of this release. I’ll take that back. Way too many interesting features in HBase 0.92.0 to highlight just one of them.

Original title and link: More Details About Apache HBase 0.92.0 (NoSQL database©myNoSQL)


Latest NoSQL Releases: HBase 0.92, DataStax Community Server, Hortonworks Data Platform, SolrCloud

Just a quick roundup of the latest releases and announcements.

Hortonworks Data Platform (HDP) version 2

HDP v2 will include:

  • NextGen MapReduce architecture
  • HDFS NameNode HA
  • HDFS Federation
  • up-to-date HCatalog, HBase, Hive, Pig

According to the announcement:

In order to avoid confusion, let me explain the two versions of HDP:

  • HDP v1 is based upon Apache Hadoop 1.0 (which comes from the 0.20.205 branch). It the most stable, production-ready version of Hadoop that is currently found in many large enterprise deployments. HDP v1 is currently available as a private technology preview. A public technology preview will be made available later this quarter.
  • HDP v2 is based upon Apache Hadoop 0.23, which includes the next generation advancements mentioned above. It’s an important step forward in terms of scalability, performance, high availability and data integrity. A technology preview will also be made publicly available later in Q1.

SolrCloud Completes Phase 2

Mark Miller about the completion of phase 2:

The second phase of SolrCloud has been in full swing for a couple of months now and it looks like we are going to be able to commit this work to trunk very soon! In Phase1 we built on top of Solr’s distributed search capabilities and added cluster state, central config, and built-in read side fault tolerance. Phase 2 is even more ambitious and focuses on the write side. We are talking full-blown fault tolerance for reads and writes, near real-time support, real-time GET, true single node durability, optimistic locking, cluster elasticity, improvements to the Phase 1 features, and more.

Not there yet, but it’s coming.

DataStax Community Server 1.0.7

A new release of DataStax’s distribution of Cassandra incorporating Cassandra 1.0.7

HBase 0.92

Don’t let the version number trick you. This is an important release for HBase featuring:

  • coprocessors
  • security
  • new (self-migrating) file format
  • AWS improvements: EBS support, building a HA cluster

The list of new features, improvements, and bug fixes in HBase 0.92 is impressive. But the highlight of this release is in my opinion HBase coprocessors (Jira entry HBASE-200).

I’m leaving you with Andrew Purtell’s slides about HBase Coprocessors:

Pros and Cons of Using MapReduce With Distributed Key-Value Stores: HBase, Cassandra, Riak

Old Quora question with very good answers.

  • (pro) can (potentially) query live data
  • (pro) can (conceptually) be highly efficient at joining data sets that are identically sharded on the join key (the joins can be pushed down into the key-value store itself)
  • (con) full scans (the most common pattern for map-reduce) is most likely to be much faster with raw file system access
  • (con) because of the better decoupling of computation and storage in the GFS+Map-Reduce model - tolerating hot spots (resulting from MR jobs) is much easier
  • (con) key-value stores are rarely arranged to have schemas optimized for analytics

Naoki Yanai

Original title and link: Pros and Cons of Using MapReduce With Distributed Key-Value Stores: HBase, Cassandra, Riak (NoSQL database©myNoSQL)

Setting Up, Modeling and Loading Data in HBase With Hadoop and Clojure: NoSQL Tutorials

Even if you are not familiar with Clojure, you’ll still enjoy this fantastic HBase tutorial:

And that’s the thing: if you are loading literally gajigabytes of data into HBase you need to be pretty sure that it’s going to be able to answer your questions in a reasonable amount of time. Simply cramming it in there probably won’t work (indeed, that approach probably won’t work great for anything). I loaded and re-loaded a test set of twenty thousand rows until I had something that worked.

Original title and link: Setting Up, Modeling and Loading Data in HBase With Hadoop and Clojure: NoSQL Tutorials (NoSQL database©myNoSQL)


HBase and Hadoop in OpenLogic Top 5 Trending Open Source Projects

It’s nice to see HBase and Hadoop in the Top 5 gainers[1] of the OpenLogic’s Open source adoption trending report, but the list of contenders in the database and big data category is way too short: HBase, Hadoop, Mongodb, MySQL, PostgreSQL, CouchDB.

  1. The top 5: HBase, Node.js, nginx, Hadoop, Rails  

Original title and link: HBase and Hadoop in OpenLogic Top 5 Trending Open Source Projects (NoSQL database©myNoSQL)