ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Membase Amazon SimpleDB MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

BigTable: All content tagged as BigTable in NoSQL databases and polyglot persistence

Hypertable Revival. Still the wrong strategy

After a very long silence (my last post about Hypertable dates back in Oct. 2010: NoSQL database architectures and Hypertable), there seems to be a bit of revival in the Hypertable space:

  1. there are new packages of (commercial) services (PR announcement):
    1. Uptime support subscription
    2. Training and certification
    3. Commercial license
  2. it seems like Hypertable has a customer in Rediff.com (India)
  3. it is taking yet another stab at HBase performance

While I’m somehow glad that Hypertable didn’t hit the deadpool, it’s quite disappointing that they are still trying to use this old and completely useless strategy of attacking another product in the market.

There are probably many marketers out there encouraging companies to use this old trick of getting attention by attacking the market leader1. And one of the simplest ways of doing that is by saying “mine is bigger than yours“.

But these days this strategy isn’t working anymore for quite a few reasons:

  1. benchmarks are most of the time incorrect, thus the attention will be pointed in the wrong direction.

    In the case of the Hypertable vs HBase benchmark, JD Cryans (HBase veteran) is demoting the results.

  2. For existing users, performance issues are already known. Performance issues are also known by core developers that are always working to address them. So nothing new, just some angry users of the attacked product.

  3. For new users, performance is just one aspect of the decision. Most of the time, it’s one of the last considered. Community, support, adoption, and well know case studies are much more important.

Attacking competitors based on feature checklists might be slightly effective in attracting a bit of attention, but it’s not the strategy to get users and customers and grow a community.


  1. HBase might not be a market leader, but it is definitely one of the NoSQL databases that have seen and a few very large deployments. 

Original title and link: Hypertable Revival. Still the wrong strategy (NoSQL database©myNoSQL)


How Does Google MegaStore Compare Against HDFS/HBase?

Alex Feinberg answering the question in the title:

This is like saying “how does a General Motors bus compare against a Ford engine”. MegaStore is built on of Google’s BigTable/GFS. HBase/HDFS are BigTable/HDFS work-alikes.

BigTable and HBase give up availability (in the CAP Theorem sense) in favour of consistency: when a tablet master node (HRegionServer in HBase) goes down, the portion of the keyspace the failed node is responsible for becomes (briefly) unavailable until another node takes over the portion of the key space. This is efficient, as the data/write-ahead-log is stored GFS (or HDFS): in a way serializing writes to GFS/HDFS (a file system with relaxed consistency semantics) through a single node ensures serializable consistency.

Make sure you read it all.

Original title and link: How Does Google MegaStore Compare Against HDFS/HBase? (NoSQL database©myNoSQL)

via: http://www.quora.com/How-does-Google-MegaStore-compare-against-HDFS-HBase


Accumulo: A New BigTable Inspired Distributed Key/Value by NSA

The National Security Agency has submitted to Apache Incubator a proposal to open source Accumulo, a BigTable inspired key-value store that they were building since 2008. The project proposal page provides more details about Accumulo history, building blocks, and how it compares to the other BigTable open source implementation HBase:

  • Access Labels: Accumulo has an additional portion of its key that sorts after the column qualifier and before the timestamp. It is called column visibility and enables expressive cell-level access control. Authorizations are passed with each query to control what data is returned to the user.

  • Iterators: Accumulo has a novel server-side programming mechanism that can modify the data written to disk or returned to the user. This mechanism can be configured for any of the scopes where data is read from or written to disk. It can be used to perform joins on data within a single tablet.

  • Flexibility: Accumulo places no restrictions on the column families. Also, each column family in HBase is stored separately on disk. Accumulo allows column families to be grouped together on disk, as does BigTable.

  • Logging: HBase uses a write-ahead log on the Hadoop Distributed File System. Accumulo has its own logging service that does not depend on communication with the HDFS NameNode.

  • Storage: Accumulo has a relative key file format that improves compression.

You can read more about Accumulo here and check the Hacker News and Reddit discussions.

Michael Stack has commented on the HBase mailing list:

The cell based ‘access labels’ seem like a matter of adding an extra field to KV and their Iterators seem like a specialization on Coprocessors. The ability to add column families on the fly seems too minor a difference to call out especially if online schema edits are now (soon) supported. They talk of locality group like functionality too — that could be a significant difference. We would have to see the code but at first blush, differences look small.

Original title and link: Accumulo: A New BigTable Inspired Distributed Key/Value by NSA (NoSQL database©myNoSQL)


Paper: Google Fusion Tables: Data Management, Integration and Collaboration in the Cloud

This paper from Google talks extensively about the usage of BigTable and Megastore, the data model, query processing, and transaction handling in the implementation of Google Fusion Tables.

Google Fusion Tables is a cloud-based service for data management and integration. Fusion Tables enables users to upload tabular data files (spreadsheets, CSV, KML), currently of up to 100MB. The system provides several ways of visualizing the data (e.g., charts, maps, and timelines) and the ability to filter and aggregate the data. It supports the integration of data from multiple sources by performing joins across tables that may belong to different users. […] This paper describes the inner workings of Fusion Tables, including the storage of data in the system and the tight integration with the Google Maps infrastructure.

Download the paper or read it after the break.


Google BigTable, MapReduce, MegaStore vs. Hadoop, MongoDB

Dhanji R. Prasanna leaving Google:

Here is something you’ve may have heard but never quite believed before: Google’s vaunted scalable software infrastructure is obsolete. Don’t get me wrong, their hardware and datacenters are the best in the world, and as far as I know, nobody is close to matching it. But the software stack on top of it is 10 years old, aging and designed for building search engines and crawlers. And it is well and truly obsolete.

Protocol Buffers, BigTable and MapReduce are ancient, creaking dinosaurs compared to MessagePack, JSON, and Hadoop. And new projects like GWT, Closure and MegaStore are sluggish, overengineered Leviathans compared to fast, elegant tools like jQuery and mongoDB. Designed by engineers in a vacuum, rather than by developers who have need of tools.

Maybe it is just the disappointment of someone whose main project was killed

. Or maybe it is true. Or maybe it is just another magic triangle:

Agility Scalability Coolness factor Triangle

Edward Ribeiro mentioned a post from another ex-Googler which points out similar issues with Google’s philosophy.

Original title and link: Google BigTable, MapReduce, MegaStore vs. Hadoop, MongoDB (NoSQL databases © myNoSQL)

via: http://rethrick.com/#waving-goodbye


Cloudata: New Open Source BigTable Implementation

Cloudata is the third open source implementation of Google’s BigTable paper, after HBase and Hypertable[1]. There’s already an 1.0 version even if the Github project page is listing just a couple of commits.

From the home page, Cloudata’s current features:

  • Basic data service
    • Single row operation(get, put)
    • Multi row operation(like, between, scanner)
    • Data uploader(DirectUploader)
    • MapReduce(TabletInputFormat)
    • Simple cloudata query and supports JDBC driver
  • Table Management
    • split
    • distribution
    • compaction
  • Utility
    • Web based Monitor
    • CLI Shell
  • Failover
    • Master failover
    • TabletServer failover
  • Change log Server
    • Reliable fast appendable change log server
  • Support language
    • Java, RESTful API, Thrift

I couldn’t figure out if this is just an experiment or if it actually plans to be a real project.

Update: Cloudata’s author, Jsjangg, mentions in the comment thread that Cloudata is used at www.searcus.com for 2 years already running on a 20 machine cluster.


  1. See why I haven’t included Cassandra in this list in the comment thread.  

Original title and link: Cloudata: New Open Source BigTable Implementation (NoSQL databases © myNoSQL)


Google App Engine High Replication Datastore

The High Replication Datastore provides the highest level of availability for your reads and writes, at the cost of increased latency for writes and changes in consistency guarantees in the API. The High Replication Datastore increases the number of data centers that maintain replicas of your data by using the Paxos algorithm to synchronize that data across datacenters in real time.

Still not completely decentralized a la Amazon Dynamo.

Original title and link: Google App Engine High Replication Datastore (NoSQL databases © myNoSQL)

via: http://googleappengine.blogspot.com/2011/01/announcing-high-replication-datastore.html


Google BigTable Paper Summarized

The slides below summarizing the Google BigTable paper are the result of a NOSQLSummer meeting in Tokyo. Nice!

Update: I just realized that the company that hosted this meeting, Gemini Mobile Technologies, is the same that announced yesterday the new key-value store Hibari


NoSQL News & Links 2010-03-29

  1. Kirk McKusick, Sean Quinlan: ☞ GFS: Evolution on Fast-Forward

    Sounds pretty hackish here and there. But it might be very similar to the whole philosophy behind NoSQL: solve your own problem, come up with a cost effective approach and deliver value to your company.

  2. blog.katipo.co.nz: ☞ Installing MongoDB on Mac OS X using Homebrew.

    As a side note, last week I went through installing CouchDB master on my MacOS and discovered that while Homebrew works pretty well, you’ll still need to hack it from time to time. Plus, having MacPorts installed on the same machine made this process almost impossible.

  3. Loraine Lawson: ☞ The Business Pros and Cons of NoSQL

Characterizing Enterprise Systems using the CAP theorem

When building your next distributed system, you will have to make sure that all subsystems are able to deliver the combination of consistency-availability-partition tolerance that you are looking for.

Taylor’s article is a great start for categorizing according to the CAP theorem some of the (enterprise) systems out there: Terracota, Oracle Coherence, GigaSpaces, but also RDBMS and a couple of NoSQL solutions like Amazon Dynamo, BigTable, Cassandra, CouchDB and Project Voldemort.

Another interesting aspect of the article is that it tries to identify how these systems are coping with the missing CAP dimension. Unfortunately, there are a couple of things in the RDBMS analysis that I do not agree with.

An RDBMS provides availability, but only when there is connectivity between the client accessing the RDBMS and the RDBMS itself.

[…] there are several well-known approaches that can be employed to compensate for the lack of Partition tolerance. One of these approaches is commonly referred to as master/slave replication.

RDBMS are not available by themselves. Leaving aside the connectivity issue, RDBMS can become busy performing complex operations or run out of resources and so they can be unavailable.

What the article identifies as a solution for dealing with partition tolerance, master/slave setups are meant in fact to provide some level of availability. But with master/slave consistency becomes only “eventual consistency”.

The other approach mentioned — sharding — is indeed a solution meant to provide some level of partition tolerance. But without replication it gives up to availability.

As side notes:

  • it was interesting to learn that GigaSpaces can behave as either an CA or AP system, depending on the configurable replication scheme (sync vs async).
  • I am wondering if there are any CP solutions out there. I’d speculate that financial services would probably be required to be CP (if distributed).

via: http://javathink.blogspot.com/2010/01/characterizing-enterprise-systems-using.html


HBase vs. BigTable Comparison

Even if information about BigTable is scarce — basically everything known so far coming either from the original paper [1] or from Jeff Dean’s presentation [2], Lars George manages to compare over 40 features and concludes:

If HBase as on open-source project with just a handful of committers of whom most have a full-time day jobs can achieve something even remotely comparable I think this is a huge success. And looking at the 0.21 and 0.22 road map, the already small gap is going to shrink even further!

For an excellent reference to HBase architecture, I’d recommend another article authored by Lars George ☞ HBase Architecture 101 - Storage.

Now based on this HBase vs BigTable Comparison and this other one on HBase vs Cassandra, you can conduct your own three-way comparison and see if you agree that Cassandra is winning the NoSQL rage.

via: http://www.larsgeorge.com/2009/11/hbase-vs-bigtable-comparison.html


Cassandra Winning the NoSQL Race… Is It Really?

Tony Bain was probably ☞ tricked to think so based on news that Cassandra is used by Digg [1], Twitter [2] etc. To me those are just signs that:

  • Cassandra has finally gathered a community behind it [3]
  • they have identified good or common use cases

Secondly, the NoSQL world is quite wide. Cassandra is a column-oriented store (in the same category: BigTable, Hypertable, HBase), but we also have key-value stores, document stores, graph stores — see [4], [5] and [6] for more details — so saying that it is winning the race is incorrect. So, at best it should be compared with the other column-oriented solution.

Thinking of HBase, we recently learnt [7] that is doing well too, that there are real-life production applications running on it, and that it has seen good performance improvements over the last couple of releases. And as far as I know there is a larger community behind it.

You should also check the HBase vs. Cassandra: NoSQL Battle! article to better understand how they compare and where they differ and also Cassandra Gets (Better) Documentation for some very good references.