ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

bigtable: All content tagged as bigtable in NoSQL databases and polyglot persistence

Introduction to Apache HBase Snapshots

Matteo Bertozzi introduces HBase snapshots:

Prior to CDH 4.2, the only way to back-up or clone a table was to use Copy/Export Table, or after disabling the table, copy all the hfiles in HDFS. Copy/Export Table is a set of tools that uses MapReduce to scan and copy the table but with a direct impact on Region Server performance. Disabling the table stops all reads and writes, which will almost always be unacceptable.

In contrast, HBase snapshots allow an admin to clone a table without data copies and with minimal impact on Region Servers. Exporting the snapshot to another cluster does not directly affect any of the Region Servers; export is just a distcp with an extra bit of logic.

The part that made me really curious and that didn’t make too much sense when first reading the post is “clone a table without data copies”. But the post clarifies what the snapshot is:

A snapshot is a set of metadata information that allows an admin to get back to a previous state of the table. A snapshot is not a copy of the table; it’s just a list of file names and doesn’t copy the data. A full snapshot restore means that you get back to the previous “table schema” and you get back your previous data losing any changes made since the snapshot was taken.

What I still don’t understand is how snapshots are working after a major compaction (which drops deletes and expired cells).

Original title and link: Introduction to Apache HBase Snapshots (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2013/03/introduction-to-apache-hbase-snapshots/


Adding Value Through Graph Analysis Using Titan and Faunus

Interesting slidedeck by Matthias Broecheler introducing 3 graph-related tools developed by Vadas Gintautas, Marko Rodriguez, Stephen Mallette and Daniel LaRocque:

  1. Titan: a massive scale property graph allowing real-time traversals and updates
  2. Faunus: for batch processing of large graphs using Hadoop
  3. Fulgora: for global running graph algorithms on large, compressed, in-memory graphs

The first couple of slides are also showing some possible use cases where these tools would prove their usefulness:

Original title and link: Adding Value Through Graph Analysis Using Titan and Faunus (NoSQL database©myNoSQL)


Simplifying HBase Schema Development With KijiSchema

Jon Natkins from WibiData:

When building an HBase application, you need to be aware of the intricacies and quirks of HBase. For example, your choice of names for column families, or columns themselves can have a drastic effect on the amount of disk space necessary to store your data. In this article, we’ll see how building HBase applications with KijiSchema can help you avoid inefficient disk utilization.

The recommendations related to the length of column names is a one of those subtle signs of how young the NoSQL space is1.


  1. This is not specific only to HBase, but also MongoDB, RethinkDB, etc. 

Original title and link: Simplifying HBase Schema Development With KijiSchema (NoSQL database©myNoSQL)

via: http://www.kiji.org/2012/03/01/using-disk-space-efficiently-with-kiji-schema


Brief Intro to Cassandra in 27 Slides

If you never looked into Apache Cassandra, Michaël Figuière’s slidedeck will give you a quick into Cassandra’s main concepts.

Apache Cassandra 1.2 introduces some new features such as a Binary Protocol and Collections datatype that together with the now finalized CQL3 query language provide a new interface to communicate with Cassandra that dramatically shrink its learning curve and simplify its daily use while still relying on its highly scalable architecture and storage engine. This presentation will iterate over all these new features including an overview of CQL3 query language, a look at the new client architecture, and an update on data modeling best practices. Then we’ll see how to implement an enterprise application using this new interface so that the audience can realize that a number of design principles are inspired from those commonly used with relational databases while some other entirely different, due to Cassandra partitioning approach.

Original title and link: Brief Intro to Cassandra in 27 Slides (NoSQL database©myNoSQL)


A Quick Tour of Internal Authentication and Authorization Security in DataStax Enterprise and Apache Cassandra

Robin Schumacher describes the new security features added to Apache Cassandra and DataStax Enterprise:

This article will concentrate on the new internal authentication and authorization (or permission management) features that are part of both open source Cassandra as well as DataStax Enterprise. Authentication deals with validating incoming user connections to a database cluster, whereas authorization concerns itself with what a logged in user can do inside a database.

I’m happy to see NoSQL databases entering the space of security as this would ease their way inside enterprises. But I fear a bit the moment when the marketing message will change from “it’s too early to provide security features” to “the first enterprise grade NoSQL database”.

Original title and link: A Quick Tour of Internal Authentication and Authorization Security in DataStax Enterprise and Apache Cassandra (NoSQL database©myNoSQL)

via: http://www.planetcassandra.org/blog/post/a-quick-tour-of-internal-authentication-and-authorization-security-in-datastax-enterprise-and-apache-cassandra


Project Rhino: Enhanced Data Protection for the Apache Hadoop Ecosystem

Avik Dey (Intel) sent the announcement of the new open source project from Intel to the Hadoop mailing list:

As the Apache Hadoop ecosystem extends into new markets and sees new use cases with security and compliance challenges, the benefits of processing sensitive and legally protected data with Hadoop must be coupled with protection for private information that limits performance impact. Project Rhino is our open source effort to enhance the existing data protection capabilities of the Hadoop ecosystem to address these challenges, and contribute the code back to Apache.

Project Rhino targets security at all levels: from encryption and key management, cell level ACLs to audit logging.

Original title and link: Project Rhino: Enhanced Data Protection for the Apache Hadoop Ecosystem (NoSQL database©myNoSQL)

via: http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201302.mbox/%3cCD5137E5.15610%25avik.dey@intel.com%3e


From SimpleDB to Cassandra: Data Migration for a High Volume Web Application at Netflix

Prasanna Padmanabhan and Shashi Madapp posted an article on the Netflix blog describing the process used to migrate data from Amazon SimpleDB to Cassandra:

There will come a time in the life of most systems serving data, when there is a need to migrate data to a more reliable, scalable and high performance data store while maintaining or improving data consistency, latency and efficiency. This document explains the data migration technique we used at Netflix to migrate the user’s queue data between two different distributed NoSQL storage systems.

The steps involved are what you’d expect for a large data set migration:

  1. forklift
  2. incremental replication
  3. consistency checking
  4. shadow writes
  5. shadow writes and shadow reads for validation
  6. end of life of the original data store (SimpleDB)

If you think of it, this is how a distributed, eventually consistent storage works (at least in big lines) when replicating data across the cluster. The main difference is that inside a storage engine you deal with a homogeneous system with a single set of constraints, while data migration has to deal with heterogenous systems most often characterized by different limitations and behavior.

In 2009, Netflix performed a similar massive data migration operation. At that time it involved moving data from its own hosted Oracle and MySQL databases to SimpleDB. The challenges of operating this hybrid solution were described in a the paper Netflix’s Transition to High-Availability Storage Systems authored by Sid Anand.

Sid Anand is now working at LinkedIn where they use Databus for low latency data transfer. But Databus’s approach is very similar.

Original title and link: From SimpleDB to Cassandra: Data Migration for a High Volume Web Application at Netflix (NoSQL database©myNoSQL)

via: http://techblog.netflix.com/2013/02/netflix-queue-data-migration-for-high.html?m=1


DataStax's Reaction to MySQL 5.6: Oracle’s MySQL Misses the NoSQL Mark

Jonathan Ellis in a post about MySQL 5.6 and how Oracle got the whole NoSQL wrong, considering NoSQL is, in this exact order, about scaling, continuous availability, flexibility, performance, and queryability:

The big news for MySQL 5.6 was the inclusion of “NoSQL” features in the form of a memcached api for get and put operations.

In cases like this, it’s tough to tell whether Oracle got this so wrong deliberately to sow confusion in the market, or because they really think that’s what NoSQL is about.

I know Jonathan Ellis has always had very strong opinions about the technical superiority of Cassandra and Cassandra is indeed a very solid solution, but I’m always reluctant to calling a competitor stupid and using the myopic argument “if I’m good at X and suck at Y, then what everyone is looking for is only X”.

Original title and link: DataStax’s Reaction to MySQL 5.6: Oracle’s MySQL Misses the NoSQL Mark (NoSQL database©myNoSQL)

via: http://www.datastax.com/dev/blog/oracles-mysql-misses-the-nosql-mark


Flatten Entire HBase Column Families With Pig and Python UDFs

Chase Seibert:

Most Pig tutorials you will find assume that you are working with data where you know all the column names ahead of time, and that the column names themselves are just labels, versus being composites of labels and data. For example, when working with HBase, it’s actually not uncommon for both of those assumptions to be false. Being a columnar database, it’s very common to be working to rows that have thousands of columns. Under that circumstance, it’s also common for the column names themselves to encode to dimensions, such as date and counter type.

Original title and link: Flatten Entire HBase Column Families With Pig and Python UDFs (NoSQL database©myNoSQL)

via: http://chase-seibert.github.com/blog/2013/02/10/pig-hbase-flatten-column-family.html


Apache HBase Internals: Locking and Multiversion Concurrency Control

Gregory Chanan explains the ACID per-row semantics of HBase and the usage of row-level locks and MVCC to ensure them:

For writes:

  1. (w1) After acquiring the RowLock, each write operation is immediately assigned a write number
  2. (w2) Each data cell in the write stores its write number.
  3. (w3) A write operation completes by declaring it is finished with the write number.

For reads:

  1. (r1) Each read operation is first assigned a read timestamp, called a read point.
  2. (r2) The read point is assigned to be the highest integer such that all writes with write number <= x have been completed.
  3. (r3) A read r for a certain (row, column) combination returns the data cell with the matching (row, column) whose write number is the largest value that is less than or equal to the read point of r.

Probably self understood that you should read and save this article if HBase is already in your datacenter or at least at horizon.

Original title and link: Apache HBase Internals: Locking and Multiversion Concurrency Control (NoSQL database©myNoSQL)

via: https://blogs.apache.org/hbase/entry/apache_hbase_internals_locking_and


Cassandra Performance in Review

Jonathan Ellis:

I honestly think Cassandra is one to two years ahead of the competition, but I’m under no illusions that Cassandra itself is perfect.

You cannot start the year without taking a stab at your competitors. At least from the performance point of view and even if they’re not really competitors—MongoDB, Riak, HBase.

The NoSQL market is ant-size compared to the database market and while easier to convince people to change from NoSQL to NoSQL, the products that will thrive are those that will be able to constantly convert people from outside of this small universe.

Original title and link: Cassandra Performance in Review (NoSQL database©myNoSQL)

via: http://www.datastax.com/dev/blog/2012-in-review-performance


System Level and Functional Requirements for the Backend Database of a User Engagement Platform

Very good and practical analysis of what the requriments of a user engagement platform are for the backend database from both the system level and functional point of views. The ideal case is also spelled out, but I don’t think there’s one product out there that could do all of these:

So, today’s and tomorrow’s engagement services should accommodate, heavy write loads, heavy read loads, heavy aggregate(counter), modify and read loads. What becomes apparent if we look at user engagement services in this way is that aggregation needs to be a first class function of engagement services that is near real time, scalable and highly available.

Original title and link: System Level and Functional Requirements for the Backend Database of a User Engagement Platform (NoSQL database©myNoSQL)

via: http://tech-blog.flipkart.net/2013/01/nosql-for-a-user-engagement-platform/