ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

column store: All content tagged as column store in NoSQL databases and polyglot persistence

Project Rhino: Enhanced Data Protection for the Apache Hadoop Ecosystem

Avik Dey (Intel) sent the announcement of the new open source project from Intel to the Hadoop mailing list:

As the Apache Hadoop ecosystem extends into new markets and sees new use cases with security and compliance challenges, the benefits of processing sensitive and legally protected data with Hadoop must be coupled with protection for private information that limits performance impact. Project Rhino is our open source effort to enhance the existing data protection capabilities of the Hadoop ecosystem to address these challenges, and contribute the code back to Apache.

Project Rhino targets security at all levels: from encryption and key management, cell level ACLs to audit logging.

Original title and link: Project Rhino: Enhanced Data Protection for the Apache Hadoop Ecosystem (NoSQL database©myNoSQL)

via: http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201302.mbox/%3cCD5137E5.15610%25avik.dey@intel.com%3e


From SimpleDB to Cassandra: Data Migration for a High Volume Web Application at Netflix

Prasanna Padmanabhan and Shashi Madapp posted an article on the Netflix blog describing the process used to migrate data from Amazon SimpleDB to Cassandra:

There will come a time in the life of most systems serving data, when there is a need to migrate data to a more reliable, scalable and high performance data store while maintaining or improving data consistency, latency and efficiency. This document explains the data migration technique we used at Netflix to migrate the user’s queue data between two different distributed NoSQL storage systems.

The steps involved are what you’d expect for a large data set migration:

  1. forklift
  2. incremental replication
  3. consistency checking
  4. shadow writes
  5. shadow writes and shadow reads for validation
  6. end of life of the original data store (SimpleDB)

If you think of it, this is how a distributed, eventually consistent storage works (at least in big lines) when replicating data across the cluster. The main difference is that inside a storage engine you deal with a homogeneous system with a single set of constraints, while data migration has to deal with heterogenous systems most often characterized by different limitations and behavior.

In 2009, Netflix performed a similar massive data migration operation. At that time it involved moving data from its own hosted Oracle and MySQL databases to SimpleDB. The challenges of operating this hybrid solution were described in a the paper Netflix’s Transition to High-Availability Storage Systems authored by Sid Anand.

Sid Anand is now working at LinkedIn where they use Databus for low latency data transfer. But Databus’s approach is very similar.

Original title and link: From SimpleDB to Cassandra: Data Migration for a High Volume Web Application at Netflix (NoSQL database©myNoSQL)

via: http://techblog.netflix.com/2013/02/netflix-queue-data-migration-for-high.html?m=1


DataStax's Reaction to MySQL 5.6: Oracle’s MySQL Misses the NoSQL Mark

Jonathan Ellis in a post about MySQL 5.6 and how Oracle got the whole NoSQL wrong, considering NoSQL is, in this exact order, about scaling, continuous availability, flexibility, performance, and queryability:

The big news for MySQL 5.6 was the inclusion of “NoSQL” features in the form of a memcached api for get and put operations.

In cases like this, it’s tough to tell whether Oracle got this so wrong deliberately to sow confusion in the market, or because they really think that’s what NoSQL is about.

I know Jonathan Ellis has always had very strong opinions about the technical superiority of Cassandra and Cassandra is indeed a very solid solution, but I’m always reluctant to calling a competitor stupid and using the myopic argument “if I’m good at X and suck at Y, then what everyone is looking for is only X”.

Original title and link: DataStax’s Reaction to MySQL 5.6: Oracle’s MySQL Misses the NoSQL Mark (NoSQL database©myNoSQL)

via: http://www.datastax.com/dev/blog/oracles-mysql-misses-the-nosql-mark


Flatten Entire HBase Column Families With Pig and Python UDFs

Chase Seibert:

Most Pig tutorials you will find assume that you are working with data where you know all the column names ahead of time, and that the column names themselves are just labels, versus being composites of labels and data. For example, when working with HBase, it’s actually not uncommon for both of those assumptions to be false. Being a columnar database, it’s very common to be working to rows that have thousands of columns. Under that circumstance, it’s also common for the column names themselves to encode to dimensions, such as date and counter type.

Original title and link: Flatten Entire HBase Column Families With Pig and Python UDFs (NoSQL database©myNoSQL)

via: http://chase-seibert.github.com/blog/2013/02/10/pig-hbase-flatten-column-family.html


Apache HBase Internals: Locking and Multiversion Concurrency Control

Gregory Chanan explains the ACID per-row semantics of HBase and the usage of row-level locks and MVCC to ensure them:

For writes:

  1. (w1) After acquiring the RowLock, each write operation is immediately assigned a write number
  2. (w2) Each data cell in the write stores its write number.
  3. (w3) A write operation completes by declaring it is finished with the write number.

For reads:

  1. (r1) Each read operation is first assigned a read timestamp, called a read point.
  2. (r2) The read point is assigned to be the highest integer such that all writes with write number <= x have been completed.
  3. (r3) A read r for a certain (row, column) combination returns the data cell with the matching (row, column) whose write number is the largest value that is less than or equal to the read point of r.

Probably self understood that you should read and save this article if HBase is already in your datacenter or at least at horizon.

Original title and link: Apache HBase Internals: Locking and Multiversion Concurrency Control (NoSQL database©myNoSQL)

via: https://blogs.apache.org/hbase/entry/apache_hbase_internals_locking_and


Cassandra Performance in Review

Jonathan Ellis:

I honestly think Cassandra is one to two years ahead of the competition, but I’m under no illusions that Cassandra itself is perfect.

You cannot start the year without taking a stab at your competitors. At least from the performance point of view and even if they’re not really competitors—MongoDB, Riak, HBase.

The NoSQL market is ant-size compared to the database market and while easier to convince people to change from NoSQL to NoSQL, the products that will thrive are those that will be able to constantly convert people from outside of this small universe.

Original title and link: Cassandra Performance in Review (NoSQL database©myNoSQL)

via: http://www.datastax.com/dev/blog/2012-in-review-performance


System Level and Functional Requirements for the Backend Database of a User Engagement Platform

Very good and practical analysis of what the requriments of a user engagement platform are for the backend database from both the system level and functional point of views. The ideal case is also spelled out, but I don’t think there’s one product out there that could do all of these:

So, today’s and tomorrow’s engagement services should accommodate, heavy write loads, heavy read loads, heavy aggregate(counter), modify and read loads. What becomes apparent if we look at user engagement services in this way is that aggregation needs to be a first class function of engagement services that is near real time, scalable and highly available.

Original title and link: System Level and Functional Requirements for the Backend Database of a User Engagement Platform (NoSQL database©myNoSQL)

via: http://tech-blog.flipkart.net/2013/01/nosql-for-a-user-engagement-platform/


11 Interesting Releases From the First Weeks of January

The list of releases I wanted to post about has been growing fast these last couple of weeks, so instead of waiting leaving it to Here it is (in no particular order1):

  1. (Jan.2nd) Cassandra 1.2 — announcement on DataStax’s blog. I’m currently learning and working on a post looking at what’s new in Cassandra 1.2.
  2. (Jan.10th) Apache Pig 0.10.1 — Hortonworks wrote about it
  3. (Jan.10th) DataStax Community Edition 1.2 and OpsCenter 2.1.3 — DataStax announcement
  4. (Jan.10th) CouchDB 1.0.4, 1.1.2, and 1.2.1 — releases fixing some security vulnerabilities
  5. (Jan.11th) MongoDB 2.3.2 unstable — announcement. This dev release includes support for full text indexing. For more details you can check:

    […] an open source project extending Hadoop and Hive with a collection of useful user-defined-functions. Its aim is to make the Hive Big Data developer more productive, and to enable scalable and robust dataflows.


  1. I’ve tried to order it chronologically, but most probably I’ve failed. 

Original title and link: 11 Interesting Releases From the First Weeks of January (NoSQL database©myNoSQL)


CCM: A Tool for Creating Local Cassandra Clusters

This little useful gem for creating local Cassandra test clusters was mentioned in Peter Bailis’s post Using Probabilistically Bounded Staleness in Cassandra 1.2.0, but I didn’t catch it until today when the DataStax guys blogged about it:

CCM (Cassandra Cluster Manager) is a tool written by Sylvain Lebresne that creates multi-node cassandra clusters on the local machine. It is great for quickly setting up clusters for development and testing, and is the foundation that the cassandra distributed tests (dtests) are built on. In this post I will give an introduction to installing and using ccm.

Original title and link: CCM: A Tool for Creating Local Cassandra Clusters (NoSQL database©myNoSQL)

via: http://www.datastax.com/dev/blog/ccm-a-development-tool-for-creating-local-cassandra-clusters


Using Probabilistically Bounded Staleness in Cassandra 1.2.0

Peter Bailis:

With the help of the Cassandra community, we recently released PBS consistency predictions as a feature in the official Cassandra 1.2.0 stable release. In case you aren’t familiar, PBS (Probabilistically Bounded Staleness) predictions help answer questions like: how eventual is eventual consistency? how consistent is eventual consistency? These predictions help you profile your existing Cassandra cluster and determine which configuration of N,R, and W are the best fit for your application, expressed quantitatively in terms of latency, consistency, and durability (see output below).

If I get this right, this tool should become a must-run-before-going-into-production and then also a good start for investigating WTFs like what am I suppose to do to avoid getting stale data.

Original title and link: Using Probabilistically Bounded Staleness in Cassandra 1.2.0 (NoSQL database©myNoSQL)

via: http://www.bailis.org/blog/using-pbs-in-cassandra-1.2.0/


Cassandra at MetricsHub for Cloud Monitoring

Charles Lamanna (CEO MetricsHub):

We use Cassandra for recording time series information (e.g. metrics) as well as special events (e.g. server failure) for our customers. We have a multi-tenant Cassandra cluster for this. We record over 16 data points per server per second, 24 hours a day, 7 days a week. We use Cassandra to store and crunch this data.

Many of the NoSQL databases can be used for monitoring. For example for small scale self-monitoring you could use Redis.

Original title and link: Cassandra at MetricsHub for Cloud Monitoring (NoSQL database©myNoSQL)

via: http://www.planetcassandra.org/blog/post/5-minute-interview-metricshub


Cassandra Application Performance Management With Request Tracing

Jonathan Ellis introduces in two posts—here and here—a new feature in Cassandra 1.2: request tracing. Basically such a feature is an improved approach over more generic APM tools like AppDynamics or NewRelic.

Be judicious with this: tracing a request will usually requre at least 10 rows to be inserted, so it is far from free. Unless you are under very light load tracing all requests (probability 1.0) will probably overwhelm your system. I recommend starting with a small fraction, e.g. 0.001 and increasing that only if necessary.

Years ago I had to implement myself a tracing layer1, after trying to get information from that system using some commercial tools—I’m sure these got better since then though. There were a few goals I’ve planned for and there were many things I’ve learned after deploying it live:

  1. granularity of the probes is critical to understanding how the system behaves. Use too coarse grained probes and you’ll miss important details, use too fine grained probes and you’ll be flooded with unusable data
  2. deciding if traces are persistent or volatile and the impact on the system performance. Should you be able to retrieve older traces? If persistent, do they contain enough information to help explain a specific behavior? Can they be used to replay a scenario?
  3. deciding what requests should be traced and when? Tracing comes with a cost and you must try to minimize the impact it has on the system. The most important data is needed when the system misbehaves or is under high load, but that’s the same time additional work could bring it down
  4. probabilistic vs pattern vs behavioral tracing. Generic solutions have no knowledge of the system, but a custom one could be created
  5. trace ordering. Can historical tracing information be ordered?

And there are probably many other things that I don’t remember right anymore.


  1. My implementation was specific to the system (in the sense that it had different tracing capabilities based on request types), but it was generic enough to allow us to change the granularity of collected probes, introduce new trace points, and also change the ratio of the requests to be traced.  

Original title and link: Cassandra Application Performance Management With Request Tracing (NoSQL database©myNoSQL)