column store: All content tagged as column store in NoSQL databases and polyglot persistence
Wednesday, 6 March 2013
Simplifying HBase Schema Development With KijiSchema
Jon Natkins from WibiData:
When building an HBase application, you need to be aware of the intricacies and quirks of HBase. For example, your choice of names for column families, or columns themselves can have a drastic effect on the amount of disk space necessary to store your data. In this article, we’ll see how building HBase applications with KijiSchema can help you avoid inefficient disk utilization.
The recommendations related to the length of column names is a one of those subtle signs of how young the NoSQL space is1.
-
This is not specific only to HBase, but also MongoDB, RethinkDB, etc. ↩
Original title and link: Simplifying HBase Schema Development With KijiSchema (©myNoSQL)
via: http://www.kiji.org/2012/03/01/using-disk-space-efficiently-with-kiji-schema
Tuesday, 5 March 2013
A Quick Tour of Internal Authentication and Authorization Security in DataStax Enterprise and Apache Cassandra
Robin Schumacher describes the new security features added to Apache Cassandra and DataStax Enterprise:
This article will concentrate on the new internal authentication and authorization (or permission management) features that are part of both open source Cassandra as well as DataStax Enterprise. Authentication deals with validating incoming user connections to a database cluster, whereas authorization concerns itself with what a logged in user can do inside a database.
I’m happy to see NoSQL databases entering the space of security as this would ease their way inside enterprises. But I fear a bit the moment when the marketing message will change from “it’s too early to provide security features” to “the first enterprise grade NoSQL database”.
Original title and link: A Quick Tour of Internal Authentication and Authorization Security in DataStax Enterprise and Apache Cassandra (©myNoSQL)
Monday, 25 February 2013
Project Rhino: Enhanced Data Protection for the Apache Hadoop Ecosystem
Avik Dey (Intel) sent the announcement of the new open source project from Intel to the Hadoop mailing list:
As the Apache Hadoop ecosystem extends into new markets and sees new use cases with security and compliance challenges, the benefits of processing sensitive and legally protected data with Hadoop must be coupled with protection for private information that limits performance impact. Project Rhino is our open source effort to enhance the existing data protection capabilities of the Hadoop ecosystem to address these challenges, and contribute the code back to Apache.
Project Rhino targets security at all levels: from encryption and key management, cell level ACLs to audit logging.
Original title and link: Project Rhino: Enhanced Data Protection for the Apache Hadoop Ecosystem (©myNoSQL)
Monday, 18 February 2013
From SimpleDB to Cassandra: Data Migration for a High Volume Web Application at Netflix
Prasanna Padmanabhan and Shashi Madapp posted an article on the Netflix blog describing the process used to migrate data from Amazon SimpleDB to Cassandra:
There will come a time in the life of most systems serving data, when there is a need to migrate data to a more reliable, scalable and high performance data store while maintaining or improving data consistency, latency and efficiency. This document explains the data migration technique we used at Netflix to migrate the user’s queue data between two different distributed NoSQL storage systems.
The steps involved are what you’d expect for a large data set migration:
- forklift
- incremental replication
- consistency checking
- shadow writes
- shadow writes and shadow reads for validation
- end of life of the original data store (SimpleDB)
If you think of it, this is how a distributed, eventually consistent storage works (at least in big lines) when replicating data across the cluster. The main difference is that inside a storage engine you deal with a homogeneous system with a single set of constraints, while data migration has to deal with heterogenous systems most often characterized by different limitations and behavior.
In 2009, Netflix performed a similar massive data migration operation. At that time it involved moving data from its own hosted Oracle and MySQL databases to SimpleDB. The challenges of operating this hybrid solution were described in a the paper Netflix’s Transition to High-Availability Storage Systems authored by Sid Anand.
Sid Anand is now working at LinkedIn where they use Databus for low latency data transfer. But Databus’s approach is very similar.
Original title and link: From SimpleDB to Cassandra: Data Migration for a High Volume Web Application at Netflix (©myNoSQL)
via: http://techblog.netflix.com/2013/02/netflix-queue-data-migration-for-high.html?m=1
Wednesday, 13 February 2013
DataStax's Reaction to MySQL 5.6: Oracle’s MySQL Misses the NoSQL Mark
Jonathan Ellis in a post about MySQL 5.6 and how Oracle got the whole NoSQL wrong, considering NoSQL is, in this exact order, about scaling, continuous availability, flexibility, performance, and queryability:
The big news for MySQL 5.6 was the inclusion of “NoSQL” features in the form of a memcached api for get and put operations.
In cases like this, it’s tough to tell whether Oracle got this so wrong deliberately to sow confusion in the market, or because they really think that’s what NoSQL is about.
I know Jonathan Ellis has always had very strong opinions about the technical superiority of Cassandra and Cassandra is indeed a very solid solution, but I’m always reluctant to calling a competitor stupid and using the myopic argument “if I’m good at X and suck at Y, then what everyone is looking for is only X”.
Original title and link: DataStax’s Reaction to MySQL 5.6: Oracle’s MySQL Misses the NoSQL Mark (©myNoSQL)
via: http://www.datastax.com/dev/blog/oracles-mysql-misses-the-nosql-mark
Monday, 11 February 2013
Flatten Entire HBase Column Families With Pig and Python UDFs
Chase Seibert:
Most Pig tutorials you will find assume that you are working with data where you know all the column names ahead of time, and that the column names themselves are just labels, versus being composites of labels and data. For example, when working with HBase, it’s actually not uncommon for both of those assumptions to be false. Being a columnar database, it’s very common to be working to rows that have thousands of columns. Under that circumstance, it’s also common for the column names themselves to encode to dimensions, such as date and counter type.
Original title and link: Flatten Entire HBase Column Families With Pig and Python UDFs (©myNoSQL)
via: http://chase-seibert.github.com/blog/2013/02/10/pig-hbase-flatten-column-family.html
Wednesday, 6 February 2013
Apache HBase Internals: Locking and Multiversion Concurrency Control
Gregory Chanan explains the ACID per-row semantics of HBase and the usage of row-level locks and MVCC to ensure them:
For writes:
- (w1) After acquiring the RowLock, each write operation is immediately assigned a write number
- (w2) Each data cell in the write stores its write number.
- (w3) A write operation completes by declaring it is finished with the write number.
For reads:
- (r1) Each read operation is first assigned a read timestamp, called a read point.
- (r2) The read point is assigned to be the highest integer such that all writes with write number <= x have been completed.
- (r3) A read r for a certain (row, column) combination returns the data cell with the matching (row, column) whose write number is the largest value that is less than or equal to the read point of r.
Probably self understood that you should read and save this article if HBase is already in your datacenter or at least at horizon.
Original title and link: Apache HBase Internals: Locking and Multiversion Concurrency Control (©myNoSQL)
via: https://blogs.apache.org/hbase/entry/apache_hbase_internals_locking_and
Wednesday, 23 January 2013
Cassandra Performance in Review
Jonathan Ellis:
I honestly think Cassandra is one to two years ahead of the competition, but I’m under no illusions that Cassandra itself is perfect.
You cannot start the year without taking a stab at your competitors. At least from the performance point of view and even if they’re not really competitors—MongoDB, Riak, HBase.
The NoSQL market is ant-size compared to the database market and while easier to convince people to change from NoSQL to NoSQL, the products that will thrive are those that will be able to constantly convert people from outside of this small universe.
Original title and link: Cassandra Performance in Review (©myNoSQL)
via: http://www.datastax.com/dev/blog/2012-in-review-performance
System Level and Functional Requirements for the Backend Database of a User Engagement Platform
Very good and practical analysis of what the requriments of a user engagement platform are for the backend database from both the system level and functional point of views. The ideal case is also spelled out, but I don’t think there’s one product out there that could do all of these:
So, today’s and tomorrow’s engagement services should accommodate, heavy write loads, heavy read loads, heavy aggregate(counter), modify and read loads. What becomes apparent if we look at user engagement services in this way is that aggregation needs to be a first class function of engagement services that is near real time, scalable and highly available.
Original title and link: System Level and Functional Requirements for the Backend Database of a User Engagement Platform (©myNoSQL)
via: http://tech-blog.flipkart.net/2013/01/nosql-for-a-user-engagement-platform/
Monday, 21 January 2013
11 Interesting Releases From the First Weeks of January
The list of releases I wanted to post about has been growing fast these last couple of weeks, so instead of waiting leaving it to Here it is (in no particular order1):
- (Jan.2nd) Cassandra 1.2 — announcement on DataStax’s blog. I’m currently learning and working on a post looking at what’s new in Cassandra 1.2.
- (Jan.10th) Apache Pig 0.10.1 — Hortonworks wrote about it
- (Jan.10th) DataStax Community Edition 1.2 and OpsCenter 2.1.3 — DataStax announcement
- (Jan.10th) CouchDB 1.0.4, 1.1.2, and 1.2.1 — releases fixing some security vulnerabilities
-
(Jan.11th) MongoDB 2.3.2 unstable — announcement. This dev release includes support for full text indexing. For more details you can check:
- MongoDB Full Text Search Explained and MongoDB Text Search Tutorial
- Full text search in MongoDB: details about supported languages and queries
- Indexing a Markdown blog using MongoDB full text indexing
- Short demo of MongoDB text search and hashed shard keys
- (Jan.12th) Apache HBase 0.94.4 — announcement and release notes
- (Jan.14th) Apache Hive 0.10.0: Hortonworks’s post about it
- (Jan.15th) Hortonworks Data Platform 1.2 featuring Apache Amabari — official PR announcement
- (Jan.16th) Redis 2.6.9 — release notes
- (Jan.16th) HyperDex 1.0RC1 — no docs
- (Jan.16th) Klout’s Brickhouse — announcement:
[…] an open source project extending Hadoop and Hive with a collection of useful user-defined-functions. Its aim is to make the Hive Big Data developer more productive, and to enable scalable and robust dataflows.
-
I’ve tried to order it chronologically, but most probably I’ve failed. ↩
Original title and link: 11 Interesting Releases From the First Weeks of January (©myNoSQL)
Wednesday, 16 January 2013
CCM: A Tool for Creating Local Cassandra Clusters
This little useful gem for creating local Cassandra test clusters was mentioned in Peter Bailis’s post Using Probabilistically Bounded Staleness in Cassandra 1.2.0, but I didn’t catch it until today when the DataStax guys blogged about it:
CCM (Cassandra Cluster Manager) is a tool written by Sylvain Lebresne that creates multi-node cassandra clusters on the local machine. It is great for quickly setting up clusters for development and testing, and is the foundation that the cassandra distributed tests (dtests) are built on. In this post I will give an introduction to installing and using ccm.
Original title and link: CCM: A Tool for Creating Local Cassandra Clusters (©myNoSQL)
via: http://www.datastax.com/dev/blog/ccm-a-development-tool-for-creating-local-cassandra-clusters
Tuesday, 15 January 2013
Using Probabilistically Bounded Staleness in Cassandra 1.2.0
Peter Bailis:
With the help of the Cassandra community, we recently released PBS consistency predictions as a feature in the official Cassandra 1.2.0 stable release. In case you aren’t familiar, PBS (Probabilistically Bounded Staleness) predictions help answer questions like: how eventual is eventual consistency? how consistent is eventual consistency? These predictions help you profile your existing Cassandra cluster and determine which configuration of N,R, and W are the best fit for your application, expressed quantitatively in terms of latency, consistency, and durability (see output below).
If I get this right, this tool should become a must-run-before-going-into-production and then also a good start for investigating WTFs like what am I suppose to do to avoid getting stale data.
Original title and link: Using Probabilistically Bounded Staleness in Cassandra 1.2.0 (©myNoSQL)
via: http://www.bailis.org/blog/using-pbs-in-cassandra-1.2.0/
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling