ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

nosql theory: All content tagged as nosql theory in NoSQL databases and polyglot persistence

Column vs Row Stores: How do they compare?

Yesterday I’ve asked on Twitter about technical papers looking at column-stores vs row-stores. Most of the answers I’ve got are pointing to the research done by Daniel Abadi: Papers and Technical Reports. I’ll start with:


Why NoSQL Databases Are Not Just For Google and Amazon?

Oren Eini1:

Why the history lesson, you ask? Why, to give you some perspective on the design choices that led to the victory of the relational databases. Space was at a premium, the interaction between the user and the application closely modeled the physical layout of the data in the database. That made sense, because there really were no other alternatives given the environment that existed at the time.

In my company, we are using RavenDB as the backend database for everything from a blog, our ordering and purchasing systems, the daily build server and many more. The major advantages that we found weren’t the ability to scale (although that exists), it is the freedom that it gives us in terms of modeling our data and changing our minds.

The Googles, Facebooks, and Amazons told the story of this was not our relational database vendors’ fault. Jan Lehnardt2 said a while back that NoSQL is about choice. I said that NoSQL databases are a departure from having just good enough solutions. And Oren Eini is emphasizing the benefits of other data models.


  1. Oren Eini is the creator and main developer of the RavenDB document database. 

  2. Jan Lehnardt: Apache CouchDB committer, Couchbase engineer 

Original title and link: Why NoSQL Databases Are Not Just For Google and Amazon? (NoSQL database©myNoSQL)

via: http://java.dzone.com/articles/why-nosql-not-just-google-and


5 Key Elements for a Firehose Data System

The 5 key elements for a firehose data system as per a presentation by Josh Berkus, CEO of PostgreSQL Experts Inc. summarized by Brian Proffitt on ITworld:

  1. Queuing software to manage out-of-sequence data
  2. Buffering techniques to deal with component outages
  3. Materialized views that update data into aggregate tables
  4. Configuration management for all the systems in the solution
  5. Comprehensive monitoring to look for failures

Basically firehose data systems are the perfect showcase of the 4 V’s in Big Data. To get an idea of the complexity involved by such systems check the DataSift architecture which relies on MySQL, HBase, Memcached, Redis, Kafka to deal just1 with the Twitter firehose.

Original title and link: 5 Key Elements for a Firehose Data System (NoSQL database©myNoSQL)


Hybrid Word Aligned Bitmaps: Why are column oriented databases so much faster than row oriented databases? -

Terence Siganakis:

I have been playing around with Hybrid Word Aligned Bitmaps for a few weeks now, and they turn out to be a rather remarkable data structure.  I believe that they are utilized extensively in modern column oriented databases such as Vertica and MonetDB. Essentially HWABs are a data structure that allows you to represent a sparse bitmap (series of 0’s and 1’s) really efficiently in memory.  The key trick here is the use of run length encoding to compress the bitmap into fewer bits while still allowing for lightening fast operations.  

The comment thread discusses a couple of reasons for column databases being faster than row-oriented databases and some scenarios where this is not happening.

Terence Siganakis links to FastBit: An efficient compressed Bitmap index technology :

FastBit is an open-source data processing library following the spirit of NoSQL movement. It offers a set of searching functions supported by compressed bitmap indexes. It treats user data in the column-oriented manner similar to well-known database management systems such as Sybase IQ, MonetDB, and Vertica.

Original title and link: Hybrid Word Aligned Bitmaps: Why are column oriented databases so much faster than row oriented databases? - (NoSQL database©myNoSQL)

via: http://siganakis.com/using-bitmap-indexes-in-query-processing


The History of NoSQL: This Was Not Our Technology Vendors’ Fault

Werner Vogels in the post about Amazon DynamoDB:

We had been pushing the scalability of commercially available technologies to their limits and finally reached a point where these third party technologies could no longer be used without significant risk. This was not our technology vendors’ fault; Amazon’s scaling needs were beyond the specs for their technologies and we were using them in ways that most of their customers were not. A number of outages at the height of the 2004 holiday shopping season can be traced back to scaling commercial technologies beyond their boundaries.

Here is what I wrote about the history behind NoSQL databases:

Providing decent solutions, up to a point, to a wide range of problems and covering more scenarios than alternative storage solutions existing at that time, made relational databases the de facto storage for the last 30 years. But during the last years, more and more problems crossed the boundaries of what could have been considered decent solutions leading to the need for specialized, better than good enough alternative solutions. And thus NoSQL databases.

It feels rewarding to get such confirmation from people that are at the forefront of NoSQL.

Original title and link: The History of NoSQL: This Was Not Our Technology Vendors’ Fault (NoSQL database©myNoSQL)


Google Research: Let's Make TCP Faster

Google is actively researching ways to improve TCP:

Our research shows that the key to reducing latency is saving round trips. We’re experimenting with several improvements to TCP. Here’s a summary of some of our recommendations to make TCP faster:

  1. Increase TCP initial congestion window to 10 (IW10). The amount of data sent at the beginning of a TCP connection is currently 3 packets, implying 3 round trips (RTT) to deliver a tiny 15KB-sized content.
  2. Reduce the initial timeout from 3 seconds to 1 second.
  3. Use TCP Fast Open (TFO).
  4. Use Proportional Rate Reduction for TCP (PRR).

The database world attacked the network latency with connection pools and pipelining. For reducing network round trips we’ve used JOINs or denormalized data. But all software architectures will benefit from a faster TCP.

Andrei Savu

Original title and link: Google Research: Let’s Make TCP Faster (NoSQL database©myNoSQL)

via: http://googlecode.blogspot.com/2012/01/lets-make-tcp-faster.html


NoSQL Databases Configuration Management

After reading about MarkLogic Packaging feature, I was wondering if managing configurations would not be better done with tools like Puppet or Chef instead of a custom built solution even if it comes packaged with your NoSQL database.

  • You’ve been working on an application on your development machine. Now it’s time to move your application to the staging or testing servers. What follows is a tedious process of reviewing the settings on your development machine and applying them to the staging machine. How sure are you that you got all the indexes just right?
  • You’ve got a certified configuration that you want to deploy onto a new cluster. Getting the hardware setup and installing the server itself isn’t too hard, but now you have to make sure that all the application servers and databases are setup. Can you see another tedious process coming?

If you’ve been involved or responsible for managing the configuration of a NoSQL database deployment, I’d really love to learn what solution and tools have been used.

Original title and link: NoSQL Databases Configuration Management (NoSQL database©myNoSQL)


Key-Value Stores, Document Databases, and Column Stores as Aggregate Oriented Databases

A different, unified look at the data model of the key-value stores, document databases, and column-family stores from Martin Fowler:

there’s a big similarity between the first three - all have a fundamental unit of storage which is a rich structure of closely related data: for key-value stores it’s the value, for document stores it’s the document, and for column-family stores it’s the column family. In DDD terms, this group of data is an aggregate.

The aggregate approach was present in the relational databases world for quite a while. It came in two flavors: views and denormalization. The first one worked well for non-distributed deployments, while the second is used everywhere the speed or the usage of joins was not an option.

Original title and link: Key-Value Stores, Document Databases, and Column Stores as Aggregate Oriented Databases (NoSQL database©myNoSQL)

via: http://java.dzone.com/articles/aggregate-oriented-database


The State of NoSQL in 2012

Wise words from Sid Anand:

Many of the NoSQL vendors view the “battle of NoSQL” to be akin to the RDBMS battle of the 80s, a winner-take-all battle. In the NoSQL world, it is by no means a winner-take-all battle. Distributed Systems are about compromises.

While there might be some that would like to see a NoSQL battle and at some point money will talk, I hope the real battle will remained centered around the technical aspects and which data solutions solve each specific problem better. The sort of battle in which everyone learns something.

Original title and link: The State of NoSQL in 2012 (NoSQL database©myNoSQL)

via: http://practicalcloudcomputing.com/post/16109041412/the-state-of-nosql-in-2012


Asking for Performance and Scalability Advice on StackOverflow

How many times have you got an answer that applies to your specific scenario when providing a short list of performance and scalability requirements? MySQL/InnoDB can do 750k qps, Cassandra is scaling linearly, MongoDB can do 8 mil ops/s. Is any of these the answer for your application?

Actually:

  • How many times did you get all the requirements right at the spec time?

  • How many times did requirements remain the same during the development cycle?

  • How many times did production reality confirmed your bullet list requirements?

Original title and link: Asking for Performance and Scalability Advice on StackOverflow (NoSQL database©myNoSQL)


Neuron Based Data Structure – an Implementation

Alexander Bresk:

The neuron based data structure (called NBDS) follows the idea, to keep an information as an atomic part. The model contains three parts. The first part is the Neuron, which acts like a container for data. The second part is the Axon. This axon connects two neurons together and it can still contain information (data about the connection or relation). The last part is the Space. In a Space you put neurons and axons together and run some operations on it. You can imagine the space as a component, that brings the order into the set of neurons and axons.

You’ll find all these features in any graph database.

Original title and link: Neuron Based Data Structure – an Implementation (NoSQL database©myNoSQL)

via: http://www.cip-labs.net/2011/12/14/neuron-based-data-structure-an-implementation/


Card Payment Sytems and the CAP Theorem

On the surface it would appear that building such a system would be easy since the card vault can be implemented in a data store (either RDBMS or noSQL store) and the data stores schema could be simple, containing just the PAN, token and perhaps some timestamp information. There are plenty of companies that have attempted to build their own card vaults and many vendors offering commercial products. However we shall see later in this article that designing a card vault it requires a distributed data store and a decision is needed on which compromises of the CAP Theorem your system is willing to accept.

Firstly a small correction to the original post: instead of “partition tolerance is not an option”, read “partition tolerance is not optional”.

One of the most frequently asked question about NoSQL databases is “how do they handle transactions. Like in a banking system”. I’ve never developed a banking system, so I don’t know how those work. But I’d bet most of those asking haven’t worked on one either. So why not asking about the solution a NoSQL database would require for the system you are actually working on.

Original title and link: Card Payment Sytems and the CAP Theorem (NoSQL database©myNoSQL)

via: http://superconductor.voltage.com/2011/12/tokenization-of-credit-card-numbers-and-the-cap-theorem.html