ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

nosql theory: All content tagged as nosql theory in NoSQL databases and polyglot persistence

Conflict Resolution Using Rev Trees and a Comparison With Vector Clocks

Damien Katz has posted on GitHub a design document for the data structures, called rev trees, used to support conflict management in Couchbase. The doc also includes references to the way conflict resolution is done in CouchDB and also compares rev trees with the vector clocks.

When this happens [nb the edits are in conflict] Couchbase will store both edits, pick an interim winner (the same winner will be selected on all nodes) and “hide” the losing conflict(s) and mark the document as being in conflict so that it can found, using views and other searches, by an external agents who can potentially resolve the conflicts.

Original title and link: Conflict Resolution Using Rev Trees and a Comparison With Vector Clocks (NoSQL database©myNoSQL)

via: https://github.com/couchbaselabs/cbconflictmgmt/blob/master/revtrees.md


Bloom Filters by Example

Bloom filters are present in a lot of NoSQL systems. Take for example HBase and Bloom Filters. Last month I’ve linked to creating a simple Bloom filter in Python and today is time for Bloom Filters by Example.

Sid Anand

Original title and link: Bloom Filters by Example (NoSQL database©myNoSQL)

via: http://billmill.org/bloomfilter-tutorial/


Is Eventual Consistency Useful?

As a continuation to The NoSQL Partition Tolerance Myth, Jeff Darcy:

Every once in a while, somebody comes up with the “new” idea that eventually consistent systems (or AP in CAP terminology) are useless. Of course, it’s not really new at all; the SQL RDBMS neanderthals have been making this claim-without-proof ever since NoSQL databases brought other models back into the spotlight. In the usual formulation, banks must have immediate consistency and would never rely on resolving conflicts after the fact … except that they do and have for centuries.

Original title and link: Is Eventual Consistency Useful? (NoSQL database©myNoSQL)

via: http://pl.atyp.us/wordpress/index.php/2013/03/is-eventual-consistency-useful/


The NoSQL Partition Tolerance Myth

Emin Gün Sirer:

What the NoSQL industry is peddling under the guise of partition tolerance is not the ability to build applications that can survive partitions. They’re repackaging and marketing a very specific, and odd, behavior known as partition obliviousness.

The post presents a very dogmatic and radical perspective on what the requirements of both applications and distributed databases must be. I cannot agree with most of it if only for the reason it’s using the “bank example”.

Dealing with data conflicts is an added complexity for systems where write availability is more important than other requirements. Many NoSQL databases provide the knobs to tune the availability and consistency to the levels required by many applications. Applications can define more fine grained knobs on top of that.

Generalizing a scenario that might require consistent transactional data access to be the canonical example for all distributed systems and ignoring features that are present in (some of ) the NoSQL databases to help applications deal with different scenarios is never going to lead to correct conclusions.

Original title and link: The NoSQL Partition Tolerance Myth (NoSQL database©myNoSQL)

via: http://hackingdistributed.com/2013/03/07/partition-tolerance-myth/


Using Apache ZooKeeper to Build Distributed Apps (And Why)

Great intro to ZooKeeper and the problems it can help solve by Sean Mackrory:

Done wrong, distributed software can be the most difficult to debug. Done right, however, it can allow you to process more data, more reliably and in less time. Using Apache ZooKeeper allows you to confidently reason about the state of your data, and coordinate your cluster the right way. You’ve seen how easy it is to get a ZooKeeper server up and running. (In fact, if you’re a CDH user, you may already have an ensemble running!) Think about how ZooKeeper could help you build more robust systems

Leaving aside for a second the main topic of the post, another important lesson here is that the NIH syndrom in distributed systems is very expensive.

Original title and link: Using Apache ZooKeeper to Build Distributed Apps (And Why) (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2013/02/how-to-use-apache-zookeeper-to-build-distributed-apps-and-why/


Creating a Simple Bloom Filter in Python

Max Burstein:

Bloom filters are super efficient data structures that allow us to tell if an object is most likely in a data set or not by checking a few bits. Bloom filters return some false positives but no false negatives. Luckily we can control the amount of false positives we receive with a trade off of time and memory.

Explanations and code included.

Original title and link: Creating a Simple Bloom Filter in Python (NoSQL database©myNoSQL)

via: http://maxburstein.com/blog/creating-a-simple-bloom-filter/


A Data Store Independent of Consistency Models, Upfront Data Modeling and Access Algorithms

Tina Groves1 in “Where Does Hadoop Fit in a Business Intelligence Data Strategy?“:

For example, the decision to move and transform operational data to an operational data store (ODS), to an enterprise data warehouses (EDW) or to some variation of OLAP is often made to improve performance or enhance broad consumability by business people, particularly for interactive analysis. Business rules are needed to interpret data and to enable BI capabilities such as drill up/drill down. The more business rules built into the data stores, the less modelling effort needed between the curated data and the BI deliverable.

That’s why Chirag Mehta’s ideal database featuring “an ubiquitous interface independent of consistency models, upfront data modeling, and access algorithms” is never going to be efficient. Actually, I’m not even sure it would make sense being built.


  1. Tina Groves: Product Strategist, IBM Business Intelligence 

Original title and link: A Data Store Independent of Consistency Models, Upfront Data Modeling and Access Algorithms (NoSQL database©myNoSQL)

via: http://www.ibmbigdatahub.com/blog/where-does-hadoop-fit-business-intelligence-data-strategy


Why Can't RDBM Cluster the Way NoSQL Does? Distributed Database Architecture 101

Fabulous answer on StackExchange to a question that is in the mind of the users of relational databases that have heard of NoSQL.

Distributed database systems are complex critters and come in a number of different flavours. If I dig deep in to the depths of my dimly remembered distributed systems papers I did at university (roughly 15 years ago) I’ll try to explain some of the key engineering problems to building a distributed database system.

Original title and link: Why Can’t RDBM Cluster the Way NoSQL Does? Distributed Database Architecture 101 (NoSQL database©myNoSQL)

via: http://dba.stackexchange.com/a/34896


Summary and Links for CAP Articles on IEEE Computer Issue

Daniel Abadi has posted a quick summary of the articles signed by Eric Brewer, Seth Gilbert and Nancy Lynch, Daniel Abadi, Raghu Ramakrishnan, Ken Birman, Daniel Freedman, Qi Huang, and Patrick Dowell for the IEEE Computer issue dedicated to the CAP theorem. Plus links to most of them:

  1. Eric Brewer’s article republished by InfoQ
  2. Seth Gilbert and Nancy A. Lynch: Perspectives on the CAP theorem (PDF)
  3. Daniel Abadi: Consistency Tradeoffs in Modern Distributed Database System Design (PDF)
  4. Ken Birman, Daniel Freedman, Qi Huang, and Patrick Dowell: Overcaming CAP with Consistent Soft-State Replication (PDF)

Original title and link: Summary and Links for CAP Articles on IEEE Computer Issue (NoSQL database©myNoSQL)

via: http://dbmsmusings.blogspot.co.il/2012/10/ieee-computer-issue-on-cap-theorem.html


Why Is It Hard to Scale a Database, in Layman's Terms?

Quite enjoyable.

Original title and link: Why Is It Hard to Scale a Database, in Layman’s Terms? (NoSQL database©myNoSQL)

via: http://www.quora.com/Database-Systems/Why-is-it-hard-to-scale-a-database-in-layman%E2%80%99s-terms


JSONiq: The JSON Query Language

The long time reader William Candillon of 28msec send me a link to JSONiq - The JSON Query Language, a group initiative to bring XQuery-like queriability to JSON:

Our goal in the JSONiq group, is to put the maturity of XQuery to work with JSON data. JSONiq is an open extension of the XQuery data model and syntax to support JSON.

After reading and experimenting a bit with JSONiq my initial thought is that while it looks interesting, it feels like an XMLish complicated query language that doesn’t really reflect the simplicity and philosophy of JSON.

let $stats := db:find("stats")
for $access in $stats
group by $url := $access("url")
return {
  "url": $url,
  "avg": avg($access("response_time")),
  "hits": count($access)
}

What do you think?

Original title and link: JSONiq: The JSON Query Language (NoSQL database©myNoSQL)


Levels of Abstractions in Big Data

Mikio L. Braun:

Many of the tools like Hadoop or NoSQL data bases are quite new and are still exploring concepts and ways to describe operations well. It’s not like the interface has been honed and polished for years to converge to a sweet spot. For example, secondary indices have been missing from Cassandra for quite some time. Likewise, whether features are added or not is more driven by whether it’s technically feasible than whether it’d make sense or not. But this often means that you are forced to model your problems in ways which might be inflexible and not suited to the problem at hand. (Of course, this is not special to Big Data. Implementing neural networks on a SQL database might feasible, but is probably also not the most practical way to do it.)

While an interesting read I’m not sure I really got it—my understanding is that the author’s advise is that disregarding your backend storage or Big Data architecture, you should always think of your data and processing tools in terms of higher concepts as data structures, operations on data structures, and processing algorithms.

Original title and link: Levels of Abstractions in Big Data (NoSQL database©myNoSQL)

via: http://blog.mikiobraun.de/2012/09/big-data-abstraction-dsl.html