NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



NoSQL theory: All content tagged as NoSQL theory in NoSQL databases and polyglot persistence

Creating a Simple Bloom Filter in Python

Max Burstein:

Bloom filters are super efficient data structures that allow us to tell if an object is most likely in a data set or not by checking a few bits. Bloom filters return some false positives but no false negatives. Luckily we can control the amount of false positives we receive with a trade off of time and memory.

Explanations and code included.

Original title and link: Creating a Simple Bloom Filter in Python (NoSQL database©myNoSQL)


A Data Store Independent of Consistency Models, Upfront Data Modeling and Access Algorithms

Tina Groves1 in “Where Does Hadoop Fit in a Business Intelligence Data Strategy?“:

For example, the decision to move and transform operational data to an operational data store (ODS), to an enterprise data warehouses (EDW) or to some variation of OLAP is often made to improve performance or enhance broad consumability by business people, particularly for interactive analysis. Business rules are needed to interpret data and to enable BI capabilities such as drill up/drill down. The more business rules built into the data stores, the less modelling effort needed between the curated data and the BI deliverable.

That’s why Chirag Mehta’s ideal database featuring “an ubiquitous interface independent of consistency models, upfront data modeling, and access algorithms” is never going to be efficient. Actually, I’m not even sure it would make sense being built.

  1. Tina Groves: Product Strategist, IBM Business Intelligence 

Original title and link: A Data Store Independent of Consistency Models, Upfront Data Modeling and Access Algorithms (NoSQL database©myNoSQL)


Why Can't RDBM Cluster the Way NoSQL Does? Distributed Database Architecture 101

Fabulous answer on StackExchange to a question that is in the mind of the users of relational databases that have heard of NoSQL.

Distributed database systems are complex critters and come in a number of different flavours. If I dig deep in to the depths of my dimly remembered distributed systems papers I did at university (roughly 15 years ago) I’ll try to explain some of the key engineering problems to building a distributed database system.

Original title and link: Why Can’t RDBM Cluster the Way NoSQL Does? Distributed Database Architecture 101 (NoSQL database©myNoSQL)


Summary and Links for CAP Articles on IEEE Computer Issue

Daniel Abadi has posted a quick summary of the articles signed by Eric Brewer, Seth Gilbert and Nancy Lynch, Daniel Abadi, Raghu Ramakrishnan, Ken Birman, Daniel Freedman, Qi Huang, and Patrick Dowell for the IEEE Computer issue dedicated to the CAP theorem. Plus links to most of them:

  1. Eric Brewer’s article republished by InfoQ
  2. Seth Gilbert and Nancy A. Lynch: Perspectives on the CAP theorem (PDF)
  3. Daniel Abadi: Consistency Tradeoffs in Modern Distributed Database System Design (PDF)
  4. Ken Birman, Daniel Freedman, Qi Huang, and Patrick Dowell: Overcaming CAP with Consistent Soft-State Replication (PDF)

Original title and link: Summary and Links for CAP Articles on IEEE Computer Issue (NoSQL database©myNoSQL)


Why Is It Hard to Scale a Database, in Layman's Terms?

Quite enjoyable.

Original title and link: Why Is It Hard to Scale a Database, in Layman’s Terms? (NoSQL database©myNoSQL)


JSONiq: The JSON Query Language

The long time reader William Candillon of 28msec send me a link to JSONiq - The JSON Query Language, a group initiative to bring XQuery-like queriability to JSON:

Our goal in the JSONiq group, is to put the maturity of XQuery to work with JSON data. JSONiq is an open extension of the XQuery data model and syntax to support JSON.

After reading and experimenting a bit with JSONiq my initial thought is that while it looks interesting, it feels like an XMLish complicated query language that doesn’t really reflect the simplicity and philosophy of JSON.

let $stats := db:find("stats")
for $access in $stats
group by $url := $access("url")
return {
  "url": $url,
  "avg": avg($access("response_time")),
  "hits": count($access)

What do you think?

Original title and link: JSONiq: The JSON Query Language (NoSQL database©myNoSQL)

Levels of Abstractions in Big Data

Mikio L. Braun:

Many of the tools like Hadoop or NoSQL data bases are quite new and are still exploring concepts and ways to describe operations well. It’s not like the interface has been honed and polished for years to converge to a sweet spot. For example, secondary indices have been missing from Cassandra for quite some time. Likewise, whether features are added or not is more driven by whether it’s technically feasible than whether it’d make sense or not. But this often means that you are forced to model your problems in ways which might be inflexible and not suited to the problem at hand. (Of course, this is not special to Big Data. Implementing neural networks on a SQL database might feasible, but is probably also not the most practical way to do it.)

While an interesting read I’m not sure I really got it—my understanding is that the author’s advise is that disregarding your backend storage or Big Data architecture, you should always think of your data and processing tools in terms of higher concepts as data structures, operations on data structures, and processing algorithms.

Original title and link: Levels of Abstractions in Big Data (NoSQL database©myNoSQL)


Migrating Between Two Different Types of NoSQL Databases

Teacher asking a student:

After the Presentation the team leader asked me how it is, to migrate the db’s under various types. […] Can I migrate from a, key-value store db, like dynamo, to a, document store db, like mongoDB?

I’m not sure if this would have been reflected on the final grade, but I would have asked how many times did the teacher had to, really had to migrate data between multiple relational databases? And how many times it worked automatically? If allowed I’d have followed up with a very brief dialogue about the complexity of migrating applications to different programming languages (even when they use the same programming paradigm) and brought up examples of important differences of access and mutations of data structures. In the end I might have failed the exam though.

On a more serious note, there are so many aspects of migrating data that is very difficult to have a good short answer to this question. A sign of this problem’s complexity is the wide range of companies and products trying to solve ETL.

Original title and link: Migrating Between Two Different Types of NoSQL Databases (NoSQL database©myNoSQL)


Addiction to Familiar Systems

Marco Arment about switching from familiar systems or programming languages to better ones:

The fear of making the “wrong” choice actually makes the familiar, mastered PHP more attractive. […] If you can get PHP programmers to agree that they need to stop using it, the first question that comes up is what to use instead, and they’re met with a barrage of difficult choices and wildly different opinions and recommendations.

The same problem plagues anyone interested in switching to Linux (Which distro? Which desktop environment? Which package manager?), and the paralysis of choice-overload usually leads people to abandon the choice and just stick with Windows or OS X. But the switching costs of choosing the “wrong” programming language for a project are much larger.

Once you master a programming language or system you start seeing the other options from a different perspective. It doesn’t mean you have a better or an objective perspective though. What you’ve got is a new dimension you are considering in all decisions: familiarity. Every other option you have will go through your familiarity filter: does it feel familiar? does it allow me to do what I’ve been doing all this time? does it work in a similar way?

You might think that using a familiar system is all about productivity. I think that is only partially true. A familiar system doesn’t come with a learning curve and so in the early stages it feels productive. But many times you’ll just have to write over and over again the same things, avoid the same traps and made the tweaks you’ve learned. In a way this part of being productive feels like repetition.

But what all these have to do with databases? The answer is probably obvious.

Familiarity is in so many cases the main reason new systems start with a relational database. It feels familiar. It is familiar. As your application grows and new features are needed there will be cases when the relational database would become a less optimal solution. But in the name of familiarity, you’ll be tempted to stick with it. Make a change here and there, declare a feature too complicated, tweak it, optimize it. Repeat.

After a while, taking a step back might make you realize that what you’ve built is not anymore familiar. Or maybe it’s still familiar to you, but to a new project team member it will feel different and new. Or maybe very similar to a different database that you could have started with.

The costs of sticking with familiar programming languages, systems, or databases could be much larger than you’d think of.

Original title and link: Addiction to Familiar Systems (NoSQL database©myNoSQL)

The Benefits of Virtual Nodes and Performance Results

Sam Overton and Tom Wilkie of Acunu explain the advantages of using virtual nodes in distributed data storage engines and the performance they’ve measure introducing virtual nodes in Acunu platform when compared with Apache Cassandra:

One of the factors that limits the amount of data that can be stored on each node is the amount of time it takes to re-replicate that data when a node fails. That time matters, because it is a period during which the cluster is more vulnerable than normal to data loss. The challenge is that the more data stored on a node, the longer it takes to re-replicate it. Therefore, to store more data per node safely, we want to reduce the time taken to return to normal. This was one of our aims with virtual nodes.

Virtual Nodes reduces the time taken to re-replicate data as it involves every node in the cluster in the operation. In contrast, Apache Cassandra v1.1 will only involve a number of nodes equal to the Replication Factor (RF) of your keyspace. What’s more, with Virtual Nodes, the cluster remains balanced after this operation - you do not need to shuffle the tokens on the other nodes to compensate for the loss!

Original title and link: The Benefits of Virtual Nodes and Performance Results (NoSQL database©myNoSQL)


Unstructured Data: What Is It?

Paige Roberts writes in a post about integrating predictive analytics with Hadoop:

Unstructured is really a misnomer. I think it was Curt Monash who coined the term polystructured. That makes a lot more sense, since if data was truly without structure, even humans wouldn’t be able to make sense of it. In every seemingly unstructured dataset, there is some form of structure. An email has structure. A web page has structure. A Twitter stream has structure. Facebook interactions have structure. Machine generated log files have structure. But none of those structures are remotely alike. Nor are they remotely similar to the structure of a standard transactional record.

I don’t think there are many that are thinking of unstructured data as data with completely random structure. My understanding of the term unstructured refers to three dimensions:

  1. variability: data representing the same entities can take different forms and contain different details. The simplest example I could think of is the information about a video shared on two different platforms.
  2. multi-purpose: data is not representing a single entity, but rather a set of related entities in an aggregated or compo
  3. data closer to natural language than mathematical structure: take for example some normal English text—according to the grammar rules it has structure, but it’s not easily understandable by machines (nb: maybe machine descriptiveness would be a better way to name this dimension)

Original title and link: Unstructured Data: What Is It? (NoSQL database©myNoSQL)

How to Organize Your HBase Keys

The primary limitation of composite keys is that you can only query efficiently by known components of the composite key in the order they are serialized. Because of this limitation I find it easiest to think of your key like a funnel. Start with the piece of data you always need to partition on, and narrow it down to the more specific data that you don’t often need to distinguish.[…]

As a caveat to this process, keep in mind that HBase partitions its data across region servers based on the same lexicographic ordering that gets us the behavior we’re exploiting. If your reads/writes are heavily concentrated into a few values for the first (or first few) components of your key, you will end up with poorly distributed load across region servers. HBase functions best when the distribution of reads/writes is uniform across all potential row key values. While a perfectly uniform distribution might be impossible, this should still be a consideration when constructing a composite key.

This sounds in a way similar to how Amazon DynamoDB hash and range type primary keys or Oracle NoSQL Major-minor keys are working.

Original title and link: How to Organize Your HBase Keys (NoSQL database©myNoSQL)