NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



A map to reviewing RavenDB code base

Ayende provides some answers to a series of questions about specific code base details of RavenDB. This could be a very good starting point for those interested into how RavenDB is implemented.

Original title and link: A map to reviewing RavenDB code base (NoSQL database©myNoSQL)


Paper: Parallel Graph Partitioning for Complex Networks

Authored by a team from Karlsruhe Institute of Technology, the paper “Parallel graph partitioning for complex networks” presents a parallelized and adapting label propagation technique for partitioning graphs:

The graph partitioning problem is NP-complete [3], [4] and there is no approximation algorithm with a constant ratio factor for general graphs [5]. Hence, heuristic algorithms are used in practice.

A successful heuristic for partitioning large graphs is the multilevel graph partitioning (MGP) approach depicted in Figure 1, where the graph is recursively contracted to achieve smaller graphs which should reflect the same basic structure as the input graph.

To SQL or to NoSQL?

Bob Lambert’s thoughts about a post about migrating from a NoSQL database back to a relational database:

I really liked this post, but particularly for these two points:

  • All data is relational, but NoSQL is useful because sometimes it isn’t practical to treat it as such for volume/complexity reasons.
  • In the comments, Jonathon Fisher’s remarked that NoSQL is really old technology, not new. (Of course you have to like any commenter that uses the word “defenestrated”).

I have lost count of how many times I’ve read exactly these arguments. But let’s take a different look at this post and the original article:

  1. the first thing that strikes me is that there’s no mention of the NoSQL database that was used; adding to that there’s no explanation of what led to using that database in the first place. What if it was just an experiment? What if the initial implementation was just a fashionable decision?

  2. all data is relational

    A more accurate statement would be “all data is connected“. The way we represent these connections can take many different forms and many times depends on the ways we use the data. This is exactly the core principle when doing data modeling in the NoSQL world too.

    The relational model is the most common as a consequence of the popularity of relational databases. One quick example of connected but not relational is hierarchical data, an area where relational databases are still not excelling (even if some have built custom solutions).

  3. “data in a relational model is optimized for the set of all possible operations”.

    Actually, the relational model is optimizing for space efficiency and set operations. There’s no such thing that optimizes for everything. Take graph data and traversal operations as an obvious counter example for relations and operations that are outside the capabilities of a relational database. And there are quite a few other examples: massive sparse matrices, etc.

  4. “Todd Homa recounts one horror story that shows how NoSQL data modelers must be aware of the corners into which they paint themselves as they optimize one access path at the expense of others.”

    This is like saying that everyone using a relational database has shut all their chances to grow their application to more than one server. Both of these are quite inaccurate.

Last, I think we should change the “choose the right tool for the job” advise to something that is a bit more clear: “understand and choose the trade-offs that correspond to your requirements”. Doesn’t sound as nice, but I think it’s better.

Original title and link: To SQL or to NoSQL? (NoSQL database©myNoSQL)


The data analytics handbook

A free book based on interviews with data scientists, data analysts, researchers. Available here.

Original title and link: The data analytics handbook (NoSQL database©myNoSQL)

MMS and the state of backups in MongoDB land

So just to be clear, if you are doing it yourself, you are probably settling for something other than a consistent snapshot. Even then, it’s not simple.

I’m always fascinated by companies introducing products by calling out how shitty and complicated their other products are. Axion. Now cleans 10 times better than before.

Original title and link: MMS and the state of backups in MongoDB land (NoSQL database©myNoSQL)


Four Easy Steps to Achieve 1 Million TPS on 1 Server using YCSB Benchmark [sponsor]

Words from myNoSQL’s supporters, Aerospike:

Last year, Aerospike published a ‘recipe’ describing how a database can be tuned to deliver 1 million TPS on a $5k server. This year, we simplified the recipe, applied it to Aerospike, and doubled performance using YCSB tests.

Find out how we did it in four easy steps:

Original title and link: Four Easy Steps to Achieve 1 Million TPS on 1 Server using YCSB Benchmark [sponsor] (NoSQL database©myNoSQL)

A proposal for more reliable locks using Redis

Salvatore Sanfilippo:

Can we have a fast and reliable system at the same time based on Redis? This blog post is an exploration in this area. I’ll try to describe a proposal for a simple algorithm to use N Redis instances for distributed and reliable locks, in the hope that the community may help me analyze and comment the algorithm to see if this is a valid candidate.

As much as I like Redis, use this post as an exercise on how to reason about distributed locks and stick with ZooKeeper for the implementation.

Original title and link: A proposal for more reliable locks using Redis (NoSQL database©myNoSQL)


Cascading components for a Big Data applications

Jules S. Damji in a quick intro to Cascading:

At the core of most data-driven applications is a data pipeline through which data flows, originating from Taps and Sources (ingestion) and ending in a Sink (retention) while undergoing transformation along a pipeline (Pipes, Traps, and Flows). And should something fail, a Trap (exception) must handle it. In the big data parlance, these are aspects of ETL operations.

You have to agree that when compared with the MapReduce model, these components could bring a lot of readability to your code. On the other hand, at a first glance Cascading API still feels verbose.

Original title and link: Cascading components for a Big Data applications (NoSQL database©myNoSQL)


Play with data: Kinetica

Just wow!

Kinetica is a new app for visualizing and exploring data on tablets. Instead of forcing you to use a boring old spreadsheet, Kinetica lets you touch, sift, and play with your data in a physical environment. Each row of data becomes a circle that can be pulled like a magnet into charts, filtered through screens, and selectively highlighted.

Created by a team from Carnigie Mellon, Kinetica is an iPad app. The future of Tableau Software.

Original title and link: Play with data: Kinetica (NoSQL database©myNoSQL)


The era of the No-Design DataBase

Holger Mueller:

So could be the common thread of the new database boom the absence of a design component, the disposition of schema design step that was and is key for the success of any relational database?


Original title and link: The era of the No-Design DataBase (NoSQL database©myNoSQL)


Merge and serialization functions for Riak

Tom Crayford (Yeller) describes how to test the merge and serialization functions used to resolve potential conflicts in Riak:

The way I prefer using riak, is with allow_mult=true. This means that whenever you have causally conflicting writes to a key, riak will store all of them, and upon your next read of that key you have to resolve them yourself. Designing your datatypes such that you can merge them is a huge topic, and an area of active research. However, even once you have a merge strategy worked out, how can you be sure that your reasoning is good? The merge functions you use have to obey a few properties: they have to be commutative, idempotent and associative, or you’ll mess things up when you have conflicts

Original title and link: Merge and serialization functions for Riak (NoSQL database©myNoSQL)


Cloudera, Hadoop, Data warehouses and SLR camera

Amr Adawallah in an interview with Dan Woods for Forbes:

Our advantage is that we can encompass more data and run more workloads with less friction than any other platform. The analogy I use most often is the difference between the SLR camera and the camera on your smart phone. Almost everyone takes more pictures on their smart phone than on their SLR.

The SLR camera is like the enterprise data warehouse. The SLR camera is really, really good at taking pictures, in the same sense that an enterprise data warehouse is really, really good at running queries. But that’s the only thing it does. The data it picks is only exposed to that workload. The system we provide, the enterprise data hub, is more like the smartphone. It can take decent pictures—they won’t be as good as the SLR camera, and in this I’m referring to the Impala system. So Impala will run queries. The queries won’t run at the same interactive OLAP speeds that you get from a high-end data warehouse. However, for many use cases, that performance might be good enough, given that the cost is 10 times lower.

I’ve linked in the past to Ben Thomspon‘s visualizations of the innovator’s dillema:

ben thompson - innovator dilemma

The explanation goes like this: incumbents’ products are usually over-serving consumer needs thus leaving room to new entrants’ good-enough lower-priced products.

Original title and link: Cloudera, Hadoop, Data warehouses and SLR camera (NoSQL database©myNoSQL)