NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



NoSQL debate: All content tagged as NoSQL debate in NoSQL databases and polyglot persistence

Use NoSQL, but Keep an RDBMS in Your Back Pocket

A very entertaining article getting most of the things right:

So clever programmers looked at this ridiculous edifice and realized the real problem: the data store and the use-case were mismatched. So they threw away ORM, SQL, and RDBMS, and wrote lovely new key-value stores, or object stores, or document stores, or searchable indexes, or any of a half-dozen other data structures that more closely matched what they were trying to do.


Is your data really just a giant hash lookup? Then a key-value store is what you want. Do you primarily access your related data via a single key? Then a document store is for you. Do you need full-text searching? Then, dear god, use a text-indexing engine, not an RDBMS. Do you need to answer questions about your data that you can’t predict in advance? Then make sure your data also ends up in an RDBMS. Maybe not in real-time, maybe summarized rather than in raw form, but somehow.

The only part I don’t agree with is the part saying that ORM has been created to deal with SQL. The reason behind ORMs is object-relational impedance mismatch:

I want to be very, very clear about this: ORM is a stupid idea.

The birth of ORM lies in the fact that SQL is ugly and intimidating (because relational algebra is pretty hard, and very different to most other types of programming). Our programs already have an object-oriented model, and we already know one programming language — why learn a second language, and a second model? Let’s just throw an abstraction layer on top of this baby and forget there’s even an RDBMS down there. […]

You’ve stored your data in a way that doesn’t match your primary use-case, accessible via a language that you are not willing to learn. Your solution is to keep the store and the language and just wrap them in abstraction?

And the end of the post is fantastic:

So go forth, use your OMADS, keep an RDBMS in your back pocket, and stop being so mean to poor old SQL.


NoSQL Pioneers, Web's Manifest Destiny... What?

GigaOm clueless again?

The result is not a steady movement to non-relational databases or other methods of storing data, but a back-and-forth as programmers and businesses figure out what kind of architecture they need and what problems they want to solve.


Why Would Quora Use a NoSQL Database?

Before jumping into Adam D’Angelo details about the decisions of using MySQL at Quora instead of a NoSQL database, I think we should firstly ask ourselves why would Quora need to look beyond a relational database?

As far as I can tell and please take it with a grain of salt, Quora is just a new fancy form of a forum: questions and non hierarchical answers. So, I’d speculate that based only on the amount of traffic and the amount and frequency of posts, scalability could be the only concern for Quora. But now let’s see what Adam says:

  1. If you partition your data at the application level, MySQL scalability isn’t an issue.
  2. These distributed databases like Cassandra, MongoDB, and CouchDB aren’t actually very scalable or stable.
  3. The primary online data store for an application is the worst place to take a risk with new technology.
  4. You can actually get pretty far on a single MySQL database and not even have to worry about partitioning at the application level.
  5. Many of the problems created by manually partitioning the data over a large number of MySQL machines can be mitigated by creating a layer below the application and above MySQL that automatically distributes data.
  6. Personally, I believe the relational data model is the “right” way to structure most of the data for an application like Quora

These are confirming my initial thoughts, but also provide a decent and correct perspective on choosing your storage solution.


It’s the End of the World As We Know It (NoSQL Edition)

Some good comments on Michael Stonebraker’s paper ☞ The End of an Architectural Era (pdf):

The provocatively named paper is simply a description of a system, designed from scratch for modern OLTP requirements and the demonstration that this system gives better performance than traditional RDBMS on OLTP type load. The conclusion is that since RDBMS can’t even excel at OLTP – it must be destined for the garbage pile. I’ll ignore the fact that hybrid systems are far from extinct and look at the paper itself.

While it’s kind of difficult to disagree with Michael Stonebraker, I still believe that some of the design considerations in that paper are not really requirements, but just based on his work on HStore and VoltDB. Take for example the “the OLTP database should fit entirely in-memory” or “the OLTP database should be single threaded”, these sound like implementation details and not as functional requirements.


Two Answers to Why NoSQL

Long post about building a prototype application with two nice answers to the question: why NoSQL?

Well, cost for one. If I could afford Oracle I’d sooner use that than go NoSQL in all likelihood. I can’t afford it. Not even close. Oracle might as well charge me a small planet for their product. It’s great stuff, but out of reach. And what about sharding? Sharding a relational database sucks, and to try to hide the fact that it sucks requires you to pile on all kinds of other crap like query proxies, pools, and replication engines, all in an effort to make this beast do something it wasn’t meant to do: scale beyond a single box. All this stuff also attempts to mask the reality that you’ve also thrown your hands in the air with respect to at least 2 letters that make up the ACID acronym. What’s an RDBMS buying you at that point? Complexity.

And there’s another cost, by the way: no startup I know has the kind of enormous hardware that an enterprise has. They have access to commodity hardware. Pizza boxes. Don’t even get me started on storage. I’ve yet to see SSD or flash storage at a startup.

And yes,it is once again about complexity and operational costs.


Goodbye Tokyo Cabinet

A while ago I started to sound like a broken record when saying that even if data modeling in NoSQL seems too simple to be true, in fact it is once again an art we need to master as otherwise sooner or later will start complaining about it. Well it looks like that time is starting to be now.

Brian on why he stopped using Tokyo Cabinet:

I have no idea how to use a key/value store database properly. TC will take anything you dump into it, which is both a strength and a weakness.

[…] Some kinds of queries were still awkward in TC.

So if you still think schema-less means not data modeling, then you are doing it wrong!


MarkLogic: “We are NoSQL too”

Lately Dave Kellogg, CEO of Mark Logic Co, has been posting a series of articles in his attempt to associate the MarkLogic XML server with the NoSQL space.

We should start by looking at what MarkLogic is offering and I’ll be using as a reference ☞ Dave Kellog’s list:

  1. Unstructured data. This means not only dealing with data in odd structures (e.g., sparse and/or semi-structured data), but also handling words and all the challenges that go with them.
  2. Scaling on cheap hardware. In effect, scaling like Google, using racks of inexpensive pizza boxes instead of big, expensive computers with expensive SANs attached. This is accomplished via shared-nothing clustering.
  3. A non-relational data model. MarkLogic Server uses the XML data model.
  4. Document-orientation. MarkLogic is a document-oriented system, meaning that the fundamental modeling unit is the (XML) document and that the system includes search functionality, in the same way that a smartphone includes a GPS.
  5. Ad hoc queries. A reductionist mission statement for MarkLogic Server is “to perform database-style queries on unstructured information.” (See diagram below.)
  6. Standard interfaces. We believe in standard interfaces, in part because it’s in our self-interest to do so. Standards help de-risk the purchase of new technologies from high-growth vendors. We support a number of W3C standards XQuery, XPath, XML, xHTML, XPointer, and coming soon, XSLT.
  7. ACID transactions. We’re database guys. While we’ll let you turn off the transaction system and are in the midst of implementing replication with a consistency dial, by default we do ACID.

While doing my part of research I couldn’t find any technical references on how MarkLogic works in distributed environments[1] and also how it addresses ACID guarantees in this environment. Hopefully we will see more details about these sooner than later.

Now, the part I cannot agree with is ☞ Dave’s conclusion that:

MarkLogic provides a best-of-both-worlds option between open source NoSQL systems and traditional DBMSs.

Like open source NoSQL systems, MarkLogic provides shared-nothing clustering on inexpensive hardware, superior support for unstructured data, document-orientation, and high-performance. But like traditional databases, MarkLogic speaks a high-level query language, implements industry standards, and is commercial-grade, supported software.

I would even say that this conclusion is invalidating most (if not all) the other points in his post.

1. NoSQL systems come in many flavors

This statement is correct as the fundamental philosophy behind NoSQL systems is having the option to use the best tool for your scenario. On the other hand, at a logical level it contradicts the above conclusion.

2. NoSQL is part of a broader trend in database systems: specialization.

That is correct too. But again it is contradicting the conclusion: a system that is specialized cannot be the “best-of-both-worlds” as that would imply the existence of “silverbullet” solutions.

3. NoSQL is largely orthogonal to specialization.

Unfortunately this one is incorrect. Most (if not all) existing “core”[2] NoSQL solutions have been created to solve very specific problems. And while there are some making the mistake to confuse them for jack-of-all-trades, hopefully that is not the trend.

4. NoSQL isn’t about open source.

Indeed, NoSQL is not about open source. It is about operational costs, complexity costs, integration, extensibility, etc. None of these implies open source per se, but there must be a reason for users discovering that open source solutions have addressed these requirements better than others.

5. most open source NoSQL systems have proprietary interfaces.

That’s correct too and I’d say one of the reasons is specialization, so another contradiction with other points. On the other hand there are clear signs that each of the NoSQL projects is working on offering friendly protocols and integrate nicely with other tools

Summarizing, while I do understand why it makes a lot of sense to associate MarkLogic with the NoSQL space (and there are too many reasons for doing it that do not fit well on myNoSQL), I’d definitely appreciate if things would remain as objective as possible and be based on facts only. In the end it will be the users that will decide if they want to call MarkLogic NoSQL or not.

  1. The only references I’ve found are to database failover, hot host add/delete, fast host restart, with no other details. Putting MarkLogic on the map of distributed storage system classification would be really useful.  ()
  2. When saying “core” NoSQL systems, I’m referring to all systems that have been associated with the NoSQL since the term came up.  ()

Is NoSQL known in the Microsoft world?

Kevin Kline (strategy manager for SQL Server at Quest Software) and Brent Ozar (SQL Server DBA expert at Quest Software):

Ozar: There are two common scenarios for why you would consider using something other than your typical relational database. One is data that is not worth very much money.


Kline: We’re going to look at other ways to look up data: key value stores, what’s the other one called?

Ozar: XML columnar storage, XML property bags.

Sounds pretty uninformed and makes me wonder what is known about NoSQL in the Microsoft world.


3 Differences between RDF Databases and Other NoSQL Solutions

RDF database systems form the largest subset of this last NoSQL category. RDF data can be thought of in terms of a decentralized directed labeled graph wherein the arcs start with subject URIs, are labeled with predicate URIs, and end up pointing to object URIs or scalar values.

Bottom line it sounds like there’s only one difference: standardization.

  • A simple and uniform standard data model: all RDF database systems share the same well-specified and W3C-standardized data model at their base.
  • A powerful standard query langauge: SPARQL is a very big win for RDF databases here, providing a standardized and interoperable query language that even non-programmers can make use of, and one which meets or exceeds SQL in its capabilities and power while retaining much of the familiar syntax.
  • Standardized data interchange formats: RDF databases, by contrast, all have import/export capability based on well-defined, standardized, entirely implementation-agnostic serialization formats such as N-Triples and N-Quads.


The future belongs to the companies and people that turn data into products

An article on the next generation apps built on top of data intelligence, talking also about the NoSQL space and big data processing.

Why do we suddenly care about statistics and about data?

In this post, I examine the many sides of data science — the technologies, the companies and the unique skill sets.

An (attempt) to summarize the core ideas:

  • I keep saying that the sexy job in the next 10 years will be statisticians.

    ☞ Hal Varian, Chief Economist at Google

  • Data is the next Intel Inside

    — Tim O’Reilly

  • user generated data does contain intelligence. It is just a matter of us making sense of it

  • data comes from everywhere and various formats
  • Google, Amazon, Facebook, LinkedIn, etc. are the first doing it in different areas
  • Most of the organizations that have built data platforms have found it necessary to go beyond the relational database model. Traditional relational database systems stop being effective at this scale. Managing sharding and replication across a horde of database servers is difficult and slow. The need to define a schema in advance conflicts with reality of multiple, unstructured data sources, in which you may not know what’s important until after you’ve analyzed the data.

    Simply put this is about complexity: the new dimension of scalability and operational costs as seen in Twitter migrating to Cassandra.

  • Storing data is only part of building a data platform, though. Data is only useful if you can do something with it, and enormous datasets present computational problems.

    We are following closely Hadoop, Pig, Hive, and Cascalog, but also new approaches for a common NoSQL query language like Toad for Cloud as alternatives to put NoSQL data to work.


A Common NoSQL Query Language

Couple of days ago I was posting about pros and cons of working on a (new) common query language for document databases. On the other hand, Hans Marggraff has generalized this question when ☞ writing:

NoSQL databases lack a common query language, that can provide the basis for a vendor independent tool ecosystem.

I should probably confess that over a year ago, I was asking for the same things when publishing the alternative data storage status quo.

Meanwhile I have understood that there are probably better ways to deal with the NoSQL custom query space:

  1. avoiding as much as possible running reports on live servers and using specialized/dedicated solutions for it (Tekpub is using both MongoDB and MySQL to deal with this normal scenario and they feel very strong about this separation)
  2. high level languages or tools can be built to work with your reporting and datawarehouse. And I’m referring here to Hadoop, Pig and Cascalog. Just to get an idea of what I mean check these awesome presentations on Hadoop, Pig and Cascalog from a Hadoop meet-up showcasing their usage at Twitter, BackType, and others.

Somehow as a confirmation to these approaches, Quest Software has launched yesterday Toad for Cloud[1] a tool that supports querying data over different NoSQL solutions by providing an indirection layer that interfaces with native NoSQL querying capabilities. You can see more about this tool in the videos posted on their website.

So, I’d say there’s no need for a common (artificial) NoSQL query language. We are already seeing tools dealing with the different APIs and I’m pretty sure more will come.

Modeling Life and Data

Michael Will:

The proper representation of life is not tabular, but associative. The structure of life is not relational, but hierarchical. Relation is a poor term that falls far short of capturing dynamic connections. […] Shoehorning life science into relational databases is a very lossy process.

☞ Cassandra for Life Science (pdf)

I wholeheartedly agree! But from this perspective it looks like graph databases are the closest to model real life.