nosql debate: All content on NoSQL databases and projects about nosql debate, featuring the best daily NoSQL articles, news, and links on nosql debate

Big Data and the Need for New Approaches to Data Integration

by Alex Popescu

Twitter Reddit
2 likes

I’d say Dave Linthicum got some things wrongly:

First is the ability to manage large data sets more efficiently than with traditional relational technology as done in the past. The methodology is to leverage an approach called MapReduce.

MapReduce is about processing data, but you got to store that data first.

The “Map” portion of MapReduce is the master node that accepts the request and divides it among any number of worker nodes. The “Reduce” portion means that the master node considers the results from the worker nodes and combines them to determine the answer to the request. The power of this architecture is the simplistic nature of MapReduce, meaning it’s both easy to understand and to implement.

???

It is clear to me that using the cloud’s ability to provide massive amounts of commodity computing power, on-demand, when combined with a database architecture that will exploit that power means data processing power on scales we have never seen at these low price points.

This is still something I’m not yet convinced of. Processing in the cloud is indeed a good option. But data must be available on the cloud. And in the case of big data either storing it or moving it to the cloud doesn’t seem to be the best alternative.

Big Data and the Need for New Approaches to Data Integration originally posted on the NoSQL blog: myNoSQL


NoSQL Databases Aren't Hierarchical

by Alex Popescu

Twitter Reddit

Unfortunately based on a wrong hypothesis:

However most of the NoSQL tools seem to be NoRelational. As I see it, many of these tools map closely to the model that the relational model replaced.. the hierarchical model. Some describe themselves as hierarchical.

While not sure what NoSQL databases the author is referring, from my point of view the common denominator of column stores, document databases and key-value stores is the key-value model which is not hierarchical. On the other hand, graph databases are using the graph model at their core which is again different from the hierarchical model. The Java Content Repository implementations (e.g. Jackrabbit) are the only systems I’m aware of being hierarchical, so the hypothesis doesn’t apply.

NoSQL Databases Aren’t Hierarchical originally posted on the NoSQL blog: myNoSQL


Heroku Encourages Polyglot Persistence

by Alex Popescu

Twitter Reddit
1 likes

Heroku published an article preaching polyglot persistence through a Database-as-a-Service approach:

Database-as-as-service is one of the coming decade’s most promising business models. […] DaaS also goes hand-in-glove with polyglot persistence. Thanks to database services, you won’t need to learn how to sysadmin/DBA for every datastore you use – you can instead outsource that job to a service provider specializing in each database.

While it definitely sounds exciting to be able to use all these NoSQL databases , we should always keep in mind the cost of complexity even if DaaS will help alleviate some of the complexity of heterogeneous systems.

The article includes also some interesting use cases for a couple of NoSQL databases:

  • Frequently-written, rarely read statistical data (for example, a web hit counter) should use an in-memory key/value store like Redis, or an update-in-place document store like MongoDB.
  • Big Data (like weather stats or business analytics) will work best in a freeform, distributed db system like Hadoop.
  • Binary assets (such as MP3s and PDFs) find a good home in a datastore that can serve directly to the user’s browser, like Amazon S3.
  • Transient data (like web sessions, locks, or short-term stats) should be kept in a transient datastore like Memcache. (Traditionally we haven’t grouped memcached into the database family, but NoSQL has broadened our thinking on this subject.)
  • If you need to be able to replicate your data set to multiple locations (such as syncing a music database between a web app and a mobile device), you’ll want the replication features of CouchDB.
  • High availability apps, where minimizing downtime is critical, will find great utility in the automatically clustered, redundant setup of datastores like Casandra and Riak.

These are good examples, but you can find many more in our coverage of NoSQL uses cases and the per-product case studies: CouchDB case studies or MongoDB case studies, etc.

Heroku Encourages Polyglot Persistence originally posted on the NoSQL blog: myNoSQL


NoSQL databases Should Support SQL Queries

by Alex Popescu

Twitter Reddit

Nati Shalom uses the old CS saying “Any software problem can be solved by adding another layer of indirection” to suggest that NoSQL databases could support SQL queries (and not only):

The key is the decoupling of the query semantics from the underlying data-store as illustrated in the diagram below:

SQL engine indirection

While it’s difficult to strongly argue against it, the real question is: how difficult will be for such a layer to calculate the costs of such queries? Or differently put:

The two software problems that can never be solved by adding another layer of indirection are that of providing adequate performance or minimal resource usage.

— Jeff Kesselman


Why Should Rubyists be Interested in NoSQL?

by Alex Popescu

Twitter Reddit
4 likes

Jesse Wolgamott answers the question why should Rubyists be interested in NoSQL?

Once you reach the point in transaction system where the database is the scalability cause of your scalability problems, there’s no going back. You’ve taken the red pill. Table-based transaction databases are constrained by memory and there’s a hard maximum until your app crawls to a halt. The dream of true replication and easy sharding is built in.

Also: migrations just suck, even in Rails.

Interesting to note that Jesse’s talk is about MongoDB, CouchDB, RavenDB and Amazon SDB, the first 3 of them not being known for built-in scalability features. While that’s not to say they cannot scale — see for example scaling CouchDB — and while each of them has an attractive feature set, there are already other NoSQL databases that provide better and easier scalability: Cassandra, HBase, Riak, Project Voldemort.


Just say NoSQL

by Alex Popescu

Twitter Reddit
2 likes

An article carrying quite a few strong statements. Some I do agree with, some I don’t

New waves of application development technology are often incompatible with old ways of thinking. Typically, when a brave new world opens to programmers, a healthy portion of them will cast aside the old ways in favor of the new. But the NoSQL movement is not about throwing out your SQL databases to be replaced by key-value stores. NoSQL, ironically, has nothing to do with avoiding SQL, and everything to do with the judicious use of relational databases.

Take for example:

He said (nb Mike Gualtieri, senior analyst Forrester Research) that saving actual customer purchasing information is better suited to a relational database, while storing more ephemeral information, such as customer product ratings and comments, is more appropriate for a NoSQL database.

Saying that NoSQL is fit for “ephemeral information” is a mistake: put your “cheap” data into NoSQL and your “important” data into relational databases. You don’t use a programming language for a product that is not so important and a different language for an important one. You always take into consideration a lot of aspects before making that decision. The same applies to choosing the storage backend.

Or:

Ellis (nb Jonathan Ellis, Cassandra lead and founder of Riptano) said that the developers at Digg invented a rule of thumb for deciding whether or not an environment necessitates a NoSQL database like Cassandra: “If you’re layering memcached on top of MySQL, you’re inventing an ad hoc NoSQL database by doing that,” said Ellis.

Well, that pretty much sounds like: “if you are using a dict/hash/map then you are inventing an ad hoc NoSQL database”. Personally I think that using a caching mechanism that is accessible through simple get/set operations just means that 1) memory access offers higher speed than anything else, 2) most of the time we like accessing our data in different ways

All in all, a good read built around quotes of different people involved or looking at the NoSQL market.


NoSQL and The Future of CMS

by Alex Popescu

Twitter Reddit
2 likes

Interesting to check if the set of requirements of a CMS represent a good fit for NoSQL solutions:

  1. Richly structured content types
  2. Unstructured binary objects
  3. Relationships / references / associations
  4. The ability to evolve content models over time (what I call “schema evolution”)
  5. Branch / merge (in the Source Code Management (SCM) sense of the term)
  6. Snapshot based versioning
  7. ACID transactions
  8. Scalability to large content sets
  9. Geographic distribution

contentcurmudgeon.wordpress.com

The only requirement that doesn’t seem to be satisfied by most of the NoSQL is “ACID transactions”. But in case this could be translated into atomic and durable operations, I think most of the NoSQL solution will pass this test too.

The guys from Outerthought, builders of the Daisy CMS, have been publishing a lot recently about their decision to build the next generation CMS (Lily) on top of HBase. Below are the slides of their presentation: “Learning Lessons: Building a CMS on top of NoSQL technologies” from Berlin Buzzwords

Another resource useful to understand the needs behind a CMS is ☞ OuterThoughts’ technology choices.


Use NoSQL, but Keep an RDBMS in Your Back Pocket

by Alex Popescu

Twitter Reddit

A very entertaining article getting most of the things right:

So clever programmers looked at this ridiculous edifice and realized the real problem: the data store and the use-case were mismatched. So they threw away ORM, SQL, and RDBMS, and wrote lovely new key-value stores, or object stores, or document stores, or searchable indexes, or any of a half-dozen other data structures that more closely matched what they were trying to do.

[…]

Is your data really just a giant hash lookup? Then a key-value store is what you want. Do you primarily access your related data via a single key? Then a document store is for you. Do you need full-text searching? Then, dear god, use a text-indexing engine, not an RDBMS. Do you need to answer questions about your data that you can’t predict in advance? Then make sure your data also ends up in an RDBMS. Maybe not in real-time, maybe summarized rather than in raw form, but somehow.

The only part I don’t agree with is the part saying that ORM has been created to deal with SQL. The reason behind ORMs is object-relational impedance mismatch:

I want to be very, very clear about this: ORM is a stupid idea.

The birth of ORM lies in the fact that SQL is ugly and intimidating (because relational algebra is pretty hard, and very different to most other types of programming). Our programs already have an object-oriented model, and we already know one programming language — why learn a second language, and a second model? Let’s just throw an abstraction layer on top of this baby and forget there’s even an RDBMS down there. […]

You’ve stored your data in a way that doesn’t match your primary use-case, accessible via a language that you are not willing to learn. Your solution is to keep the store and the language and just wrap them in abstraction?

And the end of the post is fantastic:

So go forth, use your OMADS, keep an RDBMS in your back pocket, and stop being so mean to poor old SQL.


NoSQL Pioneers, Web's Manifest Destiny... What?

by Alex Popescu

Twitter Reddit

GigaOm clueless again?

The result is not a steady movement to non-relational databases or other methods of storing data, but a back-and-forth as programmers and businesses figure out what kind of architecture they need and what problems they want to solve.


Why Would Quora Use a NoSQL Database?

by Alex Popescu

Twitter Reddit

Before jumping into Adam D’Angelo details about the decisions of using MySQL at Quora instead of a NoSQL database, I think we should firstly ask ourselves why would Quora need to look beyond a relational database?

As far as I can tell and please take it with a grain of salt, Quora is just a new fancy form of a forum: questions and non hierarchical answers. So, I’d speculate that based only on the amount of traffic and the amount and frequency of posts, scalability could be the only concern for Quora. But now let’s see what Adam says:

  1. If you partition your data at the application level, MySQL scalability isn’t an issue.
  2. These distributed databases like Cassandra, MongoDB, and CouchDB aren’t actually very scalable or stable.
  3. The primary online data store for an application is the worst place to take a risk with new technology.
  4. You can actually get pretty far on a single MySQL database and not even have to worry about partitioning at the application level.
  5. Many of the problems created by manually partitioning the data over a large number of MySQL machines can be mitigated by creating a layer below the application and above MySQL that automatically distributes data.
  6. Personally, I believe the relational data model is the “right” way to structure most of the data for an application like Quora

These are confirming my initial thoughts, but also provide a decent and correct perspective on choosing your storage solution.


It’s the End of the World As We Know It (NoSQL Edition)

by Alex Popescu

Twitter Reddit

Some good comments on Michael Stonebraker’s paper ☞ The End of an Architectural Era (pdf):

The provocatively named paper is simply a description of a system, designed from scratch for modern OLTP requirements and the demonstration that this system gives better performance than traditional RDBMS on OLTP type load. The conclusion is that since RDBMS can’t even excel at OLTP – it must be destined for the garbage pile. I’ll ignore the fact that hybrid systems are far from extinct and look at the paper itself.

While it’s kind of difficult to disagree with Michael Stonebraker, I still believe that some of the design considerations in that paper are not really requirements, but just based on his work on HStore and VoltDB. Take for example the “the OLTP database should fit entirely in-memory” or “the OLTP database should be single threaded”, these sound like implementation details and not as functional requirements.


Two Answers to Why NoSQL

by Alex Popescu

Twitter Reddit

Long post about building a prototype application with two nice answers to the question: why NoSQL?

Well, cost for one. If I could afford Oracle I’d sooner use that than go NoSQL in all likelihood. I can’t afford it. Not even close. Oracle might as well charge me a small planet for their product. It’s great stuff, but out of reach. And what about sharding? Sharding a relational database sucks, and to try to hide the fact that it sucks requires you to pile on all kinds of other crap like query proxies, pools, and replication engines, all in an effort to make this beast do something it wasn’t meant to do: scale beyond a single box. All this stuff also attempts to mask the reality that you’ve also thrown your hands in the air with respect to at least 2 letters that make up the ACID acronym. What’s an RDBMS buying you at that point? Complexity.

And there’s another cost, by the way: no startup I know has the kind of enormous hardware that an enterprise has. They have access to commodity hardware. Pizza boxes. Don’t even get me started on storage. I’ve yet to see SSD or flash storage at a startup.

And yes,it is once again about complexity and operational costs.