ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

performance: All content tagged as performance in NoSQL databases and polyglot persistence

Release: Production Ready MongoDB 1.4 Released

Judging by the number of posts I’ve seen around I’d guess you’ve already heard about the MongoDB 1.4 release[1]. Anyways, I definitely had to include it here as myNoSQL covers all major NoSQL projects and follows closely all things related to the NoSQL ecosystem.

While some MongoDB users seemed quite excited about the addition of ☞ geospatial indexing, others about some ☞ query language improvements, the things that caught my attention were:

  • background indexing and indexing improvements
  • concurrency improvements
  • the lack of autosharding (still alpha, still pushing, still…)
  • the lack of improvements or alternatives for the MongoDB durability tradeoff

Speaking of performance, the 10gen people[2] have run some benchmarks comparing MongoDB 1.2 with MongoDB 1.4. Without a couple of exceptions, the performance haven’t improved radically, so I’d speculate that there is still a lot of locking involved. The benchmark source code was made available[3] so you can dig deeper into it.

All in all, good and exciting news for the NoSQL world!


CouchDB vs MongoDB: An attempt for a More Informed Comparison

After posting about Scott Motte’s comparison of MongoDB and CouchDB, I thought there should be some more informative sources out there, so I’ve started to dig.

The first I came upon (thanks to Debasish Ghosh @debasishg) is an article about ☞ Raindrop requirements and the issues faced while attacking them with CouchDB and the pros and cons of possibly replacing CouchDB with MongoDB:

[Pros]

  • Uses update-in-place, so the file system impact/need for compaction is less if we store our schemas in one document are likely to work better.
  • Queries are done at runtime. Some indexes are still helpful to set up ahead of time though.
  • Has a binary format for passing data around. One of the issues we have seen is the JSON encode/decode times as data passes around through couch and to our API layer. This may be improving though.
  • Uses language-specific drivers. While the simplicity of REST with CouchDB sounds nice, due to our data model, the megaview and now needing a server API layer means that querying the raw couch with REST calls is actually not that useful. The harder issue is trying to figure out the right queries to do and how to do the “joins” effectively in our API app code.

[Cons]

  • easy master-master replication. However, for me personally, this is not so important. […] So while we need backups, we probably are fine with master-slave. To support the sometimes-offline case, I think it is more likely that using HTML5 local storage is the path there. But again, that is just my opinion.
  • ad-hoc query cost may still be too high. It is nice to be able to pass back a JavaScript function to do the query work. However, it is not clear how expensive that really is. On the other hand, at least it is a formalized query language — right now we are on the path to inventing our own with the server API with a “query language” made up of other API calls.

Anyway while some of the points above are generic, you should definitely try to consider them through the Raindrop requirements perspective about which you can read more here.

Another article comparing MongoDB and CouchDB is hosted by ☞ MongoDB docs. I find it well balanced and you should read it all as it covers a lot of different aspects: horizontal scalability, query expressions, atomicity, durability, mapreduce support, javascript, performance, etc.

I’d also mention this ☞ benchmark comparing the performance of MongoDB, CouchDB, Tokyo Cabinet/Tyrant (note: the author of the benchmark is categorizing Tokyo Cabinet as a document database, while Tokyo is a key-value store) and uses MySQL results as a reference.

In case you have other resources that you think would be worth including do not hesitate to send them over.

Update: Just found a nice comparison matrix [1].

As a teaser, very soon I will introduce you to a new solution available in this space, so make sure to check MyNoSQL regularly.

Update: The main article about this new document store has been published: Terrastore: A Consistent, Partitioned and Elastic Document Database. I would strongly encourage you to check it, as Terrastore is looking quite promising.


Thoughts on NoSQL vs SQL Articles

There have been a couple of articles lately about NoSQL vs SQL that seemed to have caught a lot of attention. I finally had the time to go through them and jot down some of my thoughts.

Michael Driscoll in ☞ sql in dead. long live sql! identifies three aspects of the NoSQL environment:

  1. A dislike for SQL’s syntax, which is ill-fitted to programming patterns.
  2. A rejection of the strong typing of relational schemas
  3. A critique of performance, which in turn relates to how concurrency and partitioning of computation is handled

These are quite similar to the NoSQL-ness criteria I wrote about.

Now, I don’t think there is anything in the NoSQL world against the SQL as a language, but rather by transitivity with the systems behind it. The software engineering world have longly discussed about the object-relational paradigm mismatch and it came up with a set of different patterns to overcome it (active record, ORM, etc.).

Michael builds his pro SQL argument based on the following arguments, with which I do agree:

But SQL lives on for a deeper reason: it is a simple yet powerful language for set operations. SQL captures the essential patterns of data manipulation, such as:

  • intersections (JOINs)
  • filters (WHEREs)
  • reductions or aggregations (GROUP BYs)

Considering that most NoSQL systems are moving the “intersection” operation at a different level (either at the storage level by denormalization or at programmatic level), the two operation left are “filtering” and “reductions”, which sound extremely close to MapReduce basic principles. The interesting fact is that MapReduce was designed to allow parallelization while SQL was not (it is also known that imperative code is more difficult to parallilize). And I am not aware of any RDBMS that has implemented parallel — in the sense of distributing the execution — queries.

So leaving this aside, I tend to agree with his conclusion (and I think that solutions like Yahoo! PIG, Facebook HIVE are showing that people might still prefer simpler than MapReduce solutions):

I can’t imagine the programmer pain and suffering that went into building one, unified, global database. But once it’s there, I’d much prefer to access it with SQL statements than MapReduce code .

On the other hand, I tend to disagree with the points Curt Monash is making in his article ☞ The legit part of the NoSQL idea:

Relational database management systems were invented to let you use one set of data in multiple ways

[…]

RDBMS are more mature than most competing technologies

Unfortunately ☞ Ben Scofield’s NoSQL Misconceptions article doesn’t cover any of these, so I’ll try to address them myself.

Firstly, I think it is a mistake to consider that the maturity of a technology transforms it in the right tool for the right job. While I do agree that “not all of us are Google” (as Justin Sheehy of Riak said it) and I do hate the “not invented here syndrom”, I do think that as an industry we should always try to use and provide the best tools for the right job.

Secondly, I disagree with the fact that it is easy to use relational databases for getting multiple perspectives on the same set of data. And I think datawarehouses and BI tools through their existence are proving my point. They are expensive and difficult to maintain and use. And as in Michael’s quote above: ” I can’t imagine the programmer pain and suffering that went into building one, unified, global database”.

Last, but not least, going back to Ben’s article, I completely disagree with his “I can do NoSQL just as well in a relational database” argument. I have written about this approach before in the post ☞ A Schema-less relational database and I do think that there are scenarios that can benefit of such a solution.


Memcached-in-the-Cloud by Gear6

Memcached is used as a reference in the NoSQL world for its API and also for performance comparisons. Some NoSQL KV stores are offering a Memcached compatible API and some are even supporting the same protocol.

Startup Gear6 today launched the availability of its memcached appliance on Amazon’s Web Services platform, bringing a widely used distributed memory caching system for web companies to the cloud.

What seems to be missing from the announcement is any mentions of automatic Memcached scaling. Wouldn’t that be an interesting feature?

via: http://gigaom.com/2009/12/08/gear6-brings-memcached-to-amazons-cloud/


The “NoSQL” dispute: A performance argument

In summary, blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management. To go wildly faster, one must remove all four sources of overhead, discussed above. This is possible in either a SQL context or some other context. …

But as far as I know there is no easy way to tweak any of these “features” of existing RDBMS.

via: http://sillybits.wordpress.com/2009/12/10/the-nosql-dispute-a-performance-argument/