performance: All content tagged as performance in NoSQL databases and polyglot persistence
Judging by the number of posts I’ve seen around I’d guess you’ve already heard about the MongoDB 1.4 release. Anyways, I definitely had to include it here as myNoSQL covers all major NoSQL projects and follows closely all things related to the NoSQL ecosystem.
- background indexing and indexing improvements
- concurrency improvements
- the lack of autosharding (still alpha, still pushing, still…)
- the lack of improvements or alternatives for the MongoDB durability tradeoff
Speaking of performance, the 10gen people have run some benchmarks comparing MongoDB 1.2 with MongoDB 1.4. Without a couple of exceptions, the performance haven’t improved radically, so I’d speculate that there is still a lot of locking involved. The benchmark source code was made available so you can dig deeper into it.
All in all, good and exciting news for the NoSQL world!
After posting about Scott Motte’s comparison of MongoDB and CouchDB, I thought there should be some more informative sources out there, so I’ve started to dig.
The first I came upon (thanks to Debasish Ghosh @debasishg) is an article about ☞ Raindrop requirements and the issues faced while attacking them with CouchDB and the pros and cons of possibly replacing CouchDB with MongoDB:
- Uses update-in-place, so the file system impact/need for compaction is less if we store our schemas in one document are likely to work better.
- Queries are done at runtime. Some indexes are still helpful to set up ahead of time though.
- Has a binary format for passing data around. One of the issues we have seen is the JSON encode/decode times as data passes around through couch and to our API layer. This may be improving though.
- Uses language-specific drivers. While the simplicity of REST with CouchDB sounds nice, due to our data model, the megaview and now needing a server API layer means that querying the raw couch with REST calls is actually not that useful. The harder issue is trying to figure out the right queries to do and how to do the “joins” effectively in our API app code.
- easy master-master replication. However, for me personally, this is not so important. […] So while we need backups, we probably are fine with master-slave. To support the sometimes-offline case, I think it is more likely that using HTML5 local storage is the path there. But again, that is just my opinion.
Anyway while some of the points above are generic, you should definitely try to consider them through the Raindrop requirements perspective about which you can read more here.
I’d also mention this ☞ benchmark comparing the performance of MongoDB, CouchDB, Tokyo Cabinet/Tyrant (note: the author of the benchmark is categorizing Tokyo Cabinet as a document database, while Tokyo is a key-value store) and uses MySQL results as a reference.
In case you have other resources that you think would be worth including do not hesitate to send them over.
Update: Just found a nice comparison matrix .
As a teaser, very soon I will introduce you to a new solution available in this space, so make sure to check MyNoSQL regularly.
Update: The main article about this new document store has been published: Terrastore: A Consistent, Partitioned and Elastic Document Database. I would strongly encourage you to check it, as Terrastore is looking quite promising.
There have been a couple of articles lately about NoSQL vs SQL that seemed to have caught a lot of attention. I finally had the time to go through them and jot down some of my thoughts.
Michael Driscoll in ☞ sql in dead. long live sql! identifies three aspects of the NoSQL environment:
- A dislike for SQL’s syntax, which is ill-fitted to programming patterns.
- A rejection of the strong typing of relational schemas
- A critique of performance, which in turn relates to how concurrency and partitioning of computation is handled
These are quite similar to the NoSQL-ness criteria I wrote about.
Now, I don’t think there is anything in the NoSQL world against the SQL as a language, but rather by transitivity with the systems behind it. The software engineering world have longly discussed about the object-relational paradigm mismatch and it came up with a set of different patterns to overcome it (active record, ORM, etc.).
Michael builds his pro SQL argument based on the following arguments, with which I do agree:
But SQL lives on for a deeper reason: it is a simple yet powerful language for set operations. SQL captures the essential patterns of data manipulation, such as:
- intersections (JOINs)
- filters (WHEREs)
- reductions or aggregations (GROUP BYs)
Considering that most NoSQL systems are moving the “intersection” operation at a different level (either at the storage level by denormalization or at programmatic level), the two operation left are “filtering” and “reductions”, which sound extremely close to MapReduce basic principles. The interesting fact is that MapReduce was designed to allow parallelization while SQL was not (it is also known that imperative code is more difficult to parallilize). And I am not aware of any RDBMS that has implemented parallel — in the sense of distributing the execution — queries.
So leaving this aside, I tend to agree with his conclusion (and I think that solutions like Yahoo! PIG, Facebook HIVE are showing that people might still prefer simpler than MapReduce solutions):
I can’t imagine the programmer pain and suffering that went into building one, unified, global database. But once it’s there, I’d much prefer to access it with SQL statements than MapReduce code .
On the other hand, I tend to disagree with the points Curt Monash is making in his article ☞ The legit part of the NoSQL idea:
Relational database management systems were invented to let you use one set of data in multiple ways
RDBMS are more mature than most competing technologies
Unfortunately ☞ Ben Scofield’s NoSQL Misconceptions article doesn’t cover any of these, so I’ll try to address them myself.
Firstly, I think it is a mistake to consider that the maturity of a technology transforms it in the right tool for the right job. While I do agree that “not all of us are Google” (as Justin Sheehy of Riak said it) and I do hate the “not invented here syndrom”, I do think that as an industry we should always try to use and provide the best tools for the right job.
Secondly, I disagree with the fact that it is easy to use relational databases for getting multiple perspectives on the same set of data. And I think datawarehouses and BI tools through their existence are proving my point. They are expensive and difficult to maintain and use. And as in Michael’s quote above: ” I can’t imagine the programmer pain and suffering that went into building one, unified, global database”.
Last, but not least, going back to Ben’s article, I completely disagree with his “I can do NoSQL just as well in a relational database” argument. I have written about this approach before in the post ☞ A Schema-less relational database and I do think that there are scenarios that can benefit of such a solution.