ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

document store: All content tagged as document store in NoSQL databases and polyglot persistence

Document Databases Compared: CouchDB, MongoDB, RavenDB

Brian Ritchie has two posts (☞ here and ☞ here) covering three document databases: CouchDB, MongoDB, and RavenDB concluding with the matrix below:

But before using this as a reference material there are a couple of corrections needed:

They have some special characteristics that make them kick some serious SQL.

  • Objects can be stored as documents: The relational database impedance mismatch is gone. Just serialize the object model to a document and go.
  • Documents can be complex: Entire object models can be read & written at once. No need to perform a series of insert statements or create complex stored procs.
  • Documents are independent: Improves performance and decreases concurrency side effects
  • Open Formats: Documents are described using JSON or XML or derivatives. Clean & self-describing.
  • Schema free: Strict schemas are great, until they change. Schema free gives flexibility for evolving system without forcing the existing data to be restructured.
  • Built-in Versioning: Most document databases support versioning of documents with the flip of a switch.
  1. Judging by the growing number of document database mapping tools, I’m not sure impedance mismatch is really gone (related to 1st point above)
  2. Using embedded format is not always the best solution for mapping relationships and other more complex data structures. (related to 2nd and 3rd points above)
  3. Versioning is an extra-feature that is not fundamental to document databases. MongoDB and CouchDB do not support it by default, but there are different solutions available

Related to the matrix comparison:

  1. Versioning is not supported by either MongoDB and CouchDB. MVCC should not be confused for document versioning
  2. Sharding: CouchDB doesn’t support sharding out of teh box. There are different solutions for scaling CouchDB, using Cloudant Dynamo-like scaling solution for CouchDB, or even running CouchDB with a Riak backend
  3. Replication: both MongoDB and CouchDB support master/master and master/slave
  4. Security: check firstly the NoSQL databases and security post and decide for yourself and the “basic” level is enough for your app

Original title and link for this post: Document Databases Compared: CouchDB, MongoDB, RavenDB (published on the NoSQL blog: myNoSQL)


Microsoft Azure and NoSQL Databases: MongoDB, sones GraphDB, and RavenDB

Looks like today is the day of the NoSQL databases in the Microsoft cloud. After covering how to run MongoDB on Azure and today’s guide to running sones GraphDB on Azure, the third one joining the party is RavenDB:

The short answer was, with the current build, no. RavenDB uses the .NET HttpListener class internally, and apparently that class will not work on worker roles, which are restricted to listening on TCP only.

[…]

I have to sign a contribution agreement, and do some more extensive testing, but I hope that Ayende is going to pull my TCP changes into the RavenDB trunk so that this deployment model is supported by the official releases.

So, two document stores and a graph database are already available for Microsoft Azure. Which one is next?

Microsoft Azure and NoSQL Databases: MongoDB, sones GraphDB, and RavenDB originally posted on the NoSQL blog: myNoSQL

via: http://blog.markrendle.net/2010/08/running-ravendb-on-azure.html


Document Databases and Mapping Tools

A lot of people complained about the object-relational impedance mismatch and the issues raised by using object-relational mapping frameworks. Some of them advocated the transition to NoSQL databases. But now I’m seeing a lot of object-document mapping. So where are we going?

Update: Looks like we also have schema migration tools for document stores (see ☞ this).


MongoDB: The Size of the Document and Why it Matters

Kyle Banker explains (@Hwaet) some of the possible implications of using very large documents in MongoDB:

  1. If you’re doing a full-document, replace-style, update, that entire 500k needs to be serialized and sent across the wire. This could get expensive on an update-heavy deployment.
  2. Same goes for queries. If you’re pulling back 500k at a time, that has to go across the network and be deserialized on the driver side.
  3. While most atomic updates happen in-place, the document usually has to be rewritten in-place on the server, as this is dictated by the BSON format. If you’re doing lots of $push operations on a very large document, that document will have to be rewritten server-side, which, again, on a heavy deployment, could get expensive.
  4. If an inner-document is frequently manipulated on its own, it can be less computationally expensive both client-side and server-side simply to store that “many” relationship in its own collection. It’s also frequently easier to manipulate the “many” side of a relationship when it’s in its own collection.

If going embedded all the way works for your use case, then there’s probably no problem with it. But with these extra-large documents, and a heavy load, you may start to see consequences in terms of performance and/or manipulability.

I’d say that these probably apply to most of the document databases out there.

via: http://groups.google.com/group/mongoid/msg/c82267e8e7a1df12


Document Database Query Language

Recently I have noticed that Doctrine[1], a PHP library focused on persistence services, has been working on ☞ defining a new query language for document databases.

So, I couldn’t stop asking myself is there a need for a (new) document query language?

To be able to answer this question, I thought I should firstly review what are the existing solutions/approaches.

  1. CouchDB doesn’t allow running dynamic queries against the store, but you can define views with the help of Javascript-based mapreduce functions.
  2. MongoDB allows dynamic and pretty complex queries, but it is using a custom query API.
  3. RavenDB, the latest addition to the document database space, has chosen the route of Linq[2] for defining indexes.
  4. Terrastore supports predicate (XPath currently) and range queries offering a mapreduce-like solution. You can read more about these in the Terrastore 0.5.0 article
  5. Last, but not least, XML databases are using XPath for querying.

Simply put, it looks like each solution comes with its own approach. While it will probably make sense to create a unified query language for document databases, I see only two possible solutions:

  • either make all document databases sign up to use this query language (note: this might be quite difficult)
  • or provide it through a framework that works will all of the existing document stores (note: this might not be possible)

But do not create a new query language in a framework that works only with a single document store.


  1. ☞ Doctrine project website  ()
  2. ☞ LINQ: a set of extensions to the .NET framework that encompass language-integrated query, set, and transform operations.  ()

NoSQL Ecosystem News & Links 2010-06-14


Comparing Document Databases to Key-Value Stores

Oren Eini has an interesting ☞ post emphasizing the main differences between document databases (f.e. CouchDB, MongoDB, etc.) and key-value stores (f.e. Redis, Project Voldemort, Tokyo Cabinet):

The major benefit of using a document database comes from the fact that while it has all the benefits of a key/value store, you aren’t limited to just querying by key.

One of the main advantages of data transparency (as opposed to opaque data) is that the engine will be able to perform additional work without having to translate the data into an intermediary or a format that it understands. Querying by non primary key is such an example. The various document stores provide different implementation flavors depending on index creation time, index update strategy, etc. Oren goes on and covers the query behavior for CouchDB, Raven and MongoDB:

In the first case (nb indexes prepared ahead of time), you define an indexing function (in Raven’s case, a Linq query, in CouchDB case, a JavaScript function) and the server will run it to prepare the results, once the results are prepared, they can be served to the client with minimal computation. CouchDB and Raven differs in the method they use to update those indexes, Raven will update the index immediately on document change, and queries to indexes will never wait. […] With CouchDB, a view is updated at view query time, which may lead to a long wait time on the first time a view is accessed if there were a lot of changes in the meanwhile. […]

Note that in both CouchDB and Raven’s cases, indexes do not affect write speed, since in both cases this is done at a background task.

MongoDB, on the other hand, allows ad-hoc querying, and relies on indexes defined on the document values to help it achieve reasonable performance when the data size grows large enough. MongoDB’s indexes behave in much the same way RDBMS indexes behave, that is, they are updated as part or the insert process, so large number of indexes is going to affect write performance.

Another good resource explaining the differences between MongoDB and CouchDB queries is Rick Osbourne’s ☞ article.

After RavenDB made his appearance in the NoSQL space we’ll probably have to compare it to the existing CouchDB and MongoDB features.

This is not to say that some of this functionality cannot be achieved with pure key-value stores, but these seem to be focused mainly on single/multi key lookups and most probably you’ll have to build this additional layer by yourself.


New Projects in NoSQL Space

Lately I’ve been hearing about a couple of newcomers in the NoSQL space so here is your chance to find out about them and why not provide other readers with some additional feedback about each of them:

kumofs

Kumofs is a distributed key-value store built on top of Tokyo Cabinet and using the memcached protocol.

According to the ☞ project homepage:

  • Data is partitioned and replicated over multiple servers.
  • Extreme single node performance; comparable with memcached.
  • Both read and write performance got improved as servers added.
  • Servers can be added without stopping the system.
  • Servers can be added without modifying any configuration files.
  • The system does not stop even if one or two servers crashed.
  • The system does not stop to recover crashed servers.
  • Automatic rebalancing support with a consistency control algorithm.
  • Safe CAS operation support.
  • memcached protocol support.

There already seems to be ☞ success story of kumofs, but you’ll have to take that with a grain of salt as the same company was announcing a success story with Redis for the same problem.

Raven DB

Raven DB is a dual licensed document database for the .NET platform with a RESTful API.

According to the ☞ project homepage:

  • Scalable infrastructure: Raven builds on top of existing, proven and scalable infrastructure
  • Simple Windows configuration: Raven is simple to setup and run on windows as either a service or IIS7 website
  • Transactional: Raven support System.Transaction with ACID transactions. If you put data in it, that data is going to stay there
  • Map/Reduce: Easily define map/reduce indexes with Linq queries
  • .NET Client API: Raven comes with a fully functional .NET client API which implements Unit of Work and much more
  • RESTful: Raven is built around a RESTful API

Orient

Orient comes in two flavors: OrientDB: a document database and OrientKV a key-value store, both running on the Java platform.

According to the ☞ project homepage, OrientDB is a

scalable Document based DBMS that uses the features of the Graph Databases to handle links. It’s the basic engine of all the Orient products. It can work in schema-less mode, schema-full or a mix of both. Supports advanced features, such as indexing, fluent and SQL-like queries. It handles natively JSON and XML documents.

The OrientKV is a:

Key/Value Server based on the Document Database technology and accessible as embedded repository via Java APIs or via HTTP using a RESTful API. Orient K/V uses a new algorithm called RB+Tree, derived from the Red-Black Tree and from the B+Tree. Orient Key/Value server can run in high availability mode using a cluster of multiple nodes partitioned.

Have you tried any of these?

Pincaster

According to the GitHub page, Pincaster is a persistent NoSQL database to store geographic data and key/value pairs, with a HTTP/JSON interface. Unfortunately there’s almost no documentation about the project so I don’t think there’s anything else I could add for now.

Update: hint received from @fs111


Tutorial: Riak Schema Design

Just a few days after posting about the “art” and need for data modeling in the NoSQL world, Basho guys have started a series of articles on Riak schema design.

While the ☞ first post was a bit more philosophical (or call it high-level), the ☞ second one is more hands-on and presents various approaches of modeling relationships with a key-value store or document store. Personally, I’ve kind of liked the Riak links[1] approach as soon as I ☞read about it.

PS: Guys, I hope you’ve already prepared the beers ;-).

References


Look Ma’, I’ve just got an N+1 with NoSQL Flavor

In a previous post, I was arguing that data modeling will remain an “art” even if we are talking about NoSQL systems or not. Recently I’ve noticed a couple of posts that have resurfaced this idea in the context of document databases and parent - child models.

Both CouchDB and MongoDB spend some time on their documentation[1] to explain the different approaches for mapping one-to-many and many-to-many relationships and also explain some of the pros and cons.

Unfortunately, there are tons of posts out there showing just one of the possible solutions and forgetting to detail pros and cons or at least ask the reader to further investigate the topic. One of the most used example is representing child collections as IDs in the parent entity. Another is representing child entities as an embedded collection on the parent. But what I couldn’t find in such posts was a discussion about pros and cons. For example, IDs on the parent leads to the well known N+1 access issue, embedded collections can lead to increased size data manipulation or unreachable child entities and so on.

So, my advise is that before starting to just dump your data into your favorite document store:

  • spend some extended time understanding how to model your data and relationships with your storage solution
  • think a lot about what data access patterns will be needed in your application
  • don’t just trust all “look ma’, this solution is so cool”. Dig into the topic a bit more.

Otherwise you’ll might just end up with a cool NoSQL system performing rather badly due to the fact you have (mis)modeled your data.

References

  • [1] CounchDB documentation about entity relationships can be found ☞ here. MongoDB docs on entity relationships design can be found ☞ here. ()

FleetDB: An Interview with Mark McGranaghan

FleetDB is an MIT licensed schema-free database implemented primarily on Clojure that provides a combination of schema-free records, declarative queries, optimizing query planner and a few more interesting features[1]. While not exactly targeting those scenarios that involve tons of data and require massive scalability, FleetDB seems to be a nice tool to have around when prototyping your next app. Mark McGranaghan, the project creator, has been kind enough to answer a couple of questions for us.

MyNoSQL: What made you create FleetDB? Why FleetDB? What is its ‘selling point’?

Mark McGranaghan: FleetDB is a solution to the problems that I encountered when trying to use existing relational and NoSQL database to rapidly develop applications. In particular, FleetDB offers a unique combination of schema-free data modeling, expressive and composable queries, automatically maintained indexes, excellent consistency and concurrency characteristics, in-memory operation, simple append-only durability, and universal client API that cannot be found in existing databases.

FleetDB is also a great example of the power of functional programming in general and of Clojure’s persistent data structures in particular. Many of the features of FleetDB - its ACID guarantees, concurrent performance, and powerful query language in particular - are due largely to it having been implemented in Clojure.

MyNoSQL: Where would you position FleetDB? (according to CAP, data model, etc.)

Mark McGranaghan: FleetDB is a document-oriented database. It also offers dynamic queries that are aided by indexes and an optimizing query planner. FleetDB currently operates as a single process and is therefore not subject to CAP. Indeed the database provide uncommonly strong consistency guarantees; multi-document, multi-collection, and even multi-query. In terms of memory versus disk, FleetDB answers all queries out of memory but keeps an append-only log on disk for full durability.

In comparison to other databases, FleetDB combines the optimizing query planner of relational databases, the document orientation of MongoDB, the main memory operation of Redis, the functional data model of CouchDB, the embedability and single-file durability of SQLite, and adds an original composable query interface that allows for increased consistency guarantees and general expressiveness.

One thing that I am not trying to do with FleetDB, at least right now, is build a massively scalable database. A lot of apps have a relatively modest amount of core data, especially as they are being prototyped and iteratively developed. In these cases the ease-of-use, flexibility, consistency guarantees, and performance of a well-designed single node database may be more desirable than a fully distributed database and its associated complexity and decreased durability guarantees. With FleetDB I’ve tried get the single node use case right, were a lot of other NoSQL stores are great at massive scale but not much use for the single node case.

MyNoSQL: Are you aware of any usage of FleetDB in production?

Mark McGranaghan: I’ve used FleetDB for several personal projects and prototypes, though the only one of these that is public now is ☞ GitCred. I’ve also heard from a few startups that they are considering FleetDB for use in their applications. That said, I would be surprised to see public-facing production use while the product is still in alpha; at this point I’m still working with users to ensure that the interface is good, that performance is high, and that we catch any bugs. I hope to see more production use after an 0.1.0 release in the Spring.

MyNoSQL: Anything you’d like to add.

Mark McGranaghan: FleetDB is implemented in about 1300 lines of Clojure and 100 lines of Java, much less code than any of the other database systems that I have considered. I was able to keep the code base so small because of the expressiveness of the Clojure language, the power of its data persistent structures, the availability of a variety of Java libraries on which to build, and a judicious choice of features to implement in FleetDB. Having a small code base helps me with rapidly developing features and containing bugs, but its also nice for contributers and end-users who are curious about the internals of the database.

I’m also working on a database performance evaluation suite called ☞ db-compare. That project started as a means to test FleetDB as I was developing it, but it has since evolved into a tool to evaluate the performance of a dozen open source databases under a variety of workloads and client concurrency levels. Furthermore, the evaluations produced by the suite will be rigorous, repeatable, and properly statistically analyzed. I’ll be sure to ping you when I release the first set of db-compare benchmark results. In the meantime, if you have any thoughts about what you or the community would like to see from database benchmarks feel free to let me know.

References


Using Google's V8 JavaScript Engine with MongoDB

Purely geeky: learn how to use the Google V8 JavaScript engine with MongoDB.

Currently the JavaScript engine used is spider monkey (Developed by Mozilla). If you checkout the lastest version of mongoDB you will be able to build it with Google’s V8 JavaScript engine support.

The installation details are for Ubuntu, but who knows maybe somebody can get this to work on Mac OS too.

via: http://www.howsthe.com/blog/2010/feb/22/mongodb-and-v8/