NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



document store: All content tagged as document store in NoSQL databases and polyglot persistence

Document Databases and the Impedance Mismatch with the Object World

One of the most often mentioned issues reported by software engineers working with relational databases from object-oriented languages is the object-relational impedance mismatch. Document databases adopters are saying that one benefit of document stores is that there is no impedance mismatch between the object and document worlds.

I don’t think this is entirely true.

Firstly, the numerous object-document mapping frameworks are a proof people are still using tools to convert between objects and documents. CouchDB and MongoDB already have many mapping frameworks available in the most popular languages.

Secondly, if you consider the highly connected hierarchical object model you’ll realize that mapping it into the document model is not consistent. It requires applying specific domain knowledge and involves different strategies depending on usage scenarios.

So, what is my point? Document databases are not solving the impedance mismatch with the object world. All they do is offering more flexibility in converting from one to another.

Original title and link: Document Databases and the Impedance Mismatch with the Object World (NoSQL databases © myNoSQL)

Document Databases: A “new” definition

A new very bad definition for document databases:

When people talk about document-oriented NoSQL or some similar term, they usually mean something like:

Database management that uses a JSON model and gives you reasonably robust access to individual field values inside a JSON (JavaScript Object Notation) object.


Let’s try to see what’s wrong with it. The major problem with this definition is that it tries to tie a wide range of products to a very specific data format which is completely irrelevant.

Storage format

While important for aspects like:

  • optimized access to data (either disk or memory or even both)
  • real space usage

the internal storage format is usually not important and/or complitely opaque to end users. All it matters is that the engine knows how to handle it.

Very generally, you can have two types of engines:

  • the ones for which data they store is completely opaque, i.e. the engine doesn’t know how to interpret/slice it
  • the ones that knows the exact format and can interpret every bit of it. For these engines, data types are important.

A couple of examples:

  • Each MySQL storage engine is using its internal data format. But a client accessing it will always get the same data
  • Redis is using very optimized internal data formats that allows it to offer on top of it per data type operations
  • MongoDB is using a binary JSON-like format

External format or Protocols

I’ve already written why protocols are important. But to summarize, the external protocol is important for a couple of reasons:

  • how easy is to connect to the engine and create new clients that know to produce and consume that data
  • is it optimized for over the wire transfers
  • is it easily to debug

Nonetheless, you could easily create a database engine that would be able to serve data in different formats. Actually these already exists:

  • MySQL (and probably all other relational databases) can spit out data in their custom format, CSV, or XML
  • memcached can talk both a string and binary protocol

But, what is a document database?

  1. a data engine using a non-relational data model
  2. a storage engine with knowledge about the data it is storing. Basically the engine will be able to operate on inner values of the “records”
  3. an engine that can define secondary indexes on non-key fields and allows querying data based on these.

If document databases would be characterized only by 1) and 2) above, then we could say that almost all of them are document databases. There are just a few databases (NoSQL or not) out there which cannot look inside the “records” they are storing. Thus it is all 3 fundamental characteristics that identifies document databases.

Original title and link: Document Databases: A “new” definition (NoSQL databases © myNoSQL)

On Document Databases or Is MongoDB Lacking a Sweet Spot?

Rob Ashton published an article comparing the major NoSQL document databases: CouchDB, MongoDB, and RavenDB, from the perspective that “it’s all about sweet spots”. His conclusions about MongoDB are very interesting:

Mongo on the other hand, is very similar to our traditional databases – with the core difference that data is stored as documents instead of being split out (normalised) across multiple tables. The rest of it all looks very similar – except we lose the ability to do joins across our tables (so complex reporting is out). Reads are fast, but only because Mongo has been micro-optimised to the hilt (just like most database engines), writes are fast, but only because the system doesn’t wait to find out if writes have been successful before carrying on with the next job.

I don’t see it, I don’t see the sweet spot – even if the durability issues are sorted out it’s still just a traditional database with a few less features in the name of gaining a few milliseconds that most of us don’t need.

It achieves fast query speed by storing indexes in memory, which means what Mongo actually gives us is a really slow way to query in memory objects – and heaven forbid you run out of enough RAM to store these indexes on your production servers (ala Foursquare a few months ago). If you’re going to go for an in-memory query store then you’re probably going to use Redis because that’s its sweet spot…

Taking a step back, the two most visible differences between MongoDB and a relational database are:

  • data model (relational vs document-based, non-relational)
  • a different querying language

In reality there are more than just these two — think in terms of ACID — but they are not that obvious upfront.

So, as I said it before, I think MongoDB would be much better positioned if it would be using SQL. That would offer it a “sweet spot”: applications that haven’t worked with relational databases due to the data model. Think about (prototype) applications that are evolving rapidly their data model. Or applications where JOINs have become too complicated, too slow.

Original title and link: On Document Databases or Is MongoDB Lacking a Sweet Spot? (NoSQL databases © myNoSQL)


Terrastore 0.8.0 Released, Featuring Map/Reduce

After a short break, Terrastore has published a new version, 0.8.0, which brings quite a few interesting features, plus some performance, scalability, and stability enhancements:

  • map/reduce processing
  • active event listeners
  • adaptive ensemble scheduling
  • document and communication compression

Sergio Bossa, Terrastore lead developer, has shared more about this release ☞ here:

Terrastore map/reduce implementation targets all documents, or just a subset of documents specified by range, belonging to a single bucket, and is based on three phases: mapper, combiner and reducer. The mapper phase is initiated by the node which received the map/reduce request, the originator node: it locates the target documents and the nodes that hold them, then sends the map function to those node so that it can be applied in parallel on each node; the map function will take each target document as input argument, and return, for each document, a map of pairs as output. Then, each remote node runs the combiner phase, aggregating its local map results and returning a partial map of pairs. Finally, the originator node runs the reducer phase, aggregating all partial results.

You can download the new Terrastore from ☞ here.

Original title and link: Terrastore 0.8.0 Released, Featuring Map/Reduce (NoSQL databases © myNoSQL)

OrientDB New Release Featuring Sync and Async Replication

OrientDB, the document or graph store, has announced a new release, 0.9.24, featuring amongst a few SQL support improvements, synchronous and asynchronous replication.

The complete list of changes can be found ☞ here. The ☞ official announcement is listing the following new features:

  • Support for Clustering with synchronous and asynchronous replication
  • New SQL RANGE keyword: SELECT FROM ... WHERE ... RANGE <from> [,<to>]
  • New SQL LIMIT keyword: SELECT FROM ... WHERE ... LIMIT 20
  • Improved CREATE INDEX command
  • New REMOVE INDEX command
  • New console command INFO CLASS
  • New console command TRUNCATE CLASS and TRUNCATE CLUSTER
  • MRB+Tree now is faster and stable
  • Improved import/export commands
  • Improved JSON compliance
  • Improved TRAVERSE operator with the optional field list to traverse

I’ve contacted Luca Garulli, OrientDB main developer, for more details about the OrientDB replication.

Original title and link: OrientDB New Release Featuring Sync and Async Replication (NoSQL databases © myNoSQL)

How to Document Document Databases?

Interesting question about documenting the “schema” of document databases on the ☞ MongoDB group. Suggested solutions:

I am not aware of any tools to help you keep in sync the UML model and your JSON/BSON structures, so I’d probably say that’s not a good long term solution (at least not in an evolving project).

Another way of documenting the structure of data stored in document databases would your model. But what I’m not very sure about is how do you maintain historical versions and mark the differences in the evolution of your data structure.

Original title and link: How to Document Document Databases? (NoSQL databases © myNoSQL)

How to prepare for integrating new social media into proprietary software?

How does one prepare for and integrate the inclusion of new social media outlets into proprietary software when the shape and substance of those platform is not immediately available and has not yet been designed??


The second lies in abandoning the old traditional relational database structure, where appropriate, and embracing a more flexible and more adaptable document oriented database format, commonly referred to as NoSQL databases.  […]

The advantage of the NoSQL format is that the data model does not need to be rigidly defined.  The DBMS assumes that the data is unstructured, and allows for a wide diversity of formats: media, pictures, text, documents, numbers, arrays of undetermined length, etc.  Data retrieval is lightening fast, and supported by a javascript query language (which interfaces with C++ drivers).  The genius and the beauty of such a system is that it accommodates very well additions and mutations to the data structure.  A field in any record, for instance could be an array of any type, and that array need not have the same size or even type as an array in the following record.  (To be precise, NoSQL DBMS refer to records as ‘documents’ and tables as ‘collections’.)

The key part is: “where appropriate”.

Plus I’m still not convinced how much of the enterprise world will have to really integrate with social media. But indeed, this is one important advantage of using schemaless NoSQL databases.

Original title and link: How to prepare for integrating new social media into proprietary software? (NoSQL databases © myNoSQL)


Correction: OrientDB is a Document and Graph Store

Luca Garulli, ☞ OrientDB project lead, contacted me a couple of days ago offering some clarifications about OrientDB.

Luca Garulli: OrientDB is a document-graph dbms with schema-less, schema-full or mixed modes. Why also graph? Because the relationships are all direct links between documents. No “JOIN” is used. This allow to load entire graph of interconnected documents in few ms!

The Graph interface is documented ☞ here and starting from v. 0.9.22 OrientDB is compliant with Tinkerpop stack of Graph tools such as the Gremlin language. ☞ This is the link that shows the OrientDB usage from Gremlin.

Alex: Couple of questions:

  1. what is the format in which data is stored?
  2. how do you query data?

Luca: The document is stored in a compressed JSON-like format. Documents are contained in clusters. Clusters can be physical, logical or in-memory. A cluster is something close to the Collection of MongoDB and its aim is to group documents all together. The first use of a cluster is to group documents of the same type, as a sort of TABLE in the Relational world. But you can create a cluster “UrgentInvoices” and put all the urgent invoices close to be expired.

A cluster can be browsed and queried using Native queries and SQL queries. The SQL support is good enough and has extension to handle the schema-free features such as add/remove items in collections and maps. This example add the String ‘Luca’ to the collection “names”.

update Account add names = 'Luca'

And special operators to treat Trees and Graphs. This cross all the relationships avoiding costly JOINs:

select from Profile where = 'Rome'

This one is much more powerful and complex:

select from Profile where any() traverse( 0,3 ) ( 
    any().toUpperCase().indexOf( 'NAVONA' ) > -1 )

any() means any fields because each documents can have different fields (is schema-less). the traverse operator goes recursively from the current document (0) to maximum the 3rd level of nesting (3) checking the condition on the right.

Then you have native queries:

new ONativeAsynchQuery<ODocument, OQueryContextNativeSchema<ODocument>>(
        new OQueryContextNativeSchema<ODocument>(), this) {

      public boolean filter(OQueryContextNativeSchema<ODocument> iRecord) {
        return iRecord.column("id").toInt().minor(10).go();

Alex: Thanks a lot!

Update: It looks like OrientDB is also seeing some speed improvements these days. You can read about it ☞ here.

Original title and link: Correction: OrientDB is a Document and Graph Store (NoSQL databases © myNoSQL)

Document Databases and RavenDB

Brian Ritchie talked about document database and RavenDB at ☞ Jacksonville Software Architecture Group:

Original title and link: Document Databases and RavenDB (NoSQL databases © myNoSQL)

MongoDB Use Case: Archiving

Document-oriented databases, with their flexible schemas, provide a nice solution. We can have older documents which vary a bit from the newer ones in the archive. The lack of homogeneity over time may mean that querying the archive is a little harder. However, keeping the data is potentially much easier.

I think this is pushing the schema migration issue from data to code, which might actually be a good idea.

Original title and link: MongoDB Use Case: Archiving (NoSQL databases © myNoSQL)


Document databases: 11 Document-oriented Applications

From Zef Hemel:

Some examples of document-oriented applications:

  • CRM
  • Contact Address/Phone Book
  • Forum/Discussion
  • Bug Tracking
  • Document Collaboration/Wiki
  • Customer Call Tracking
  • Expense Reporting
  • To-Dos
  • Time Sheets
  • E-mail
  • Help/Reference Desk

Looking at this list I’m like, what application is not document-oriented?

A partial answer to the last question is simple: all those that require highly connected data.

Original title and link: Document databases: 11 Document-oriented Applications (NoSQL databases © myNoSQL)


Normalization is from the devil

Based on his experience building a document database, RavenDB, Ayende writes about data normalization:

If you think about it, normalization in RDBMS had such a major role because storage was expensive. It made sense to try to optimize this with normalization. In essence, normalization is compressing the data, by taking the repeated patterns and substituting them with a marker. There is also another issue, when normalization came out, the applications being being were far different than the type of applications we build today. In terms of number of users, time that you had to process a single request, concurrent requests, amount of data that you had to deal with, etc.

While I do think he’s wrong about the rationale of normalization — very shortly, main reason for normalizing data is to guarantee data integrity, working with a document database will offer you a different perspective about organizing data. Just like with programming languages: even if you don’t use every programming language you know or learn, each of them will hopefully give you a different perspective on how to deal with problems.

Original title and link for this post: Normalization is from the devil (published on the NoSQL blog: myNoSQL)