NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Document Databases: A “new” definition

A new very bad definition for document databases:

When people talk about document-oriented NoSQL or some similar term, they usually mean something like:

Database management that uses a JSON model and gives you reasonably robust access to individual field values inside a JSON (JavaScript Object Notation) object.


Let’s try to see what’s wrong with it. The major problem with this definition is that it tries to tie a wide range of products to a very specific data format which is completely irrelevant.

Storage format

While important for aspects like:

  • optimized access to data (either disk or memory or even both)
  • real space usage

the internal storage format is usually not important and/or complitely opaque to end users. All it matters is that the engine knows how to handle it.

Very generally, you can have two types of engines:

  • the ones for which data they store is completely opaque, i.e. the engine doesn’t know how to interpret/slice it
  • the ones that knows the exact format and can interpret every bit of it. For these engines, data types are important.

A couple of examples:

  • Each MySQL storage engine is using its internal data format. But a client accessing it will always get the same data
  • Redis is using very optimized internal data formats that allows it to offer on top of it per data type operations
  • MongoDB is using a binary JSON-like format

External format or Protocols

I’ve already written why protocols are important. But to summarize, the external protocol is important for a couple of reasons:

  • how easy is to connect to the engine and create new clients that know to produce and consume that data
  • is it optimized for over the wire transfers
  • is it easily to debug

Nonetheless, you could easily create a database engine that would be able to serve data in different formats. Actually these already exists:

  • MySQL (and probably all other relational databases) can spit out data in their custom format, CSV, or XML
  • memcached can talk both a string and binary protocol

But, what is a document database?

  1. a data engine using a non-relational data model
  2. a storage engine with knowledge about the data it is storing. Basically the engine will be able to operate on inner values of the “records”
  3. an engine that can define secondary indexes on non-key fields and allows querying data based on these.

If document databases would be characterized only by 1) and 2) above, then we could say that almost all of them are document databases. There are just a few databases (NoSQL or not) out there which cannot look inside the “records” they are storing. Thus it is all 3 fundamental characteristics that identifies document databases.

Original title and link: Document Databases: A “new” definition (NoSQL databases © myNoSQL)