new very bad definition for document databases:
When people talk about document-oriented NoSQL or some similar term, they usually mean something like:
Let’s try to see what’s wrong with it. The major problem with this definition is that it tries to tie a wide range of products to a very specific data format which is completely irrelevant.
While important for aspects like:
- optimized access to data (either disk or memory or even both)
- real space usage
the internal storage format is usually not important and/or complitely opaque to end users. All it matters is that the engine knows how to handle it.
Very generally, you can have two types of engines:
- the ones for which data they store is completely opaque, i.e. the engine doesn’t know how to interpret/slice it
- the ones that knows the exact format and can interpret every bit of it. For these engines, data types are important.
A couple of examples:
- Each MySQL storage engine is using its internal data format. But a client accessing it will always get the same data
- Redis is using very optimized internal data formats that allows it to offer on top of it per data type operations
- MongoDB is using a binary JSON-like format
External format or Protocols
I’ve already written why protocols are important. But to summarize, the external protocol is important for a couple of reasons:
- how easy is to connect to the engine and create new clients that know to produce and consume that data
- is it optimized for over the wire transfers
- is it easily to debug
Nonetheless, you could easily create a database engine that would be able to serve data in different formats. Actually these already exists:
- MySQL (and probably all other relational databases) can spit out data in their custom format, CSV, or XML
- memcached can talk both a string and binary protocol
But, what is a document database?
- a data engine using a non-relational data model
- a storage engine with knowledge about the data it is storing. Basically the engine will be able to operate on inner values of the “records”
- an engine that can define secondary indexes on non-key fields and allows querying data based on these.
If document databases would be characterized only by 1) and 2) above, then we could say that almost all of them are document databases. There are just a few databases (NoSQL or not) out there which cannot look inside the “records” they are storing. Thus it is all 3 fundamental characteristics that identifies document databases.