document database: All content tagged as document database in NoSQL databases and polyglot persistence
Two great posts from mongolab covering details about the structure of MongoDB’s data on disk, how this is reflected in the results returned by the
dbStats API, and last some attempts to recover disk space:
Original title and link: MongoDB data storage structure, dbStats, and managing disk space ( ©myNoSQL)
After re-reading HyperDex’s comparison of Cassandra, MongoDB, and Riak backups, I’ve realized there are no links to the corresponding docs. So here they are:
Cassandra backs up data by taking a snapshot of all on- disk data files (SSTable files) stored in the data directory.
You can take a snapshot of all keyspaces, a single keyspace, or a single table while the system is online. Using a parallel ssh tool (such as pssh), you can snapshot an entire cluster. This provides an eventually consistent backup. Although no one node is guaranteed to be consistent with its replica nodes at the time a snapshot is taken, a restored snapshot resumes consistency using Cassandra’s built-in consistency mechanisms.
After a system-wide snapshot is performed, you can enable incremental backups on each node to backup data that has changed since the last snapshot: each time an SSTable is flushed, a hard link is copied into a /backups subdirectory of the data directory (provided JNA is enabled).
Basically three are three ways to backup MongoDB:
- Using MMS
- Copying underlying files
Riak’s backup operations are pretty different for the two main storage backends, Bitcask and LevelDB, used by Riak:
Choosing your Riak backup strategy will largely depend on the backend configuration of your nodes. In many cases, Riak will conform to your already established backup methodologies. When backing up a node, it is important to backup both the ring and data directories that pertain to your configured backend.
Note: I’d be happy to update this entry with links to docs on what tools and solutions other NoSQL databases (HBase, Redis, Neo4j, CouchDB, Couchbase, RethinkDB) are providing.
✚ Considering that creating backups is as useful as making sure that these will actually work when trying to restore, I’m wondering why there are no tools that can validate a backup without forcing a complete restore. The two mechanisms are not equivalent, but for large size databases this might simplify a bit the process and increase the confidence of the users.
Original title and link: Quick links for how to backup different NoSQL databases ( ©myNoSQL)
This is how it goes:
- someone declares a solution being fast. It’s usually a micro benchmark presented with almost no context.
- then someone else shows better numbers from a competing product. It’s a similar micro benchmark performed with a completely different hardware. An apple-to-oranges comparison.
- the first person revists the topic and says that actually performance doesn’t matter.
What’s wrong with this?
- most of the readers will only see the first post. The attraction for numbers is irresistible.
- the very few people seeing the second type of post will already be segregated and dismiss the other results.
The bottom line is that we end up with 2 posts with irrelevant numbers that each group could use to claim theirs is bigger than others. And very few actually learn what’s so (completely) wrong about them.
Original title and link: Look how fast it is… actually it’s not, but who cares ( ©myNoSQL)
I have been trying to avoid graph “intro” slides and presentations.
There are only so many times you can stand to hear “…all the world is a graph…” as though that’s news. To anyone.
This presentation by Luca is different from the usual introduction to graphs presentation.
Original title and link: Why relationships are cool… Relationship in RDBMS vs graph databases ( ©myNoSQL)
In two posts, the Tokutek guys are explaining how transactions work on TokuMX, the replacement engine they are proposing to MongoDB users—remember that Vadim Tkachenko (“MySQL Performance blog“) called TokuMX the InnoDB for MongoDB?:
- For each statement that tries to modify a TokuMX collection, either the entire statement is applied, or none of the statement is applied. A statement is never partially applied.
`, androllbackTransaction` have been added to allow users to perform multi-statement transactions.
- TokuMX queries use multi-version concurrency control (MVCC). That is, queries operate on a snapshot of the system that does not change for the duration of the query. Concurrent inserts, updates, and deletes do not affect query results (note this does not include file operations like removing a collection).
- cursors represent a true snapshot of the system
- simpler to batch inserts together for performance
- simpler for applications to update multiple documents with a single statement
- no need to combine documents together for the purpose of atomicity
✚ I’d find TokuMX’s transactions even more interesting if they would work by default at a shard level instead of cluster level. Users would need to manually configure cluster-wise transaction thus remaining in control of the performance and availability.
✚ I still have my doubts about TokuMK’s positioning, but that’s a business & marketing story.
Original title and link: TokuMX transactions for MongoDB ( ©myNoSQL)