Every article I’ve read and linked to that includes a data migration phase from one database to another tells the same story:
- incremental replication
- consistency checking
- shadow writes
- shadow writes and shadow reads for validation
- end of life of the original data store
The same story for Netflix’s migration from SimpleDB to Cassandra and Shift.com’s migration from MongoDB/Titan to Cassandra. And once again, the same appears in FullContact’s migration from MongoDB to Cassandra. This last post also provides a nice diagram of the process:
The key part of these stories is that the migration was performed with zero downtime.
Original title and link: Migrating databases with zero downtime ( ©myNoSQL)
In my post about in-memory databases vs Aster Data and Greenplum vs Hadoop market share, I’ve proposed a scenario in which Aster Data and Greenplum could expand into the space of in-memory databases by employing hybrid storage.
What I haven’t covered in that post is the possibility of Hadoop, actually HDFS, expanding into hybrid storage.
But that’s happening already and Hortonworks is already working on introducing support for heterogeneous storages in HDFS:
We plan to introduce the idea of Storage Preferences for files. A Storage Preference is a hint to HDFS specifying how the application would like block replicas for the given file to be placed. Initially the Storage Preference will include:
- The desired number of file replicas (also called the replication factor) and;
- The target storage type for the replicas.
Even if the costs of memory will continue to decrease at the same rate as before 2012, when they flat-lined, a cost effective architecture will almost always rely on hybrid storage.
Original title and link: Heterogeneous storages in HDFS ( ©myNoSQL)
After re-reading HyperDex’s comparison of Cassandra, MongoDB, and Riak backups, I’ve realized there are no links to the corresponding docs. So here they are:
Cassandra backs up data by taking a snapshot of all on- disk data files (SSTable files) stored in the data directory.
You can take a snapshot of all keyspaces, a single keyspace, or a single table while the system is online. Using a parallel ssh tool (such as pssh), you can snapshot an entire cluster. This provides an eventually consistent backup. Although no one node is guaranteed to be consistent with its replica nodes at the time a snapshot is taken, a restored snapshot resumes consistency using Cassandra’s built-in consistency mechanisms.
After a system-wide snapshot is performed, you can enable incremental backups on each node to backup data that has changed since the last snapshot: each time an SSTable is flushed, a hard link is copied into a /backups subdirectory of the data directory (provided JNA is enabled).
Basically three are three ways to backup MongoDB:
- Using MMS
- Copying underlying files
Riak’s backup operations are pretty different for the two main storage backends, Bitcask and LevelDB, used by Riak:
Choosing your Riak backup strategy will largely depend on the backend configuration of your nodes. In many cases, Riak will conform to your already established backup methodologies. When backing up a node, it is important to backup both the ring and data directories that pertain to your configured backend.
Note: I’d be happy to update this entry with links to docs on what tools and solutions other NoSQL databases (HBase, Redis, Neo4j, CouchDB, Couchbase, RethinkDB) are providing.
✚ Considering that creating backups is as useful as making sure that these will actually work when trying to restore, I’m wondering why there are no tools that can validate a backup without forcing a complete restore. The two mechanisms are not equivalent, but for large size databases this might simplify a bit the process and increase the confidence of the users.
Original title and link: Quick links for how to backup different NoSQL databases ( ©myNoSQL)