Prasanna Padmanabhan and Shashi Madapp posted an article on the Netflix blog describing the process used to migrate data from Amazon SimpleDB to Cassandra:
There will come a time in the life of most systems serving data,
when there is a need to migrate data to a more reliable, scalable
and high performance data store while maintaining or improving data
consistency, latency and efficiency. This document explains the data
migration technique we used at Netflix to migrate the user’s queue
data between two different distributed NoSQL storage systems.
The steps involved are what you’d expect for a large data set migration:
- incremental replication
- consistency checking
- shadow writes
- shadow writes and shadow reads for validation
- end of life of the original data store (SimpleDB)
If you think of it, this is how a distributed, eventually consistent storage works (at least in big lines) when replicating data across the cluster. The main difference is that inside a storage engine you deal with a homogeneous system with a single set of constraints, while data migration has to deal with heterogenous systems most often characterized by different limitations and behavior.
In 2009, Netflix performed a similar massive data migration operation. At that time it involved moving data from its own hosted Oracle and MySQL databases to SimpleDB. The challenges of operating this hybrid solution were described in a the paper Netflix’s Transition to High-Availability Storage Systems authored by Sid Anand.
Sid Anand is now working at LinkedIn where they use Databus for low latency data transfer. But Databus’s approach is very similar.
Original title and link: From SimpleDB to Cassandra: Data Migration for a High Volume Web Application at Netflix ( ©myNoSQL)