Todd Hoff puts together some good references on data deduplication:
With BigData comes BigStorage costs. One way to store less is simply not to store the same data twice. That’s the radically simple and powerful notion behind data deduplication. […] A parallel idea in programming is the once-and-only-once principle of never duplicating code.
Using deduplication technology, for some upfront CPU usage, which is a plentiful resource in many systems that are IO bound anyway, it’s possible to reduce storage requirements by upto 20:1, depending on your data, which saves both money and disk write overhead.
Data deduplication and data compression are must haves for big data systems.
Original title and link: Paper: A Study of Practical Deduplication (NoSQL databases © myNoSQL)