ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Paper: A Study of Practical Deduplication

Todd Hoff puts together some good references on data deduplication:

With BigData comes BigStorage costs. One way to store less is simply not to store the same data twice. That’s the radically simple and powerful notion behind data deduplication. […] A parallel idea in programming is the once-and-only-once principle of never duplicating code.

Using deduplication technology, for some upfront CPU usage, which is a plentiful resource in many systems that are IO bound anyway, it’s possible to reduce storage requirements by upto 20:1, depending on your data, which saves both money and disk write overhead. 

Data deduplication and data compression are must haves for big data systems.

Original title and link: Paper: A Study of Practical Deduplication (NoSQL databases © myNoSQL)

via: http://highscalability.com/blog/2011/5/5/paper-a-study-of-practical-deduplication.html