From Daniel Peng and Frank Dabek paper (☞ PDF):
Updating an index of the web as documents are crawled requires continuously transforming a large repository of existing documents as new documents arrive. This task is one example of a class of data processing tasks that transform a large repository of data via small, independent mutations. These tasks lie in a gap between the capabilities of existing infrastructure. Databases do not meet the storage or throughput requirements of these tasks: Google’s indexing system stores tens of petabytes of data and processes billions of updates per day on thousands of machines. MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency.
Is this paper at the origin of Google Caffeine?
Original title and link: Large-scale Incremental Processing Using Distributed Transactions and Notifications (NoSQL databases © myNoSQL)