Krishna Sankar has a ☞ great summary of a recent talk of Google’s Jeff Dean on Google’s systems and infrastructure:
An interesting set of statistics of MapReduce over time:
- MapReduce at Google, now at 4 million jobs; processing ~1000 PB with 130 PB intermediate data and 45 PB output
- Data has doubled while the number of machines have been constant from ‘07 to ‘10.
- Machine usage has quadrupled while the job completion has doubled ‘07 to ‘10
- Trivia : Jeff shared an anecdote where the network engineers were rewiring the network while Jeff & Co were running MapReduce. They did lose machines in a strange pattern and were wondering what is going on; but the job succeeded, a little slower than normal and of course, the machines came back up ! Only after the fact did they hear about the network rewiring !
The talk generated some interesting comments on ☞ Greg Linden’s blog about the number of machines Google is using for running MapReduce:
Well, on the one hand, the machines probably have four cores (so 1/4 the machines), but the average utilization rate is probably a lot lower than 100%, probably more like 20-30%. So, I’d guess that 500k+ machines is a decent rough estimate for machines dedicated to MapReduce in the Google data centers based on the data they released.
What do others think? Roughly 500k physical machines a good estimate?
Could anyone confirm these numbers?
- Krishna Sankar: ☞ Google – A Study In Scalability And A Little Systems Horse Sense
- ☞ Jeff Dean’s talk at Stanford (windows video)
- Greg Linden: ☞ An update on Google’s infrastructure
- Jeff Dean’s talk at WSDM 2009 ☞ video and ☞ slides (PDF)
- Greg Linden notes on the above talk: ☞ here, ☞ here, and ☞ here
Original title and link: Google: A Study in Scalability, MapReduce Evolution (NoSQL databases © myNoSQL)