Google: All content tagged as Google in NoSQL databases and polyglot persistence
Krishna Sankar has a ☞ great summary of a recent talk of Google’s Jeff Dean on Google’s systems and infrastructure:
An interesting set of statistics of MapReduce over time:
- MapReduce at Google, now at 4 million jobs; processing ~1000 PB with 130 PB intermediate data and 45 PB output
- Data has doubled while the number of machines have been constant from ‘07 to ‘10.
- Machine usage has quadrupled while the job completion has doubled ‘07 to ‘10
- Trivia : Jeff shared an anecdote where the network engineers were rewiring the network while Jeff & Co were running MapReduce. They did lose machines in a strange pattern and were wondering what is going on; but the job succeeded, a little slower than normal and of course, the machines came back up ! Only after the fact did they hear about the network rewiring !
The talk generated some interesting comments on ☞ Greg Linden’s blog about the number of machines Google is using for running MapReduce:
Well, on the one hand, the machines probably have four cores (so 1/4 the machines), but the average utilization rate is probably a lot lower than 100%, probably more like 20-30%. So, I’d guess that 500k+ machines is a decent rough estimate for machines dedicated to MapReduce in the Google data centers based on the data they released.
What do others think? Roughly 500k physical machines a good estimate?
Could anyone confirm these numbers?
- Krishna Sankar: ☞ Google – A Study In Scalability And A Little Systems Horse Sense
- ☞ Jeff Dean’s talk at Stanford (windows video)
- Greg Linden: ☞ An update on Google’s infrastructure
- Jeff Dean’s talk at WSDM 2009 ☞ video and ☞ slides (PDF)
- Greg Linden notes on the above talk: ☞ here, ☞ here, and ☞ here
Original title and link: Google: A Study in Scalability, MapReduce Evolution (NoSQL databases © myNoSQL)
Good preso about Pregel:
The slides talk about:
- Pregel compute model
- Pregel C++ API
- implementation details
- fault tolerance
- workers, master, and aggregators
Chad DePue about Phoebus, the first (?) open source implementation of Google’s Pregel algorithm:
Essentially, Phoebus makes calculating data for each vertex and edge in parallel possible on a cluster of nodes. Makes me wish I had a massively large graph to test it with.
Developed by Arun Suresh (Yahoo!), the project ☞ page includes a bullet description of the Pregel computational model:
- A Graph is partitioned into a groups of Records.
- A Record consists of a Vertex and its outgoing Edges (An Edge is a Tuple consisting of the edge weight and the target vertex name).
- A User specifies a ‘Compute’ function that is applied to each Record.
- Computation on the graph happens in a sequence of incremental Super Steps.
- At each Super step, the Compute function is applied to all ‘active’ vertices of the graph.
- Vertices communicate with each other via Message Passing.
- The Compute function is provided with the Vertex record and all Messages sent to the Vertex in the previous SuperStep.
- A Compute funtion can:
- Mutate the value associated to a vertex
- Add/Remove outgoing edges.
- Mutate Edge weight
- Send a Message to any other vertex in the graph.
- Change state of the vertex from ‘active’ to ‘hold’.
- At the begining of each SuperStep, if there are no more active vertices -and- if there are no messages to be sent to any vertex, the algorithm terminates.
- A User may additionally specify a ‘MaxSteps’ to stop the algorithm after a some number of super steps.
- A User may additionally specify a ‘Combine’ funtion that is applied to the all the Messages targetted at a Vertex before the Compute function is applied to it.
While it sounds similar to mapreduce, Pregel is optimized for graph operations, by reducing I/O, ensuring data locality, but also preserving processing state between phases.
Original title and link: Phoebus: Erlang-based Implementation of Google’s Pregel (NoSQL databases © myNoSQL)
In the light of ☞ Google Caffeine announcement — a summary of a summary would be that Google replaced MapReduce-based index updates with a new engine that would provide more timely updates — ☞ Tony Bain is wondering if Michael Stonebraker and DeWitt’ paper ☞ MapReduce: a major step backwards hasn’t thus been proved to be correct:
Firstly, was Stonebraker and Dewitt right? It is red faced time for those who came out and aggressively defended the Map/Reduce architecture?
And secondly what impact does this have on the future of Map/Reduce now those responsible for its popularity seem to have migrated their key use case? Is the proposition for Map/Reduce today still just as good now the Google don’t do it? (Yes I am sure Google still use Map/Reduce extensively and this is a bit tongue in cheek. But the primary quoted example relates to building the search index which is what, reportedly, has been moved away from MR).
While all these questions seem to be appropriate, I think some details could help with finding the correct answers.
Firstly, I think Google’s decission to “drop” MapReduce-based index updates was determined by their particular implementation and their storage strategy. Simply put, Google’s MapReduce-based index updates required reprocessing of data, so providing timely updates was more or less impossible. But as proved by CouchDB mapreduce implementation this approach is not the only one possible. CouchDB views are built as a result of running a pair of map and reduce functions and storing it in btrees. As for updates, CouchDB doesn’t need to reprocess all initial data and rebuild the index from scratch, but only apply changes from the updates. In this regard, Stonebraker seem to have been right when saying that it is “a sub-optimal implementation, in that it uses brute force instead of indexing”.
While Hadoop, the most well know mapreduce implementation, is following closely Google’s design, that doesn’t mean that there isn’t work done to improve its behavior for special scenarios like real-time stream processing, cascading, etc.
As regards the questions related to the impact of Google’s announcement on MapReduce adoption, I’d say that taking a look at the reports from the Hadoop Summit we all would agree that for quite some time the biggest proponents of MapReduce (in its Hadoop incarnation) have been Yahoo!, Facebook, Twitter, and other such companies. And, as I said it before, it sounds like Hadoop is actually processing more data than Google’s MapReduce .
Last, but not least, as with any NoSQL technology all these do not mean that MapReduce or Hadoop will fit all scenarios.