Google: All content tagged as Google in NoSQL databases and polyglot persistence
Krishna Sankar has a ☞ great summary of a recent talk of Google’s Jeff Dean on Google’s systems and infrastructure:
An interesting set of statistics of MapReduce over time:
- MapReduce at Google, now at 4 million jobs; processing ~1000 PB with 130 PB intermediate data and 45 PB output
- Data has doubled while the number of machines have been constant from ‘07 to ‘10.
- Machine usage has quadrupled while the job completion has doubled ‘07 to ‘10
- Trivia : Jeff shared an anecdote where the network engineers were rewiring the network while Jeff & Co were running MapReduce. They did lose machines in a strange pattern and were wondering what is going on; but the job succeeded, a little slower than normal and of course, the machines came back up ! Only after the fact did they hear about the network rewiring !
The talk generated some interesting comments on ☞ Greg Linden’s blog about the number of machines Google is using for running MapReduce:
Well, on the one hand, the machines probably have four cores (so 1/4 the machines), but the average utilization rate is probably a lot lower than 100%, probably more like 20-30%. So, I’d guess that 500k+ machines is a decent rough estimate for machines dedicated to MapReduce in the Google data centers based on the data they released.
What do others think? Roughly 500k physical machines a good estimate?
Could anyone confirm these numbers?
- Krishna Sankar: ☞ Google – A Study In Scalability And A Little Systems Horse Sense
- ☞ Jeff Dean’s talk at Stanford (windows video)
- Greg Linden: ☞ An update on Google’s infrastructure
- Jeff Dean’s talk at WSDM 2009 ☞ video and ☞ slides (PDF)
- Greg Linden notes on the above talk: ☞ here, ☞ here, and ☞ here
Original title and link: Google: A Study in Scalability, MapReduce Evolution (NoSQL databases © myNoSQL)
Good preso about Pregel:
The slides talk about:
- Pregel compute model
- Pregel C++ API
- implementation details
- fault tolerance
- workers, master, and aggregators
Chad DePue about Phoebus, the first (?) open source implementation of Google’s Pregel algorithm:
Essentially, Phoebus makes calculating data for each vertex and edge in parallel possible on a cluster of nodes. Makes me wish I had a massively large graph to test it with.
Developed by Arun Suresh (Yahoo!), the project ☞ page includes a bullet description of the Pregel computational model:
- A Graph is partitioned into a groups of Records.
- A Record consists of a Vertex and its outgoing Edges (An Edge is a Tuple consisting of the edge weight and the target vertex name).
- A User specifies a ‘Compute’ function that is applied to each Record.
- Computation on the graph happens in a sequence of incremental Super Steps.
- At each Super step, the Compute function is applied to all ‘active’ vertices of the graph.
- Vertices communicate with each other via Message Passing.
- The Compute function is provided with the Vertex record and all Messages sent to the Vertex in the previous SuperStep.
- A Compute funtion can:
- Mutate the value associated to a vertex
- Add/Remove outgoing edges.
- Mutate Edge weight
- Send a Message to any other vertex in the graph.
- Change state of the vertex from ‘active’ to ‘hold’.
- At the begining of each SuperStep, if there are no more active vertices -and- if there are no messages to be sent to any vertex, the algorithm terminates.
- A User may additionally specify a ‘MaxSteps’ to stop the algorithm after a some number of super steps.
- A User may additionally specify a ‘Combine’ funtion that is applied to the all the Messages targetted at a Vertex before the Compute function is applied to it.
While it sounds similar to mapreduce, Pregel is optimized for graph operations, by reducing I/O, ensuring data locality, but also preserving processing state between phases.
Original title and link: Phoebus: Erlang-based Implementation of Google’s Pregel (NoSQL databases © myNoSQL)