Google: All content tagged as Google in NoSQL databases and polyglot persistence
Krishna Sankar has a ☞ great summary of a recent talk of Google’s Jeff Dean on Google’s systems and infrastructure:
An interesting set of statistics of MapReduce over time:
- MapReduce at Google, now at 4 million jobs; processing ~1000 PB with 130 PB intermediate data and 45 PB output
- Data has doubled while the number of machines have been constant from ‘07 to ‘10.
- Machine usage has quadrupled while the job completion has doubled ‘07 to ‘10
- Trivia : Jeff shared an anecdote where the network engineers were rewiring the network while Jeff & Co were running MapReduce. They did lose machines in a strange pattern and were wondering what is going on; but the job succeeded, a little slower than normal and of course, the machines came back up ! Only after the fact did they hear about the network rewiring !
The talk generated some interesting comments on ☞ Greg Linden’s blog about the number of machines Google is using for running MapReduce:
Well, on the one hand, the machines probably have four cores (so 1/4 the machines), but the average utilization rate is probably a lot lower than 100%, probably more like 20-30%. So, I’d guess that 500k+ machines is a decent rough estimate for machines dedicated to MapReduce in the Google data centers based on the data they released.
What do others think? Roughly 500k physical machines a good estimate?
Could anyone confirm these numbers?
- Krishna Sankar: ☞ Google – A Study In Scalability And A Little Systems Horse Sense
- ☞ Jeff Dean’s talk at Stanford (windows video)
- Greg Linden: ☞ An update on Google’s infrastructure
- Jeff Dean’s talk at WSDM 2009 ☞ video and ☞ slides (PDF)
- Greg Linden notes on the above talk: ☞ here, ☞ here, and ☞ here
Original title and link: Google: A Study in Scalability, MapReduce Evolution (NoSQL databases © myNoSQL)
Good preso about Pregel:
The slides talk about:
- Pregel compute model
- Pregel C++ API
- implementation details
- fault tolerance
- workers, master, and aggregators
Chad DePue about Phoebus, the first (?) open source implementation of Google’s Pregel algorithm:
Essentially, Phoebus makes calculating data for each vertex and edge in parallel possible on a cluster of nodes. Makes me wish I had a massively large graph to test it with.
Developed by Arun Suresh (Yahoo!), the project ☞ page includes a bullet description of the Pregel computational model:
- A Graph is partitioned into a groups of Records.
- A Record consists of a Vertex and its outgoing Edges (An Edge is a Tuple consisting of the edge weight and the target vertex name).
- A User specifies a ‘Compute’ function that is applied to each Record.
- Computation on the graph happens in a sequence of incremental Super Steps.
- At each Super step, the Compute function is applied to all ‘active’ vertices of the graph.
- Vertices communicate with each other via Message Passing.
- The Compute function is provided with the Vertex record and all Messages sent to the Vertex in the previous SuperStep.
- A Compute funtion can:
- Mutate the value associated to a vertex
- Add/Remove outgoing edges.
- Mutate Edge weight
- Send a Message to any other vertex in the graph.
- Change state of the vertex from ‘active’ to ‘hold’.
- At the begining of each SuperStep, if there are no more active vertices -and- if there are no messages to be sent to any vertex, the algorithm terminates.
- A User may additionally specify a ‘MaxSteps’ to stop the algorithm after a some number of super steps.
- A User may additionally specify a ‘Combine’ funtion that is applied to the all the Messages targetted at a Vertex before the Compute function is applied to it.
While it sounds similar to mapreduce, Pregel is optimized for graph operations, by reducing I/O, ensuring data locality, but also preserving processing state between phases.
Original title and link: Phoebus: Erlang-based Implementation of Google’s Pregel (NoSQL databases © myNoSQL)
In the light of ☞ Google Caffeine announcement — a summary of a summary would be that Google replaced MapReduce-based index updates with a new engine that would provide more timely updates — ☞ Tony Bain is wondering if Michael Stonebraker and DeWitt’ paper ☞ MapReduce: a major step backwards hasn’t thus been proved to be correct:
Firstly, was Stonebraker and Dewitt right? It is red faced time for those who came out and aggressively defended the Map/Reduce architecture?
And secondly what impact does this have on the future of Map/Reduce now those responsible for its popularity seem to have migrated their key use case? Is the proposition for Map/Reduce today still just as good now the Google don’t do it? (Yes I am sure Google still use Map/Reduce extensively and this is a bit tongue in cheek. But the primary quoted example relates to building the search index which is what, reportedly, has been moved away from MR).
While all these questions seem to be appropriate, I think some details could help with finding the correct answers.
Firstly, I think Google’s decission to “drop” MapReduce-based index updates was determined by their particular implementation and their storage strategy. Simply put, Google’s MapReduce-based index updates required reprocessing of data, so providing timely updates was more or less impossible. But as proved by CouchDB mapreduce implementation this approach is not the only one possible. CouchDB views are built as a result of running a pair of map and reduce functions and storing it in btrees. As for updates, CouchDB doesn’t need to reprocess all initial data and rebuild the index from scratch, but only apply changes from the updates. In this regard, Stonebraker seem to have been right when saying that it is “a sub-optimal implementation, in that it uses brute force instead of indexing”.
While Hadoop, the most well know mapreduce implementation, is following closely Google’s design, that doesn’t mean that there isn’t work done to improve its behavior for special scenarios like real-time stream processing, cascading, etc.
As regards the questions related to the impact of Google’s announcement on MapReduce adoption, I’d say that taking a look at the reports from the Hadoop Summit we all would agree that for quite some time the biggest proponents of MapReduce (in its Hadoop incarnation) have been Yahoo!, Facebook, Twitter, and other such companies. And, as I said it before, it sounds like Hadoop is actually processing more data than Google’s MapReduce .
Last, but not least, as with any NoSQL technology all these do not mean that MapReduce or Hadoop will fit all scenarios.
Google has announced at GoogleIO 2010, but didn’t launch yet, a new API for ad-hoc analysis, reporting, data exploration of massively large datasets: ☞ BigQuery. What I find interesting is that, BigQuery is using ☞ an SQL flavor, instead of MapReduce or Hive or PIG.
It still strikes me that Google hasn’t figured out yet a way to expose access to their MapReduce implementation. Judging by the numbers in the industry, I’d say that by now Hadoop is probably handling the largest volumes of data.
One of the problems mentioned when discussing relational databases scalability is that handling storage enforced relationships, ACID and scale do not play well together. In the NoSQL space there is a category of storage solutions that uses highly interconnected data: graph databases. (note also that some of these graph databases are also transactional).
Lately there have been quite a few interesting discussions related to scaling graph databases. Alex Averbuch is working on a sharding Neo4j thesis and his recent post presents some of the possible solutions. Alex’s article is a very good starting point for anyone interesting in scaling graph databases.
Then there is also this article on InfoGrid‘s blog that is presenting a different web-like solution based on a custom protocol: XPRISO: eXtensible Protocol for the Replication, Integration and Synchronization of distributed Objects. While I haven’t had the chance to dig deeper into InfoGrid suggested approach there was one thing that caught my attention right away: while the association with web-scale is definitely an interesting idea, having specific knowledge of the nodes location and having to use custom API for it doesn’t seem to be the best solution. Basically the web addressed this by having URIs for each reachable resource (InfoGrid should try a similar idea, get rid of the different API for accessing local vs remote nodes, etc.)
Update: make sure you check the comment thread for more details about InfoGrid perspective on scaling graph databases.
Oren Eini concludes in his post:
After spending some time thinking about it, I came to the conclusion that I can’t envision any general way to solve the problem. Oh, I can think of several ways of reduce the problem:
- Batching cross machine queries so we only perform them at the close of each breadth first step.
- Storing multiple levels of associations (So “users/ayende” would store its relations but also “users/ayende”’s relation and “users/arik”’s relations).
While I haven’t had enough time to think about this topic, my gut feeling is that possible solutions are to be found in the space of a combination of using unique identifiers for distributed nodes and a mapreduce-like approach. I cannot stop wondering if this is not what Google’s Pregel is doing (nb I should have read the paper (pdf) firstly).