ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Google: All content tagged as Google in NoSQL databases and polyglot persistence

Google Megastore Paper Summarized

In case you didn’t read the Google Megastore paper[1], James Hamilton has published his notes on the paper:

Overall, an excellent paper with lots of detail on a nicely executed storage system. Supporting consistent read and full ACID update semantics is impressive although the limitation of not being able to update an entity group at more than a “few per second” is limiting.

Original title and link: Google Megastore Paper Summarized (NoSQL databases © myNoSQL)

via: http://perspectives.mvdirona.com/2011/01/09/GoogleMegastoreTheDataEngineBehindGAE.aspx


Google Megastore: Scalable, Highly Available Storage for Interactive Services

A new paper from Google:

Megastore blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS in a novel way, and provides both strong consistency guarantees and high availability.
We provide fully serializable ACID semantics within fine-grained partitions of data. This partitioning allows us to synchronously replicate each write across a wide area network with reasonable latency and support seamless failover between datacenters.
This paper describes Megastore’s semantics and replication algorithm.

Megastore seems to be the solution behind the Google App Engine high replication datastore.

Emphases are mine.

Original title and link: Google Megastore: Scalable, Highly Available Storage for Interactive Services (NoSQL databases © myNoSQL)

via: http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf


Google: A Study in Scalability, MapReduce Evolution

Krishna Sankar has a ☞ great summary of a recent talk of Google’s Jeff Dean on Google’s systems and infrastructure:

An interesting set of statistics of MapReduce over time:

  • MapReduce at Google, now at 4 million jobs; processing ~1000 PB with 130 PB intermediate data and 45 PB output
  • Data has doubled while the number of machines have been constant from ‘07 to ‘10.
  • Machine usage has quadrupled while the job completion has doubled ‘07 to ‘10
  • Trivia : Jeff shared an anecdote where the network engineers were rewiring the network while Jeff & Co were running MapReduce. They did lose machines in a strange pattern and were wondering what is going on; but the job succeeded, a little slower than normal and of course, the machines came back up ! Only after the fact did they hear about the network rewiring !
Google MapReduce stats

The talk generated some interesting comments on ☞ Greg Linden’s blog about the number of machines Google is using for running MapReduce:

Well, on the one hand, the machines probably have four cores (so 1/4 the machines), but the average utilization rate is probably a lot lower than 100%, probably more like 20-30%. So, I’d guess that 500k+ machines is a decent rough estimate for machines dedicated to MapReduce in the Google data centers based on the data they released.

What do others think? Roughly 500k physical machines a good estimate?

Could anyone confirm these numbers?

@evan

Resources

Original title and link: Google: A Study in Scalability, MapReduce Evolution (NoSQL databases © myNoSQL)


Goodbye Google App Engine (GAE)

[…] developing on GAE introduced such a design complexity that working around it pushes us 5 months behind schedule. Now that we had developed tools and workarounds for all the problems we found in GAE, we were starting being fast with the development of features eventually, and at that point we found the cloud was totally unstable, doggy.

Carlos Ble’s post lists 13 issues they’ve had with Google App Engine. Now, I know Google App Engine is not offering the best experience or performance, has a couple of (interesting?) constraints and Google support is (how to put it gently?) completely missing, but some of the points listed in there make me think they’ve thrown themselves in heads first without trying to understand:

  • what kind of persistency is Google App Engine offering
  • how to model and access your data in Google App Engine
  • data migration paths
  • architecting with SLAs in mind

These should sound kind of familiar to everyone building distributed systems using NoSQL databases. Going live after prototyping a “Hello world” with a NoSQL database will not gonna “scale” :-).

Original title and link: Goodbye Google App Engine (GAE) (NoSQL databases © myNoSQL)

via: http://www.carlosble.com/?p=719


Szl: A Compiler and Runtime for the Sawzall Language

Google open sourced szl an implementation of ☞ Sawzall:

Szl is a compiler and runtime for the Sawzall language. It includes support for statistical aggregation of values read or computed from the input. Google uses Sawzall to process log data generated by Google’s servers.

Since a Sawzall program processes one record of input at a time and does not preserve any state (values of variables) between records, it is well suited for execution as the map phase of a map-reduce. The library also includes support for the statistical aggregation that would be done in the reduce phase of a map-reduce.

You can probably think of it as Flume and Pig without Hadoop.

Original title and link: Szl: A Compiler and Runtime for the Sawzall Language (NoSQL databases © myNoSQL)

via: http://code.google.com/p/szl/


Pregel: Graph Processing at Large-Scale

Good preso about Pregel:

The slides talk about:

  • Pregel compute model
  • Pregel C++ API
  • implementation details
  • fault tolerance
  • workers, master, and aggregators

As mentioned before Pregel is MapReduce for graphs. And besides Google’s implementation we’ll probably never see, there’s Phoebus, an Erlang implementation of Pregel.

Original title and link: Pregel: Graph Processing at Large-Scale (NoSQL databases © myNoSQL)


Google's Dremel: Can MapReduce Handle Fast, Interactive Querying?

A bit old, but thought that in the light of latest posts about MapReduce and Hadoop future, it made sense to have this piece of the puzzle too.

Tasso Argyros (AsterData):

Native MapReduce execution is not fundamentally slow; however Google’s MapReduce and Hadoop happen to be oriented more towards batch processing. Dremel tries to overcome that by building a completely different system that speeds interactive querying.

The Dremel: Interactive Analysis of Web-Scale Datasets paper can be downloaded from ☞ here (PDF).

Original title and link: Google’s Dremel: Can MapReduce Handle Fast, Interactive Querying? (NoSQL databases © myNoSQL)

via: http://www.asterdata.com/blog/index.php/2010/07/19/google%E2%80%99s-dremel-%E2%80%93-or-can-mapreduce-itself-handle-fast-interactive-querying/


Phoebus: Erlang-based Implementation of Google’s Pregel

Chad DePue about Phoebus, the first (?) open source implementation of Google’s Pregel algorithm:

Essentially, Phoebus makes calculating data for each vertex and edge in parallel possible on a cluster of nodes. Makes me wish I had a massively large graph to test it with.

Developed by Arun Suresh (Yahoo!), the project ☞ page includes a bullet description of the Pregel computational model:

  • A Graph is partitioned into a groups of Records.
  • A Record consists of a Vertex and its outgoing Edges (An Edge is a Tuple consisting of the edge weight and the target vertex name).
  • A User specifies a ‘Compute’ function that is applied to each Record.
  • Computation on the graph happens in a sequence of incremental Super Steps.
  • At each Super step, the Compute function is applied to all ‘active’ vertices of the graph.
  • Vertices communicate with each other via Message Passing.
  • The Compute function is provided with the Vertex record and all Messages sent to the Vertex in the previous SuperStep.
  • A Compute funtion can:
    • Mutate the value associated to a vertex
    • Add/Remove outgoing edges.
    • Mutate Edge weight
    • Send a Message to any other vertex in the graph.
    • Change state of the vertex from ‘active’ to ‘hold’.
  • At the begining of each SuperStep, if there are no more active vertices -and- if there are no messages to be sent to any vertex, the algorithm terminates.
  • A User may additionally specify a ‘MaxSteps’ to stop the algorithm after a some number of super steps.
  • A User may additionally specify a ‘Combine’ funtion that is applied to the all the Messages targetted at a Vertex before the Compute function is applied to it.

While it sounds similar to mapreduce, Pregel is optimized for graph operations, by reducing I/O, ensuring data locality, but also preserving processing state between phases.

Original title and link: Phoebus: Erlang-based Implementation of Google’s Pregel (NoSQL databases © myNoSQL)


MapReduce and Hadoop Future

In the light of ☞ Google Caffeine announcement — a summary of a summary would be that Google replaced MapReduce-based index updates with a new engine that would provide more timely updates — ☞ Tony Bain is wondering if Michael Stonebraker and DeWitt’ paper ☞ MapReduce: a major step backwards hasn’t thus been proved to be correct:

Firstly, was Stonebraker and Dewitt right? It is red faced time for those who came out and aggressively defended the Map/Reduce architecture?

And secondly what impact does this have on the future of Map/Reduce now those responsible for its popularity seem to have migrated their key use case? Is the proposition for Map/Reduce today still just as good now the Google don’t do it? (Yes I am sure Google still use Map/Reduce extensively and this is a bit tongue in cheek. But the primary quoted example relates to building the search index which is what, reportedly, has been moved away from MR).

While all these questions seem to be appropriate, I think some details could help with finding the correct answers.

Firstly, I think Google’s decission to “drop” MapReduce-based index updates was determined by their particular implementation and their storage strategy. Simply put, Google’s MapReduce-based index updates required reprocessing of data, so providing timely updates was more or less impossible. But as proved by CouchDB mapreduce implementation this approach is not the only one possible. CouchDB views are built as a result of running a pair of map and reduce functions and storing it in btrees. As for updates, CouchDB doesn’t need to reprocess all initial data and rebuild the index from scratch, but only apply changes from the updates. In this regard, Stonebraker seem to have been right when saying that it is “a sub-optimal implementation, in that it uses brute force instead of indexing”.

While Hadoop, the most well know mapreduce implementation, is following closely Google’s design, that doesn’t mean that there isn’t work done to improve its behavior for special scenarios like real-time stream processing, cascading, etc.

As regards the questions related to the impact of Google’s announcement on MapReduce adoption, I’d say that taking a look at the reports from the Hadoop Summit we all would agree that for quite some time the biggest proponents of MapReduce (in its Hadoop incarnation) have been Yahoo!, Facebook, Twitter, and other such companies. And, as I said it before, it sounds like Hadoop is actually processing more data than Google’s MapReduce .

Last, but not least, as with any NoSQL technology all these do not mean that MapReduce or Hadoop will fit all scenarios.

Original title and link: MapReduce Future (NoSQL databases © myNoSQL)


Google BigTable Paper Summarized

The slides below summarizing the Google BigTable paper are the result of a NOSQLSummer meeting in Tokyo. Nice!

Update: I just realized that the company that hosted this meeting, Gemini Mobile Technologies, is the same that announced yesterday the new key-value store Hibari


Google BigQuery SQL-like API

Google has announced at GoogleIO 2010, but didn’t launch yet, a new API for ad-hoc analysis, reporting, data exploration of massively large datasets: ☞ BigQuery. What I find interesting is that, BigQuery is using ☞ an SQL flavor, instead of MapReduce or Hive or PIG.

It still strikes me that Google hasn’t figured out yet a way to expose access to their MapReduce implementation. Judging by the numbers in the industry, I’d say that by now Hadoop is probably handling the largest volumes of data.


An Interesting Problem: Scaling Graph Databases

One of the problems mentioned when discussing relational databases scalability is that handling storage enforced relationships, ACID and scale do not play well together. In the NoSQL space there is a category of storage solutions that uses highly interconnected data: graph databases. (note also that some of these graph databases are also transactional).

Lately there have been quite a few interesting discussions related to scaling graph databases. Alex Averbuch is working on a sharding Neo4j thesis and his recent post presents some of the possible solutions. Alex’s article is a very good starting point for anyone interesting in scaling graph databases.

Then there is also this article on InfoGrid‘s blog that is presenting a different web-like solution based on a custom protocol: XPRISO: eXtensible Protocol for the Replication, Integration and Synchronization of distributed Objects. While I haven’t had the chance to dig deeper into InfoGrid suggested approach there was one thing that caught my attention right away: while the association with web-scale is definitely an interesting idea, having specific knowledge of the nodes location and having to use custom API for it doesn’t seem to be the best solution. Basically the web addressed this by having URIs for each reachable resource (InfoGrid should try a similar idea, get rid of the different API for accessing local vs remote nodes, etc.)

Update: make sure you check the comment thread for more details about InfoGrid perspective on scaling graph databases.

Oren Eini concludes in his post:

After spending some time thinking about it, I came to the conclusion that I can’t envision any general way to solve the problem. Oh, I can think of several ways of reduce the problem:

  • Batching cross machine queries so we only perform them at the close of each breadth first step.
  • Storing multiple levels of associations (So “users/ayende” would store its relations but also “users/ayende”’s relation and “users/arik”’s relations).

While I haven’t had enough time to think about this topic, my gut feeling is that possible solutions are to be found in the space of a combination of using unique identifiers for distributed nodes and a mapreduce-like approach. I cannot stop wondering if this is not what Google’s Pregel is doing (nb I should have read the paper (pdf) firstly).