NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Google: All content tagged as Google in NoSQL databases and polyglot persistence

MongoDB and Google Megastore

TheRegister quoting Dwight Merriman, 10gen founder, in a post titled “MongoDB daddy: My baby beats Google BigTable”:

We read [Google’s Megastore research paper] and we were almost laughing at the similarities

I hope both the title and the quote are not really Dwight’s.

Original title and link: MongoDB and Google Megastore (NoSQL databases © myNoSQL)


Ravel Hopes to Open-Source Graph Databases

Ravel, an Austin, Texas-based company, wants to provide a supported, open-source version of Google’s Pregel software called GoldenOrb to handle large-scale graph analytics.

Is it a new graph database or a Pregel implementation? Watch the interview for yourself and tell me what do you think it is?


Google Paper: Availability in Globally Distributed Storage Systems

Google paper presented at Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation, 2010:

Highly available cloud storage is often implemented with complex, multi-tiered distributed systems built on top of clusters of commodity servers and disk drives. Sophisticated management, load balancing and recovery techniques are needed to achieve high performance and availability amidst an abundance of failure sources that include software, hardware, network connectivity, and power issues. While there is a relative wealth of failure studies of individual components of storage systems, such as disk drives, relatively little has been reported so far on the overall availability behavior of large cloud-based storage services. We characterize the availability properties of cloud storage systems based on an extensive one year study of Google’s main storage infrastructure and present statistical models that enable further insight into the impact of multiple design choices, such as data placement and replication strategies. With these models we compare data availability under a variety of system parameters given the real patterns of failures observed in our fleet.

Original title and link: Google Paper: Availability in Globally Distributed Storage Systems (NoSQL databases © myNoSQL)


Amazon SimpleDB, Google Megastore & CAP

Nati Shalom (Gigaspaces) pulls out a couple of references from James Hamilton’s posts[1] on Amazon SimpleDB and Google Megastore consistency model concluding:

It is interesting to see that the reality is that even Google and Amazon - which I would consider the extreme cases for big data - realized the limitation behind eventual consistency and came up with models that can deal with scaling without forcing a compromise on consistency as I also noted in one of my recent NoCAP series

But he lefts out small details like these:

Update rates within a entity group are seriously limited by:

  • When there is log contention, one wins and the rest fail and must be retried
  • Paxos only accepts a very limited update rate (order 10^2 updates per second)


Cross entity group updates are supported by:

  • two-phase commit with the fragility that it brings
  • queueing ans asynchronously applying the changes

Original title and link: Amazon SimpleDB, Google Megastore & CAP (NoSQL databases © myNoSQL)


Google Megastore Paper Summarized

In case you didn’t read the Google Megastore paper[1], James Hamilton has published his notes on the paper:

Overall, an excellent paper with lots of detail on a nicely executed storage system. Supporting consistent read and full ACID update semantics is impressive although the limitation of not being able to update an entity group at more than a “few per second” is limiting.

Original title and link: Google Megastore Paper Summarized (NoSQL databases © myNoSQL)


Google Megastore: Scalable, Highly Available Storage for Interactive Services

A new paper from Google:

Megastore blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS in a novel way, and provides both strong consistency guarantees and high availability.
We provide fully serializable ACID semantics within fine-grained partitions of data. This partitioning allows us to synchronously replicate each write across a wide area network with reasonable latency and support seamless failover between datacenters.
This paper describes Megastore’s semantics and replication algorithm.

Megastore seems to be the solution behind the Google App Engine high replication datastore.

Emphases are mine.

Original title and link: Google Megastore: Scalable, Highly Available Storage for Interactive Services (NoSQL databases © myNoSQL)


Google: A Study in Scalability, MapReduce Evolution

Krishna Sankar has a ☞ great summary of a recent talk of Google’s Jeff Dean on Google’s systems and infrastructure:

An interesting set of statistics of MapReduce over time:

  • MapReduce at Google, now at 4 million jobs; processing ~1000 PB with 130 PB intermediate data and 45 PB output
  • Data has doubled while the number of machines have been constant from ‘07 to ‘10.
  • Machine usage has quadrupled while the job completion has doubled ‘07 to ‘10
  • Trivia : Jeff shared an anecdote where the network engineers were rewiring the network while Jeff & Co were running MapReduce. They did lose machines in a strange pattern and were wondering what is going on; but the job succeeded, a little slower than normal and of course, the machines came back up ! Only after the fact did they hear about the network rewiring !
Google MapReduce stats

The talk generated some interesting comments on ☞ Greg Linden’s blog about the number of machines Google is using for running MapReduce:

Well, on the one hand, the machines probably have four cores (so 1/4 the machines), but the average utilization rate is probably a lot lower than 100%, probably more like 20-30%. So, I’d guess that 500k+ machines is a decent rough estimate for machines dedicated to MapReduce in the Google data centers based on the data they released.

What do others think? Roughly 500k physical machines a good estimate?

Could anyone confirm these numbers?



Original title and link: Google: A Study in Scalability, MapReduce Evolution (NoSQL databases © myNoSQL)

Goodbye Google App Engine (GAE)

[…] developing on GAE introduced such a design complexity that working around it pushes us 5 months behind schedule. Now that we had developed tools and workarounds for all the problems we found in GAE, we were starting being fast with the development of features eventually, and at that point we found the cloud was totally unstable, doggy.

Carlos Ble’s post lists 13 issues they’ve had with Google App Engine. Now, I know Google App Engine is not offering the best experience or performance, has a couple of (interesting?) constraints and Google support is (how to put it gently?) completely missing, but some of the points listed in there make me think they’ve thrown themselves in heads first without trying to understand:

  • what kind of persistency is Google App Engine offering
  • how to model and access your data in Google App Engine
  • data migration paths
  • architecting with SLAs in mind

These should sound kind of familiar to everyone building distributed systems using NoSQL databases. Going live after prototyping a “Hello world” with a NoSQL database will not gonna “scale” :-).

Original title and link: Goodbye Google App Engine (GAE) (NoSQL databases © myNoSQL)


Szl: A Compiler and Runtime for the Sawzall Language

Google open sourced szl an implementation of ☞ Sawzall:

Szl is a compiler and runtime for the Sawzall language. It includes support for statistical aggregation of values read or computed from the input. Google uses Sawzall to process log data generated by Google’s servers.

Since a Sawzall program processes one record of input at a time and does not preserve any state (values of variables) between records, it is well suited for execution as the map phase of a map-reduce. The library also includes support for the statistical aggregation that would be done in the reduce phase of a map-reduce.

You can probably think of it as Flume and Pig without Hadoop.

Original title and link: Szl: A Compiler and Runtime for the Sawzall Language (NoSQL databases © myNoSQL)


Pregel: Graph Processing at Large-Scale

Good preso about Pregel:

The slides talk about:

  • Pregel compute model
  • Pregel C++ API
  • implementation details
  • fault tolerance
  • workers, master, and aggregators

As mentioned before Pregel is MapReduce for graphs. And besides Google’s implementation we’ll probably never see, there’s Phoebus, an Erlang implementation of Pregel.

Original title and link: Pregel: Graph Processing at Large-Scale (NoSQL databases © myNoSQL)

Google's Dremel: Can MapReduce Handle Fast, Interactive Querying?

A bit old, but thought that in the light of latest posts about MapReduce and Hadoop future, it made sense to have this piece of the puzzle too.

Tasso Argyros (AsterData):

Native MapReduce execution is not fundamentally slow; however Google’s MapReduce and Hadoop happen to be oriented more towards batch processing. Dremel tries to overcome that by building a completely different system that speeds interactive querying.

The Dremel: Interactive Analysis of Web-Scale Datasets paper can be downloaded from ☞ here (PDF).

Original title and link: Google’s Dremel: Can MapReduce Handle Fast, Interactive Querying? (NoSQL databases © myNoSQL)


Phoebus: Erlang-based Implementation of Google’s Pregel

Chad DePue about Phoebus, the first (?) open source implementation of Google’s Pregel algorithm:

Essentially, Phoebus makes calculating data for each vertex and edge in parallel possible on a cluster of nodes. Makes me wish I had a massively large graph to test it with.

Developed by Arun Suresh (Yahoo!), the project ☞ page includes a bullet description of the Pregel computational model:

  • A Graph is partitioned into a groups of Records.
  • A Record consists of a Vertex and its outgoing Edges (An Edge is a Tuple consisting of the edge weight and the target vertex name).
  • A User specifies a ‘Compute’ function that is applied to each Record.
  • Computation on the graph happens in a sequence of incremental Super Steps.
  • At each Super step, the Compute function is applied to all ‘active’ vertices of the graph.
  • Vertices communicate with each other via Message Passing.
  • The Compute function is provided with the Vertex record and all Messages sent to the Vertex in the previous SuperStep.
  • A Compute funtion can:
    • Mutate the value associated to a vertex
    • Add/Remove outgoing edges.
    • Mutate Edge weight
    • Send a Message to any other vertex in the graph.
    • Change state of the vertex from ‘active’ to ‘hold’.
  • At the begining of each SuperStep, if there are no more active vertices -and- if there are no messages to be sent to any vertex, the algorithm terminates.
  • A User may additionally specify a ‘MaxSteps’ to stop the algorithm after a some number of super steps.
  • A User may additionally specify a ‘Combine’ funtion that is applied to the all the Messages targetted at a Vertex before the Compute function is applied to it.

While it sounds similar to mapreduce, Pregel is optimized for graph operations, by reducing I/O, ensuring data locality, but also preserving processing state between phases.

Original title and link: Phoebus: Erlang-based Implementation of Google’s Pregel (NoSQL databases © myNoSQL)