NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Google: All content tagged as Google in NoSQL databases and polyglot persistence

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

Google’s paper about their large-scale distributed systems tracing solution Dapper which inspired Twitter’s Zipkin:

Here we introduce the design of Dapper, Google’s production distributed systems tracing infrastructure, and describe how our design goals of low overhead, application-level transparency, and ubiquitous deployment on a very large scale system were met. Dapper shares conceptual similarities with other tracing systems, particularly Magpie [3] and X-Trace [12], but certain design choices were made that have been key to its success in our environment, such as the use of sampling and restricting the instrumentation to a rather small number of common libraries.

Download or read the paper after the break.

Google BigQuery: Running SQL-like Queries Against Very Large Datasets

Announced at GigaOm Structure Data event, Google launches a new BigData service named BigQuery:

BigQuery enables businesses and developers to gain real-time business insights from massive amounts of data without any upfront hardware or software investments.

A quick bullet point list of BigQuery features and limitations:

  • BigQuery is ideal for running queries over vast amounts of data—up to billions of rows—with great speed.
  • BigQuery is good for analyzing vast quantities of data quickly, but not for modifying it. In data analysis terms, BigQuery is an OLAP (online analytical processing) system.
  • You can import data into BigQuery as CSV data, where it is stored in the cloud in a relatively small number of tables with no explicit relationship to each other.
  • BigQuery isn’t a database system:
    • It doesn’t support table indexes or other database management features.
    • BigQuery supports a specialized subset of SQL; it doesn’t support update or delete requests.
    • BigQuery supports joins only when one side of the join is much smaller than the other.
  • BigQuery can be used by any client able to send REST commands over the Internet.

After the break you can watch the 15 minutes video recorded at the GigaOm event.

How Web giants store big data

An ArsTechnica, not very technical, overview of the storage engines developed and used by Google (Google File System, BigTable), Amazon (Dynamo), Microsoft (Azure DFS), plus the Hadoop Distributed File System (HDFS).

Original title and link: How Web giants store big data (NoSQL database©myNoSQL)


Research in the MapReduce Space

Over the weekend I’ve read two papers presenting products or research related to improving or adding new capabilities to the MapReduce data processing approach. The first of them comes from a team at Microsoft and is describing TiMR a time-oriented data processing system in MapReduce. The second, from a team at Google, presents Tenzin - a SQL implementation on the MapReduce framework. It’s great to learn that while the Hadoop community is eliminating some of the initial limitations and hardening the technical details of the platform, there are already ideas and systems out there that augment the capabilities of the MapReduce data processing model.

Original title and link: Research in the MapReduce Space (NoSQL database©myNoSQL)

Paper: Tenzing A SQL Implementation on the MapReduce Framework

This recent paper from a team at Google is presenting details about Tenzing a system that is currently in use at Google:

Tenzing is a query engine built on top of MapReduce for ad hoc analysis of Google data. Tenzing supports a mostly complete SQL implementation (with several extensions) combined with several key characteristics such as heterogeneity, high performance, scalability, reliability, metadata awareness, low latency, support for columnar storage and structured data, and easy extensibility.

A couple of things I’ve highlighted when reading it:

  • Tenzing is in production, but doesn’t serve yet a huge amount of queries
  • the backend storage can be a mix of various data stores, such as ColumnIO, Bigtable, GFS files, MySQL databases
  • when compared with other similar solutions (Sawzall, Flume-Java, Pig, Hive„ HadoopDB), Tenzing’s advantage is low latency
  • the paper acknowledges AsterData, GreenPlum, Paraccel, Vertica for using a MapReduce execution model in their engines
  • to perform query optimizations, Tenzing is enhancing queries with information from a metadata server
    • there is no information about what kind of metadata is needed in Tenzing. I assume it might refer to details about the data sources and data source metadata (indexes, access patterns, etc)
  • to reduce query latency, processes are kept running
  • Tenzing supports almost all SQL92 standard and some extensions from SQL99
    • projection and filtering (for some of these and depending on the data source Tenzing can do some optimizations)
    • set operations (implemented in the reduce phase)
    • nested queries and subqueries
    • aggregation and statistical functions
    • analytic functions (syntax similar to PostgreSQL/Oracle)
    • OLAP extensions
    • JOINs:

      Tenzing supports efficient joins across data sources, such as ColumnIO to Bigtable; inner, left, right, cross, and full outer joins; and equi semi-equi, non-equi and function based joins. Cross joins are only supported for tables small enough to fit in memory, and right outer joins are supported only with sort/merge joins. Non-equi correlated subqueries are currently not supported. We include distributed implementations for nested loop, sort/merge and hash joins.

Read and download the “Tenzing A SQL Implementation on the MapReduce framework” after the break.

Google Launches Google Cloud SQL a Relational Database as a Service

Google has just announced a new (lab) product: Google Cloud SQL which is Google’s Database-as-a-Service version of Amazon RDS—based on initial information, Google Cloud SQL could be characterized as a very basic/intro version of Amazon RDS.

Main features listed in the announcement:

  • Managed environment
  • High reliability and availability - your data is replicated synchronously to multiple data centers. Machine, rack and data center failures are handled automatically to minimize end-user impact. It also support asynchronous replication
  • Familiar MySQL database environment with JDBC support (for Java-based App Engine applications) and DB-API support (for Python-based App Engine applications). It even support data import and export using mysqldump
  • Simple and powerful integration with Google App Engine.
  • Command line tool
  • SQL prompt in the Google APIs Console

The service is free for now and Google promises a 30 days notice without giving any hints on the pricing model though.

Original title and link: Google Launches Google Cloud SQL a Relational Database as a Service (NoSQL database©myNoSQL)

How Does Google MegaStore Compare Against HDFS/HBase?

Alex Feinberg answering the question in the title:

This is like saying “how does a General Motors bus compare against a Ford engine”. MegaStore is built on of Google’s BigTable/GFS. HBase/HDFS are BigTable/HDFS work-alikes.

BigTable and HBase give up availability (in the CAP Theorem sense) in favour of consistency: when a tablet master node (HRegionServer in HBase) goes down, the portion of the keyspace the failed node is responsible for becomes (briefly) unavailable until another node takes over the portion of the key space. This is efficient, as the data/write-ahead-log is stored GFS (or HDFS): in a way serializing writes to GFS/HDFS (a file system with relaxed consistency semantics) through a single node ensures serializable consistency.

Make sure you read it all.

Original title and link: How Does Google MegaStore Compare Against HDFS/HBase? (NoSQL database©myNoSQL)


Amazon Is More Interesting Than Google

Google has been doing these sort of blog posts for years. Some engineer wites up an entry about how they are doing research using terabytes or petabytes of data. And then they end by saying you should work at Google. So nowadays, I don’t care about any of what Google does. […] MapReduce? Great, they’ve been sitting on this technology for a decade. Good for them. It doesn’t matter to me.

But the world has changed, and Google can’t seem to keep up. Amazon has become the polar opposite of Google, empowering every developer on the planet to make incredible technology. Want MapReduce? Amazon has you covered. Want to play with terabytes of data like it ain’t no thing? Check. Want to launch thousands of servers to handle a tough computation? Check, check, and check. Want to launch thousands of human brains to solve otherwise unassailable problems? No problem. Heck, want to simply send email to your users? They have that too.

I read this just hours after expressing my concerns about the awesome future of Big Data and data anlytics. For now we’re lucky there’s still an Amazon out there.

Original title and link: Amazon Is More Interesting Than Google (NoSQL database©myNoSQL)


LevelDB: Google’s Fast Persistent Key-Value Store Library

Google open sourced a while ago LevelDB , a C++ library that provides an ordered mapping key-value storage. LevelDB performance convinced Basho guys to experiment with adding LevelDB as a storage engine for Riak. And there’s also a benchmark comparing LevelDB with SQLite and Kyoto Cabinet.

The LevelDB project lists the following key features:

  • Keys and values are arbitrary byte arrays.
  • Data is stored sorted by key.
  • Callers can provide a custom comparison function to override the sort order.
  • The basic operations are Put(key,value), Get(key), Delete(key).
  • Multiple changes can be made in one atomic batch.
  • Users can create a transient snapshot to get a consistent view of data.
  • Forward and backward iteration is supported over the data.
  • Data is automatically compressed using the Snappy compression library.
  • External activity (file system operations etc.) is relayed through a virtual interface so users can customize the operating system interactions.
  • Detailed documentation about how to use the library is included with the source code.

You can check out also the old thread on Hacker News about LevelDB..

Original title and link: LevelDB: Google’s Fast Persistent Key-Value Store Library (NoSQL database©myNoSQL)

Paper: Google Fusion Tables: Data Management, Integration and Collaboration in the Cloud

This paper from Google talks extensively about the usage of BigTable and Megastore, the data model, query processing, and transaction handling in the implementation of Google Fusion Tables.

Google Fusion Tables is a cloud-based service for data management and integration. Fusion Tables enables users to upload tabular data files (spreadsheets, CSV, KML), currently of up to 100MB. The system provides several ways of visualizing the data (e.g., charts, maps, and timelines) and the ability to filter and aggregate the data. It supports the integration of data from multiple sources by performing joins across tables that may belong to different users. […] This paper describes the inner workings of Fusion Tables, including the storage of data in the system and the tight integration with the Google Maps infrastructure.

Download the paper or read it after the break.

GoldenOrb: Ravel Google Pregel Implementation Released

Announced back in March, Ravel has finally released GoldenOrb an implementation of the Google Pregel paper—if you are not familiar with Google Pregel check the Pregel: Graph Processing at Large-Scale and Ricky Ho’s comparison of Pregel and MapReduce.

Until Ravel’s GoldenOrb the only experimental implementation of Pregel was the Erlang-based Phoebus. GoldenOrb was released under the Apache License v2.0 and is available on GitHub.

GoldenOrb is a cloud-based open source project for massive-scale graph analysis, built upon best-of-breed software from the Apache Hadoop project modeled after Google’s Pregel architecture.

Original title and link: GoldenOrb: Ravel Google Pregel Implementation Released (NoSQL database©myNoSQL)

Google BigTable, MapReduce, MegaStore vs. Hadoop, MongoDB

Dhanji R. Prasanna leaving Google:

Here is something you’ve may have heard but never quite believed before: Google’s vaunted scalable software infrastructure is obsolete. Don’t get me wrong, their hardware and datacenters are the best in the world, and as far as I know, nobody is close to matching it. But the software stack on top of it is 10 years old, aging and designed for building search engines and crawlers. And it is well and truly obsolete.

Protocol Buffers, BigTable and MapReduce are ancient, creaking dinosaurs compared to MessagePack, JSON, and Hadoop. And new projects like GWT, Closure and MegaStore are sluggish, overengineered Leviathans compared to fast, elegant tools like jQuery and mongoDB. Designed by engineers in a vacuum, rather than by developers who have need of tools.

Maybe it is just the disappointment of someone whose main project was killed

. Or maybe it is true. Or maybe it is just another magic triangle:

Agility Scalability Coolness factor Triangle

Edward Ribeiro mentioned a post from another ex-Googler which points out similar issues with Google’s philosophy.

Original title and link: Google BigTable, MapReduce, MegaStore vs. Hadoop, MongoDB (NoSQL databases © myNoSQL)