google: All content tagged as google in NoSQL databases and polyglot persistence
Google’s paper about their large-scale distributed systems tracing solution Dapper which inspired Twitter’s Zipkin:
Here we introduce the design of Dapper, Google’s production distributed systems tracing infrastructure, and describe how our design goals of low overhead, application-level transparency, and ubiquitous deployment on a very large scale system were met. Dapper shares conceptual similarities with other tracing systems, particularly Magpie  and X-Trace , but certain design choices were made that have been key to its success in our environment, such as the use of sampling and restricting the instrumentation to a rather small number of common libraries.
Download or read the paper after the break.
Announced at GigaOm Structure Data event, Google launches a new BigData service named BigQuery:
BigQuery enables businesses and developers to gain real-time business insights from massive amounts of data without any upfront hardware or software investments.
A quick bullet point list of BigQuery features and limitations:
- BigQuery is ideal for running queries over vast amounts of data—up to billions of rows—with great speed.
- BigQuery is good for analyzing vast quantities of data quickly, but not for modifying it. In data analysis terms, BigQuery is an OLAP (online analytical processing) system.
- You can import data into BigQuery as CSV data, where it is stored in the cloud in a relatively small number of tables with no explicit relationship to each other.
- BigQuery isn’t a database system:
- It doesn’t support table indexes or other database management features.
- BigQuery supports a specialized subset of SQL; it doesn’t support update or delete requests.
- BigQuery supports joins only when one side of the join is much smaller than the other.
- BigQuery can be used by any client able to send REST commands over the Internet.
After the break you can watch the 15 minutes video recorded at the GigaOm event.
Over the weekend I’ve read two papers presenting products or research related to improving or adding new capabilities to the MapReduce data processing approach. The first of them comes from a team at Microsoft and is describing TiMR a time-oriented data processing system in MapReduce. The second, from a team at Google, presents Tenzin - a SQL implementation on the MapReduce framework. It’s great to learn that while the Hadoop community is eliminating some of the initial limitations and hardening the technical details of the platform, there are already ideas and systems out there that augment the capabilities of the MapReduce data processing model.
Original title and link: Research in the MapReduce Space ( ©myNoSQL)
This recent paper from a team at Google is presenting details about Tenzing a system that is currently in use at Google:
Tenzing is a query engine built on top of MapReduce for ad hoc analysis of Google data. Tenzing supports a mostly complete SQL implementation (with several extensions) combined with several key characteristics such as heterogeneity, high performance, scalability, reliability, metadata awareness, low latency, support for columnar storage and structured data, and easy extensibility.
A couple of things I’ve highlighted when reading it:
- Tenzing is in production, but doesn’t serve yet a huge amount of queries
- the backend storage can be a mix of various data stores, such as ColumnIO, Bigtable, GFS files, MySQL databases
- when compared with other similar solutions (Sawzall, Flume-Java, Pig, Hive„ HadoopDB), Tenzing’s advantage is low latency
- the paper acknowledges AsterData, GreenPlum, Paraccel, Vertica for using a MapReduce execution model in their engines
- to perform query optimizations, Tenzing is enhancing queries with information from a metadata server
- there is no information about what kind of metadata is needed in Tenzing. I assume it might refer to details about the data sources and data source metadata (indexes, access patterns, etc)
- to reduce query latency, processes are kept running
- Tenzing supports almost all SQL92 standard and some extensions from SQL99
- projection and filtering (for some of these and depending on the data source Tenzing can do some optimizations)
- set operations (implemented in the reduce phase)
- nested queries and subqueries
- aggregation and statistical functions
- analytic functions (syntax similar to PostgreSQL/Oracle)
- OLAP extensions
Tenzing supports efficient joins across data sources, such as ColumnIO to Bigtable; inner, left, right, cross, and full outer joins; and equi semi-equi, non-equi and function based joins. Cross joins are only supported for tables small enough to fit in memory, and right outer joins are supported only with sort/merge joins. Non-equi correlated subqueries are currently not supported. We include distributed implementations for nested loop, sort/merge and hash joins.
Read and download the “Tenzing A SQL Implementation on the MapReduce framework” after the break.
Google has just announced a new (lab) product: Google Cloud SQL which is Google’s Database-as-a-Service version of Amazon RDS—based on initial information, Google Cloud SQL could be characterized as a very basic/intro version of Amazon RDS.
Main features listed in the announcement:
- Managed environment
- High reliability and availability - your data is replicated synchronously to multiple data centers. Machine, rack and data center failures are handled automatically to minimize end-user impact. It also support asynchronous replication
- Familiar MySQL database environment with JDBC support (for Java-based App Engine applications) and DB-API support (for Python-based App Engine applications). It even support data import and export using
- Simple and powerful integration with Google App Engine.
- Command line tool
- SQL prompt in the Google APIs Console
The service is free for now and Google promises a 30 days notice without giving any hints on the pricing model though.
Original title and link: Google Launches Google Cloud SQL a Relational Database as a Service ( ©myNoSQL)
Google open sourced a while ago LevelDB , a C++ library that provides an ordered mapping key-value storage. LevelDB performance convinced Basho guys to experiment with adding LevelDB as a storage engine for Riak. And there’s also a benchmark comparing LevelDB with SQLite and Kyoto Cabinet.
The LevelDB project lists the following key features:
- Keys and values are arbitrary byte arrays.
- Data is stored sorted by key.
- Callers can provide a custom comparison function to override the sort order.
- The basic operations are
- Multiple changes can be made in one atomic batch.
- Users can create a transient snapshot to get a consistent view of data.
- Forward and backward iteration is supported over the data.
- Data is automatically compressed using the Snappy compression library.
- External activity (file system operations etc.) is relayed through a virtual interface so users can customize the operating system interactions.
- Detailed documentation about how to use the library is included with the source code.
You can check out also the old thread on Hacker News about LevelDB..
Original title and link: LevelDB: Google’s Fast Persistent Key-Value Store Library ( ©myNoSQL)
This paper from Google talks extensively about the usage of BigTable and Megastore, the data model, query processing, and transaction handling in the implementation of Google Fusion Tables.
Google Fusion Tables is a cloud-based service for data management and integration. Fusion Tables enables users to upload tabular data files (spreadsheets, CSV, KML), currently of up to 100MB. The system provides several ways of visualizing the data (e.g., charts, maps, and timelines) and the ability to filter and aggregate the data. It supports the integration of data from multiple sources by performing joins across tables that may belong to different users. […] This paper describes the inner workings of Fusion Tables, including the storage of data in the system and the tight integration with the Google Maps infrastructure.
Download the paper or read it after the break.
Announced back in March, Ravel has finally released GoldenOrb an implementation of the Google Pregel paper—if you are not familiar with Google Pregel check the Pregel: Graph Processing at Large-Scale and Ricky Ho’s comparison of Pregel and MapReduce.
GoldenOrb is a cloud-based open source project for massive-scale graph analysis, built upon best-of-breed software from the Apache Hadoop project modeled after Google’s Pregel architecture.
Original title and link: GoldenOrb: Ravel Google Pregel Implementation Released ( ©myNoSQL)