This recent paper from a team at Google is presenting details about Tenzing a system that is currently in use at Google:
Tenzing is a query engine built on top of MapReduce for ad hoc analysis of Google data. Tenzing supports a mostly complete SQL implementation (with several extensions) combined with several key characteristics such as heterogeneity, high performance, scalability, reliability, metadata awareness, low latency, support for columnar storage and structured data, and easy extensibility.
A couple of things I’ve highlighted when reading it:
- Tenzing is in production, but doesn’t serve yet a huge amount of queries
- the backend storage can be a mix of various data stores, such as ColumnIO, Bigtable, GFS files, MySQL databases
- when compared with other similar solutions (Sawzall, Flume-Java, Pig, Hive„ HadoopDB), Tenzing’s advantage is low latency
- the paper acknowledges AsterData, GreenPlum, Paraccel, Vertica for using a MapReduce execution model in their engines
- to perform query optimizations, Tenzing is enhancing queries with information from a metadata server
- there is no information about what kind of metadata is needed in Tenzing. I assume it might refer to details about the data sources and data source metadata (indexes, access patterns, etc)
- to reduce query latency, processes are kept running
- Tenzing supports almost all SQL92 standard and some extensions from SQL99
- projection and filtering (for some of these and depending on the data source Tenzing can do some optimizations)
- set operations (implemented in the reduce phase)
- nested queries and subqueries
- aggregation and statistical functions
- analytic functions (syntax similar to PostgreSQL/Oracle)
- OLAP extensions
Tenzing supports efficient joins across data sources, such as ColumnIO to Bigtable; inner, left, right, cross, and full outer joins; and equi semi-equi, non-equi and function based joins. Cross joins are only supported for tables small enough to fit in memory, and right outer joins are supported only with sort/merge joins. Non-equi correlated subqueries are currently not supported. We include distributed implementations for nested loop, sort/merge and hash joins.
Read and download the “Tenzing A SQL Implementation on the MapReduce framework” after the break.
Original title and link: Paper: Tenzing A SQL Implementation on the MapReduce Framework ( ©myNoSQL)