NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Tenzing: All content tagged as Tenzing in NoSQL databases and polyglot persistence

Research in the MapReduce Space

Over the weekend I’ve read two papers presenting products or research related to improving or adding new capabilities to the MapReduce data processing approach. The first of them comes from a team at Microsoft and is describing TiMR a time-oriented data processing system in MapReduce. The second, from a team at Google, presents Tenzin - a SQL implementation on the MapReduce framework. It’s great to learn that while the Hadoop community is eliminating some of the initial limitations and hardening the technical details of the platform, there are already ideas and systems out there that augment the capabilities of the MapReduce data processing model.

Original title and link: Research in the MapReduce Space (NoSQL database©myNoSQL)

Paper: Tenzing A SQL Implementation on the MapReduce Framework

This recent paper from a team at Google is presenting details about Tenzing a system that is currently in use at Google:

Tenzing is a query engine built on top of MapReduce for ad hoc analysis of Google data. Tenzing supports a mostly complete SQL implementation (with several extensions) combined with several key characteristics such as heterogeneity, high performance, scalability, reliability, metadata awareness, low latency, support for columnar storage and structured data, and easy extensibility.

A couple of things I’ve highlighted when reading it:

  • Tenzing is in production, but doesn’t serve yet a huge amount of queries
  • the backend storage can be a mix of various data stores, such as ColumnIO, Bigtable, GFS files, MySQL databases
  • when compared with other similar solutions (Sawzall, Flume-Java, Pig, Hive„ HadoopDB), Tenzing’s advantage is low latency
  • the paper acknowledges AsterData, GreenPlum, Paraccel, Vertica for using a MapReduce execution model in their engines
  • to perform query optimizations, Tenzing is enhancing queries with information from a metadata server
    • there is no information about what kind of metadata is needed in Tenzing. I assume it might refer to details about the data sources and data source metadata (indexes, access patterns, etc)
  • to reduce query latency, processes are kept running
  • Tenzing supports almost all SQL92 standard and some extensions from SQL99
    • projection and filtering (for some of these and depending on the data source Tenzing can do some optimizations)
    • set operations (implemented in the reduce phase)
    • nested queries and subqueries
    • aggregation and statistical functions
    • analytic functions (syntax similar to PostgreSQL/Oracle)
    • OLAP extensions
    • JOINs:

      Tenzing supports efficient joins across data sources, such as ColumnIO to Bigtable; inner, left, right, cross, and full outer joins; and equi semi-equi, non-equi and function based joins. Cross joins are only supported for tables small enough to fit in memory, and right outer joins are supported only with sort/merge joins. Non-equi correlated subqueries are currently not supported. We include distributed implementations for nested loop, sort/merge and hash joins.

Read and download the “Tenzing A SQL Implementation on the MapReduce framework” after the break.