MapReduce: All content tagged as MapReduce in NoSQL databases and polyglot persistence
Another weekend read, this time from Facebook and The Ohio State University and closer to the hot topic of the last two weeks: SQL, MapReduce, Hadoop:
MapReduce has become an effective approach to big data analytics in large cluster systems, where SQL-like queries play important roles to interface between users and systems. However, based on our Facebook daily operation results, certain types of queries are executed at an unacceptable low speed by Hive (a production SQL-to-MapReduce translator). In this paper, we demonstrate that existing SQL-to-MapReduce translators that operate in a one-operation-to-one-job mode and do not consider query correlations cannot generate high-performance MapReduce programs for certain queries, due to the mismatch between complex SQL structures and simple MapReduce framework. We propose and develop a system called YSmart, a correlation aware SQL-to- MapReduce translator. YSmart applies a set of rules to use the minimal number of MapReduce jobs to execute multiple correlated operations in a complex query. YSmart can significantly reduce redundant computations, I/O operations and network transfers compared to existing translators. We have implemented YSmart with intensive evaluation for complex queries on two Amazon EC2 clusters and one Facebook production cluster. The results show that YSmart can outperform Hive and Pig, two widely used SQL-to-MapReduce translators, by more than four times for query execution.
For the weekend reads, a paper authored by a reseach team from IBM:
Main Memory Map Reduce (M3R) is a new implementation of the Hadoop Map Reduce (HMR) API targeted at online analytics on high mean-time-to-failure clusters. It does not support resilience, and supports only those workloads which can fit into cluster memory. In return, it can run HMR jobs unchanged — including jobs produced by compilers for higher-level languages such as Pig, Jaql, and SystemML and interactive front-ends like IBM BigSheets — while providing significantly better performance than the Hadoop engine on several workloads (e.g. 45x on some input sizes for sparse matrix vector multiply). M3R also supports extensions to the HMR API which can enable Map Reduce jobs to run faster on the M3R engine, while not affecting their perfor- mance under the Hadoop engine.
Based on ESG’s modeling of a medium-sized Hadoop-oriented big data project, the preconfigured Oracle Big Data Appliance is 39% less costly than a “build” equivalent do-it-yourself infrastructure. And using Oracle Big Data Appliance will cut the project length by about one-third. For most enterprises planning to take big data beyond experimentation and proof-of- concept, ESG suggests skipping the idea of in-house development, on-going management, and expansion of your own big data infrastructure, to instead look to purpose-built infrastructure solutions such as Oracle Big Data Appliance.
This is an extract from Oracle’s whitepaper “Getting Real about Big Data: Build Versus Buy“. It’s a nice reading excercise to better understand how the database leader is positioning their Oracle Big Data Appliance compared to Hadoop’s commodity-hardware cluster.
I’d love seeing the equivalent paper from Hortonworks1.
The only reason I’m referring directly to Hortonworks and not also Cloudera is that the Hadoop part of Oracle Big Data Appliance is offered by Cloudera. ↩
Original title and link: Oracle Paper: The Cost of Do-It-Yourself Hadoop vs Oracle Big Data Appliance ( ©myNoSQL)