ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

mapreduce: All content tagged as mapreduce in NoSQL databases and polyglot persistence

Hadoop + Terracotta BigMemory: Run, Elephant, Run!

While Hadoop is great for batch processing and storage of very large data sets, it can take hours to produce results. […] To address this challenge, Terracotta recently announced the > BigMemory-Hadoop Connector, a game-changing solution that lets Hadoop jobs write data directly into BigMemory, Terracotta’s in-memory data management platform. This enables downstream applications to get instant access to Hadoop results by reading from BigMemory. Hadoop jobs also execute faster, as they can now write to memory instead of disk (HDFS). The result can be a significant boost in competitive advantage and enterprise profitability.

Think about online applications. When the database gets slow you add a caching layer. It looks like a similar direction is very tempting for the majority of in-memory data grid-like solutions.

✚ The top speed of an african bush elephant is 24.9mph/40kmh. According to this.

Original title and link: Hadoop + Terracotta BigMemory: Run, Elephant, Run! (NoSQL database©myNoSQL)

via: http://blog.terracotta.org/2013/04/02/hadoop-bigmemory-run-elephant-run/


Field-Level Encryption for Apache Hadoop From Dataguise

Dataguise says the latest version of its data-protection product enables users to encrypt sensitive data right down to specific fields within an open source Apache Hadoop database.

DG for Hadoop 4.3 also makes use of the traditional Dataguise “masking” capability across single or multiple Hadoop clusters to camouflage sensitive data.

$25.000 a piece (hopefully not a piece of encrypted data though).

Apache Accumulo is known to offer a BigTable inspired open source implementation with cell-based access control.

Original title and link: Field-Level Encryption for Apache Hadoop From Dataguise (NoSQL database©myNoSQL)

via: http://news.techworld.com/security/3437999/dataguise-introduces-field-level-encryption-for-apache-hadoop-database/


Happy Birthday Hadoop!

On this special April 1 – the seven-year anniversary of the Apache Hadoop project’s first release – Hadoop founder Doug Cutting (also Cloudera’s chief architect and the Apache Software Foundation chair) offers seven thoughts on Hadoop.

Happy Birthday Hadoop! And thank you Doug Cutting and the armies of people that put tons of effort behind Apache Hadoop to make it what it is today and what it’ll become tomorrow!

Original title and link: Happy Birthday Hadoop! (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2013/04/seven-thoughts-on-hadoops-seventh-birthday/


‎Apache Incubator: Tajo - a Relational and Distributed Data Warehouse for Hadoop

‎Tajo:

  • Fast and low-latency query processing on SQL queries including projection, filter, group-by, sort, and join.
  • Rudiment ETL that transforms one data format to another data format.
  • Support various file formats, such as CSV, RCFile, RowFile (a row store file), and Trevni.
  • Command line interface to allow users to submit SQL queries
  • Java API to enable clients to submit SQL queries to Tajo

Just another example of the way of the future.

Original title and link: ‎Apache Incubator: Tajo - a Relational and Distributed Data Warehouse for Hadoop (NoSQL database©myNoSQL)

via: http://tajo.incubator.apache.org/


Recognizing the Power of Hadoop: Platfora BI Is Better on Hadoop

Ben Werther announcing the general availability of the Platfora BI:

At Platfora, we made a bet that Hadoop’s destiny wasn’t simply to be a cheaper, slower cousin of the relational data warehouse. […] Hadoop is superb at two things — it provides a near-infinite data reservoir where data of all kinds can be landed without needing to figure out how it will be used ahead of time, and it is a slow lumbering freight-train of an engine for crunching and aggregating batches of millions or billions of rows.

They are neither the first, nor the last to understand and bet on Hadoop. But in some cases this bet originates only in the financial potential of the Hadoop market and less so on the technological potential.

Indeed it’s rarely the case that these two can leave alone. When they do, it leads to either a smaller market segment or to a shorter life time. Looking around at what’s happening in the Hadoop space, technologically and business wise, I assume many economists would recognize the signs of a long lived opportunity.

As a side note, I find it interesting that very few articles are looking at two other fundamental aspects of the Hadoop platform, which, in my opinion, were, are and will remain critical to the growth of this market: open source and extensibility. Without any of these two, what would we see would be tons of copy cats wasting resources in creating small indistinguishable clones, plus countless and endless negotiations to extend and integrate the platform. Hadoop is open source and the open source developers working on it have built it with extensibility in mind. The proof is out there and is clear: look at the breadth and depth of the tools around Hadoop.

That’s the power of open source. The way of the future.

Original title and link: Recognizing the Power of Hadoop: Platfora BI Is Better on Hadoop (NoSQL database©myNoSQL)

via: http://www.platfora.com/bi-is-better-on-hadoop/


GIS Tools for Hadoop by Esri

Interesting project, GIS Tools for Hadoop:

GIS Tools for Hadoop is an open source toolkit intended for Big Spatial Data Analytics. The toolkit provides different libraries:

  • Esri Geometry API for Java: A generic geometry library, can be used to extend Hadoop core with vector geometry types and operations, and enables developers to build MapReduce applications for spatial data.
  • Spatial Framework for Hadoop: Extends Hive and is based on the Esri Geometry API, to enable Hive Query Language users to leverage a set of analytical functions and geometry types. In addition to some utilities for JSON used in ArcGIS.
  • Geoprocessing Tools for Hadoop: Contains a set of ready to use ArcGIS Geoprocessing tools, based on the Esri Geometry API and Spatial Framework for Hadoop. Developers can download the source code of the tools and customize it; they can also create new tools and contribute it to the open source project. Through these tools ArcGIS users can move their spatial data and execute a pre-defined workflow inside Hadoop.

I recently learned about GeoJSON — JSON Geometry and Feature Description, but the two don’t seem to be related.

Original title and link: GIS Tools for Hadoop by Esri (NoSQL database©myNoSQL)

via: http://esri.github.com/gis-tools-for-hadoop/


White Elephant: Task Statistics for Hadoop

From LinkedIn’s engineering team:

While tools like Ganglia provide system-level metrics, we wanted to be able to understand what resources were being used by each user and at what times. White Elephant parses Hadoop logs to provide visual drill downs and rollups of task statistics for your Hadoop cluster, including total task time, slots used, CPU time, and failed job counts.

Isn’t this a form of resource usage auditing? Based on this, next you could build support for resource quotas and then start enforcing them.

Original title and link: White Elephant: Task Statistics for Hadoop (NoSQL database©myNoSQL)

via: http://engineering.linkedin.com/hadoop/white-elephant-hadoop-tool-you-never-knew-you-needed


How Does MapR Compare to Cloudera?

Staying in the MapR land, the question of comparing MapR to Cloudera is answered by people from all sides (MapR, Cloudera and Hortonworks). My summary: “cool proprietary technology addressing some of the current limitations of the Hadoop, but also missing some of the features the Hadoop community has come up with”.

Original title and link: How Does MapR Compare to Cloudera? (NoSQL database©myNoSQL)

via: http://www.quora.com/How-does-MapR-plan-to-compete-with-Cloudera


Paper: YSmart - Yet Another SQL-to-MapReduce Translator

Another weekend read, this time from Facebook and The Ohio State University and closer to the hot topic of the last two weeks: SQL, MapReduce, Hadoop:

MapReduce has become an effective approach to big data analytics in large cluster systems, where SQL-like queries play important roles to interface between users and systems. However, based on our Facebook daily operation results, certain types of queries are executed at an unacceptable low speed by Hive (a production SQL-to-MapReduce translator). In this paper, we demonstrate that existing SQL-to-MapReduce translators that operate in a one-operation-to-one-job mode and do not consider query correlations cannot generate high-performance MapReduce programs for certain queries, due to the mismatch between complex SQL structures and simple MapReduce framework. We propose and develop a system called YSmart, a correlation aware SQL-to- MapReduce translator. YSmart applies a set of rules to use the minimal number of MapReduce jobs to execute multiple correlated operations in a complex query. YSmart can significantly reduce redundant computations, I/O operations and network transfers compared to existing translators. We have implemented YSmart with intensive evaluation for complex queries on two Amazon EC2 clusters and one Facebook production cluster. The results show that YSmart can outperform Hive and Pig, two widely used SQL-to-MapReduce translators, by more than four times for query execution.


Paper: M3R - Increased Performance for In-Memory Hadoop Jobs

For the weekend reads, a paper authored by a reseach team from IBM:

Main Memory Map Reduce (M3R) is a new implementation of the Hadoop Map Reduce (HMR) API targeted at online analytics on high mean-time-to-failure clusters. It does not support resilience, and supports only those workloads which can fit into cluster memory. In return, it can run HMR jobs unchanged — including jobs produced by compilers for higher-level languages such as Pig, Jaql, and SystemML and interactive front-ends like IBM BigSheets — while providing significantly better performance than the Hadoop engine on several workloads (e.g. 45x on some input sizes for sparse matrix vector multiply). M3R also supports extensions to the HMR API which can enable Map Reduce jobs to run faster on the M3R engine, while not affecting their perfor- mance under the Hadoop engine.


Hadoop and Splunk Use Cases

Good post from Splunk about the use cases where Hadoop and Splunk coexist and cooperate:

The Splunk and Hadoop communities can benefit from each other’s strengths. Below are several examples of customers that use both environments.

  1. Splunk then Hadoop
    • Splunk: collects, visualizes and analyzes the data
    • Hadoop: ETL and other batch processing
  2. Hadoop then Splunk
    • Hadoop: collects the data
    • Splunk: visualization
  3. Bi-directional: Splunk and Hadoop collect different artifacts and share the data that Hadoop needs for ETL or batch analytics and Splunk needs for real-time analysis and visualization
  4. Splunk monitors Hadoop

Original title and link: Hadoop and Splunk Use Cases (NoSQL database©myNoSQL)

via: http://blogs.splunk.com/2012/11/28/hadoop-and-splunk-use-cases/


Lingual - a SQL DSL for Hadoop From Concurrent

Concurrent, the company behind the quite popular Java application framework for Hadoop, has released Lingual, a SQL-based DSL (ANSI SQL parser) with an execution engine optimizer built on top of Cascading:

Lingual is not going to provide sub-second response times on a petabyte of data on a Hadoop cluster. Rather, the company’s goal is to provide the ability to easily move applications onto Hadoop—the challenge there is really around moving from a relational or MPP database over to Hadoop.

So it looks like not every new SQL tool around Hadoop is commercial.

Original title and link: Lingual - a SQL DSL for Hadoop From Concurrent (NoSQL database©myNoSQL)

via: http://www.infoq.com/news/2013/02/Lingual