NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



mapreduce: All content tagged as mapreduce in NoSQL databases and polyglot persistence

Hadoop in Fedora 20

Being included in the default Fedora distro is yet another big step for Hadoop.

The hardest part about getting Hadoop into Fedora? “Dependencies, dependencies, dependencies!” says Farrellee. […]

For Hadoop? It was more difficult than usual. “There were some dependencies that were just missing and we had to work through those as you’d expect - there were a lot of these. Then there were dependencies that were older than what upstream was using - rare, I know, for Fedora, which aims to be on the bleeding edge. The hardest to deal with were dependencies that were newer than what upstream was using. We tried to write patches for these, but we weren’t always successful. […]”

On the other hand, one thing that continues to puzzle me is: how many different people coming from different backgrounds need to say that Hadoop is crazy complex?

Original title and link: Hadoop in Fedora 20 (NoSQL database©myNoSQL)


Minimap MapReduce Algorithms - Paper

Abstract of the paper authored by a team from universities in Hong Kong, Korea, and Singapore:

MapReduce has become a dominant parallel computing paradigm for big data, i.e., colossal datasets at the scale of tera-bytes or higher. Ideally, a MapReduce system should achieve a high degree of load balancing among the participating machines, and minimize the space usage, CPU and I/O time, and network transfer at each machine. Although these principles have guided the development of MapReduce algorithms, limited emphasis has been placed on enforcing serious constraints on the aforementioned metrics simultaneously. This paper presents the notion of minimal algorithm, that is, an algorithm that guarantees the best parallelization in multiple aspects at the same time, up to a small constant factor. We show the existence of elegant minimal algorithms for a set of fundamental database problems, and demonstrate their excellent performance with extensive experiments.

Start with the definition of the minimal MapReduce algorithms and you’ll find yourself diving into the paper (even if the proof parts are complex).

HAWK: Performance monitoring tool for Hive

JunHo Cho’s slides introducing HAWK, a performance monitoring tool for Hive:

✚ I couldn’t find a link for HAWK. The slides are pointing to NexR.

Original title and link: HAWK: Performance monitoring tool for Hive (NoSQL database©myNoSQL)

Apache Ambari is now an Apache Top Level Project


We are very excited to announce that Apache Ambari has graduated out of Incubator and is now an Apache Top Level Project!

Ambari is a framework for provisioning, managing, and monitoring Hadoop clusters.

✚ Such a tool is usually part of the distributions of Hadoop and in some cases it comes in a proprietary form.

Original title and link: Apache Ambari is now an Apache Top Level Project (NoSQL database©myNoSQL)


A quick guide to using Sentry authorization in Hive

A guide to Apache Sentry:

Sentry brings in fine-grained authorization support for both data and metadata in a Hadoop cluster. It is already being used in production systems to secure the data and provide fine-grained access to its users. It is also integrated with the version of Hive shipping in CDH (upstream contribution is pending), Cloudera Impala, and Cloudera Search.

Original title and link: A quick guide to using Sentry authorization in Hive (NoSQL database©myNoSQL)


Hadoop on SAN? Never, ever do this to Hadoop

Andrew C. Oliver in an article for InfoWorld:

I’ve done this myself, figuring we’d kick off the project and show how we could “optimize” to local disks later. Let me say this unequivocally: You absolutely should not use a SAN or NAS with Hadoop.

As simple as that.

Original title and link: Hadoop on SAN? Never, ever do this to Hadoop (NoSQL database©myNoSQL)


Using Spark for fast in-memory computing

Justin Kestelyn from Databricks describes the differences between Hadoop and Spark processing models in a post on “Cloudera’s blog“:

At its core, Spark provides a general programming model that enables developers to write application by composing arbitrary operators, such as mappers, reducers, joins, group-bys, and filters. […] In addition, Spark keeps track of the data that each of the operators produces, and enables applications to reliably store this data in memory.


✚ This looks in a way similar to the Cascading programming model combined with the capability of storing in memory the working dataset for the current computations.

Original title and link: Using Spark for fast in-memory computing (NoSQL database©myNoSQL)


jumboDB - a data store for low-latency Big Data apps

From jumboDB’s homepage:

Working on Big Data projects with Telefonica Digital, Carsten Hufe and the comSysto-Team started looking for an efficient and affordable way to store and query large amounts of data being delivered in large batches through Apache Hadoop. Our goal was to build a data visualization app for end users issuing different kinds of selective queries on already processed data. Some of the queries were returning large result sets of up to 800.000 JSON documents representing data points for browser visualisation.

Why not using HBase if you already have Hadoop?

Original title and link: jumboDB - a data store for low-latency Big Data apps (NoSQL database©myNoSQL)


Essential migration steps for a Hadoop cluster to Hortonworks Data Platform 2.0

Ulf Sandberg:

A Hadoop distribution has multiple Apache components, and possibly some vendor-specific components. This graphic shows best practice for the order in which to migrate the various components. The Hortonworks services team has automated some of the migration steps to simplify the process.

It’s been only a few years since the inception of the Hadoop platform as a result of the collaboration of people that believed in open source and community. Now we are already talking about vendor-specific components. I’m afraid to think that in just a couple of years, we might be talking only about vendor-based, proprietary distributions of Hadoop.

Original title and link: Essential migration steps for a Hadoop cluster to Hortonworks Data Platform 2.0 (NoSQL database©myNoSQL)


A prolific season for Hadoop and its ecosystem

In 4 years of writing this blog I haven’t seen such a prolific month:

  • Apache Hadoop 2.2.0 (more links here)
  • Apache HBase 0.96 (here and here)
  • Apache Hive 0.12 (more links here)
  • Apache Ambari 1.4.1
  • Apache Pig 0.12
  • Apache Oozie 4.0.0
  • Plus Presto.

Actually I don’t think I’ve ever seen such an ecosystem like the one created around Hadoop.

Original title and link: A prolific season for Hadoop and its ecosystem (NoSQL database©myNoSQL)

Apache Hadoop Compatibility Guide

I’ve learned that there’s an Apache Hadoop compatibility guide that covers API, wire, Java binary compatibility, any many other such aspects.

✚ Karthik Kambatla posted on Cloudera’s blog Writing Hadoop programs that work across releases that looks at the Hadoop API annotations and compatibility policies.

Original title and link: Apache Hadoop Compatibility Guide (NoSQL database©myNoSQL)

To everybody who uses MapReduce: what problems do you solve?

At the time I’m reading this Ask HN: To everybody who uses MapReduce: what problems do you solve?, there aren’t many interesting answers.

✚ Compare it with AskReddit: What is an invention that the human race is fully capable of making, but hasn’t been made yet?

Original title and link: To everybody who uses MapReduce: what problems do you solve? (NoSQL database©myNoSQL)