NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



hadoop: All content tagged as hadoop in NoSQL databases and polyglot persistence

Apache Ambari is now an Apache Top Level Project


We are very excited to announce that Apache Ambari has graduated out of Incubator and is now an Apache Top Level Project!

Ambari is a framework for provisioning, managing, and monitoring Hadoop clusters.

✚ Such a tool is usually part of the distributions of Hadoop and in some cases it comes in a proprietary form.

Original title and link: Apache Ambari is now an Apache Top Level Project (NoSQL database©myNoSQL)


A quick guide to using Sentry authorization in Hive

A guide to Apache Sentry:

Sentry brings in fine-grained authorization support for both data and metadata in a Hadoop cluster. It is already being used in production systems to secure the data and provide fine-grained access to its users. It is also integrated with the version of Hive shipping in CDH (upstream contribution is pending), Cloudera Impala, and Cloudera Search.

Original title and link: A quick guide to using Sentry authorization in Hive (NoSQL database©myNoSQL)


Hadoop on SAN? Never, ever do this to Hadoop

Andrew C. Oliver in an article for InfoWorld:

I’ve done this myself, figuring we’d kick off the project and show how we could “optimize” to local disks later. Let me say this unequivocally: You absolutely should not use a SAN or NAS with Hadoop.

As simple as that.

Original title and link: Hadoop on SAN? Never, ever do this to Hadoop (NoSQL database©myNoSQL)


Using Spark for fast in-memory computing

Justin Kestelyn from Databricks describes the differences between Hadoop and Spark processing models in a post on “Cloudera’s blog“:

At its core, Spark provides a general programming model that enables developers to write application by composing arbitrary operators, such as mappers, reducers, joins, group-bys, and filters. […] In addition, Spark keeps track of the data that each of the operators produces, and enables applications to reliably store this data in memory.


✚ This looks in a way similar to the Cascading programming model combined with the capability of storing in memory the working dataset for the current computations.

Original title and link: Using Spark for fast in-memory computing (NoSQL database©myNoSQL)


jumboDB - a data store for low-latency Big Data apps

From jumboDB’s homepage:

Working on Big Data projects with Telefonica Digital, Carsten Hufe and the comSysto-Team started looking for an efficient and affordable way to store and query large amounts of data being delivered in large batches through Apache Hadoop. Our goal was to build a data visualization app for end users issuing different kinds of selective queries on already processed data. Some of the queries were returning large result sets of up to 800.000 JSON documents representing data points for browser visualisation.

Why not using HBase if you already have Hadoop?

Original title and link: jumboDB - a data store for low-latency Big Data apps (NoSQL database©myNoSQL)


Essential migration steps for a Hadoop cluster to Hortonworks Data Platform 2.0

Ulf Sandberg:

A Hadoop distribution has multiple Apache components, and possibly some vendor-specific components. This graphic shows best practice for the order in which to migrate the various components. The Hortonworks services team has automated some of the migration steps to simplify the process.

It’s been only a few years since the inception of the Hadoop platform as a result of the collaboration of people that believed in open source and community. Now we are already talking about vendor-specific components. I’m afraid to think that in just a couple of years, we might be talking only about vendor-based, proprietary distributions of Hadoop.

Original title and link: Essential migration steps for a Hadoop cluster to Hortonworks Data Platform 2.0 (NoSQL database©myNoSQL)


A prolific season for Hadoop and its ecosystem

In 4 years of writing this blog I haven’t seen such a prolific month:

  • Apache Hadoop 2.2.0 (more links here)
  • Apache HBase 0.96 (here and here)
  • Apache Hive 0.12 (more links here)
  • Apache Ambari 1.4.1
  • Apache Pig 0.12
  • Apache Oozie 4.0.0
  • Plus Presto.

Actually I don’t think I’ve ever seen such an ecosystem like the one created around Hadoop.

Original title and link: A prolific season for Hadoop and its ecosystem (NoSQL database©myNoSQL)

Status update on Project Stinger, the interactive query for Apache Hive

Cloudera is investing in Impala. Pivotal in HAWQ. Facebook, who created Hive, has announced Presto.

Hortonworks continues to work on Hive with project Stinger and Apache Tez. Mid-October, they announced Hive 0.12:


And at the end of October, Hortonworks has shared a new set of results:

Historically, even simple Hive queries could not run in less than 30 seconds, yet many of these queries are running in less than 10 seconds. How did that happen? The answer mainly boils down to Apache Tez and Apache Hadoop YARN, which proves that Hadoop is more than just batch. Tez features such as container pre-launch and re-use overcome Hadoop’s traditional latency barriers, and are available to any data processing framework running in Hadoop.


Pretty impressive.

Original title and link: Status update on Project Stinger, the interactive query for Apache Hive (NoSQL database©myNoSQL)

Apache Hadoop Compatibility Guide

I’ve learned that there’s an Apache Hadoop compatibility guide that covers API, wire, Java binary compatibility, any many other such aspects.

✚ Karthik Kambatla posted on Cloudera’s blog Writing Hadoop programs that work across releases that looks at the Hadoop API annotations and compatibility policies.

Original title and link: Apache Hadoop Compatibility Guide (NoSQL database©myNoSQL)

To everybody who uses MapReduce: what problems do you solve?

At the time I’m reading this Ask HN: To everybody who uses MapReduce: what problems do you solve?, there aren’t many interesting answers.

✚ Compare it with AskReddit: What is an invention that the human race is fully capable of making, but hasn’t been made yet?

Original title and link: To everybody who uses MapReduce: what problems do you solve? (NoSQL database©myNoSQL)

Apache Hadoop 2 - YARN is GA

Even if there’s been almost 3 weeks since the announcement, Apache Hadoop 2 is too big of a news not to mention it here. If you want to read something about it, here are a couple of links:

  • The Apache Software Foundation Announces Apache™ Hadoop™ 2 (a bit PRish)

    Doug Cutting:

    What started out a few years ago as a scalable batch processing system for Java programmers has now emerged as the kernel of the operating system for big data.

  • A short interview with Rohit Bakhshi (product manager at Hortonwork) YARN Brings New Capabilities To Hadoop:

    By turning Apache Hadoop 2.0 into a multi- application data system, YARN enables the Hadoop community to address a generation of new requirements IN Hadoop. YARN responds to these enterprise challenges by addressing the actual requirements at a foundational level rather than being commercial bolt-ons that complicate the environment for customers.

  • Mike Miller’s post on GigaOm: Why the world should care about Hadoop 2:

    This might be surprising, because Hadoop 2 is not a blow-your-socks-off release. It is not packed with revolutionary new features from a user perspective. Instead, its greatest innovation is a glorious refactoring of some internal plumbing. But that plumbing grants the community of Hadoop developers the pathways they need to address some of Hadoops greatest shortcomings in comparison to both the commercial and the internal Google tools that Hadoop was derived from.

  • Last but not least, any article you can find about YARN and signed Aarun C. Murthy will be well worth reading (e.g. Apache Hadoop YARN – Background and an Overview, old but very very details series about YARN’s objectives, or Moving Hadoop Beyond Batch with Apache YARN

Original title and link: Apache Hadoop 2 - YARN is GA (NoSQL database©myNoSQL)

Hadoop Buyer's Guide

Alan Gardner reads a marketing material about Hadoop choices:

…this guide is specifically designed to be incorporated into your RFP when it comes to evaluating Hadoop platforms. - Hadoop Buyer’s Guide, page 1

The Guide makes some bold promises right from page one. Not only will it literally write your RFP, but it will also explain “… why selecting a Hadoop platform is so vital”. Ostensibly the alternative, a Hadoop quantum superposition, is difficult and costly to maintain at room temperature.

I have always wondered who’s the target audience of these pseudo-technical marketing materials. Moreover, I’ve always wondered if there’s a single person that made a decision based on such a thing1.

  1. I really cannot call this a (white)paper

Original title and link: Hadoop Buyer’s Guide (NoSQL database©myNoSQL)