NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Hadoop: All content tagged as Hadoop in NoSQL databases and polyglot persistence

Using Spark for fast in-memory computing

Justin Kestelyn from Databricks describes the differences between Hadoop and Spark processing models in a post on “Cloudera’s blog“:

At its core, Spark provides a general programming model that enables developers to write application by composing arbitrary operators, such as mappers, reducers, joins, group-bys, and filters. […] In addition, Spark keeps track of the data that each of the operators produces, and enables applications to reliably store this data in memory.


✚ This looks in a way similar to the Cascading programming model combined with the capability of storing in memory the working dataset for the current computations.

Original title and link: Using Spark for fast in-memory computing (NoSQL database©myNoSQL)


jumboDB - a data store for low-latency Big Data apps

From jumboDB’s homepage:

Working on Big Data projects with Telefonica Digital, Carsten Hufe and the comSysto-Team started looking for an efficient and affordable way to store and query large amounts of data being delivered in large batches through Apache Hadoop. Our goal was to build a data visualization app for end users issuing different kinds of selective queries on already processed data. Some of the queries were returning large result sets of up to 800.000 JSON documents representing data points for browser visualisation.

Why not using HBase if you already have Hadoop?

Original title and link: jumboDB - a data store for low-latency Big Data apps (NoSQL database©myNoSQL)


Essential migration steps for a Hadoop cluster to Hortonworks Data Platform 2.0

Ulf Sandberg:

A Hadoop distribution has multiple Apache components, and possibly some vendor-specific components. This graphic shows best practice for the order in which to migrate the various components. The Hortonworks services team has automated some of the migration steps to simplify the process.

It’s been only a few years since the inception of the Hadoop platform as a result of the collaboration of people that believed in open source and community. Now we are already talking about vendor-specific components. I’m afraid to think that in just a couple of years, we might be talking only about vendor-based, proprietary distributions of Hadoop.

Original title and link: Essential migration steps for a Hadoop cluster to Hortonworks Data Platform 2.0 (NoSQL database©myNoSQL)


A prolific season for Hadoop and its ecosystem

In 4 years of writing this blog I haven’t seen such a prolific month:

  • Apache Hadoop 2.2.0 (more links here)
  • Apache HBase 0.96 (here and here)
  • Apache Hive 0.12 (more links here)
  • Apache Ambari 1.4.1
  • Apache Pig 0.12
  • Apache Oozie 4.0.0
  • Plus Presto.

Actually I don’t think I’ve ever seen such an ecosystem like the one created around Hadoop.

Original title and link: A prolific season for Hadoop and its ecosystem (NoSQL database©myNoSQL)

Status update on Project Stinger, the interactive query for Apache Hive

Cloudera is investing in Impala. Pivotal in HAWQ. Facebook, who created Hive, has announced Presto.

Hortonworks continues to work on Hive with project Stinger and Apache Tez. Mid-October, they announced Hive 0.12:


And at the end of October, Hortonworks has shared a new set of results:

Historically, even simple Hive queries could not run in less than 30 seconds, yet many of these queries are running in less than 10 seconds. How did that happen? The answer mainly boils down to Apache Tez and Apache Hadoop YARN, which proves that Hadoop is more than just batch. Tez features such as container pre-launch and re-use overcome Hadoop’s traditional latency barriers, and are available to any data processing framework running in Hadoop.


Pretty impressive.

Original title and link: Status update on Project Stinger, the interactive query for Apache Hive (NoSQL database©myNoSQL)

Apache Hadoop Compatibility Guide

I’ve learned that there’s an Apache Hadoop compatibility guide that covers API, wire, Java binary compatibility, any many other such aspects.

✚ Karthik Kambatla posted on Cloudera’s blog Writing Hadoop programs that work across releases that looks at the Hadoop API annotations and compatibility policies.

Original title and link: Apache Hadoop Compatibility Guide (NoSQL database©myNoSQL)

To everybody who uses MapReduce: what problems do you solve?

At the time I’m reading this Ask HN: To everybody who uses MapReduce: what problems do you solve?, there aren’t many interesting answers.

✚ Compare it with AskReddit: What is an invention that the human race is fully capable of making, but hasn’t been made yet?

Original title and link: To everybody who uses MapReduce: what problems do you solve? (NoSQL database©myNoSQL)

Apache Hadoop 2 - YARN is GA

Even if there’s been almost 3 weeks since the announcement, Apache Hadoop 2 is too big of a news not to mention it here. If you want to read something about it, here are a couple of links:

  • The Apache Software Foundation Announces Apache™ Hadoop™ 2 (a bit PRish)

    Doug Cutting:

    What started out a few years ago as a scalable batch processing system for Java programmers has now emerged as the kernel of the operating system for big data.

  • A short interview with Rohit Bakhshi (product manager at Hortonwork) YARN Brings New Capabilities To Hadoop:

    By turning Apache Hadoop 2.0 into a multi- application data system, YARN enables the Hadoop community to address a generation of new requirements IN Hadoop. YARN responds to these enterprise challenges by addressing the actual requirements at a foundational level rather than being commercial bolt-ons that complicate the environment for customers.

  • Mike Miller’s post on GigaOm: Why the world should care about Hadoop 2:

    This might be surprising, because Hadoop 2 is not a blow-your-socks-off release. It is not packed with revolutionary new features from a user perspective. Instead, its greatest innovation is a glorious refactoring of some internal plumbing. But that plumbing grants the community of Hadoop developers the pathways they need to address some of Hadoops greatest shortcomings in comparison to both the commercial and the internal Google tools that Hadoop was derived from.

  • Last but not least, any article you can find about YARN and signed Aarun C. Murthy will be well worth reading (e.g. Apache Hadoop YARN – Background and an Overview, old but very very details series about YARN’s objectives, or Moving Hadoop Beyond Batch with Apache YARN

Original title and link: Apache Hadoop 2 - YARN is GA (NoSQL database©myNoSQL)

Hadoop Buyer's Guide

Alan Gardner reads a marketing material about Hadoop choices:

…this guide is specifically designed to be incorporated into your RFP when it comes to evaluating Hadoop platforms. - Hadoop Buyer’s Guide, page 1

The Guide makes some bold promises right from page one. Not only will it literally write your RFP, but it will also explain “… why selecting a Hadoop platform is so vital”. Ostensibly the alternative, a Hadoop quantum superposition, is difficult and costly to maintain at room temperature.

I have always wondered who’s the target audience of these pseudo-technical marketing materials. Moreover, I’ve always wondered if there’s a single person that made a decision based on such a thing1.

  1. I really cannot call this a (white)paper

Original title and link: Hadoop Buyer’s Guide (NoSQL database©myNoSQL)


Teradata: Hadoop, big data technologies 'small factor' in our slowdown

Larry Dignan for ZDNet reporting from Teradata‘s quarterly earnings call:

Teradata on Thursday moved to shoot down the theory that Hadoop and open source big data technologies are putting the kibosh on data warehouse rollouts.

The explanation offered for the slowdown:

The major contributor to our reduced revenue guidance for 2013 was the number of data warehouse opportunities that have moved out into 2014 with a large amount of that happening in the US where the pent-up demand in our user base that we expected to see in the second half has not materialized yet.

Wondering what‘s the real reason for not closing these deals. Maybe, just maybe, it’s those customers that decided to spend a bit more time learning about new technologies before writing the big checks.

Original title and link: Teradata: Hadoop, big data technologies ‘small factor’ in our slowdown (NoSQL database©myNoSQL)


How to Escape the Dark Valley of Your Hadoop Journey

The conclusion of the post is more balanced than the beginning which reads like it’s doomsday1:

The power of big data has been established, but our understanding of how to exploit it in the most productive way is still maturing. The initial toolset that came with Hadoop didn’t anticipate the kinds of enterprise applications and powerful analyses that businesses would want to build on it. Thus, many have fallen into the Dark Valley. But a new breed of middleware (APIs and DSLs) has arrived. They keep track of all the variables and peculiarities of Hadoop, abstract them away from development, and offer better reliability, sustainability and operational characteristics so that enterprises can find their way back out into the light.

Everyone that doesn’t have extensive experience with Hadoop will realize its complexity right away. But…

Is this complexity insurmountable? No. Does addressing Hadoop’s complexity really require huge budgets? No. Is it the fault of Hadoop that other tools aren’t working well with it? Definitely not. Can Hadoop and vendors offer a better experience? The answer is a resounding Yes.

  1. Keep in mind that the article was written by the CEO of Concurrent, a company that promotes better tools for Hadoop. 

Original title and link: How to Escape the Dark Valley of Your Hadoop Journey (NoSQL database©myNoSQL)


Spark and Shark company Databricks raises $14M from Andreessen Horowitz

Spark and Shark getting wings:

A team of professors who has created the in-memory Spark and Shark platforms for analyzing big data has raised nearly $13.9 million to commercialize those products. The company is still in stealth mode, but it’s called Databricks and Andreessen Horowitz led the round. […] It also lists Databricks’ very impressive board of directors: Co-founder and CEO Ion Stoica (University of California, Berkeley professor and former co-founder and CEO of Conviva); Co-founder and CTO Matei Zaharia (MIT professor); Ben Horowitz (general partner at Andreessen Horowitz and former Opsware co-founder and CEO); and Scott Shenker (University of California, Berkeley professor and former Nicira co-founder and CEO).

You should have probably heard already of all these guys.

Original title and link: Spark and Shark company Databricks raises $14M from Andreessen Horowitz (NoSQL database©myNoSQL)