NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



hadoop: All content about hadoop in NoSQL databases and polyglot persistence

SQL-on-Hadoop: Pivotal HAWQ benchmark.

The results bore out Pivotal’s statement that HAWQ is the world’s fastest SQL query engine on Hadoop […] The paper, titled “Orca: A Modular Query Optimizer Architecture for Big Data,” includes benchmark results based on the TPC-DS, a well-known decision support benchmark that models several generally applicable aspects of a decision support system.

Pivotal’s SQL-on-Hadoop solution is based on a cost-based query optimizer.


The expanding alternative universe of Hadoop

Merv Adrian:

Hadoop has moved from a coarse-grained blunt instrument for largely ETL- style workloads to an expanding stack for virtually any IT task big data professionals will want to undertake. What is Hadoop now? It’s a candidate to be the alternative universe for data processing, with over 20 components that span a wide array of functions.

As the Hadoop alternative universe is expanding, its complexity continues to grow too. The whole purpose of bBig data platforms” from Cloudera and Hortonworks is to make this universe navigable, but it feels the majority of travelers still needs a lot of patience and courage to discover it.

Original title and link: The expanding alternative universe of Hadoop (NoSQL database©myNoSQL)


Dell and Cloudera and Intel join forces for appliances

Me in Intel kills a Hadoop and feeds another:

As for Intel, what if this investment also sealed an exclusive deal for Hadoop-centric Cloudera-supported Intel-powered appliance?

I didn’t know about the existing Dell-Cloudera-Intel partnership, but this is re-inforced with the recent announcement of an in-memory appliance.

Since 2011, Cloudera, Dell and Intel have built pre-validated reference architectures for Hadoop. […]

The Dell In-Memory Appliances for Cloudera Enterprise is yet another proof point of the collaboration and synergies between the three companies. As the first of a family of appliances, it includes leading Dell hardware, Cloudera’s enterprise data hub -based on Cloudera Enterprise, Intel architecture for fast processing, and ScaleMP’s Versatile SMP (vSMP) architecture to aggregate multiple x86 servers into a single virtual machine to create large memory pools for in-memory processing.

Original title and link: Dell and Cloudera and Intel join forces for appliances (NoSQL database©myNoSQL)

Using Elastic MapReduce as a generic Hadoop cluster manager

Steve McPherson for the AWS Blog:

Despite the name Elastic MapReduce, the service goes far beyond batch- oriented processing. Clusters in EMR have a flexible and rich cluster- management framework that users can customize to run any Hadoop ecosystem application such as low-latency query engines like Hbase (with Phoenix), Impala, Spark/Shark and machine learning frameworks like Mahout. These additional components can be installed using Bootstrap Actions or Steps.

Operational simplicity is a critical aspect for the early days of many companies when large hardware investments and time are so important. Amazon is building a huge data ecosystem to convince its users to stay even afterwards (the more data you put in, the more difficult it’s to move it out later).

Original title and link: Using Elastic MapReduce as a generic Hadoop cluster manager (NoSQL database©myNoSQL)


Three questions about MapR and their products.

There are three things that I’d really appreciate some help understanding:

  1. MapR says it is an Apache Hadoop distribution. Does any of the MapR products include the

    While I know there’s no definition of such a thing, as far as I know self-claimed API compatibility is by no means the same thing as Apache Hadoop.

    I’m also not aware of any action from ASF on this matter.

  2. MapR says it’s the most complete distribution of Hadoop. The matrix below, from Kirill Grigorchuk’s summary of Altoros’s Hadoop Distributions: Cloudera vs. Hortonworks vs. MapR paper, doesn’t seem to confirm this.

    Hadoop distros compared: Cloudera vs Hortonworks vs MapR

  3. MapR says it is committed to open source. I’ve checked the list of committers for Apache Hadoop, Apache HBase, Apache Pig, and Apache ZooKeeper and except Ted Dunning’s PMC role in Apache ZooKeeper, I couldn’t find any MapR employee listed.

Original title and link: Three questions about MapR and their products. (NoSQL database©myNoSQL)

Big Data benchmark: Redshift, Hive, Impala, Shark, Stinger/Tez

Hosted on amplab, the origin of Spark this benchmark compares Redshift, Hive, Shark, Impala, Stinger/Tez:

Several analytic frameworks have been announced in the last year. Among them are inexpensive data-warehousing solutions based on traditional Massively Parallel Processor (MPP) architectures (Redshift), systems which impose MPP- like execution engines on top of Hadoop (Impala, HAWQ) and systems which optimize MapReduce to improve performance on analytical workloads (Shark, Stinger/Tez). This benchmark provides quantitative and qualitative comparisons of five systems. It is entirely hosted on EC2 and can be reproduced directly from your computer.

More important than the results:

  1. the clear methodology
  2. and its reproducibility

Original title and link: Big Data benchmark: Redshift, Hive, Impala, Shark, Stinger/Tez (NoSQL database©myNoSQL)


Moving product recommendations from Hadoop to Redshift saves us time and money

Our old relational data warehousing solution, Hive, was not performant enough for us to generate product recommendations in SQL in our configuration.

This right here describes the common theme across all “Redshift is so much faster and cheaper than Hive”: expect a relational data warehouse from a Hadoop and Hive. You tell me if that’s the right expectation.

Here are other similar “revelations”:

Original title and link: Moving product recommendations from Hadoop to Redshift saves us time and money (NoSQL database©myNoSQL)


What does comprehensive security mean for Hadoop?

Hortonworks and their new security team explain the current status and their plans for a “holistic and comprehensive” security solution for Hadoop:

A comprehensive security approach means that irrespective of how the data is stored and accessed, there should be an integrated framework for securing data. Enterprises may adopt any use case (batch, real time, interactive), but data should be secured through the same standards, and security should be administered centrally and in one place.

✚ If you have only a couple of seconds, focus on the diagram under the section “HDP + XA - Current offering” and skim over the following 4 sections: Authentication, Authorization, Auditing, Data protection


✚ It’s safe to assume this post was meant to introduce Hortonwork’s position to Hadoop security as compared to Cloudera’s (and their collaboration on security aspects with Intel):

Original title and link: What does comprehensive security mean for Hadoop? (NoSQL database©myNoSQL)


Project Rhino goal: at-rest encryption for Apache Hadoop

Although network encryption has been provided in the Apache Hadoop platform for some time (since Hadoop 2.02-alpha/CDH 4.1), at-rest encryption, the encryption of data stored on persistent storage such as disk, is not. To meet that requirement in the platform, Cloudera and Intel are working with the rest of the Hadoop community under the umbrella of Project Rhino — an effort to bring a comprehensive security framework for data protection to Hadoop, which also now includes Apache Sentry (incubating) — to implement at-rest encryption for HDFS (HDFS-6134 and HADOOP-10150).

Looks like I got this wrong: Apache Sentry will become part of Project Rhino.

Original title and link: Project Rhino goal: at-rest encryption for Apache Hadoop (NoSQL database©myNoSQL)


Hadoop security: unifying Project Rhino and Sentry

One result of Intel’s investment in Cloudera is putting together the teams to work on the same projects:

As the goals of Project Rhino and Sentry to develop more robust authorization mechanisms in Apache Hadoop are in complete alignment, the efforts of the engineers and security experts from both companies have merged, and their work now contributes to both projects. The specific goal is “unified authorization”, which goes beyond setting up authorization policies for multiple Hadoop components in a single administrative tool; it means setting an access policy once (typically tied to a “group” defined in an external user directory) and having it enforced across all of the different tools that this group of people uses to access data in Hadoop – for example access through Hive, Impala, search, as well as access from tools that execute MapReduce, Pig, and beyond.

A great first step.

You know what would be even better? A single security framework for Hadoop instead of two.

Original title and link: Hadoop security: unifying Project Rhino and Sentry (NoSQL database©myNoSQL)


Hortonworks’ Hadoop secret weapon is... Yahoo

Derrick Harris:

Hortonworks was working right alongside Yahoo all through that process. They’ve also worked together on things like rolling upgrades so Hadoop users can upgrade software without taking down a cluster.

  1. who didn’t know about Hortonworks and Yahoo’s collaboration?
  2. what company and product management team would choose not to work with one of the largest user of the technology it is working on?

    This is the perfect example of testing and validating new ideas, learning about the pain your customers are facing in real life. Basically by the book product/market fit.

Original title and link: Hortonworks’ Hadoop secret weapon is… Yahoo (NoSQL database©myNoSQL)


Where to look for Hadoop reliability problems

Dan Woods (Forbes) gets a list of 10 possible problems in Hadoop from Raymie Stata (CEO Altiscale) that can be summarized as:

  1. using default configuration options
  2. doing no tuning
  3. understanding Amazon Elastic MapReduce’s behavior

Original title and link: Where to look for Hadoop reliability problems (NoSQL database©myNoSQL)