NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



hadoop: All content tagged as hadoop in NoSQL databases and polyglot persistence

Pigs can build graphs too for graph analytics

Extremely interesting and intriguing usage and extension of Pig at Intel:

Pigs eat everything and Pig can ingest many data formats and data types from many data sources, in line with our objectives for Graph Builder. Also, Pig has native support for local file systems, HDFS, and HBase, and tools like Sqoop can be used upstream of Pig to transfer data into HDFS from relational databases. One of the most fascinating things about Pig is that it only takes a few lines of code to define a complex workflow comprised of a long chain of data operations. These operations can map to multiple MapReduce jobs and Pig compiles the logical plan into an optimized workflow of MapReduce jobs. With all of these advantages, Pig seemed like the right tool for graph ETL, so we re-architected Graph Builder 2.0 as a library of User- Defined Functions (UDF’s) and macros in Pig Latin.

Original title and link: Pigs can build graphs too for graph analytics (NoSQL database©myNoSQL)


Datameer raises $19 Million

Announced yesterday:

“This funding is entirely about allowing us to meet the nonstop global demand for our product. Across every industry, companies are moving past Hadoop science projects and realizing they need a proven big data analytics tool that finally frees them from schemas and ETL,” said Stefan Groschupf, CEO of Datameer.

Funding in the Hadoop space is at a higher level than the pure NoSQL databases market. In the Big Data/BI market it’s easier to grasp the competitors and the market potential they’re fighting for. In the NoSQL market, many are still afraid to think that some of these players will actually make (big) dents into incumbents’ market segments.

Original title and link: Datameer raises $19 Million (NoSQL database©myNoSQL)


Challenges and Opportunities for Big Data - an interview with Actian's CTO Mike Hoskins

Roberto V. Zicari interviews Actian’s CTO Mike Hoskins:

Until recently, most data projects were solely focused on preparation. Seminal developments in the big data landscape, including Hortonworks Data Platform (HDP) 2.0 and the arrival of YARN (Yet Another Resource Negotiator) – which takes Hadoop’s capabilities in data processing beyond the limitations of the highly regimented and restrictive MapReduce programming model – provides an opportunity to move beyond the initial hype of big data and instead towards the more high-value work of predictive analytics.

Original title and link: Challenges and Opportunities for Big Data - an interview with Actian’s CTO Mike Hoskins (NoSQL database©myNoSQL)


Picking the Right Platform: Big Data or Traditional Warehouse?

Stephen Swoyer (tdwi) is summarizing Richard Winter’s research into the topic of cost-based efficiency of Hadoop vs data warehouses:

“Under what circumstances, in fact, does Hadoop save you a lot of money, and under what circumstances does a data warehouse save you a lot of money?”

The conversation happened at a Teradata event, so you might already guess some of the findings. Anyways without seeing the data it’s difficult to agree or disagree:

In fact, he argued that misusing Hadoop for some types of decision support workloads could cost up to 2.8x more than a data warehouse.

Original title and link: Picking the Right Platform: Big Data or Traditional Warehouse? (NoSQL database©myNoSQL)


The three most common ways data junkies are using Hadoop

Shaun Connolly (Hortonworks) lists the 3 most commons usages of Hadoop in a guest post on GigaOm:

  1. Data refinery
  2. Data exploration
  3. Application enrichment

Nothing new here, except the new buzzwords used to describe those Hadoop use cases that were slowly, but steadily establishing as patterns. And even if they sound nicer than ETL, analytics, etc. I doubt anyone needed new terms.

Original title and link: The three most common ways data junkies are using Hadoop (NoSQL database©myNoSQL)

Hadoop in Fedora 20

Being included in the default Fedora distro is yet another big step for Hadoop.

The hardest part about getting Hadoop into Fedora? “Dependencies, dependencies, dependencies!” says Farrellee. […]

For Hadoop? It was more difficult than usual. “There were some dependencies that were just missing and we had to work through those as you’d expect - there were a lot of these. Then there were dependencies that were older than what upstream was using - rare, I know, for Fedora, which aims to be on the bleeding edge. The hardest to deal with were dependencies that were newer than what upstream was using. We tried to write patches for these, but we weren’t always successful. […]”

On the other hand, one thing that continues to puzzle me is: how many different people coming from different backgrounds need to say that Hadoop is crazy complex?

Original title and link: Hadoop in Fedora 20 (NoSQL database©myNoSQL)


HAWK: Performance monitoring tool for Hive

JunHo Cho’s slides introducing HAWK, a performance monitoring tool for Hive:

✚ I couldn’t find a link for HAWK. The slides are pointing to NexR.

Original title and link: HAWK: Performance monitoring tool for Hive (NoSQL database©myNoSQL)

Apache Ambari is now an Apache Top Level Project


We are very excited to announce that Apache Ambari has graduated out of Incubator and is now an Apache Top Level Project!

Ambari is a framework for provisioning, managing, and monitoring Hadoop clusters.

✚ Such a tool is usually part of the distributions of Hadoop and in some cases it comes in a proprietary form.

Original title and link: Apache Ambari is now an Apache Top Level Project (NoSQL database©myNoSQL)


A quick guide to using Sentry authorization in Hive

A guide to Apache Sentry:

Sentry brings in fine-grained authorization support for both data and metadata in a Hadoop cluster. It is already being used in production systems to secure the data and provide fine-grained access to its users. It is also integrated with the version of Hive shipping in CDH (upstream contribution is pending), Cloudera Impala, and Cloudera Search.

Original title and link: A quick guide to using Sentry authorization in Hive (NoSQL database©myNoSQL)


Hadoop on SAN? Never, ever do this to Hadoop

Andrew C. Oliver in an article for InfoWorld:

I’ve done this myself, figuring we’d kick off the project and show how we could “optimize” to local disks later. Let me say this unequivocally: You absolutely should not use a SAN or NAS with Hadoop.

As simple as that.

Original title and link: Hadoop on SAN? Never, ever do this to Hadoop (NoSQL database©myNoSQL)


Using Spark for fast in-memory computing

Justin Kestelyn from Databricks describes the differences between Hadoop and Spark processing models in a post on “Cloudera’s blog“:

At its core, Spark provides a general programming model that enables developers to write application by composing arbitrary operators, such as mappers, reducers, joins, group-bys, and filters. […] In addition, Spark keeps track of the data that each of the operators produces, and enables applications to reliably store this data in memory.


✚ This looks in a way similar to the Cascading programming model combined with the capability of storing in memory the working dataset for the current computations.

Original title and link: Using Spark for fast in-memory computing (NoSQL database©myNoSQL)


jumboDB - a data store for low-latency Big Data apps

From jumboDB’s homepage:

Working on Big Data projects with Telefonica Digital, Carsten Hufe and the comSysto-Team started looking for an efficient and affordable way to store and query large amounts of data being delivered in large batches through Apache Hadoop. Our goal was to build a data visualization app for end users issuing different kinds of selective queries on already processed data. Some of the queries were returning large result sets of up to 800.000 JSON documents representing data points for browser visualisation.

Why not using HBase if you already have Hadoop?

Original title and link: jumboDB - a data store for low-latency Big Data apps (NoSQL database©myNoSQL)