NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Cascading: All content tagged as Cascading in NoSQL databases and polyglot persistence

Cascading components for a Big Data applications

Jules S. Damji in a quick intro to Cascading:

At the core of most data-driven applications is a data pipeline through which data flows, originating from Taps and Sources (ingestion) and ending in a Sink (retention) while undergoing transformation along a pipeline (Pipes, Traps, and Flows). And should something fail, a Trap (exception) must handle it. In the big data parlance, these are aspects of ETL operations.

You have to agree that when compared with the MapReduce model, these components could bring a lot of readability to your code. On the other hand, at a first glance Cascading API still feels verbose.

Original title and link: Cascading components for a Big Data applications (NoSQL database©myNoSQL)


BloomJoin: BloomFilter + CoGroup for Cascading

Ben Podgursky:

We recently open-sourced a number of internal tools we’ve built to help our engineers write high-performance Cascading code as the cascading_ext project. Today I’m going to to talk about a tool we use to improve the performance of asymmetric joins—joins where one data set in the join contains significantly more records than the other, or where many of the records in the larger set don’t share a common key with the smaller set.

In the relational world there’s the Hash join.

Original title and link: BloomJoin: BloomFilter + CoGroup for Cascading (NoSQL database©myNoSQL)


Lingual - a SQL DSL for Hadoop From Concurrent

Concurrent, the company behind the quite popular Java application framework for Hadoop, has released Lingual, a SQL-based DSL (ANSI SQL parser) with an execution engine optimizer built on top of Cascading:

Lingual is not going to provide sub-second response times on a petabyte of data on a Hadoop cluster. Rather, the company’s goal is to provide the ability to easily move applications onto Hadoop—the challenge there is really around moving from a relational or MPP database over to Hadoop.

So it looks like not every new SQL tool around Hadoop is commercial.

Original title and link: Lingual - a SQL DSL for Hadoop From Concurrent (NoSQL database©myNoSQL)


An Overview of Cascading

Earlier today I’ve posted Dean Wampler’s video Overview of Scalding. Scalding is a Scala API on top of Cascading1. Below you can find the video and slides from Paco Nathan’s Cascading presentation at Chicago Hadoop User Group:

In this video he will introduce Cascading, then examine the concept of a “workflow” as an abstraction for integrating Hadoop with other systems. We’ll show new features including support for SQL-92, PMML, plus an application manager.

✚ Leaving aside the Java vs. Scala part, I’m still not sure I see any major advantages of any of these libraries over the other. Besides tighter integration with an existing environment.

  1. Cascading: an application framework for Java developers to quickly and easily develop robust data analytics and data management applications on Apache Hadoop. 

Original title and link: An Overview of Cascading (NoSQL database©myNoSQL)

An Overview of Scalding

An intro to Scalding1, Twitter’s Scala API for Cascading, by Dean Wampler2:

“There’s not better way to write general-purpose Hadoop MapReduce programs when specialized tools like Hive and Pig aren’t quite what you need.”

Watch the video and slides after below.

✚ At Twitter, the creators of Scalding, different teams use different libraries for dealing with different scenarios.

✚ Dean Wampler is the co-author of the Programming Scala book so his preference for Scala is understandable.

✚ Do you know any other teams or companies using Scalding instead of Cascading or Cascalog?

  1. Scalding 

  2. Dean Wampler: Principal Consultant at Think Big Analytics 

Original title and link: An Overview of Scalding (NoSQL database©myNoSQL)

Twitter and Their Cascading Libraries for Dealing With Different Scenarios

This is the only interesting paragraph from InfoWorld’s article “Twitter’s programmers speed Hadoop development“:

Three Twitter teams are using Cascading in combination with programming languages: The revenue team uses Scala, the publisher analytics team uses Clojure, and the analytics team uses Jython.

Each of these combinations led to new projects:

An interesting question I couldn’t answer is why each team prefers a different language. My hypothesis:

  1. Scala with its strong typing for handling clear models generating numbers that must always be correct.
  2. Clojure for designing new analysis models
  3. Jython enables quick experimentation with data.

Your thoughts?

Original title and link: Twitter and Their Cascading Libraries for Dealing With Different Scenarios (NoSQL database©myNoSQL)


Twitter's Scalding and Algebird: Matrix and Lighweight Algebra Library

The new release of Twitter’s Scalding brings quite a few interesting features:

  1. Scalding now includes a type-safe Matrix API
  2. In the familiar Fields API, we’ve added the ability to add type information to fields which allows scalding to pick up Ordering instances so that grouping on almost any scala collection becomes easy.
  3. Algebird is our lightweight abstract algebra library for Scala and is targeted for building aggregation systems (such as Storm).

Original title and link: Twitter’s Scalding and Algebird: Matrix and Lighweight Algebra Library (NoSQL database©myNoSQL)


Cascalog and Cascading: Productivity Solutions for Data Scientists

A good explanation of why Cascading, Cascalog, and other frameworks hiding away the details of MapReduce are making things easier for non-programmers:

Data scientists at The Climate Corporation chose to create their algorithms in Cascalog, which is a high-level Clojure-based machine learning language built on Cascading. Cascading is an advanced Java application framework that abstracts the MapReduce APIs in Apache Hadoop and provides developers with a simplified way to create powerful data processing workflows. Programming in Cascalog, data scientists create compact expressions that represent complex batch-oriented AI and machine learning workflows. This results in improved productivity for the data scientists, many of whom are mathematicians rather than computer scientists. It also gives them the ability to quickly analyze complex data sets without having to create large complicated programs in MapReduce. Furthermore, programmers at The Climate Corporation also use Cascading directly for creating jobs inside Hadoop streaming to process additional batch-oriented data workflows.

Original title and link: Cascalog and Cascading: Productivity Solutions for Data Scientists (NoSQL database©myNoSQL)


Cascading 2.0 Released

Cascading the Java framework offering data processing, data flow, data integration, and process scheduling APIs for Hadoop has reached version 2.0. The most interesting points in this release summarized on the Cascading blog:

  • Apache 2.0 Licensing
  • Support for Hadoop 1.0.2
  • Local and Hadoop planner modes, where local runs in memory without Hadoop dependencies
  • HashJoin pipe for “map side joins”
  • Merge pipe for “map side merges”
  • Simple Checkpointing for capturing intermediate data as a file
  • Improved Tap and Scheme APIs


Original title and link: Cascading 2.0 Released (NoSQL database©myNoSQL)

Cascalog-Checkpoint: Fault-Tolerant MapReduce Topologies

A brief but very clear explanation of the benefits of using Cascalog-checkpoints by Paul Lam:

Building Cascading/Cascalog queries can be visualised as assembling pipes to connect a flow of data. Imagine that you have Flow A and B. Flow B uses the result from A along with other bits. Thus, Flow B is dependent on A. Typically, if a MapReduce job fail for whatever reason, you simply fix what’s wrong and start the job all over again. But what if Flow A takes hours to run (which is common for a MR job) and the error happened in Flow B? Why re-do all that processing for Flow A if we know that it finished successfully?

Original title and link: Cascalog-Checkpoint: Fault-Tolerant MapReduce Topologies (NoSQL database©myNoSQL)


An Introduction to Scalding, the Scala and Cascading MapReduce Framework From Twitter

A fantastic guide to Twitter’s Scala and Cascading MapReduce framework Scalding from Edwin Chen1:

In 140: instead of forcing you to write raw map and reduce functions, Scalding allows you to write natural code like

// Create a histogram of tweet lengths.'tweet -> 'length) { tweet : String => tweet.size }.groupBy('length) { _.size }

Looking at the code samples, this looks a lot like Apache Pig. But the Scalding documentation compares it to Scrunch/Scoobi and points to the answers in this Quora thread:

The main difference between Scalding (and Cascading) and Scrunch/Scoobi is that Cascading has a record model where each element in your distributed list/table is a table with some named fields. This is nice because most common cases are to have a few primitive columns (ints, strings, etc…).

  1. Edwin Chen is data scientist at Twitter 

Original title and link: An Introduction to Scalding, the Scala and Cascading MapReduce Framework From Twitter (NoSQL database©myNoSQL)


Looking for a Map Reduce Language

Java, Cascading, Pipes - C++, Hive, Pig, Rhipe, Dumbo, Cascalog… which one of these should you use for writing Map Reduce code?

Antonio Piccolboni takes them up for a test:

At the end of this by necessity incomplete and unscientific language and library comparison, there is a winner and there isn’t. There isn’t because language comparison is always multidimensional and subjective but also because the intended applications are very different. On the other hand, looking for a general purpose, moderately elegant, not necessarily most efficient, not necessarily mature language for exploration purposes, Rhipe seems to fit the bill pretty nicely.

Original title and link: Looking for a Map Reduce Language (NoSQL database©myNoSQL)