cascading: All content tagged as cascading in NoSQL databases and polyglot persistence
Thursday, 4 April 2013
BloomJoin: BloomFilter + CoGroup for Cascading
Ben Podgursky:
We recently open-sourced a number of internal tools we’ve built to help our engineers write high-performance Cascading code as the cascading_ext project. Today I’m going to to talk about a tool we use to improve the performance of asymmetric joins—joins where one data set in the join contains significantly more records than the other, or where many of the records in the larger set don’t share a common key with the smaller set.
In the relational world there’s the Hash join.
Original title and link: BloomJoin: BloomFilter + CoGroup for Cascading (©myNoSQL)
via: http://blog.liveramp.com/2013/04/03/bloomjoin-bloomfilter-cogroup/
Thursday, 14 March 2013
Lingual - a SQL DSL for Hadoop From Concurrent
Concurrent, the company behind the quite popular Java application framework for Hadoop, has released Lingual, a SQL-based DSL (ANSI SQL parser) with an execution engine optimizer built on top of Cascading:
Lingual is not going to provide sub-second response times on a petabyte of data on a Hadoop cluster. Rather, the company’s goal is to provide the ability to easily move applications onto Hadoop—the challenge there is really around moving from a relational or MPP database over to Hadoop.
So it looks like not every new SQL tool around Hadoop is commercial.
Original title and link: Lingual - a SQL DSL for Hadoop From Concurrent (©myNoSQL)
Saturday, 23 February 2013
An Overview of Cascading
Earlier today I’ve posted Dean Wampler’s video Overview of Scalding. Scalding is a Scala API on top of Cascading1. Below you can find the video and slides from Paco Nathan’s Cascading presentation at Chicago Hadoop User Group:
In this video he will introduce Cascading, then examine the concept of a “workflow” as an abstraction for integrating Hadoop with other systems. We’ll show new features including support for SQL-92, PMML, plus an application manager.
✚ Leaving aside the Java vs. Scala part, I’m still not sure I see any major advantages of any of these libraries over the other. Besides tighter integration with an existing environment.
Original title and link: An Overview of Cascading (©myNoSQL)
An Overview of Scalding
An intro to Scalding1, Twitter’s Scala API for Cascading, by Dean Wampler2:
“There’s not better way to write general-purpose Hadoop MapReduce programs when specialized tools like Hive and Pig aren’t quite what you need.”
Watch the video and slides after below.
✚ At Twitter, the creators of Scalding, different teams use different libraries for dealing with different scenarios.
✚ Dean Wampler is the co-author of the Programming Scala book so his preference for Scala is understandable.
✚ Do you know any other teams or companies using Scalding instead of Cascading or Cascalog?
Original title and link: An Overview of Scalding (©myNoSQL)
Tuesday, 5 February 2013
Twitter and Their Cascading Libraries for Dealing With Different Scenarios
This is the only interesting paragraph from InfoWorld’s article “Twitter’s programmers speed Hadoop development“:
Three Twitter teams are using Cascading in combination with programming languages: The revenue team uses Scala, the publisher analytics team uses Clojure, and the analytics team uses Jython.
Each of these combinations led to new projects:
- Scala + Cascading => Scalding
- Clojure + Cascading => Cascalog
- Jython + Cascading => PyCascading
An interesting question I couldn’t answer is why each team prefers a different language. My hypothesis:
- Scala with its strong typing for handling clear models generating numbers that must always be correct.
- Clojure for designing new analysis models
- Jython enables quick experimentation with data.
Your thoughts?
Original title and link: Twitter and Their Cascading Libraries for Dealing With Different Scenarios (©myNoSQL)
Monday, 24 September 2012
Twitter's Scalding and Algebird: Matrix and Lighweight Algebra Library
The new release of Twitter’s Scalding brings quite a few interesting features:
- Scalding now includes a type-safe Matrix API
- In the familiar Fields API, we’ve added the ability to add type information to fields which allows scalding to pick up Ordering instances so that grouping on almost any scala collection becomes easy.
- Algebird is our lightweight abstract algebra library for Scala and is targeted for building aggregation systems (such as Storm).
Original title and link: Twitter’s Scalding and Algebird: Matrix and Lighweight Algebra Library (©myNoSQL)
via: http://engineering.twitter.com/2012/09/scalding-080-and-algebird.html
Monday, 23 July 2012
Cascalog and Cascading: Productivity Solutions for Data Scientists
A good explanation of why Cascading, Cascalog, and other frameworks hiding away the details of MapReduce are making things easier for non-programmers:
Data scientists at The Climate Corporation chose to create their algorithms in Cascalog, which is a high-level Clojure-based machine learning language built on Cascading. Cascading is an advanced Java application framework that abstracts the MapReduce APIs in Apache Hadoop and provides developers with a simplified way to create powerful data processing workflows. Programming in Cascalog, data scientists create compact expressions that represent complex batch-oriented AI and machine learning workflows. This results in improved productivity for the data scientists, many of whom are mathematicians rather than computer scientists. It also gives them the ability to quickly analyze complex data sets without having to create large complicated programs in MapReduce. Furthermore, programmers at The Climate Corporation also use Cascading directly for creating jobs inside Hadoop streaming to process additional batch-oriented data workflows.
Original title and link: Cascalog and Cascading: Productivity Solutions for Data Scientists (©myNoSQL)
via: http://www.concurrentinc.com/case-studies/climate-corp/
Tuesday, 5 June 2012
Cascading 2.0 Released
Cascading the Java framework offering data processing, data flow, data integration, and process scheduling APIs for Hadoop has reached version 2.0. The most interesting points in this release summarized on the Cascading blog:
- Apache 2.0 Licensing
- Support for Hadoop 1.0.2
- Local and Hadoop planner modes, where local runs in memory without Hadoop dependencies
- HashJoin pipe for “map side joins”
- Merge pipe for “map side merges”
- Simple Checkpointing for capturing intermediate data as a file
- Improved Tap and Scheme APIs
Congrats!
Original title and link: Cascading 2.0 Released (©myNoSQL)
Monday, 27 February 2012
Cascalog-Checkpoint: Fault-Tolerant MapReduce Topologies
A brief but very clear explanation of the benefits of using Cascalog-checkpoints by Paul Lam:
Building Cascading/Cascalog queries can be visualised as assembling pipes to connect a flow of data. Imagine that you have Flow A and B. Flow B uses the result from A along with other bits. Thus, Flow B is dependent on A. Typically, if a MapReduce job fail for whatever reason, you simply fix what’s wrong and start the job all over again. But what if Flow A takes hours to run (which is common for a MR job) and the error happened in Flow B? Why re-do all that processing for Flow A if we know that it finished successfully?
Original title and link: Cascalog-Checkpoint: Fault-Tolerant MapReduce Topologies (©myNoSQL)
via: http://www.quantisan.com/cascalog-checkpoint-fault-tolerant-mapreduce-topologies/
Tuesday, 21 February 2012
An Introduction to Scalding, the Scala and Cascading MapReduce Framework From Twitter
A fantastic guide to Twitter’s Scala and Cascading MapReduce framework Scalding from Edwin Chen1:
In 140: instead of forcing you to write raw map and reduce functions, Scalding allows you to write natural code like
// Create a histogram of tweet lengths. tweets.map('tweet -> 'length) { tweet : String => tweet.size }.groupBy('length) { _.size }
Looking at the code samples, this looks a lot like Apache Pig. But the Scalding documentation compares it to Scrunch/Scoobi and points to the answers in this Quora thread:
The main difference between Scalding (and Cascading) and Scrunch/Scoobi is that Cascading has a record model where each element in your distributed list/table is a table with some named fields. This is nice because most common cases are to have a few primitive columns (ints, strings, etc…).
-
Edwin Chen is data scientist at Twitter ↩
Original title and link: An Introduction to Scalding, the Scala and Cascading MapReduce Framework From Twitter (©myNoSQL)
via: http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/
Sunday, 4 December 2011
Looking for a Map Reduce Language
Java, Cascading, Pipes - C++, Hive, Pig, Rhipe, Dumbo, Cascalog… which one of these should you use for writing Map Reduce code?
Antonio Piccolboni takes them up for a test:
At the end of this by necessity incomplete and unscientific language and library comparison, there is a winner and there isn’t. There isn’t because language comparison is always multidimensional and subjective but also because the intended applications are very different. On the other hand, looking for a general purpose, moderately elegant, not necessarily most efficient, not necessarily mature language for exploration purposes, Rhipe seems to fit the bill pretty nicely.
Original title and link: Looking for a Map Reduce Language (©myNoSQL)
via: http://blog.piccolboni.info/2011/04/looking-for-map-reduce-language.html
Wednesday, 11 May 2011
Simhashing in Hadoop with MapReduce, Cascalog and Cascading
Simhashing in MapReduce is a quick way to find clusters in a huge amount of data. By using Cascading and Cascalog we’re able to work with MapReduce jobs at the level of functions rather than individual map-reduce phases.
Original title and link: Simhashing in Hadoop with MapReduce, Cascalog and Cascading (NoSQL databases © myNoSQL)
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling