Scalding: All content tagged as Scalding in NoSQL databases and polyglot persistence
Saturday, 23 February 2013
An Overview of Cascading
Earlier today I’ve posted Dean Wampler’s video Overview of Scalding. Scalding is a Scala API on top of Cascading1. Below you can find the video and slides from Paco Nathan’s Cascading presentation at Chicago Hadoop User Group:
In this video he will introduce Cascading, then examine the concept of a “workflow” as an abstraction for integrating Hadoop with other systems. We’ll show new features including support for SQL-92, PMML, plus an application manager.
✚ Leaving aside the Java vs. Scala part, I’m still not sure I see any major advantages of any of these libraries over the other. Besides tighter integration with an existing environment.
Original title and link: An Overview of Cascading (©myNoSQL)
An Overview of Scalding
An intro to Scalding1, Twitter’s Scala API for Cascading, by Dean Wampler2:
“There’s not better way to write general-purpose Hadoop MapReduce programs when specialized tools like Hive and Pig aren’t quite what you need.”
Watch the video and slides after below.
✚ At Twitter, the creators of Scalding, different teams use different libraries for dealing with different scenarios.
✚ Dean Wampler is the co-author of the Programming Scala book so his preference for Scala is understandable.
✚ Do you know any other teams or companies using Scalding instead of Cascading or Cascalog?
Original title and link: An Overview of Scalding (©myNoSQL)
Tuesday, 5 February 2013
Twitter and Their Cascading Libraries for Dealing With Different Scenarios
This is the only interesting paragraph from InfoWorld’s article “Twitter’s programmers speed Hadoop development“:
Three Twitter teams are using Cascading in combination with programming languages: The revenue team uses Scala, the publisher analytics team uses Clojure, and the analytics team uses Jython.
Each of these combinations led to new projects:
- Scala + Cascading => Scalding
- Clojure + Cascading => Cascalog
- Jython + Cascading => PyCascading
An interesting question I couldn’t answer is why each team prefers a different language. My hypothesis:
- Scala with its strong typing for handling clear models generating numbers that must always be correct.
- Clojure for designing new analysis models
- Jython enables quick experimentation with data.
Your thoughts?
Original title and link: Twitter and Their Cascading Libraries for Dealing With Different Scenarios (©myNoSQL)
Monday, 24 September 2012
Twitter's Scalding and Algebird: Matrix and Lighweight Algebra Library
The new release of Twitter’s Scalding brings quite a few interesting features:
- Scalding now includes a type-safe Matrix API
- In the familiar Fields API, we’ve added the ability to add type information to fields which allows scalding to pick up Ordering instances so that grouping on almost any scala collection becomes easy.
- Algebird is our lightweight abstract algebra library for Scala and is targeted for building aggregation systems (such as Storm).
Original title and link: Twitter’s Scalding and Algebird: Matrix and Lighweight Algebra Library (©myNoSQL)
via: http://engineering.twitter.com/2012/09/scalding-080-and-algebird.html
Tuesday, 27 March 2012
Introducing Scoobi and Scalding: Scala DSLs for Hadoop MapReduce
After posting Impressions About Hive, Pig, Scalding, Scoobi, Scrunch, Spark, I’ve found myself wondering why so many of these libraries are built in Scala and what’s their main purpose. A day later and I’ve found Age Mooij‘s presentation about Scoobi and Scalding which provide an answer to my question. Plus a quick intro to Scoobi1 and Scalding2. Check the slides after the break.
Monday, 26 March 2012
Impressions About Hive, Pig, Scalding, Scoobi, Scrunch, Spark
Sami Badawi enumerates the issues he encountered while trying all these tools (Pig1, Scalding2, Scoobi3, Hive4, Spark5, Scrunch6, Cascalog7) for a simple experiment with Hadoop:
The task was to read log files join with other data do some statistics on arrays of doubles. Writing Hadoop MapReduce classes in Java is the assembly code of Big Data.
-
Pig : a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. ↩
-
Hive: a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. ↩
-
Spark: open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write ↩
-
Cascalog: a fully-featured Clojure-based data processing and querying library for Hadoop ↩
Original title and link: Impressions About Hive, Pig, Scalding, Scoobi, Scrunch, Spark (©myNoSQL)
via: http://blog.samibadawi.com/2012/03/hive-pig-scalding-scoobi-scrunch-and.html