ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

An Introduction to Scalding, the Scala and Cascading MapReduce Framework From Twitter

A fantastic guide to Twitter’s Scala and Cascading MapReduce framework Scalding from Edwin Chen1:

In 140: instead of forcing you to write raw map and reduce functions, Scalding allows you to write natural code like

// Create a histogram of tweet lengths.
tweets.map('tweet -> 'length) { tweet : String => tweet.size }.groupBy('length) { _.size }

Looking at the code samples, this looks a lot like Apache Pig. But the Scalding documentation compares it to Scrunch/Scoobi and points to the answers in this Quora thread:

The main difference between Scalding (and Cascading) and Scrunch/Scoobi is that Cascading has a record model where each element in your distributed list/table is a table with some named fields. This is nice because most common cases are to have a few primitive columns (ints, strings, etc…).


  1. Edwin Chen is data scientist at Twitter 

Original title and link: An Introduction to Scalding, the Scala and Cascading MapReduce Framework From Twitter (NoSQL database©myNoSQL)

via: http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/