NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Scalding: All content tagged as Scalding in NoSQL databases and polyglot persistence

An Overview of Cascading

Earlier today I’ve posted Dean Wampler’s video Overview of Scalding. Scalding is a Scala API on top of Cascading1. Below you can find the video and slides from Paco Nathan’s Cascading presentation at Chicago Hadoop User Group:

In this video he will introduce Cascading, then examine the concept of a “workflow” as an abstraction for integrating Hadoop with other systems. We’ll show new features including support for SQL-92, PMML, plus an application manager.

✚ Leaving aside the Java vs. Scala part, I’m still not sure I see any major advantages of any of these libraries over the other. Besides tighter integration with an existing environment.

  1. Cascading: an application framework for Java developers to quickly and easily develop robust data analytics and data management applications on Apache Hadoop. 

Original title and link: An Overview of Cascading (NoSQL database©myNoSQL)

An Overview of Scalding

An intro to Scalding1, Twitter’s Scala API for Cascading, by Dean Wampler2:

“There’s not better way to write general-purpose Hadoop MapReduce programs when specialized tools like Hive and Pig aren’t quite what you need.”

Watch the video and slides after below.

✚ At Twitter, the creators of Scalding, different teams use different libraries for dealing with different scenarios.

✚ Dean Wampler is the co-author of the Programming Scala book so his preference for Scala is understandable.

✚ Do you know any other teams or companies using Scalding instead of Cascading or Cascalog?

  1. Scalding 

  2. Dean Wampler: Principal Consultant at Think Big Analytics 

Original title and link: An Overview of Scalding (NoSQL database©myNoSQL)

Twitter and Their Cascading Libraries for Dealing With Different Scenarios

This is the only interesting paragraph from InfoWorld’s article “Twitter’s programmers speed Hadoop development“:

Three Twitter teams are using Cascading in combination with programming languages: The revenue team uses Scala, the publisher analytics team uses Clojure, and the analytics team uses Jython.

Each of these combinations led to new projects:

An interesting question I couldn’t answer is why each team prefers a different language. My hypothesis:

  1. Scala with its strong typing for handling clear models generating numbers that must always be correct.
  2. Clojure for designing new analysis models
  3. Jython enables quick experimentation with data.

Your thoughts?

Original title and link: Twitter and Their Cascading Libraries for Dealing With Different Scenarios (NoSQL database©myNoSQL)


Twitter's Scalding and Algebird: Matrix and Lighweight Algebra Library

The new release of Twitter’s Scalding brings quite a few interesting features:

  1. Scalding now includes a type-safe Matrix API
  2. In the familiar Fields API, we’ve added the ability to add type information to fields which allows scalding to pick up Ordering instances so that grouping on almost any scala collection becomes easy.
  3. Algebird is our lightweight abstract algebra library for Scala and is targeted for building aggregation systems (such as Storm).

Original title and link: Twitter’s Scalding and Algebird: Matrix and Lighweight Algebra Library (NoSQL database©myNoSQL)


Introducing Scoobi and Scalding: Scala DSLs for Hadoop MapReduce

After posting Impressions About Hive, Pig, Scalding, Scoobi, Scrunch, Spark, I’ve found myself wondering why so many of these libraries are built in Scala and what’s their main purpose. A day later and I’ve found Age Mooij‘s presentation about Scoobi and Scalding which provide an answer to my question. Plus a quick intro to Scoobi1 and Scalding2. Check the slides after the break.

Impressions About Hive, Pig, Scalding, Scoobi, Scrunch, Spark

Sami Badawi enumerates the issues he encountered while trying all these tools (Pig1, Scalding2, Scoobi3, Hive4, Spark5, Scrunch6, Cascalog7) for a simple experiment with Hadoop:

The task was to read log files join with other data do some statistics on arrays of doubles. Writing Hadoop MapReduce classes in Java is the assembly code of Big Data.

  1. Pig : a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. 

  2. Scalding: A Scala API for Cascading 

  3. Scoobi: a Scala productivity framework for Hadoop 

  4. Hive: a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. 

  5. Spark: open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write 

  6. Scrunch: a Scala wrapper for Crunch 

  7. Cascalog: a fully-featured Clojure-based data processing and querying library for Hadoop  

Original title and link: Impressions About Hive, Pig, Scalding, Scoobi, Scrunch, Spark (NoSQL database©myNoSQL)