NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



twitter: All content tagged as twitter in NoSQL databases and polyglot persistence

Hadoop Map Reduce jobs run time data and stats with Twitter's hRaven

Announced in the first day of the Hadoop Summit, hRaven is a new open source tool from the data team at Twitter meant to collect data and help analyse the usage of a Hadoop cluster:

hRaven collects run time data and statistics from map reduce jobs running on Hadoop clusters and stores the collected job history in an easily queryable format. For the jobs that are run through frameworks (Pig or Scalding/Cascading) that decompose a script or application into a DAG of map reduce jobs for actual execution, hRaven groups job history data together by an application construct. This allows for easier visualization of all of the component jobs’ execution for an application and more comprehensive trending and analysis over time.

Original title and link: Hadoop Map Reduce jobs run time data and stats with Twitter’s hRaven (NoSQL database©myNoSQL)


Parquet - Columnar Storage Format for Hadooop by Twitter and Cloudera

Announced 2 hours ago, by Twitter’s analytics infrastructure engineer Dmitriy Ryaboy, here comes Parquet:

We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.

The Parquet format page describes the details of the Apache Thrift metadata encoding, supported types, Thrift definitions, etc.

Original title and link: Parquet - Columnar Storage Format for Hadooop by Twitter and Cloudera (NoSQL database©myNoSQL)


An Overview of Scalding

An intro to Scalding1, Twitter’s Scala API for Cascading, by Dean Wampler2:

“There’s not better way to write general-purpose Hadoop MapReduce programs when specialized tools like Hive and Pig aren’t quite what you need.”

Watch the video and slides after below.

✚ At Twitter, the creators of Scalding, different teams use different libraries for dealing with different scenarios.

✚ Dean Wampler is the co-author of the Programming Scala book so his preference for Scala is understandable.

✚ Do you know any other teams or companies using Scalding instead of Cascading or Cascalog?

  1. Scalding 

  2. Dean Wampler: Principal Consultant at Think Big Analytics 

Original title and link: An Overview of Scalding (NoSQL database©myNoSQL)

Twitter and Their Cascading Libraries for Dealing With Different Scenarios

This is the only interesting paragraph from InfoWorld’s article “Twitter’s programmers speed Hadoop development“:

Three Twitter teams are using Cascading in combination with programming languages: The revenue team uses Scala, the publisher analytics team uses Clojure, and the analytics team uses Jython.

Each of these combinations led to new projects:

An interesting question I couldn’t answer is why each team prefers a different language. My hypothesis:

  1. Scala with its strong typing for handling clear models generating numbers that must always be correct.
  2. Clojure for designing new analysis models
  3. Jython enables quick experimentation with data.

Your thoughts?

Original title and link: Twitter and Their Cascading Libraries for Dealing With Different Scenarios (NoSQL database©myNoSQL)


Twitter's Scalding and Algebird: Matrix and Lighweight Algebra Library

The new release of Twitter’s Scalding brings quite a few interesting features:

  1. Scalding now includes a type-safe Matrix API
  2. In the familiar Fields API, we’ve added the ability to add type information to fields which allows scalding to pick up Ordering instances so that grouping on almost any scala collection becomes easy.
  3. Algebird is our lightweight abstract algebra library for Scala and is targeted for building aggregation systems (such as Storm).

Original title and link: Twitter’s Scalding and Algebird: Matrix and Lighweight Algebra Library (NoSQL database©myNoSQL)


Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

Google’s paper about their large-scale distributed systems tracing solution Dapper which inspired Twitter’s Zipkin:

Here we introduce the design of Dapper, Google’s production distributed systems tracing infrastructure, and describe how our design goals of low overhead, application-level transparency, and ubiquitous deployment on a very large scale system were met. Dapper shares conceptual similarities with other tracing systems, particularly Magpie [3] and X-Trace [12], but certain design choices were made that have been key to its success in our environment, such as the use of sampling and restricting the instrumentation to a rather small number of common libraries.

Download or read the paper after the break.

Tracing Distributed Systems With Twitter Zipkin

Whenever a request reaches Twitter, we decide if the request should be sampled. We attach a few lightweight trace identifiers and pass them along to all the services used in that request. By only sampling a portion of all the requests we reduce the overhead of tracing, allowing us to always have it enabled in production.

The Zipkin collector receives the data via Scribe and stores it in Cassandra along with a few indexes. The indexes are used by the Zipkin query daemon to find interesting traces to display in the web UI.

There a many APM solutions out there, but sometimes the overhead or the lack of support for specific components may lead to the need of custom solutions like this one. While in a different field, this is just another example of why products and services should be designed with openness and integration in mind.

Original title and link: Tracing Distributed Systems With Twitter Zipkin (NoSQL database©myNoSQL)


Scala Client for Cassandra From Twitter: Cassie

Staying in the land of recent open source data-related projects from Twitter, Ryan King:

Cassie is a Finagle and Scala-based client originally based on Coda Hale’s library.

While it is certainly stable— we use it in production to talk to a dozen clusters and over a thousand Cassandra machines— it is currently limited to the features we use in production and has a few rough edges.

For the JVM there’s also Netflix’s Cassandra client (Astyanax) available on GitHub.

Derrick Harris

Original title and link: Scala Client for Cassandra From Twitter: Cassie (NoSQL database©myNoSQL)


Big Graph-Processing Library From Twitter: Cassovary

Cassovary is designed from the ground up to efficiently handle graphs with billions of edges. It comes with some common node and graph data structures and traversal algorithms. A typical usage is to do large-scale graph mining and analysis.

If you are reading this you’ve most probably heard of Pregel—if you didn’t then you should check out the Pregel: a system for large-scale graph processing paper and then how Pregel and MapReduce compare—and also the 6 Pregel inspired frameworks.

The Cassovary project page introduces it as:

Cassovary is a simple “big graph” processing library for the JVM. Most JVM-hosted graph libraries are flexible but not space efficient. Cassovary is designed from the ground up to first be able to efficiently handle graphs with billions of nodes and edges. A typical example usage is to do large scale graph mining and analysis of a big network. Cassovary is written in Scala and can be used with any JVM-hosted language. It comes with some common data structures and algorithms.

I’m not sure yet if:

  1. Cassovary works with any graphy data source or requires FlockDB—which is more of a persisted graph than a graph database
  2. Cassovary is inspired by Pregel in any ways or if it’s addressing a limited problem space (similarly to FlockDB)

Update: Pankaj Gupta helped clarify the first question (and probably part of the second too):

At Twitter we use flockdb as our real-time graphdb, and export daily for use in cassovary, but any store could be used.

Original title and link: Big Graph-Processing Library From Twitter: Cassovary (NoSQL database©myNoSQL)


Cascalog-Checkpoint: Fault-Tolerant MapReduce Topologies

A brief but very clear explanation of the benefits of using Cascalog-checkpoints by Paul Lam:

Building Cascading/Cascalog queries can be visualised as assembling pipes to connect a flow of data. Imagine that you have Flow A and B. Flow B uses the result from A along with other bits. Thus, Flow B is dependent on A. Typically, if a MapReduce job fail for whatever reason, you simply fix what’s wrong and start the job all over again. But what if Flow A takes hours to run (which is common for a MR job) and the error happened in Flow B? Why re-do all that processing for Flow A if we know that it finished successfully?

Original title and link: Cascalog-Checkpoint: Fault-Tolerant MapReduce Topologies (NoSQL database©myNoSQL)


An Introduction to Scalding, the Scala and Cascading MapReduce Framework From Twitter

A fantastic guide to Twitter’s Scala and Cascading MapReduce framework Scalding from Edwin Chen1:

In 140: instead of forcing you to write raw map and reduce functions, Scalding allows you to write natural code like

// Create a histogram of tweet lengths.'tweet -> 'length) { tweet : String => tweet.size }.groupBy('length) { _.size }

Looking at the code samples, this looks a lot like Apache Pig. But the Scalding documentation compares it to Scrunch/Scoobi and points to the answers in this Quora thread:

The main difference between Scalding (and Cascading) and Scrunch/Scoobi is that Cascading has a record model where each element in your distributed list/table is a table with some named fields. This is nice because most common cases are to have a few primitive columns (ints, strings, etc…).

  1. Edwin Chen is data scientist at Twitter 

Original title and link: An Introduction to Scalding, the Scala and Cascading MapReduce Framework From Twitter (NoSQL database©myNoSQL)


Flake: An Erlang Decentralized, K-Ordered Unique ID Generator by Boundary

After posting about the proposal of adding a 128-bit K-Order unique id generator to Redis, I’ve realized I haven’t linked to Boundary’s post about Flake, their decentralized k-ordered unique id generator project:

All that is required to construct such an id is a monotonically increasing clock and a location 3. K-ordering dictates that the most-significant bits of the id be the timestamp. UUID-1 contains this information, but arranges the pieces in such a way that k-ordering is lost. Still other schemes offer k-ordering with either a questionable representation of ‘location’ or one that requires coordination among nodes.

Original title and link: Flake: An Erlang Decentralized, K-Ordered Unique ID Generator by Boundary (NoSQL database©myNoSQL)