NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



linkedin: All content tagged as linkedin in NoSQL databases and polyglot persistence

LinkedIn's Hourglass: Incremental data processing in Hadoop

Matthew Hayes introduces a very interesting new framework from LinkedIn

Hourglass is designed to make computations over sliding windows more efficient. For these types of computations, the input data is partitioned in some way, usually according to time, and the range of input data to process is adjusted as new data arrives. Hourglass works with input data that is partitioned by day, as this is a common scheme for partitioning temporal data.

Hourglass is available on GitHub.

We have found that two types of sliding window computations are extremely common in practice:

  • Fixed-length: the length of the window is set to some constant number of days and the entire window moves forward as new data becomes available. Example: a daily report summarizing the the number of visitors to a site from the past 30 days.
  • Fixed-start: the beginning of the window stays constant, but the end slides forward as new input data becomes available. Example: a daily report summarizing all visitors to a site since the site launched.

Original title and link: LinkedIn’s Hourglass: Incremental data processing in Hadoop (NoSQL database©myNoSQL)


White Elephant: Task Statistics for Hadoop

From LinkedIn’s engineering team:

While tools like Ganglia provide system-level metrics, we wanted to be able to understand what resources were being used by each user and at what times. White Elephant parses Hadoop logs to provide visual drill downs and rollups of task statistics for your Hadoop cluster, including total task time, slots used, CPU time, and failed job counts.

Isn’t this a form of resource usage auditing? Based on this, next you could build support for resource quotas and then start enforcing them.

Original title and link: White Elephant: Task Statistics for Hadoop (NoSQL database©myNoSQL)


Which Big Data Company Has the World's Biggest Hadoop Cluster?

Jimmy Wong:

Which companies use Hadoop for analyzing big data? How big are their clusters? I thought it would be fun to compare companies by the size of their Hadoop installations. The size would indicate the company’s investment in Hadoop, and subsequently their appetite to buy big data products and services from vendors, as well as their hiring needs to support their analytics infrastructure.


Unfortunately the data available is sooo little and soooo old.

Original title and link: Which Big Data Company Has the World’s Biggest Hadoop Cluster? (NoSQL database©myNoSQL)


From SimpleDB to Cassandra: Data Migration for a High Volume Web Application at Netflix

Prasanna Padmanabhan and Shashi Madapp posted an article on the Netflix blog describing the process used to migrate data from Amazon SimpleDB to Cassandra:

There will come a time in the life of most systems serving data, when there is a need to migrate data to a more reliable, scalable and high performance data store while maintaining or improving data consistency, latency and efficiency. This document explains the data migration technique we used at Netflix to migrate the user’s queue data between two different distributed NoSQL storage systems.

The steps involved are what you’d expect for a large data set migration:

  1. forklift
  2. incremental replication
  3. consistency checking
  4. shadow writes
  5. shadow writes and shadow reads for validation
  6. end of life of the original data store (SimpleDB)

If you think of it, this is how a distributed, eventually consistent storage works (at least in big lines) when replicating data across the cluster. The main difference is that inside a storage engine you deal with a homogeneous system with a single set of constraints, while data migration has to deal with heterogenous systems most often characterized by different limitations and behavior.

In 2009, Netflix performed a similar massive data migration operation. At that time it involved moving data from its own hosted Oracle and MySQL databases to SimpleDB. The challenges of operating this hybrid solution were described in a the paper Netflix’s Transition to High-Availability Storage Systems authored by Sid Anand.

Sid Anand is now working at LinkedIn where they use Databus for low latency data transfer. But Databus’s approach is very similar.

Original title and link: From SimpleDB to Cassandra: Data Migration for a High Volume Web Application at Netflix (NoSQL database©myNoSQL)


DataFu: A Collection of Pig UDFs for Data Analysis on Hadoop by LinkedIn

Sam Shah in a guest post on Hortonworks blog:

If Pig is the “duct tape for big data”, then DataFu is the WD-40. Or something. […] Over the years, we developed several routines that were used across LinkedIn and were thrown together into an internal package we affectionately called “littlepiggy.”

a penetrating oil and water-displacing spray“? “littlepiggy”? Seriously?

How could one come up with these names for such a useful library of statistical functions, PageRank, set and bag operations?

Original title and link: DataFu: A Collection of Pig UDFs for Data Analysis on Hadoop by LinkedIn (NoSQL database©myNoSQL)


What Is Unique About LinkedIn’s Databus

After learning about LinkedIn’s Databus low latency data transfer system, I’ve had a short chat with Sid Anand focused on understanding what makes Databus unique.

As I’ve mentioned in my post about Databus, Databus looks at first as a data-oriented ESB. But what is innovative about Databus comes from decoupling the data source from the consumers/clients thus being able to offer speed to a large number of subscribers that are up-to-date, but also help clients that fall behind or are just bootstrapping without adding load on the source database.

Databus clients are smart enough to:

  1. ask for Consolidated Deltas since time T if they fall behind
  2. ask for a Consistent Snapshot and then for a Consolidated Delta if they bootstrap

and Databus is build so it can serve both Consolidate Deltas and Consistent Snapshots without any impact on the original data source.

Databus Boostrapping

Diagram from

The “catching-up” and boostrapping processes are described in much more details in Sid Anand’s article.

Databus is the single and only way that data is replicated from LinkedIn’s databases to search indexes, the graph, Memcached, Voldemort, etc.

Original title and link: What Is Unique About LinkedIn’s Databus (NoSQL database©myNoSQL)

Introducing Databus: LinkedIn's Low Latency Change Data Capture Tool

Great article by Siddharth Anand1 introducing LinkedIn’s Databus: a low latency system used for transferring data between data stores (change data capture system):

Databus offers the following feature:

  • Pub-sub semantics
  • In-commit-order delivery guarantees
  • Commits at the source are grouped by transaction
    • ACID semantics are preserved through the entire pipeline
  • Supports partitioning of streams
    • Ordering guarantees are then per partition
  • Like other messaging systems, offers very low latency consumption for recently-published messages
  • Unlike other messaging systems, offers arbitrarily-long look-back with no impact to the source
  • High Availability and Reliability

The ESB model is well-known, but like NoSQL databases, Databus is specialized in handling specific requirements related to distributed systems and high volume data processing architectures.

  1. Siddharth Anand: senior member of LinkedIn’s Distributed Data Systems team 

Original title and link: Introducing Databus: LinkedIn’s Low Latency Change Data Capture Tool (NoSQL database©myNoSQL)


LinkedIn NoSQL Paper: Serving Large-Scale Batch Computed Data With Project Voldemort

The abstract of a new paper from a team at LinkedIn (Roshan Sumbaly, Jay Kreps, Lei Gao, Alex Feinberg, Chinmay Soman, Sam Shah):

Current serving systems lack the ability to bulk load massive immutable data sets without affecting serving performance. The performance degradation is largely due to index creation and modification as CPU and memory resources are shared with request serving. We have ex- tended Project Voldemort, a general-purpose distributed storage and serving system inspired by Amazon’s Dy- namo, to support bulk loading terabytes of read-only data. This extension constructs the index offline, by leveraging the fault tolerance and parallelism of Hadoop. Compared to MySQL, our compact storage format and data deploy- ment pipeline scales to twice the request throughput while maintaining sub 5 ms median latency. At LinkedIn, the largest professional social network, this system has been running in production for more than 2 years and serves many of the data-intensive social features on the site.

Read or download the paper after the break.

DataFu: Open Source Apache Pig UDFs by LinkedIn

Here’s a taste of what you can do with DataFu:

  • Run PageRank on a large number of independent graphs.
  • Perform set operations such as intersect and union.
  • Compute the haversine distance between two points on the globe.
  • Create an assertion on input data which will cause the script to fail if the condition is not met.
  • Perform various operations on bags such as append a tuple, prepend a tuple, concatenate bags, generate unordered pairs, etc.

I’m starting to notice a pattern here. Twitter is open sourcing pretty much everything they are doing related to data storage. Yahoo (now Hortonworks) and Cloudera are the forces behind the open source Hadoop and the tools to bring data to Hadoop. And LinkedIn is starting to open source the tools they are using on top of Hadoop to analyze big data.

What is interesting about this is that you might not get the most polished tools, but they definitely are battle tested.

Original title and link: DataFu: Open Source Apache Pig UDFs by LinkedIn (NoSQL database©myNoSQL)


Data Jujitsu and Data Karate

David F. Carr in an article about DJ Patil and his work on Big Data at LinkedIn:

That is what he means by data jujitsu, where jujitsu is the art of using an opponent’s leverage and momentum against him. In data jujitsu, you try to use the scope of the problem to create the solution—without investing disproportionate resources at the early experimental stage. That’s as opposed to data karate, which would be a direct frontal assault to hack your way through the problem.

Original title and link: Data Jujitsu and Data Karate (NoSQL database©myNoSQL)


State of HBase With Michael Stack

Michael Stack (StumbleUpon & Hadoop PMC) presents on some of the more interesting HBase deployments, HBase scenario usages, HBase and HDFS, and near-future of HBase:

Kafka: LinkedIn's Distributed Publish/Subscribe Messaging System

Another open source project from LinkedIn:

Kafka is a distributed publish-subscribe messaging system. It is designed to support the following:

  • Persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages.
  • High-throughput: even with very modest hardware Kafka can support hundreds of thousands of messages per second.
  • Explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics.
  • Support for parallel data load into Hadoop.

LinkedIn has open sourced a couple of exciting projects, but they haven’t been able to get enough attention and grow so far a community around these.

Original title and link: Kafka: LinkedIn’s Distributed Publish/Subscribe Messaging System (NoSQL databases © myNoSQL)