LinkedIn: All content tagged as LinkedIn in NoSQL databases and polyglot persistence
Wednesday, 20 March 2013
White Elephant: Task Statistics for Hadoop
From LinkedIn’s engineering team:
While tools like Ganglia provide system-level metrics, we wanted to be able to understand what resources were being used by each user and at what times. White Elephant parses Hadoop logs to provide visual drill downs and rollups of task statistics for your Hadoop cluster, including total task time, slots used, CPU time, and failed job counts.
Isn’t this a form of resource usage auditing? Based on this, next you could build support for resource quotas and then start enforcing them.
Original title and link: White Elephant: Task Statistics for Hadoop (©myNoSQL)
via: http://engineering.linkedin.com/hadoop/white-elephant-hadoop-tool-you-never-knew-you-needed
Friday, 22 February 2013
Which Big Data Company Has the World's Biggest Hadoop Cluster?
Jimmy Wong:
Which companies use Hadoop for analyzing big data? How big are their clusters? I thought it would be fun to compare companies by the size of their Hadoop installations. The size would indicate the company’s investment in Hadoop, and subsequently their appetite to buy big data products and services from vendors, as well as their hiring needs to support their analytics infrastructure.
Unfortunately the data available is sooo little and soooo old.
Original title and link: Which Big Data Company Has the World’s Biggest Hadoop Cluster? (©myNoSQL)
via: http://www.hadoopwizard.com/which-big-data-company-has-the-worlds-biggest-hadoop-cluster/
Monday, 18 February 2013
From SimpleDB to Cassandra: Data Migration for a High Volume Web Application at Netflix
Prasanna Padmanabhan and Shashi Madapp posted an article on the Netflix blog describing the process used to migrate data from Amazon SimpleDB to Cassandra:
There will come a time in the life of most systems serving data, when there is a need to migrate data to a more reliable, scalable and high performance data store while maintaining or improving data consistency, latency and efficiency. This document explains the data migration technique we used at Netflix to migrate the user’s queue data between two different distributed NoSQL storage systems.
The steps involved are what you’d expect for a large data set migration:
- forklift
- incremental replication
- consistency checking
- shadow writes
- shadow writes and shadow reads for validation
- end of life of the original data store (SimpleDB)
If you think of it, this is how a distributed, eventually consistent storage works (at least in big lines) when replicating data across the cluster. The main difference is that inside a storage engine you deal with a homogeneous system with a single set of constraints, while data migration has to deal with heterogenous systems most often characterized by different limitations and behavior.
In 2009, Netflix performed a similar massive data migration operation. At that time it involved moving data from its own hosted Oracle and MySQL databases to SimpleDB. The challenges of operating this hybrid solution were described in a the paper Netflix’s Transition to High-Availability Storage Systems authored by Sid Anand.
Sid Anand is now working at LinkedIn where they use Databus for low latency data transfer. But Databus’s approach is very similar.
Original title and link: From SimpleDB to Cassandra: Data Migration for a High Volume Web Application at Netflix (©myNoSQL)
via: http://techblog.netflix.com/2013/02/netflix-queue-data-migration-for-high.html?m=1
Wednesday, 30 January 2013
DataFu: A Collection of Pig UDFs for Data Analysis on Hadoop by LinkedIn
Sam Shah in a guest post on Hortonworks blog:
If Pig is the “duct tape for big data”, then DataFu is the WD-40. Or something. […] Over the years, we developed several routines that were used across LinkedIn and were thrown together into an internal package we affectionately called “littlepiggy.”
“a penetrating oil and water-displacing spray“? “littlepiggy”? Seriously?
How could one come up with these names for such a useful library of statistical functions, PageRank, set and bag operations?
Original title and link: DataFu: A Collection of Pig UDFs for Data Analysis on Hadoop by LinkedIn (©myNoSQL)
Wednesday, 21 March 2012
What Is Unique About LinkedIn’s Databus
After learning about LinkedIn’s Databus low latency data transfer system, I’ve had a short chat with Sid Anand focused on understanding what makes Databus unique.
As I’ve mentioned in my post about Databus, Databus looks at first as a data-oriented ESB. But what is innovative about Databus comes from decoupling the data source from the consumers/clients thus being able to offer speed to a large number of subscribers that are up-to-date, but also help clients that fall behind or are just bootstrapping without adding load on the source database.
Databus clients are smart enough to:
- ask for Consolidated Deltas since time T if they fall behind
- ask for a Consistent Snapshot and then for a Consolidated Delta if they bootstrap
and Databus is build so it can serve both Consolidate Deltas and Consistent Snapshots without any impact on the original data source.

Diagram from Highscalability.com
The “catching-up” and boostrapping processes are described in much more details in Sid Anand’s article.
Databus is the single and only way that data is replicated from LinkedIn’s databases to search indexes, the graph, Memcached, Voldemort, etc.
Original title and link: What Is Unique About LinkedIn’s Databus (©myNoSQL)
Monday, 19 March 2012
Introducing Databus: LinkedIn's Low Latency Change Data Capture Tool
Great article by Siddharth Anand1 introducing LinkedIn’s Databus: a low latency system used for transferring data between data stores (change data capture system):
Databus offers the following feature:
- Pub-sub semantics
- In-commit-order delivery guarantees
- Commits at the source are grouped by transaction
- ACID semantics are preserved through the entire pipeline
- Supports partitioning of streams
- Ordering guarantees are then per partition
- Like other messaging systems, offers very low latency consumption for recently-published messages
- Unlike other messaging systems, offers arbitrarily-long look-back with no impact to the source
- High Availability and Reliability
The ESB model is well-known, but like NoSQL databases, Databus is specialized in handling specific requirements related to distributed systems and high volume data processing architectures.
-
Siddharth Anand: senior member of LinkedIn’s Distributed Data Systems team ↩
Original title and link: Introducing Databus: LinkedIn’s Low Latency Change Data Capture Tool (©myNoSQL)
Monday, 27 February 2012
LinkedIn NoSQL Paper: Serving Large-Scale Batch Computed Data With Project Voldemort
The abstract of a new paper from a team at LinkedIn (Roshan Sumbaly, Jay Kreps, Lei Gao, Alex Feinberg, Chinmay Soman, Sam Shah):
Current serving systems lack the ability to bulk load massive immutable data sets without affecting serving performance. The performance degradation is largely due to index creation and modification as CPU and memory resources are shared with request serving. We have ex- tended Project Voldemort, a general-purpose distributed storage and serving system inspired by Amazon’s Dy- namo, to support bulk loading terabytes of read-only data. This extension constructs the index offline, by leveraging the fault tolerance and parallelism of Hadoop. Compared to MySQL, our compact storage format and data deploy- ment pipeline scales to twice the request throughput while maintaining sub 5 ms median latency. At LinkedIn, the largest professional social network, this system has been running in production for more than 2 years and serves many of the data-intensive social features on the site.
Read or download the paper after the break.
Thursday, 12 January 2012
DataFu: Open Source Apache Pig UDFs by LinkedIn
Here’s a taste of what you can do with DataFu:
- Run PageRank on a large number of independent graphs.
- Perform set operations such as intersect and union.
- Compute the haversine distance between two points on the globe.
- Create an assertion on input data which will cause the script to fail if the condition is not met.
- Perform various operations on bags such as append a tuple, prepend a tuple, concatenate bags, generate unordered pairs, etc.
I’m starting to notice a pattern here. Twitter is open sourcing pretty much everything they are doing related to data storage. Yahoo (now Hortonworks) and Cloudera are the forces behind the open source Hadoop and the tools to bring data to Hadoop. And LinkedIn is starting to open source the tools they are using on top of Hadoop to analyze big data.
What is interesting about this is that you might not get the most polished tools, but they definitely are battle tested.
Original title and link: DataFu: Open Source Apache Pig UDFs by LinkedIn (©myNoSQL)
Thursday, 3 November 2011
Data Jujitsu and Data Karate
David F. Carr in an article about DJ Patil and his work on Big Data at LinkedIn:
That is what he means by data jujitsu, where jujitsu is the art of using an opponent’s leverage and momentum against him. In data jujitsu, you try to use the scope of the problem to create the solution—without investing disproportionate resources at the early experimental stage. That’s as opposed to data karate, which would be a direct frontal assault to hack your way through the problem.
Original title and link: Data Jujitsu and Data Karate (©myNoSQL)
Saturday, 10 September 2011
State of HBase With Michael Stack
Michael Stack (StumbleUpon & Hadoop PMC) presents on some of the more interesting HBase deployments, HBase scenario usages, HBase and HDFS, and near-future of HBase:
Wednesday, 29 December 2010
Kafka: LinkedIn's Distributed Publish/Subscribe Messaging System
Another open source project from LinkedIn:
Kafka is a distributed publish-subscribe messaging system. It is designed to support the following:
- Persistent messaging with
O(1)disk structures that provide constant time performance even with many TB of stored messages.- High-throughput: even with very modest hardware Kafka can support hundreds of thousands of messages per second.
- Explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics.
- Support for parallel data load into Hadoop.
LinkedIn has open sourced a couple of exciting projects, but they haven’t been able to get enough attention and grow so far a community around these.
Original title and link: Kafka: LinkedIn’s Distributed Publish/Subscribe Messaging System (NoSQL databases © myNoSQL)
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling
