NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Databus: All content tagged as Databus in NoSQL databases and polyglot persistence

From SimpleDB to Cassandra: Data Migration for a High Volume Web Application at Netflix

Prasanna Padmanabhan and Shashi Madapp posted an article on the Netflix blog describing the process used to migrate data from Amazon SimpleDB to Cassandra:

There will come a time in the life of most systems serving data, when there is a need to migrate data to a more reliable, scalable and high performance data store while maintaining or improving data consistency, latency and efficiency. This document explains the data migration technique we used at Netflix to migrate the user’s queue data between two different distributed NoSQL storage systems.

The steps involved are what you’d expect for a large data set migration:

  1. forklift
  2. incremental replication
  3. consistency checking
  4. shadow writes
  5. shadow writes and shadow reads for validation
  6. end of life of the original data store (SimpleDB)

If you think of it, this is how a distributed, eventually consistent storage works (at least in big lines) when replicating data across the cluster. The main difference is that inside a storage engine you deal with a homogeneous system with a single set of constraints, while data migration has to deal with heterogenous systems most often characterized by different limitations and behavior.

In 2009, Netflix performed a similar massive data migration operation. At that time it involved moving data from its own hosted Oracle and MySQL databases to SimpleDB. The challenges of operating this hybrid solution were described in a the paper Netflix’s Transition to High-Availability Storage Systems authored by Sid Anand.

Sid Anand is now working at LinkedIn where they use Databus for low latency data transfer. But Databus’s approach is very similar.

Original title and link: From SimpleDB to Cassandra: Data Migration for a High Volume Web Application at Netflix (NoSQL database©myNoSQL)


Apache Flume Performance Tuning

A lot of apps get to ship logs and while there are probably numerous tools to help with this, Apache Flume1 is the one I’d look first (even if for taking inpiration on how to do things):

An important decision to make when designing your Flume flow is what type of channel you want to use. At the time of this writing, the two recommended channels are the file channel and the memory channel. The file channel is a durable channel, as it persists all events that are stored in it to disk. So, even if the Java virtual machine is killed, or the operating system crashes or reboots, events that were not successfully transferred to the next agent in the pipeline will still be there when the Flume agent is restarted. The memory channel is a volatile channel, as it buffers events in memory only: if the Java process dies, any events stored in the memory channel are lost. Naturally, the memory channel also exhibits very low put/take latencies compared to the file channel, even for a batch size of 1. Since the number of events that can be stored is limited by available RAM, its ability to buffer events in the case of temporary downstream failure is quite limited. The file channel, on the other hand, has far superior buffering capability due to utilizing cheap, abundant hard disk space.

Just a couple of extra-thoughts:

  1. Flume NG seems to offer 3 types of channels: file, jdbc, memory.
  2. For the memory channel, I’d be adding an option to start dropping events if the memory consumption goes above a configurable threshold (this might already be implemented, but I couldn’t find it)
  3. Would it be worth investigating a channel based on LinkedIn’s low latency transfer Databus tool?

  1. Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. 

Original title and link: Apache Flume Performance Tuning (NoSQL database©myNoSQL)


What Is Unique About LinkedIn’s Databus

After learning about LinkedIn’s Databus low latency data transfer system, I’ve had a short chat with Sid Anand focused on understanding what makes Databus unique.

As I’ve mentioned in my post about Databus, Databus looks at first as a data-oriented ESB. But what is innovative about Databus comes from decoupling the data source from the consumers/clients thus being able to offer speed to a large number of subscribers that are up-to-date, but also help clients that fall behind or are just bootstrapping without adding load on the source database.

Databus clients are smart enough to:

  1. ask for Consolidated Deltas since time T if they fall behind
  2. ask for a Consistent Snapshot and then for a Consolidated Delta if they bootstrap

and Databus is build so it can serve both Consolidate Deltas and Consistent Snapshots without any impact on the original data source.

Databus Boostrapping

Diagram from

The “catching-up” and boostrapping processes are described in much more details in Sid Anand’s article.

Databus is the single and only way that data is replicated from LinkedIn’s databases to search indexes, the graph, Memcached, Voldemort, etc.

Original title and link: What Is Unique About LinkedIn’s Databus (NoSQL database©myNoSQL)

Introducing Databus: LinkedIn's Low Latency Change Data Capture Tool

Great article by Siddharth Anand1 introducing LinkedIn’s Databus: a low latency system used for transferring data between data stores (change data capture system):

Databus offers the following feature:

  • Pub-sub semantics
  • In-commit-order delivery guarantees
  • Commits at the source are grouped by transaction
    • ACID semantics are preserved through the entire pipeline
  • Supports partitioning of streams
    • Ordering guarantees are then per partition
  • Like other messaging systems, offers very low latency consumption for recently-published messages
  • Unlike other messaging systems, offers arbitrarily-long look-back with no impact to the source
  • High Availability and Reliability

The ESB model is well-known, but like NoSQL databases, Databus is specialized in handling specific requirements related to distributed systems and high volume data processing architectures.

  1. Siddharth Anand: senior member of LinkedIn’s Distributed Data Systems team 

Original title and link: Introducing Databus: LinkedIn’s Low Latency Change Data Capture Tool (NoSQL database©myNoSQL)