ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Apache Flume Performance Tuning

A lot of apps get to ship logs and while there are probably numerous tools to help with this, Apache Flume1 is the one I’d look first (even if for taking inpiration on how to do things):

An important decision to make when designing your Flume flow is what type of channel you want to use. At the time of this writing, the two recommended channels are the file channel and the memory channel. The file channel is a durable channel, as it persists all events that are stored in it to disk. So, even if the Java virtual machine is killed, or the operating system crashes or reboots, events that were not successfully transferred to the next agent in the pipeline will still be there when the Flume agent is restarted. The memory channel is a volatile channel, as it buffers events in memory only: if the Java process dies, any events stored in the memory channel are lost. Naturally, the memory channel also exhibits very low put/take latencies compared to the file channel, even for a batch size of 1. Since the number of events that can be stored is limited by available RAM, its ability to buffer events in the case of temporary downstream failure is quite limited. The file channel, on the other hand, has far superior buffering capability due to utilizing cheap, abundant hard disk space.

Just a couple of extra-thoughts:

  1. Flume NG seems to offer 3 types of channels: file, jdbc, memory.
  2. For the memory channel, I’d be adding an option to start dropping events if the memory consumption goes above a configurable threshold (this might already be implemented, but I couldn’t find it)
  3. Would it be worth investigating a channel based on LinkedIn’s low latency transfer Databus tool?

  1. Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. 

Original title and link: Apache Flume Performance Tuning (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2013/01/how-to-do-apache-flume-performance-tuning-part-1/