NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



flume: All content tagged as flume in NoSQL databases and polyglot persistence

Apache Flume Performance Tuning

A lot of apps get to ship logs and while there are probably numerous tools to help with this, Apache Flume1 is the one I’d look first (even if for taking inpiration on how to do things):

An important decision to make when designing your Flume flow is what type of channel you want to use. At the time of this writing, the two recommended channels are the file channel and the memory channel. The file channel is a durable channel, as it persists all events that are stored in it to disk. So, even if the Java virtual machine is killed, or the operating system crashes or reboots, events that were not successfully transferred to the next agent in the pipeline will still be there when the Flume agent is restarted. The memory channel is a volatile channel, as it buffers events in memory only: if the Java process dies, any events stored in the memory channel are lost. Naturally, the memory channel also exhibits very low put/take latencies compared to the file channel, even for a batch size of 1. Since the number of events that can be stored is limited by available RAM, its ability to buffer events in the case of temporary downstream failure is quite limited. The file channel, on the other hand, has far superior buffering capability due to utilizing cheap, abundant hard disk space.

Just a couple of extra-thoughts:

  1. Flume NG seems to offer 3 types of channels: file, jdbc, memory.
  2. For the memory channel, I’d be adding an option to start dropping events if the memory consumption goes above a configurable threshold (this might already be implemented, but I couldn’t find it)
  3. Would it be worth investigating a channel based on LinkedIn’s low latency transfer Databus tool?

  1. Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. 

Original title and link: Apache Flume Performance Tuning (NoSQL database©myNoSQL)


Apache Bigtop: Apache Big Data Management Distribution Based on Apache Hadoop

Apache Bigtop:

The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.

Currently packaging:

  • Apache Hadoop 1.0.x
  • Apache Zookeeper 3.4.3
  • Apache HBase 0.92.0
  • Apache Hive 0.8.1
  • Apache Pig 0.9.2
  • Apache Mahout 0.6.1
  • Apache Oozie 3.1.3
  • Apache Sqoop 1.4.1
  • Apache Flume 1.0.0
  • Apache Whirr 0.7.0

Apache Bigtop looks like the first step towards the Big Data LAMP-like platform analysts are calling for. Practically though it’s goal is to ensure that all the components of the wide Hadoop ecosystem remain interoperable.

Original title and link: Apache Bigtop: Apache Big Data Management Distribution Based on Apache Hadoop (NoSQL database©myNoSQL)

Real-Time Analytics With Kafka and IronCount

Edward Capriolo suggesting an alternative approach to real-time analytics backed by solutions like Rainbird, Flume, Scribe, or Storm:

Distributed processing is RTA requirement #2 which is where IronCount comes in. It is great that we can throw tons of messages into Kafka, but we do not have a system to process these messages. We could pick say 4 servers on our network and write a program implementing a Kafka Consumer interface to process messages, write init scripts, write nagios check, manage it. How to stop it start it upgrade it? How should the code even be written? What if we need to run two programs, or five or ten?

IronCount gives an simple answer for this questions. It starts by abstracting users from many of the questions mentioned above. Users need to only implement a single interface.

In a way this post reminded me of Ted Dziuba’s Taco Bell Programming:

The more I write code and design systems, the more I understand that many times, you can achieve the desired functionality simply with clever reconfigurations of the basic Unix tool set. After all, functionality is an asset, but code is a liability.

Original title and link: Real-Time Analytics With Kafka and IronCount (NoSQL database©myNoSQL)


The components and their functions in the Hadoop ecosystem

Edd Dumbill enumerates the various components of the Hadoop ecosystem:

Hadoop ecosystem

My quick reference of the Hadoop ecosystem is including a couple of other tools that are not in this list, with the exception of Ambari and HCatalog which were released later.

Original title and link: The components and their functions in the Hadoop ecosystem (NoSQL database©myNoSQL)

Project Isotope Will Bring Together Hadoop Toolchain With Microsoft’s Data Products

There’s a series of events lately that makes me think Microsoft is nowhere near accepting defeat in the cloud services area. As regards Microsoft’s Project Isotop, things are much simpler than ZDNet article make them sound[1]: Microsoft is working on integrating Hadoop and its toolchain with their own products (SQL Server Analysis Services, PowerPivot).

Microsoft Project Isotop

A picture worth more than the 626 words.

  1. I bet the details of integration are fascinating and far from being simple, but the article is not focusing on those  

Original title and link: Project Isotope Will Bring Together Hadoop Toolchain With Microsoft’s Data Products (NoSQL database©myNoSQL)

Flume and Hadoop on OS X

Arbo v. Monkiewitsch:

For a kick-ass webscale big-data setup on your local mac you’ll want to have Hadoop and Flume place.

No seriously, this setup is especially useful, if you want to route your syslog output from your nodes to HDFS in order to process it later using Map/Reduce jobs.

My emphasis.

Original title and link: Flume and Hadoop on OS X (NoSQL database©myNoSQL)


Experimenting with Hadoop using Cloudera VirtualBox Demo

CDH Mac OS X VirtualBox VM

If you don’t count the download, you’ll get this up and running in 5 minutes tops. At the end you’ll have Hadoop, Sqoop, Pig, Hive, HBase, ZooKeeper, Oozie, Hume, Flume, and Whirr all configured and ready to experiment with.

Making it easy for users to experiment with these tools increases the chances for adoption. Adoption means business.

Original title and link: Experimenting with Hadoop using Cloudera VirtualBox Demo (NoSQL databases © myNoSQL)


Cloudera’s Distribution for Apache Hadoop version 3 Beta 4

New version of Cloudera’s Hadoop distro — complete release notes available here:

CDH3 Beta 4 also includes new versions of many components. Highlights include:

  • HBase 0.90.1, including much improved stability and operability.
  • Hive 0.7.0rc0, including the beginnings of authorization support, support for multiple databases, and many other new features.
  • Pig 0.8.0, including many new features like scalar types, custom partitioners, and improved UDF language support.
  • Flume 0.9.3, including support for Windows and improved monitoring capabilities.
  • Sqoop 1.2, including improvements to usability and Oracle integration.
  • Whirr 0.3, including support for starting HBase clusters on popular cloud platforms.

Plus many scalability improvements contributed by Yahoo!.

Cloudera’s CDH is the most popular Hadoop distro bringing together many components of the Hadoop ecosystem. Yahoo remains the main innovator behind Hadoop.

Original title and link: Cloudera’s Distribution for Apache Hadoop version 3 Beta 4 (NoSQL databases © myNoSQL)


Why the Cloudera - Membase partnership?

For those scenarios that require both scalable low latency data access and batch analytics to complete the application’s mission. This kind of hybrid, bidirectional data integration is the topological requirement of new applications – AOL Advertising and ShareThis are joint customers with these requirements. A Flume interface provides a streaming interface from Membase to Hadoop;  a Sqoop utility can be used for batch transfers between the two. Both of these utilities will be familiar to Hadoop watchers.

Basically OLTP (Membase) and OLAP (Cloudera/Hadoop). And I told you everybody Flumes.

Original title and link: Why the Cloudera - Membase partnership? (NoSQL databases © myNoSQL)


Everybody Flumes

Everyone is integrating with ☞ Flume, the distributed, reliable, and available service for collecting, aggregating, and moving large amounts of log data.

MongoDB: mongo-hadoop


Still experimental, but it has a Flume sync for storing log data into MongoDB.

Cassandra: flume-cassandra-plugin


A new Cassandra plugin for Flume is designed to require very little configuration and produce good results out of the box. The plugin offers two sinks: a simple sink that indexes entries by date, and a sink designed to take syslog events and store for use with the Logsandra search engine, a tool for searching and analyzing logs stored in Cassandra.


Voldemort: flume-voldemort-plugin


Voldemort sink for Flume is supposed to store events which are received from different Flume sources into Voldemort. It looks like this is pretty much straightforward task initially. But there were several issues required more consideration.


Dunith Dhanushka’s article has also a section on design consideration for the Voldemort sink.

What other Flume integrations are out there?

Udpate: Since posting this I’ve heard about more Flume integrations coming from ElasticSearch, HBase, Akka, Membase, Hive, AMQP. I’ll update the post when I find more details.

Original title and link: Everybody Flumes (NoSQL databases © myNoSQL)

Quick Reference: Hadoop Tools Ecosystem

Just a quick reference of the continuously growing Hadoop tools ecosystem.


The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.


A distributed file system that provides high throughput access to application data.


A software framework for distributed processing of large data sets on compute clusters.

Amazon Elastic MapReduce

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Cloudera Distribution for Hadoop (CDH)

Cloudera’s Distribution for Hadoop (CDH) sets a new standard for Hadoop-based data management platforms.


A high-performance coordination service for distributed applications. ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.


A scalable, distributed database that supports structured data storage for large tables.


A data serialization system. Similar to ☞ Thrift and ☞ Protocol Buffers.


Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:

  • Imports individual tables or entire databases to files in HDFS
  • Generates Java classes to allow you to interact with your imported data
  • Provides the ability to import from SQL databases straight into your Hive data warehouse


Flume is a distributed, reliable, and available service for efficiently moving large amounts of data soon after the data is produced.


Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language.


A high-level data-flow language and execution framework for parallel computation. Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.


Oozie is a workflow/coordination service to manage data processing jobs for Apache Hadoop. It is an extensible, scalable and data-aware service to orchestrate dependencies between jobs running on Hadoop (including HDFS, Pig and MapReduce).


Cascading is a Query API and Query Planner used for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster.


Cascalog is a tool for processing data on Hadoop with Clojure in a concise and expressive manner. Cascalog combines two cutting edge technologies in Clojure and Hadoop and resurrects an old one in Datalog. Cascalog is high performance, flexible, and robust.


Hue is a graphical user interface to operate and develop applications for Hadoop. Hue applications are collected into a desktop-style environment and delivered as a Web application, requiring no additional installation for individual users.

You can read more about HUE on ☞ Cloudera blog.


Chukwa is a data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.


A Scalable machine learning and data mining library.

Integration with Relational databases

Integration with Data Warehouses

The only list I have is MapReduce, RDBMS, and Data Warehouse, but I’m afraid it is quite a bit old. So maybe someone could help me update it.

Anything else? Once we validate this list, I’ll probably have to move it on the NoSQL reference

Original title and link: Quick Reference: Hadoop Tools Ecosystem (NoSQL databases © myNoSQL)

Search Analytics with Flume and HBase

In the last week, I’ve seen 3 articles or presentations on using Hadoop-based searches:

and then embedded below sematext’s Search Analytics with Flume and HBase.

Meanwhile, Google went Caffeine to deal with more timely index updates.

Original title and link: Search Analytics with Flume and HBase (NoSQL databases © myNoSQL)