NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



kafka: All content tagged as kafka in NoSQL databases and polyglot persistence

Apache Kafka: Next generation distributed messaging system

Abhishek Sharma in an 3000 words article on InfoQ:

Its architecture consists of the following components:

  • A stream of messages of a particular type is defined as a topic. A Message is defined as a payload of bytes and a Topic is a category or feed name to which messages are published.
  • A Abhishek Sharma can be anyone who can publish messages to a Topic.
  • The published messages are then stored at a set of servers called Brokers or Kafka Cluster.
  • A Consumer can subscribe to one or more Topics and consume the published Messages by pulling data from the Brokers.

Producer can choose their favorite serialization method to encode the message content. For efficiency, the producer can send a set of messages in a single publish request. Following code examples shows how to create a Producer to send messages.

Kafka is an amazing system. I just wish the article would have actually looked into what makes it unique and how it compares to systems like RabbitMQ or ActiveMQ.

✚ Cameron Purdy in one of the comments:

If you carefully read the article, you’ll note that Kafka is not actually a message queue. It’s just a specialized database with some messaging semantics in its API. That means if you need the behaviors that you would associate with a message queue, you can’t get them with Kafka (or if you can, the performance will plummet.)

Original title and link: Apache Kafka: Next generation distributed messaging system (NoSQL database©myNoSQL)


Project Secor: Long-term S3 storage for Kafka logs

A new project open sourced by Pinterest, Secor:

Project Secor was born from the need to persist messages logged to Kafka to S3 for long-term storage. Data lost or corrupted at this stage isn’t recoverable so the greatest design objective for Secor is data integrity.

Original title and link: Project Secor: Long-term S3 storage for Kafka logs (NoSQL database©myNoSQL)


Grape - a realtime processing pipeline

From the project page:

The main goal is data availability and data persistency. We created grape for those who can not afford losing data.

Instead of going over Storm’s steps we dramatically changed Grape logic.

Contrary to Kafka we can not lose your data if ‘data-file’ was not read for a long time or its size overflows under constant write load.

Original title and link: Grape - a realtime processing pipeline (NoSQL database©myNoSQL)


Takeaways From the Kafka Talk at AirBnB: The Power of Structured Data and the Myth of “Exactly Once”

Sadayuki Furuhashi distils the lessons learned from Jay Krep’s talk about Kafka:

The holy grail of messaging systems is “exactly once”, meaning that every message is always delivered (“at least once”) and never duplicated (“at most once”). And just like any other thing “holy grail”, it’s pretty unrealistic without major drawbacks.

Original title and link: Takeaways From the Kafka Talk at AirBnB: The Power of Structured Data and the Myth of “Exactly Once” (NoSQL database©myNoSQL)


A Big Data Trifecta: Storm, Kafka and Cassandra

Brain O’Neill details his first experiments of migrating from using JMS to Kafka in a very interesting architecture involving:

Now, Kafka is fast.  When running the Kafka Spout by itself, I easily reproduced Kafka’s claim that you can consume “hundreds of thousands of messages per second”.  When I first fired up the topology, things went well for the first minute, but then quickly crashed as the Kafka spout emitted  too fast for the Cassandra Bolt to keep up.  Even though Cassandra is fast as well, it is still orders of magnitude slower than Kafka.

Original title and link: A Big Data Trifecta: Storm, Kafka and Cassandra (NoSQL database©myNoSQL)


Kafka: Major Design Elements

A very interesting read for the weekend, the rationale and major design elements behind Kafka, LinkedIn’s open source messaging system:

There is a small number of major design decisions that make Kafka different from most other messaging systems:

  1. Kafka is designed for persistent messages as the common case
  2. Throughput rather than features are the primary design constraint
  3. State about what has been consumed is maintained as part of the consumer not the server
  4. Kafka is explicitly distributed. It is assumed that producers, brokers, and consumers are all spread over multiple machines.

Original title and link: Kafka: Major Design Elements (NoSQL database©myNoSQL)


Real-Time Analytics With Kafka and IronCount

Edward Capriolo suggesting an alternative approach to real-time analytics backed by solutions like Rainbird, Flume, Scribe, or Storm:

Distributed processing is RTA requirement #2 which is where IronCount comes in. It is great that we can throw tons of messages into Kafka, but we do not have a system to process these messages. We could pick say 4 servers on our network and write a program implementing a Kafka Consumer interface to process messages, write init scripts, write nagios check, manage it. How to stop it start it upgrade it? How should the code even be written? What if we need to run two programs, or five or ten?

IronCount gives an simple answer for this questions. It starts by abstracting users from many of the questions mentioned above. Users need to only implement a single interface.

In a way this post reminded me of Ted Dziuba’s Taco Bell Programming:

The more I write code and design systems, the more I understand that many times, you can achieve the desired functionality simply with clever reconfigurations of the basic Unix tool set. After all, functionality is an asset, but code is a liability.

Original title and link: Real-Time Analytics With Kafka and IronCount (NoSQL database©myNoSQL)


DataSift PubSub: From Redis to Kafka and 0mq

DataSift moves from Redis PubSub to Kafka and 0mq :

Kafka is still a young project, but it’s maturing fast, and we’re confident enough to use it in production (as a matter of fact, we’ve been using it for months now) in front of our HBase cluster and to collect monitoring events sent from all our internal services. We chose Kafka especially for its persistent storage (which is essentially a partitioned binary log), but we plan to do some analytics via its support for Hadoop soon. And its distributed nature (coordination beetween consumers and brokers is done via Zookeeper) makes it very appealing too.

Original title and link: DataSift PubSub: From Redis to Kafka and 0mq (NoSQL database©myNoSQL)


Recipe for a Distributed Realtime Tweet Search System



  1. Place Voldemort, Kafka, and Sensei on a couple of servers.
  2. Arrange them with taste:

    Chirp Architecture

  3. Spray a large quantity of tweets on the system

Preparation time:

24 hours


For more servings, add the appropriate number of servers.


Chirper on Github


  • One design choice was letting the process that writes to Voldemort also be a Kafka consumer. Although this would be cleaner, we would risk a data-race where search may return hit array before they are yet added to Voldemort. By making sure it is first added to Voldemort, we can rely on it being an authoritative storage for our tweets.
  • You may have already realized Kafka is acting as a proxy for twitter stream, and we could have also streamed tweets directly into the search systems, bypassing the Kafka layer. What we would be missing is the ability to play back tweet events from a specific check-point. One really nice feature about Kafka is that you can keep a consumption point to have data replayed. This makes reindexing for cases such as data corruption and schema changes, etc., possible. Furthermore, to scale search, we would have a growing number of search nodes consume from the same Kafka stream. Kafka is written in a way where adding consumers does not affect through-put of the system really helps in scaling the entire system.
  • Another important design decision was on using Voldemort for storage. One solution would be instead store tweets in the search index, e.g. Lucene stored fields. The benefits with this approach would be stronger consistency between search and store, and also the stored data would follow the retention policy of that’s defined by the search system. However, other than the fact that Lucene stored field is no-where near as optimal comparing to a Voldemort cluster (an implementation issue), there are more convincing reasons:
    • We can first see the consistency benefit for having search and store be together is negligible. Actually, if we follow our assumption of tweets being append-only and we always write to Voldemort first, we really wouldn’t have consistency issues. Yet, having data storage reside on the same search system would disproportionally introduce contention for IO bandwidth and OS cache, as data volume increases, search performance can be negatively impacted.
    • The point about retention is rather valid. As search index guarantees older tweets to be expired, Voldemort store would continue to grow. Our decision ultimately came down to two points: 1) Voldemort’s growth factor is very different, e.g. adding new records into the system is much cheaper, so it is feasible to have a much longer data retention policy. 2) Having have cluster of tweet storage allows us to integrate with other systems if desired for analytics, display etc.

Original title and link: Recipe for a Distributed Realtime Tweet Search System (NoSQL databases © myNoSQL)


Kafka: LinkedIn's Distributed Publish/Subscribe Messaging System

Another open source project from LinkedIn:

Kafka is a distributed publish-subscribe messaging system. It is designed to support the following:

  • Persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages.
  • High-throughput: even with very modest hardware Kafka can support hundreds of thousands of messages per second.
  • Explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics.
  • Support for parallel data load into Hadoop.

LinkedIn has open sourced a couple of exciting projects, but they haven’t been able to get enough attention and grow so far a community around these.

Original title and link: Kafka: LinkedIn’s Distributed Publish/Subscribe Messaging System (NoSQL databases © myNoSQL)