Kafka: All content tagged as Kafka in NoSQL databases and polyglot persistence
Monday, 3 September 2012
Takeaways From the Kafka Talk at AirBnB: The Power of Structured Data and the Myth of “Exactly Once”
Sadayuki Furuhashi distils the lessons learned from Jay Krep’s talk about Kafka:
The holy grail of messaging systems is “exactly once”, meaning that every message is always delivered (“at least once”) and never duplicated (“at most once”). And just like any other thing “holy grail”, it’s pretty unrealistic without major drawbacks.
Original title and link: Takeaways From the Kafka Talk at AirBnB: The Power of Structured Data and the Myth of “Exactly Once” (©myNoSQL)
via: http://java.dzone.com/articles/akeaways-kafka-talk-airbnb
Monday, 6 August 2012
A Big Data Trifecta: Storm, Kafka and Cassandra
Brain O’Neill details his first experiments of migrating from using JMS to Kafka in a very interesting architecture involving:
Now, Kafka is fast. When running the Kafka Spout by itself, I easily reproduced Kafka’s claim that you can consume “hundreds of thousands of messages per second”. When I first fired up the topology, things went well for the first minute, but then quickly crashed as the Kafka spout emitted too fast for the Cassandra Bolt to keep up. Even though Cassandra is fast as well, it is still orders of magnitude slower than Kafka.
Original title and link: A Big Data Trifecta: Storm, Kafka and Cassandra (©myNoSQL)
via: http://brianoneill.blogspot.com/2012/08/a-big-data-trifecta-storm-kafka-and.html
Saturday, 9 June 2012
Kafka: Major Design Elements
A very interesting read for the weekend, the rationale and major design elements behind Kafka, LinkedIn’s open source messaging system:
There is a small number of major design decisions that make Kafka different from most other messaging systems:
- Kafka is designed for persistent messages as the common case
- Throughput rather than features are the primary design constraint
- State about what has been consumed is maintained as part of the consumer not the server
- Kafka is explicitly distributed. It is assumed that producers, brokers, and consumers are all spread over multiple machines.
Original title and link: Kafka: Major Design Elements (©myNoSQL)
Wednesday, 7 March 2012
Real-Time Analytics With Kafka and IronCount
Edward Capriolo suggesting an alternative approach to real-time analytics backed by solutions like Rainbird, Flume, Scribe, or Storm:
Distributed processing is RTA requirement #2 which is where IronCount comes in. It is great that we can throw tons of messages into Kafka, but we do not have a system to process these messages. We could pick say 4 servers on our network and write a program implementing a Kafka Consumer interface to process messages, write init scripts, write nagios check, manage it. How to stop it start it upgrade it? How should the code even be written? What if we need to run two programs, or five or ten?
IronCount gives an simple answer for this questions. It starts by abstracting users from many of the questions mentioned above. Users need to only implement a single interface.
In a way this post reminded me of Ted Dziuba’s Taco Bell Programming:
The more I write code and design systems, the more I understand that many times, you can achieve the desired functionality simply with clever reconfigurations of the basic Unix tool set. After all, functionality is an asset, but code is a liability.
Original title and link: Real-Time Analytics With Kafka and IronCount (©myNoSQL)
via: http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/what_is_the_deal_with
Wednesday, 13 July 2011
DataSift PubSub: From Redis to Kafka and 0mq
DataSift moves from Redis PubSub to Kafka and 0mq :
Kafka is still a young project, but it’s maturing fast, and we’re confident enough to use it in production (as a matter of fact, we’ve been using it for months now) in front of our HBase cluster and to collect monitoring events sent from all our internal services. We chose Kafka especially for its persistent storage (which is essentially a partitioned binary log), but we plan to do some analytics via its support for Hadoop soon. And its distributed nature (coordination beetween consumers and brokers is done via Zookeeper) makes it very appealing too.
Original title and link: DataSift PubSub: From Redis to Kafka and 0mq (©myNoSQL)
Thursday, 24 February 2011
Recipe for a Distributed Realtime Tweet Search System
Ingredients:
Method:
- Place Voldemort, Kafka, and Sensei on a couple of servers.
-
Arrange them with taste:

-
Spray a large quantity of tweets on the system
Preparation time:
24 hours
Notes:
For more servings, add the appropriate number of servers.
Result:
Reviews
- One design choice was letting the process that writes to Voldemort also be a Kafka consumer. Although this would be cleaner, we would risk a data-race where search may return hit array before they are yet added to Voldemort. By making sure it is first added to Voldemort, we can rely on it being an authoritative storage for our tweets.
- You may have already realized Kafka is acting as a proxy for twitter stream, and we could have also streamed tweets directly into the search systems, bypassing the Kafka layer. What we would be missing is the ability to play back tweet events from a specific check-point. One really nice feature about Kafka is that you can keep a consumption point to have data replayed. This makes reindexing for cases such as data corruption and schema changes, etc., possible. Furthermore, to scale search, we would have a growing number of search nodes consume from the same Kafka stream. Kafka is written in a way where adding consumers does not affect through-put of the system really helps in scaling the entire system.
- Another important design decision was on using Voldemort for storage. One solution would be instead store tweets in the search index, e.g. Lucene stored fields. The benefits with this approach would be stronger consistency between search and store, and also the stored data would follow the retention policy of that’s defined by the search system. However, other than the fact that Lucene stored field is no-where near as optimal comparing to a Voldemort cluster (an implementation issue), there are more convincing reasons:
- We can first see the consistency benefit for having search and store be together is negligible. Actually, if we follow our assumption of tweets being append-only and we always write to Voldemort first, we really wouldn’t have consistency issues. Yet, having data storage reside on the same search system would disproportionally introduce contention for IO bandwidth and OS cache, as data volume increases, search performance can be negatively impacted.
- The point about retention is rather valid. As search index guarantees older tweets to be expired, Voldemort store would continue to grow. Our decision ultimately came down to two points: 1) Voldemort’s growth factor is very different, e.g. adding new records into the system is much cheaper, so it is feasible to have a much longer data retention policy. 2) Having have cluster of tweet storage allows us to integrate with other systems if desired for analytics, display etc.
Original title and link: Recipe for a Distributed Realtime Tweet Search System (NoSQL databases © myNoSQL)
Wednesday, 29 December 2010
Kafka: LinkedIn's Distributed Publish/Subscribe Messaging System
Another open source project from LinkedIn:
Kafka is a distributed publish-subscribe messaging system. It is designed to support the following:
- Persistent messaging with
O(1)disk structures that provide constant time performance even with many TB of stored messages.- High-throughput: even with very modest hardware Kafka can support hundreds of thousands of messages per second.
- Explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics.
- Support for parallel data load into Hadoop.
LinkedIn has open sourced a couple of exciting projects, but they haven’t been able to get enough attention and grow so far a community around these.
Original title and link: Kafka: LinkedIn’s Distributed Publish/Subscribe Messaging System (NoSQL databases © myNoSQL)