Storm: All content tagged as Storm in NoSQL databases and polyglot persistence
Wednesday, 24 April 2013
Storm and Hadoop: Convergence of Big-Data and Low-Latency Processing at Yahoo!
Andy Feng wrote a blog post on YDN blog about the data processing architecture at Yahoo! for delivering personalized content by analyzing billions of events for 700mil. users and 2.2bil content pieces every day using a combination of batch-processing (Hadoop) and stream-processing (Storm):
Enabling low-latency big-data processing is one of the primary design goals of Yahoo!’s next-generation big-data platform. While MapReduce is a key design pattern for batch processing, additional design patterns will be supported over time. Stream/micro-batch processing is one of design patterns applicable to many Yahoo! use cases. In Q1 2013, we added Storm as a new service to our big-data platform. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for stream/micro-batch processing.
✚ I don’t think I’ve seen the term micro-batch processing used before. Any ideas why using it as an alternative to the well established stream processing?
Original title and link: Storm and Hadoop: Convergence of Big-Data and Low-Latency Processing at Yahoo! (©myNoSQL)
Tuesday, 8 January 2013
Distributed Stream Processing Showdown: S4 vs Storm
Fantastic post by Gianmarco De Francisci Morales describing the similarities and major differencences between the two Apache licensed, JVM-based stream processing platforms S4 and Storm.
There are many other differences, but for sake of brevity I just present a short summary of the pros of each platform that the other one lacks.
S4 pros
- Clean programming model.
- State recovery.
- Inter-app communication.
- Classpath isolation.
- Tools for packaging and deployment.
- Apache incubation.
Storm pros
- Pull model.
- Guaranteed processing.
- More mature, more traction, larger community.
- High performance.
- Thread programming support.
- Advanced features (transactional topologies, Trident).
Original title and link: Distributed Stream Processing Showdown: S4 vs Storm (©myNoSQL)
via: http://gdfm.me/2013/01/02/distributed-stream-processing-showdown-s4-vs-storm/
Monday, 6 August 2012
A Big Data Trifecta: Storm, Kafka and Cassandra
Brain O’Neill details his first experiments of migrating from using JMS to Kafka in a very interesting architecture involving:
Now, Kafka is fast. When running the Kafka Spout by itself, I easily reproduced Kafka’s claim that you can consume “hundreds of thousands of messages per second”. When I first fired up the topology, things went well for the first minute, but then quickly crashed as the Kafka spout emitted too fast for the Cassandra Bolt to keep up. Even though Cassandra is fast as well, it is still orders of magnitude slower than Kafka.
Original title and link: A Big Data Trifecta: Storm, Kafka and Cassandra (©myNoSQL)
via: http://brianoneill.blogspot.com/2012/08/a-big-data-trifecta-storm-kafka-and.html
Wednesday, 6 June 2012
Storm 0.8.0: The Most Significant Release of Storm Yet
In Nathan Marz’s words Storm1 0.8.0 is “a major step forward in the evolution of the project”:
- Executors: Storm 0.8.0 has a new model where a worker is a process and an executor is a thread.
- Pluggable scheduler
- Throughput improvements
- Decreased ZooKeeper load/increased Storm UI performance
- Abstractions for shared resources
- Tick tuples
and a lot more changes and improvements detailed in the announcement.
Original title and link: Storm 0.8.0: The Most Significant Release of Storm Yet (©myNoSQL)
Friday, 16 March 2012
Real-Time Analytics With Storm and Esper
Thomas Dudziak:
At work, we recently started using Esper1 for realtime analytics, and so far we quite like Esper. It is a great tool at what it does – running queries continuously over data. The problem however then becomes how to get data into Esper. The recently released Storm2 could be one way to do that, so I got curios and started playing around with it to see if it could be made to work with Esper. And it turns out, the integration is straightforward.
Original title and link: Real-Time Analytics With Storm and Esper (©myNoSQL)
Wednesday, 7 March 2012
Real-Time Analytics With Kafka and IronCount
Edward Capriolo suggesting an alternative approach to real-time analytics backed by solutions like Rainbird, Flume, Scribe, or Storm:
Distributed processing is RTA requirement #2 which is where IronCount comes in. It is great that we can throw tons of messages into Kafka, but we do not have a system to process these messages. We could pick say 4 servers on our network and write a program implementing a Kafka Consumer interface to process messages, write init scripts, write nagios check, manage it. How to stop it start it upgrade it? How should the code even be written? What if we need to run two programs, or five or ten?
IronCount gives an simple answer for this questions. It starts by abstracting users from many of the questions mentioned above. Users need to only implement a single interface.
In a way this post reminded me of Ted Dziuba’s Taco Bell Programming:
The more I write code and design systems, the more I understand that many times, you can achieve the desired functionality simply with clever reconfigurations of the basic Unix tool set. After all, functionality is an asset, but code is a liability.
Original title and link: Real-Time Analytics With Kafka and IronCount (©myNoSQL)
via: http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/what_is_the_deal_with
Thursday, 4 August 2011
Twitter Open Sourcing Storm at Strange Loop
Ask and you’ll be answered. Nathan Marz announces that Twitter will open source Storm, the Hadoop-like real-time data processing tool developed at BackType:
I’m pleased to announce that I will be releasing Storm at Strange Loop on September 19th!
Here’s a recap of the three broad use cases for Storm:
- Stream processing: Storm can be used to process a stream of new data and update databases in realtime. Unlike the standard approach of doing stream processing with a network of queues and workers, Storm is fault-tolerant and scalable.
- Continuous computation: Storm can do a continuous query and stream the results to clients in realtime. An example is streaming trending topics on Twitter into browsers. The browsers will have a realtime view on what the trending topics are as they happen.
- Distributed RPC: Storm can be used to parallelize an intense query on the fly. The idea is that your Storm topology is a distributed function that waits for invocation messages. When it receives an invocation, it computes the query and sends back the results. Examples of Distributed RPC are parallelizing search queries or doing set operations on large numbers of large sets.
Original title and link: Twitter Open Sourcing Storm at Strange Loop (©myNoSQL)
via: http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html
Tuesday, 5 July 2011
ElephantDB and Storm Join the Twitter Flock
That’s to say BackType, creators of Cascalog, ElephantDB, and Storm , has been acquired by Twitter (which in case you didn’t know names most of their open source libraries and storage solutions using bird names).
The announcement is here . Looking forward to seeing Storm open sourced.
Original title and link: ElephantDB and Storm Join the Twitter Flock (©myNoSQL)