ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Storm: All content tagged as Storm in NoSQL databases and polyglot persistence

Grape - a realtime processing pipeline

From the project page:

The main goal is data availability and data persistency. We created grape for those who can not afford losing data.

Instead of going over Storm’s steps we dramatically changed Grape logic.

Contrary to Kafka we can not lose your data if ‘data-file’ was not read for a long time or its size overflows under constant write load.

Original title and link: Grape - a realtime processing pipeline (NoSQL database©myNoSQL)

via: http://reverbrain.com/grape/


Storm and Hadoop: Convergence of Big-Data and Low-Latency Processing at Yahoo!

Andy Feng wrote a blog post on YDN blog about the data processing architecture at Yahoo! for delivering personalized content by analyzing billions of events for 700mil. users and 2.2bil content pieces every day using a combination of batch-processing (Hadoop) and stream-processing (Storm):

Enabling low-latency big-data processing is one of the primary design goals of Yahoo!’s next-generation big-data platform. While MapReduce is a key design pattern for batch processing, additional design patterns will be supported over time. Stream/micro-batch processing is one of design patterns applicable to many Yahoo! use cases. In Q1 2013, we added Storm as a new service to our big-data platform. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for stream/micro-batch processing.

✚ I don’t think I’ve seen the term micro-batch processing used before. Any ideas why using it as an alternative to the well established stream processing?

Original title and link: Storm and Hadoop: Convergence of Big-Data and Low-Latency Processing at Yahoo! (NoSQL database©myNoSQL)

via: http://developer.yahoo.com/blogs/ydn/storm-hadoop-convergence-big-data-low-latency-processing-54503.html


Distributed Stream Processing Showdown: S4 vs Storm

Fantastic post by Gianmarco De Francisci Morales describing the similarities and major differencences between the two Apache licensed, JVM-based stream processing platforms S4 and Storm.

There are many other differences, but for sake of brevity I just present a short summary of the pros of each platform that the other one lacks.

S4 pros

  • Clean programming model.
  • State recovery.
  • Inter-app communication.
  • Classpath isolation.
  • Tools for packaging and deployment.
  • Apache incubation.

Storm pros

  • Pull model.
  • Guaranteed processing.
  • More mature, more traction, larger community.
  • High performance.
  • Thread programming support.
  • Advanced features (transactional topologies, Trident).

Original title and link: Distributed Stream Processing Showdown: S4 vs Storm (NoSQL database©myNoSQL)

via: http://gdfm.me/2013/01/02/distributed-stream-processing-showdown-s4-vs-storm/


A Big Data Trifecta: Storm, Kafka and Cassandra

Brain O’Neill details his first experiments of migrating from using JMS to Kafka in a very interesting architecture involving:

Now, Kafka is fast.  When running the Kafka Spout by itself, I easily reproduced Kafka’s claim that you can consume “hundreds of thousands of messages per second”.  When I first fired up the topology, things went well for the first minute, but then quickly crashed as the Kafka spout emitted  too fast for the Cassandra Bolt to keep up.  Even though Cassandra is fast as well, it is still orders of magnitude slower than Kafka.

Original title and link: A Big Data Trifecta: Storm, Kafka and Cassandra (NoSQL database©myNoSQL)

via: http://brianoneill.blogspot.com/2012/08/a-big-data-trifecta-storm-kafka-and.html


Storm 0.8.0: The Most Significant Release of Storm Yet

In Nathan Marz’s words Storm1 0.8.0 is “a major step forward in the evolution of the project”:

  1. Executors: Storm 0.8.0 has a new model where a worker is a process and an executor is a thread.
  2. Pluggable scheduler
  3. Throughput improvements
  4. Decreased ZooKeeper load/increased Storm UI performance
  5. Abstractions for shared resources
  6. Tick tuples

and a lot more changes and improvements detailed in the announcement.


  1. Storm: Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC 

Original title and link: Storm 0.8.0: The Most Significant Release of Storm Yet (NoSQL database©myNoSQL)


Real-Time Analytics With Storm and Esper

Thomas Dudziak:

At work, we recently started using Esper1 for realtime analytics, and so far we quite like Esper. It is a great tool at what it does – running queries continuously over data. The problem however then becomes how to get data into Esper. The recently released Storm2 could be one way to do that, so I got curios and started playing around with it to see if it could be made to work with Esper. And it turns out, the integration is straightforward.

Dmitriy Ryaboy


  1. Esper: complex event processing framework 

  2. Storm: distributed and fault-tolerant real-time computation: stream processing, continuous computation, distributed RPC. 

Original title and link: Real-Time Analytics With Storm and Esper (NoSQL database©myNoSQL)

via: http://tomdzk.wordpress.com/2011/09/28/storm-esper/


Real-Time Analytics With Kafka and IronCount

Edward Capriolo suggesting an alternative approach to real-time analytics backed by solutions like Rainbird, Flume, Scribe, or Storm:

Distributed processing is RTA requirement #2 which is where IronCount comes in. It is great that we can throw tons of messages into Kafka, but we do not have a system to process these messages. We could pick say 4 servers on our network and write a program implementing a Kafka Consumer interface to process messages, write init scripts, write nagios check, manage it. How to stop it start it upgrade it? How should the code even be written? What if we need to run two programs, or five or ten?

IronCount gives an simple answer for this questions. It starts by abstracting users from many of the questions mentioned above. Users need to only implement a single interface.

In a way this post reminded me of Ted Dziuba’s Taco Bell Programming:

The more I write code and design systems, the more I understand that many times, you can achieve the desired functionality simply with clever reconfigurations of the basic Unix tool set. After all, functionality is an asset, but code is a liability.

Original title and link: Real-Time Analytics With Kafka and IronCount (NoSQL database©myNoSQL)

via: http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/what_is_the_deal_with


Twitter Open Sourcing Storm at Strange Loop

Ask and you’ll be answered. Nathan Marz announces that Twitter will open source Storm, the Hadoop-like real-time data processing tool developed at BackType:

I’m pleased to announce that I will be releasing Storm at Strange Loop on September 19th!

Here’s a recap of the three broad use cases for Storm:

  • Stream processing: Storm can be used to process a stream of new data and update databases in realtime. Unlike the standard approach of doing stream processing with a network of queues and workers, Storm is fault-tolerant and scalable.
  • Continuous computation: Storm can do a continuous query and stream the results to clients in realtime. An example is streaming trending topics on Twitter into browsers. The browsers will have a realtime view on what the trending topics are as they happen.
  • Distributed RPC: Storm can be used to parallelize an intense query on the fly. The idea is that your Storm topology is a distributed function that waits for invocation messages. When it receives an invocation, it computes the query and sends back the results. Examples of Distributed RPC are parallelizing search queries or doing set operations on large numbers of large sets.

Original title and link: Twitter Open Sourcing Storm at Strange Loop (NoSQL database©myNoSQL)

via: http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html


ElephantDB and Storm Join the Twitter Flock

That’s to say BackType, creators of Cascalog, ElephantDB, and Storm , has been acquired by Twitter (which in case you didn’t know names most of their open source libraries and storage solutions using bird names).

The announcement is here . Looking forward to seeing Storm open sourced.

Original title and link: ElephantDB and Storm Join the Twitter Flock (NoSQL database©myNoSQL)