Spark Streaming vs. Storm Streaming

Overview – Both Storm and Spark Streaming are open-source frameworks for distributed stream processing. But, there are important differences. Processing Model, Latency – Although both frameworks provide scalability and fault tolerance, they differ fundamentally in their processing model. Storm processes incoming events one at a time, Spark Streaming batches up events that arrive within a short time window before processing them. Thus, Storm can achieve sub-second latency, while Spark Streaming has a latency of several seconds. Fault Tolerance, Data Guarantees – The tradeoff is in the fault tolerance data guarantees. Spark Streaming provides better support for stateful computation that is fault tolerant. In Storm, each individual record has to be tracked as it moves through the system, so Storm only guarantees that each record will be processed at least once, but allows duplicates to appear during recovery from a fault. That means mutable state may be incorrectly updated twice. Spark Streaming, on the other hand, need only track processing at the batch level, so it can efficiently guarantee that each mini-batch will be processed exactly once, even if a fault such as a node failure occurs. [Actually, Storm’s Trident library also provides exactly once processing. But, it relies on transactions to update state, which is slower and often has to be implemented by the user.]

In short, Storm is a good choice if you need sub-second latency and no data lossSpark Streaming is better if you need stateful computation, with the guarantee that each event is processed exactly once. Implementation – Storm is primarily implemented in Clojure, while Spark Streaming is implemented in Scala. Storm was developed at BackType and Twitter, Spark Streaming was developed at UC Berkeley. Batch Framework Integration – One nice feature of Spark Streaming is that it runs on Spark. Thus, you can use the same (or very similar) code that you write for batch processing and/or interactive queries in Spark, on Spark Streaming. This reduces the need to write separate code to process streaming data and historical data.

Production Use – Storm has been around for several years and has run in production at Twitter since 2011, as well as at many other companies. Meanwhile, Spark Streaming is a newer project, its only production deployment has been at Sharethrough since 2013. Hadoop Distribution, Support – Storm is the streaming solution in the Hortonworks Hadoop data platform, whereas Spark Streaming is in both MapR’s distribution and Cloudera’s Enterprise data platform. In addition, Databricks is a company that provides support for the Spark stack, including Spark Streaming. Cluster Manager Integration – Although both systems can run on their own clusters, Storm also runs on Mesos, while Spark Streaming runs on both YARN and Mesos.

Further Reading – For an overview of Storm, see these slides. For a good overview of Spark Streaming, see the slides to a Strata Conference talk. A more detailed description can be found in this research paper. There is this other comparison of Storm and Spark Streaming from Hortonworks, written in defense of Storm’s features and performance.

Original post

Advertisements