Apache Storm

From Bauman National Library
This page was last modified on 3 December 2018, at 22:35.
Revision as of 22:35, 3 December 2018 by kirill tarasov (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
</td></tr>
Apache Storm
Apache Storm's Logo
Developer(s) Backtype, Twitter
Stable release
1.2.2 (4 June 2018)
Repository {{#property:P1324}}
Development status Active
Written in Clojure & Java
Operating system Cross-platform
Website storm.apache.org

Apache Storm is a free and open source distributed realtime computation system.

Characteristics

Apache Storm is a free and open source distributed realtime computation system [Reference 1]. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.

Application

Storm has many use cases:

  • Realtime analytics
  • Online machine learning
  • Continuous computation
  • distributed RPC
  • ETL (Extract, Transform, Load)

Features

Apache Storm has several key features.

Integrates

Storm integrates with any queueing system and any database system [Reference 2]. Storm's spout abstraction makes it easy to integrate a new queuing system.

Example queue integrations include:

  • Kestrel
  • RabbitMQ / AMQP
  • Kafka
  • JMS
  • Amazon Kinesis

Likewise, integrating Storm with database systems is easy. Simply open a connection to your database and read/write like you normally would. Storm will handle the parallelization, partitioning, and retrying on failures when necessary.

Scalable

Storm topologies are inherently parallel and run across a cluster of machines [Reference 3]. Different parts of the topology can be scaled individually by tweaking their parallelism. The "rebalance" command of the "storm" command line client can adjust the parallelism of running topologies on the fly. Storm's inherent parallelism means it can process very high throughputs of messages with very low latency.

Fault tolerant

Storm is fault-tolerant: when workers die, Storm will automatically restart them [Reference 4]. If a node dies, the worker will be restarted on another node. The Storm daemons, Nimbus and the Supervisors, are designed to be stateless and fail-fast. So if they die, they will restart like nothing happened. This means you can kill -9 the Storm daemons without affecting the health of the cluster or your topologies.

Guarantees data processing

Storm guarantees every tuple will be fully processed [Reference 5]. One of Storm's core mechanisms is the ability to track the lineage of a tuple as it makes its way through the topology in an extremely efficient way.

Storm's basic abstractions provide an at-least-once processing guarantee, the same guarantee you get when using a queueing system. Messages are only replayed when there are failures. Using Trident, a higher level abstraction over Storm's basic abstractions, you can achieve exactly-once processing semantics.

Easy to deploy and operate

Storm clusters are easy to deploy, requiring a minimum of setup and configuration to get up and running [Reference 6]. Storm's out of the box configurations are suitable for production. If you're on EC2, the storm-deploy project can provision, configure, and install a Storm cluster from scratch at just the click of a button.

Additionally, Storm is easy to operate once deployed. Storm has been designed to be extremely robust – the cluster will just keep on running, month after month.

System requirements

You can choose any of the Linux operating systems for installation [Reference 7].

Storm needs a good memory and adequate processing power. Below are the recommended machine configurations.

For development systems:

  • Minimum of 2GB RAM
  • 1 CPU for Storm
  • 1 TB hard disk

For production systems:

  • Minimum 16GB RAM
  • Up to 32GB of RAM per machine (recommended)
  • At least 6-Core CPUs (recommended)
  • Processors which are 2GHz or more.
  • 4x2TB hard disks.
  • 1 GB Ethernet.

Concepts

Apache Storm reads raw stream of real-time data from one end and passes it through a sequence of small processing units and output the processed / useful information at the other end. The following diagram on the image 1 depicts the core concepts of Apache Storm.

Image 1 – core concepts of Apache Storm

API

Storm has a simple and easy to use API [Reference 8]. When programming on Storm, you manipulate and transform streams of tuples, and a tuple is a named list of values. Tuples can contain objects of any type; if you want to use a type Storm doesn't know about it's very easy to register a serializer for that type.

There are just three abstractions in Storm: spouts, bolts, and topologies. A spout is a source of streams in a computation. Typically a spout reads from a queueing broker such as Kestrel, RabbitMQ, or Kafka, but a spout can also generate its own stream or read from somewhere like the Twitter streaming API. Spout implementations already exist for most queueing systems.

A bolt processes any number of input streams and produces any number of new output streams. Most of the logic of a computation goes into bolts, such as functions, filters, streaming joins, streaming aggregations, talking to databases, and so on.

A topology is a network of spouts and bolts, with each edge in the network representing a bolt subscribing to the output stream of some other spout or bolt. A topology is an arbitrarily complex multi-stage stream computation. Topologies run indefinitely when deployed.

Storm has a "local mode" where a Storm cluster is simulated in-process. This is useful for development and testing. The "storm" command line client is used when ready to submit a topology for execution on an actual cluster.

Installation

To install Apache Storm, follow these steps:

  1. Install zookeeper.
  2. Download the archive from the official site, unpack, register the host settings in conf / storm.yaml (see video)
  3. Start nimbus, supervisor, ui
  4. Open web interface on port 8080 of selected host

References

  1. Apache Storm // Welcome to The Apache Software Foundation. [2018]. Date of update: 30.11.2018. URL: https://storm.apache.org/ (date of appeal: 30.11.2018).
  2. Integrates // Apache Storm. [2018]. Date of update: 30.11.2018. URL: https://storm.apache.org/about/integrates.html (date of appeal: 30.11.2018).
  3. Scalable // Apache Storm. [2018]. Date of update: 30.11.2018. URL: https://storm.apache.org/about/scalable.html (date of appeal: 30.11.2018).
  4. Fault Tolerant // Apache Storm. [2018]. Date of update: 30.11.2018. URL: https://storm.apache.org/about/fault-tolerant.html (date of appeal: 30.11.2018).
  5. Guarantees Data Processing // Apache Storm. [2018]. Date of update: 30.11.2018. URL: https://storm.apache.org/about/guarantees-data-processing.html (date of appeal: 30.11.2018).
  6. Easy to deploy and operate // Apache Storm. [2018]. Date of update: 30.11.2018. URL: https://storm.apache.org/about/deployment.html (date of appeal: 30.11.2018).
  7. Apache Storm - Installation and Configuration Tutorial // Online Certification Training Courses for Professionals | Simplilearn. [2009 – 2018]. Date of update: 30.11.2018. URL: https://www.simplilearn.com/apache-storm-installation-and-configuration-tutorial-video (date of appeal: 30.11.2018).
  8. Simple API // Apache Storm. [2018]. Date of update: 30.11.2018. URL: https://storm.apache.org/about/simple-api.html (date of appeal: 30.11.2018).