Apache Flink

From Bauman National Library
This page was last modified on 6 December 2016, at 11:51.
Apache Flink
Apache Flink logo
Developer(s) Apache Software Foundation
Initial release 13 October 2016
Repository {{#property:P1324}}
Development status Active
Written in Java и Scala
Operating system Cross-platform
Available in English
License Apache License 2.0
Website https://flink.apache.org/

Apache Flink is a distributed data processing platform for use in big data applications, primarily involving analysis of data stored in Hadoop clusters. Supporting a combination of in-memory and disk-based processing, Flink handles both batch and stream processing jobs, with data streaming the default implementation and batch jobs running as special-case versions of streaming applications.

Development

Flink was designed as an alternative to MapReduce, the batch-only processing engine that was paired with the Hadoop Distributed File System (HDFS) in Hadoop's initial incarnation. The Flink software is open source and adheres to The Apache Software Foundation's licensing provisions. Its development is primarily being driven by DataArtisans GmbH, a startup vendor based in Berlin.

Flink streaming applications are programmed via a DataStream API using either Java or Scala. These languages, as well as Python, can also be used to program against a complementary DataSet API for processing static data. Flink can be deployed on a single Java virtual machine (JVM) in standalone mode or YARN-based Hadoop clusters, or on cloud systems.

The core Flink runtime supports a pipelined streaming architecture; it also offers a built-in method to support iterative data processing for machine learning and other analytics applications. Dedicated APIs and libraries are provided for development of machine learning programs, as well as string handling, graph processing and other uses. Another API is focused on Hadoop application integration.

Flink arose as an offshoot of Stratosphere, a project begun in 2009 at three universities in Germany: TU Berlin, Humboldt University of Berlin and the Hasso Plattner Institute. The Flink technology subsequently became an Apache incubator project in April 2014 and a top-level project late that year; after nine earlier releases, Apache Flink 1.0.0 was released in March 2016. With that, Flink officially joined other Hadoop ecosystem frameworks such as Spark, Storm and Samza in the competition to provide big data streaming capabilities.

Architecture

Architecture

Flink is a batch and streaming data processing engine, meaning it can operate in both batch and streaming modes. It takes data as input, processes the data using programs coded by the user, and outputs in real time. It is one of the quickest (if not the quickest) big data processing engines currently available. At Flink’s core is a streaming data flow engine that “provides data distribution, communication, and fault tolerance for distributed computations.” Input (the data stream) comes from messaging queues like Kafka. Users write Flink programs in Java or Scala. These programs are automatically compiled and optimized, and executed in a data-parallel and pipelined manner to process the incoming data.

Other features

  • High performance & low latency
  • Support for event time & out-of-order events
  • Exactly-once semantics for stateful computations
  • Highly flexible streaming windows
  • Continuous streaming model with backpressure
  • Fault-tolerance via lightweight distributed snapshots
  • One runtime for streaming and batch processing
  • Memory management
  • Iterations & delta iterations
    • Dedicated support for iterative computations (machine learning, graph analysis)
    • Delta iterations use computational dependencies for faster convergence
  • Program optimizer
  • Streaming data applications (DataStream API)
  • Batch processing applications (DataSet API)
  • Library ecosystem for machine learning, graph analytics, and relational data processing

Alternatives and Competitors

Flink has a lot of competitors.

  • Hadoop MapReduce
  • Apache Spark
  • Apache Storm
  • Apache Tez
  • Apache Apex

Conclusion

System Architecture

Pros

  • Able to execute in both batch and stream modes
  • Real time data processing & analytics
  • High performance and low latency
  • Support for event time and out-of-order events
  • Exactly-once semantics
  • Highly flexible streaming windows
  • Continuous streaming model with backpressure
  • Fault tolerance
  • One runtime for streaming and batch processing
  • Has own memory management system
  • Iterative computation
  • Automatically compiles & optimizes programs
  • Compatible with:
    • Runs on YARN
    • Works with HDFS
    • Streams data from Kafka
    • Can execute Hadoop program code
    • Apache HBase
    • Google Cloud Platform
    • Tachyon
    • Storm compatibility package
    • Bulleted list item
  • Offers APIs in Java and Scala that are “very easy to use”
  • Actively maintained (last stable release was March 16, 2016)
  • Active programming community providing support

Cons

  • Does not provide own data storage system
    • Data must be stored in distributed storage systems like HDFS or HBase
    • Input data is taken from message queues like Kafka
  • Just recently released from incubator mode
    • Limited production deployment
    • Libraries still in beta mode
  • No Python API at this time
  • Does not currently have REPL (read-eval-print-loop)


Apache Flink is useful for real time analytics because of its streaming data processing features. It’s also one of the more flexible options for big data analytics, since it is offers both distributed streaming and batch data processing. If a company wants to make timely business decisions based on real time analytics, then Apache Flink is definitely the way to go. It has a lot of competitors, however, and Flink itself is still relatively new, so the choice of which to use really boils down to the data project, personal tastes, and IT capabilities.

Installation

Install Java

Apache Flink requires Java to be installed as it runs on JVM. So, let’s begin by installing Java.

  • Install Python Software Properties
$ sudo apt-get install python-software-properties
  • Add Repository
$ sudo add-apt-repository ppa:webupd8team/java
  • Update the source list
$ sudo apt-get update
  • Install Java
$ sudo apt-get install oracle-java7-installer
  • Verify Java Installation

To check whether installation procedure gets successfully completed or not and to know the version of Java installed we can use the below command:

$ java -version

Install Apache Flink

  • Download the Apache Flink

You can download Flink from official Apache website.

  • Untar the setup file

Move the downloaded setup file in home directory and run below command to extract Flink:

$ tar xzf flink-1.1.3-bin-hadoop1-scala_2.10.tgz
  • Rename the installation Directory
$ mv flink-1.1.3/ flink 

To start Flink services, run sample program and play with it, change the directory to flink by using below command

$ cd flink
  • Start Flink

Start Apache Flink in a local mode use this command

/flink$ bin/start-local.sh
  • Check status

Check the status of running services

/flink$ jps
Output should be
6740 Jps
6725 JobManager

To start Apache Flink use the following URL localhost:8081

References