Apache Flink
This page was last modified on 6 December 2016, at 11:51.
![]() | |
Developer(s) | Apache Software Foundation |
---|---|
Initial release | 13 October 2016 |
Repository |
{{ |
Development status | Active |
Written in | Java и Scala |
Operating system | Cross-platform |
Available in | English |
License | Apache License 2.0 |
Website | https://flink.apache.org/ |
Apache Flink is a distributed data processing platform for use in big data applications, primarily involving analysis of data stored in Hadoop clusters. Supporting a combination of in-memory and disk-based processing, Flink handles both batch and stream processing jobs, with data streaming the default implementation and batch jobs running as special-case versions of streaming applications.
Contents
Development
Flink was designed as an alternative to MapReduce, the batch-only processing engine that was paired with the Hadoop Distributed File System (HDFS) in Hadoop's initial incarnation. The Flink software is open source and adheres to The Apache Software Foundation's licensing provisions. Its development is primarily being driven by DataArtisans GmbH, a startup vendor based in Berlin.
Flink streaming applications are programmed via a DataStream API using either Java or Scala. These languages, as well as Python, can also be used to program against a complementary DataSet API for processing static data. Flink can be deployed on a single Java virtual machine (JVM) in standalone mode or YARN-based Hadoop clusters, or on cloud systems.
The core Flink runtime supports a pipelined streaming architecture; it also offers a built-in method to support iterative data processing for machine learning and other analytics applications. Dedicated APIs and libraries are provided for development of machine learning programs, as well as string handling, graph processing and other uses. Another API is focused on Hadoop application integration.
Flink arose as an offshoot of Stratosphere, a project begun in 2009 at three universities in Germany: TU Berlin, Humboldt University of Berlin and the Hasso Plattner Institute. The Flink technology subsequently became an Apache incubator project in April 2014 and a top-level project late that year; after nine earlier releases, Apache Flink 1.0.0 was released in March 2016. With that, Flink officially joined other Hadoop ecosystem frameworks such as Spark, Storm and Samza in the competition to provide big data streaming capabilities.
Architecture
Flink is a batch and streaming data processing engine, meaning it can operate in both batch and streaming modes. It takes data as input, processes the data using programs coded by the user, and outputs in real time. It is one of the quickest (if not the quickest) big data processing engines currently available. At Flink’s core is a streaming data flow engine that “provides data distribution, communication, and fault tolerance for distributed computations.” Input (the data stream) comes from messaging queues like Kafka. Users write Flink programs in Java or Scala. These programs are automatically compiled and optimized, and executed in a data-parallel and pipelined manner to process the incoming data.
Other features
- High performance & low latency
- Support for event time & out-of-order events
- Exactly-once semantics for stateful computations
- Highly flexible streaming windows
- Continuous streaming model with backpressure
- Fault-tolerance via lightweight distributed snapshots
- One runtime for streaming and batch processing
- Memory management
- Iterations & delta iterations
- Dedicated support for iterative computations (machine learning, graph analysis)
- Delta iterations use computational dependencies for faster convergence
- Program optimizer
- Streaming data applications (DataStream API)
- Batch processing applications (DataSet API)
- Library ecosystem for machine learning, graph analytics, and relational data processing
Alternatives and Competitors
Flink has a lot of competitors.
- Hadoop MapReduce
- Apache Spark
- Apache Storm
- Apache Tez
- Apache Apex
Conclusion
Pros
- Able to execute in both batch and stream modes
- Real time data processing & analytics
- High performance and low latency
- Support for event time and out-of-order events
- Exactly-once semantics
- Highly flexible streaming windows
- Continuous streaming model with backpressure
- Fault tolerance
- One runtime for streaming and batch processing
- Has own memory management system
- Iterative computation
- Automatically compiles & optimizes programs
- Compatible with:
- Runs on YARN
- Works with HDFS
- Streams data from Kafka
- Can execute Hadoop program code
- Apache HBase
- Google Cloud Platform
- Tachyon
- Storm compatibility package
- Bulleted list item
- Offers APIs in Java and Scala that are “very easy to use”
- Actively maintained (last stable release was March 16, 2016)
- Active programming community providing support
Cons
- Does not provide own data storage system
- Data must be stored in distributed storage systems like HDFS or HBase
- Input data is taken from message queues like Kafka
- Just recently released from incubator mode
- Limited production deployment
- Libraries still in beta mode
- No Python API at this time
- Does not currently have REPL (read-eval-print-loop)
Apache Flink is useful for real time analytics because of its streaming data processing features. It’s also one of the more flexible options for big data analytics, since it is offers both distributed streaming and batch data processing. If a company wants to make timely business decisions based on real time analytics, then Apache Flink is definitely the way to go. It has a lot of competitors, however, and Flink itself is still relatively new, so the choice of which to use really boils down to the data project, personal tastes, and IT capabilities.
Installation
Install Java
Apache Flink requires Java to be installed as it runs on JVM. So, let’s begin by installing Java.
- Install Python Software Properties
$ sudo apt-get install python-software-properties
- Add Repository
$ sudo add-apt-repository ppa:webupd8team/java
- Update the source list
$ sudo apt-get update
- Install Java
$ sudo apt-get install oracle-java7-installer
- Verify Java Installation
To check whether installation procedure gets successfully completed or not and to know the version of Java installed we can use the below command:
$ java -version
Install Apache Flink
- Download the Apache Flink
You can download Flink from official Apache website.
- Untar the setup file
Move the downloaded setup file in home directory and run below command to extract Flink:
$ tar xzf flink-1.1.3-bin-hadoop1-scala_2.10.tgz
- Rename the installation Directory
$ mv flink-1.1.3/ flink
To start Flink services, run sample program and play with it, change the directory to flink by using below command
$ cd flink
- Start Flink
Start Apache Flink in a local mode use this command
/flink$ bin/start-local.sh
- Check status
Check the status of running services
/flink$ jps Output should be 6740 Jps 6725 JobManager
To start Apache Flink use the following URL localhost:8081
Присоединяйся к команде
ISSN:
Следуй за Полисом
Оставайся в курсе последних событий
License
Except as otherwise noted, the content of this page is licensed under the Creative Commons Creative Commons «Attribution-NonCommercial-NoDerivatives» 4.0 License, and code samples are licensed under the Apache 2.0 License. See Terms of Use for details.