XenonStack Recommends

Apache Flink Architecture and Use Cases

Acknowledging Data Management
          Best Practices with DataOps Image

Subscription

XenonStack White Arrow Image

What is Apache Flink?

Before having a look at Apache Flink lets have a look at some basic concepts in stream processing Keys to an efficient Big Data Stream Processing engines -
  • Keep the data moving (Streaming architecture, How to treat stream event)
  • Declarative access eg. Stream SQL, CQL
  • Handle imperfections eg. late event, missing events, unordered events.
  • Integrate batch and streaming data.
  • Data safety and availability. (Fault tolerance, Durable state)
  • Automatic partitioning and scaling.
Apache Flink is a community-driven open source framework for shared Big Data Analytics. Apache Flink engine exploits in-memory processing and data streaming and iteration operators to improve performance. The Apache Flink Framework is written in Java which provides -
  • Several APIs in Java/Scala/Python
  • DataSet API - Batch processing
  • DataStream API - Real-Time streaming analytics
  • Table API - Relational Queries
  • DSL ( Domain-Specific Libraries)
  • CEP - Complex Event Processing
  • FlinkML - Machine Learning Library for Flink
  • Gelly - Graph Library for Flink
  • Shell for interactive data analysis
Some unique features in Apache Flink -
  • True Streaming Capabilities - Execute everything as streams
  • Native iterative execution - Allow some cyclic dataflows
  • Handling of mutable state
  • Custom memory manager - Operate on managed memory
  • Cost-Based Optimizer - For both stream and batch processing

How does Flink fulfill the requirements of a Streaming Analytics System?

Keep the data moving

Flink treats data streams in the form of a data stream. Flink has a data stream using which we can manipulate the streaming data. Flink can handle -
  • Bounded data
  • Unbounded data
  • Real-time streams
  • Recorded streams

Declarative access

Apache Flink has Table API and SQL API, which is unified for both streaming and batch data, which implies same semantics can be used on all types of data.SQL and Table api are built upon Apache Calcite and leverage the features such as parsing, validations, and query optimizations.

Handle imperfections

Using the process functions in Apache Flink, we can handle imperfections in data and manipulate event, time and state of streaming data. Time-related features in Flink -
    • Event time mode
    • Watermark support
    • Late data handling
    • Process-Time mode

Integrate Batch and Streaming data

Apache Flink has Dataset API available for batch processing, and the SQL and Table API would work on batch data as well.

Data Safety and Availability

  Fault tolerance Apache Flink is made to handle failures. Durable state Flink maintains strong state via checkpoint the state from time to time. Checkpoints allow Flink to recover state and positions in streams to recover from a failed state. Flink interacts with the persistent storages to store checkpoints. We can configure the various back-end such as Message Queue's - Kafka, Google Pub/Sub, AWS Kinesis, RabbitMQ Filesystems - HDFS, GFS, S3, NFS, Ceph

Automatic partitioning and scaling

Flink has excellent support for partitioning and scaling. What makes flink great is the support for both stateless as well as stateful streaming. Internals of Apache Flink

Benefits of Apache Flink

True Low latency Streaming engine

Flink is a low latency streaming engine that unifies batch and streaming in a single Big Data processing framework.

Custom Memory Manager

Flink contains its memory management stack. Flink includes its serialization and type extraction components. Flink uses C++ style memory management and User data stored in serialized byte arrays in JVM. Memory is allocated, de-allocated, and used strictly using an internal buffer pool implementation.

Apache Flink Advantages

  • Flink will not throw an OOM(out of memory )exception
  • Reduction of Garbage Collection
  • Very efficient disk spilling and network transfers
  • No Need for runtime tuning
  • More reliable and stable performance
  • Built-in Cost-Based Optimizer
  • Custom state maintenance
Native closed-loop iteration operators Flink support iterative computation. Flink iterates data by using streaming architecture. It's pipelined architecture allows processing the streaming data faster with lower latency. Flink used an iterative algorithm which is tightly bounded into flink query optimizer. Unified Framework Flink is a unified framework which allows building a single data workflow that holds streaming, batch, SQL and Machine learning. Analyze real-time streaming data Process graphs Machine Learning algorithms

Why Apache Flink matters in Big data Ecosystem?

Apache Flink Apache Spark Samza Apache Storm
Native streaming means Processing every record as it arrives Fast Batching, means it Processing records in batches of some seconds. Supports native streaming using spark structured streaming API. Native streaming means Processing every record as it arrives Native streaming means Processing every record as it arrives
Exactly once guarantee Exactly once At least once guarantee At least once guarantee Exactly once guarantee using Trident as an abstraction
Supports advanced streaming features like Watermarks, triggers, Sessions, etc. Supports advanced streaming features like Watermarks, Sessions, triggers, etc. Lacks advanced streaming features like Watermarks, Sessions, triggers, etc. Supports advanced streaming features like Watermarks, Sessions, triggers, etc.
Scala, Java, Python Scala, Java, Python Java Scala, Java, Python
Hybrid framework( batch + stream processing ) Hybrid framework( batch + stream processing ) Stream only framework Stream only framework

Apache Flink in Production

Apache Flink Architecture and Tools In production Apache Flink can be integrated with familiar cluster managers We can deploy flink in the resource-manager specific deployment mode, and Flink interacts with the resource manager in their specific appropriate way. Flink communicates with the resource managers to ask for the resources required by the application from its parallelism configuration. In the case of a failover situation where a job fails, flink automatically requests a new resource accordingly. It has been reported that flink can support -
  • Multiple trillions of events per day.
  • Multiple terabytes of state.
  • Running on thousands of cores.

Apache Flink Best Practices

Parsing command line arguments and passing them around in Flink application Getting configuration values into the ParameterTool Using the parameters in Flink program Naming large TupleX types - Used POJO (Plain old java object ) instead of TupleX for data types with many fields. Used POJOs to give large Tuple-types a name. Instead of using -
 
Tuplell<String,String,String> var = new ...;
Use -

CustomType var=new . . . ;
public static class CustomType extends Tuplell<String, String, String>{

}
Using Logback instead of Log4j
Use Logback when running Flink out of the IDE/java application
Use Logback when running Flink on a cluster

Best Tools for Enabling Apache Flink

Flink has the following useful tools -
  • Command Line Interface (CLI)- It is used for operating Flink's utilities directly from a command prompt.
  • Job Manager - It is a management interface which is used to track jobs, status, failure, etc.
  • Job Client - It is a client interface which is used to submit, execute, debug and inspect jobs.
  • Zeppelin - It is an interactive web-based computational platform along with visualization tools and analytics.
  • Interactive Scala Shell/REPL - It is used for interactive queries.

How Can XenonStack Help You?

XenonStack offers Real-Time Data Analytics and Big Data Engineering Services for Enterprises and Startups Real-Time Data Analytics Services Big Data Managed Solutions for Enterprises

Related blogs and Articles