Introduction to Apache Kafka
In a data-driven approach, business decisions are based on data, not intuitions. Hence, data-driven organizations use this strategic process to leverage insights from data and identify business opportunities to improve the way to serve customers and increase sales.
As everything revolves around data, transmitting data from one place to another is challenging, and doing this with real-time data is even more challenging. So to solve this problem, there comes a perfect distributed data streaming platform called Kafka. Next, the question arises: What is it, and how does it work? Let's look into it.
Streaming Data Visualization gives users Real-Time Data Analytics to see the trends and patterns in the data to take action rapidly. Click to explore about our, Real-Time Streaming Data Visualizations
Apache Kafka Archutecture ?
It is an open-source distributed data streaming platform that Linkedin developed. It can handle both batch data as well as real-time data with very low latency. Kafka lets you publish and subscribe to events and can store those events as long as you want. Not only this, you can perform analytics on these events and process them. The event we are talking about here is nothing but a record for kafka. We will discuss it more but first, let's study its architecture.
What is its architecture?
Above shown is the basic its architecture. Producers are the ones that send data to it, and consumers are the ones that consume this data.
Kafka broker: It is where the data resides and acts as a bridge between producers and consumers. All Partitions and Replications are managed here for the data and records.
Zookeeper: The zookeeper is mainly responsible for managing the Kafka brokers(or clusters). Zookeeper has a set of responsibilities that it handles. Some of them are
Assigning the Controller broker.
- Managing a list of all the working brokers in a server.
- She manages the access control list, i.e., who can read and write to the topics and many more permissions.
A stream processing framework which is developed by Apache Software Foundation. Click to explore about our, Stream Processing with Apache Flink
Who are the producers of it?
A producer is a kind of it client responsible for publishing records to it. These Producers are thread-safe, i.e., they have a pool of buffers that holds the to-be-sent records. These buffers are available to send immediately and are sent as fast as brokers can handle them. The records sent by producers are serialized into an array of bytes before being sent. When a producer sends a message, the broker has three types of acknowledgments, which you can configure, and by default, ack = 'all.'
- ACKS 0 (ALL)
The producer doesn't wait for acknowledgment by the broker. The messages added to the buffer are considered to be sent.
- ACKS 1 (Leader)
The producer sends the record to it and waits for the acknowledgment by the leader partition but not by the follower partition.
- ACKS -1 (ALL)
The producer sends the record to it and waits for the acknowledgment by the leader partitions and the follower partitions, hence providing full fault tolerance.
Who are the consumers of it?
A consumer is a kind of its client that is responsible for subscribing to a specific topic and starting to receive records from it. When the speed at which the producer produces the message exceeds than at which the consumer consumes it, then there comes the concept of the Consumer group.
Consumer Group refers to the group of consumers who subscribed to the same topic. Just like multiple producers can send data to a single topic, multiple consumers can receive data from the same topic by dividing the partitions between them.
Partitions and Replications
In its records or messages are stored in Topics, and these topics are further divided into Partitions. The actual data is stored in these partitions, and the main advantage these partitions provide is parallelism. Internally, the working of partitions depends on the key: null key or hash key. On key being null, the data will be sent to any partition. But when a specific hash key is provided, the data will be sent to a specific partition.
Replication is the concept that ensures reliability and availability when a Kafka node eventually fails. As per the name, replication is the practice of having multiple copies of data. In it, the replication occurs at partition granularity, so every partition has multiple replicas. Among these replicas is one Leader Partition, and the rest are Follower Partitions.
- Leader Partition: The role of leader partition is to get the messages from the producer and send them to the consumer when requested. Each partition can only have one leader at a time.
- Follower Partition: The replicas of a partition that are not leaders are all follower partitions. They do not handle client requests, and their only job is to replicate data to themselves as soon as the data arrives at the leader partition.
Real-time streaming refers to quickly processing data so that firms can respond to changing conditions in real time.Click to explore about our, Real-Time Event Processing with Kafka
Apache Kafka used for the Real time Data Streaming?
Real-time streaming refers to quickly processing data so that firms can respond to changing conditions in real time. It is widely used for real-time streaming. As we have read above, it has producer APIs, and here the producer can be a web host, web server, iot devices, and many other sources that send big data continuously. Then the consumers or the spark streaming listen to these topics and consume the data reliably. Let's have a basic understanding of how real data streams in it.
Extraction of data into Kafka
The first step is the extraction of data to it. In this, our primary goal is to get data for our application. For copying the data from multiple sources to it, It provides its connectors run with the kafka connect framework. It Connect stream data between it and other data storage devices. It is a data hub that can ingest databases into it, making data available at low latency for streaming. There are readily available connectors, for example, JDBC, that can be used as source connectors. Suppose a case where we have a data source as SQL lite. The jdbc connector can be used to ingest data into it. So the JDBC will process the record one by one, creating the stream.
Transformation of the data using kafka Streams API
In the previous phase, as we have written the data from source to its topic, now multiple applications can read data from these topics, which will not be enriched. By using its stream APIs for real-time processing, we can apply transformations and aggregations to the data. We can transform one message at a time or apply filters to the message based on the conditions. We can also apply to join operations, aggregations, or windowing operations on the records. Once done, we push this enriched data to its topic so that other sources can consume it.
Downstream of the data
Till now, we have fetched data from sources, applied transformations to it, and now comes the last step, i.e., downstream or sinking the data. There can be multiple target sinks, and to handle them, we can use kafka connect with multiple connectors so that any number of systems can receive the data. For example, an Amazon S3 bucket can act as a sink source, connect the output topic with the sink connector and run it using its connect and we will have our data in the required form.
Real-time streaming refers to quickly processing data so that firms can respond to changing conditions in real time. Click to explore about our, Real Time Data Streaming Tools and Technologies
What are the Use Cases for apache Kafka?
- Messaging: It is a real-time data streaming application. It can be used for messaging systems. Compared to other messaging systems, it has better fault tolerance, throughput, built-in partitioning, and replication. All these factors make it an excellent alternative to large-scale message processing applications.
- Log Aggregation: It can be used for log aggregation. From servers, collecting physical log files and putting them to a file server or HDFS for processing is called Log Aggregation. These logs are sent to it, and it returns a cleaner abstraction of these logs or event data as a message by applying the transformations at very low latency.
- Metrics: It is often used for operational monitoring data as it aggregates statistics from distributed applications.
- Commit Logs: It can serve as an external commit log for a distributed system. The logs stored in it are replicated, providing complete fault tolerance, which helps to keep vital data records such as bank transaction records.
Optimizations for Real-time Data with Kafka
It is already optimized out of the box. Still, there are some ways by which you can improve the performance of its clusters. Two main data metrics for Kafka to consider are:
- Throughput: The total number of messages arriving in a given time.
- Latency: the amount of time it takes to process each message.
It seeks to optimize both throughput and latency that makes it optimized. But still, there are some tunings that we can do based on the type of job or application we are working on.
Tuning the Brokers
As we know, there are partitions in a topic, and increasing these partitions results in increased parallelism while producing and consuming the message. Hence an improved throughput can be achieved.
Tuning the Producers
Producer runs in 2 different modes:
- Synchronous Mode
- Asynchronous Mode
In synchronous mode, as soon as the publisher publishes a message, a request is sent to the broker, which means if there are 1000 messages per second, 1000 requests will be produced, resulting in a decreased throughput. So when sending many messages, make sure to choose the asynchronous mode.
Compressing the messages
Compressing the message results in improved latency as the message size will be decreased, and small packets can be sent at high speed. But by default, Kafka messages are not compressed and need custom configurations.
Tuning the Consumers
- Consumers receive the records in batches, and if we set the batch size for pulling the messages too high, it may take a lot of time to process each one, decreasing the throughput. Similarly, polling the broker for a single message every time, causing too many broker requests, decreases the throughput.
- It's true that having more partitions and consumers increases parallelism, but as the number of consumers increases, the number of offset commit requests also increases. These requests lead to increased load on brokers resulting in low throughput.
Considering the real-world scenarios having multiple sources and targets for ingesting data while supporting variable schema that evolves, it carries a lot of overhead. It is a multi-step process and includes complex transformations that require total durability and fault tolerance. It provides the perfect architecture to do so. You can flexibly build streaming ETL pipelines and avoid messy processes. In this blog, we learned that. Its Connectors runs the corresponding framework that can help to load and sink data from it to any system or database.With the help of Kafka stream API, you can easily apply transformations, aggregations, Joining multiple data sources and filtering of data.