Stream Data Pipeline with Apache Kafka and Spark

Introduction to Real-Time Streaming

  • Real-Time Streaming involves data pipeline for Data Ingestion from different sources using Apache Nifi, Apache Kafka, Apache Spark, and Cassandra.
  • Apache Nifi provides Web UI Dashboard and helps to automate the workflow.


Real-Time Streaming Architecture for Data Pipeline Components

  • Automate Data Workflow – Apache Nifi
  • Messaging System – Apache Kafka
  • Stream Processing Engine – Apache Spark Streaming
  • Rest API & Twitter Dashboard for Real – Time Tweets


Business Challenge for Building Data Pipeline

  • Benchmarking of Data Pipeline using Nifi and Kafka with message size and duration.
  • Real-Time Streaming, Memory Management, scalability, and concurrency.
  • Implement Interactive Dashboard with Real-Time Data Analytics and visualization in D3.js Charts and React.js.
  • End-to-End delivery guarantee and Error handling of data from Twitter agent to processing engine.
  • Test Data will be Apache Hadoop Cluster Logs and Twitter Stream API.


Solution Offered For Building Real-Time Streaming Data Pipeline

  • Real-Time Streaming Platform with Apache Nifi as Collector as well as Producer for Data Ingestion.
  • Apache Nifi as Collector and Apache Kafka as a Producer with Apache Spark Streaming and Apache Spark Structured Streaming.
  • Apache Cassandra deployed as Microservices architecture on Kubernetes as well as on EC2 Instances as a Cluster for scaling, guaranteed delivery of data across the Data Pipeline.
Read more