Real Time Data Ingestion and CDC using StreamSets

Introduction to StreamSets Architecture

 

StreamSets implementation for Data Ingestion and CDC for Real-Time Tweets from Twitter API’s and Data Migration from MySQL to Data Pipeline using Kafka and Amazon Redshift.

 

StreamSet Working Framework

  • Platform to build, execute and manage data flows for Batch and Streaming data.
  • StreamSet Data Collector provides Drag-and-Drop connectors for Batch and Streaming sources.
  • The destination for Data Ingestion, monitoring of the Data Pipeline and error detection.
  • Real-Time Data Ingestion and Processing with Change Data Capture (CDC) for extraction, transformation, and loading in ETL application.

 

Business Challenge for Building the Data Pipeline

 

  • To create a Real-Time Twitter Stream into Amazon Redshift Cluster.
  • Build a Data Pipeline for MySQL to migrate its data to MySQL.
  • Implement Change Data Capture Mechanism to capture changes in any data source.
  • Build a Data Pipeline to fetch Google Analytics Data and send the stream to Amazon Redshift.

 

Solution Offered for Building the Ingestion Platform

 

  • Perform Data Ingestion using StreamSet Data collector that streams data in Real-Time.

  • For streaming data to Amazon Redshift, there are two ways –

    • Using Connection Pool – Use JDBC producer as the destination and the connection strings of Redshift for connecting to Redshift.
    • Using Kinesis Firehose Stream – Configure a Kinesis Firehose stream first which uses Amazon S3 bucket as the intermediate and copy command to transfer data to Amazon Redshift Cluster.
Read more