XenonStack Recommends

DataOps

StreamSets -Real Time Data Ingestion and CDC

Chandan Gaur | 20 November 2017

Introduction to StreamSets Architecture

StreamSets implementation for Data Ingestion and CDC for Real-Time Tweets from Twitter API's and Data Migration from MySQL to Data Pipeline using Kafka and Amazon Redshift.

StreamSet Working Framework

  • Platform to build, execute and manage data flows for Batch and Streaming data.
  • StreamSet Data Collector provides Drag-and-Drop connectors for Batch and Streaming sources.
  • The destination for Data Ingestion, monitoring of the Data Pipeline and error detection.
  • Real-Time Data Ingestion and Processing with Change Data Capture (CDC) for extraction, transformation, and loading in ETL application.
 

Business Challenge for Building the Data Pipeline

  • To create a Real-Time Twitter Stream into Amazon Redshift Cluster.
  • Build a Data Pipeline for MySQL to migrate its data to MySQL.
  • Implement Change Data Capture Mechanism to capture changes in any data source.
  • Build a Data Pipeline to fetch Google Analytics Data and send the stream to Amazon Redshift.

Solution Offered for Building the Ingestion Platform

  • Perform Data Ingestion using StreamSet Data collector that streams data in Real-Time.

  • For streaming data to Amazon Redshift, there are two ways -

    • Using Connection Pool - Use JDBC producer as the destination and the connection strings of Redshift for connecting to Redshift.
    • Using Kinesis Firehose Stream - Configure a Kinesis Firehose stream first which uses Amazon S3 bucket as the intermediate and copy command to transfer data to Amazon Redshift Cluster.

Building Data Flow Pipeline

StreamSets Data Collector contains connectors to many systems acting as origins or destinations including not only traditional methods such as relational databases, files, but Kafka, HDFS, cloud tools also. Moreover, it allows a graphical interface for building pipeline bifurcated into :

  • Data Acquisition
  • Data Transformation
  • Data Storage
  • Data Flow Triggers

Steps to Build Data Flow Pipeline using StreamSets

  • StreamSet Data Collector Installation
  • Creation of Java DataBase Connectivity
  • Create Data Flow Pipeline
  • Discard Useless Fields from Pipeline
  • Modification of fields through Expression Evaluator
  • Stream Selector to pass data to streams
  • View Data Pipeline States and Statistics
  • Automate through Data Collector Logs and Pipeline History

Supremacy of StreamSets

  • Efficient Pipeline Development
  • Pipelines Ingestion
  • Change Data Capture
  • Continuous Data Integration
  • Timely Data Delivery
  • Detection of Anomalies at every stage throughout the pipeline

Download the Use Case

Download Now and Get Access to the detailed Use Case

XenonStack Cyber Security Solution Image
captcha text
Refresh Icon

Thanks for submitting the form.

Request for Services

Find out more about How your Enterprise can Streamline Data Operations and enable effective Management

Thanks for submitting the form.