Unified Data Ingestion Solution -Apache Gobblin

Interested in Solving your Challenges with XenonStack Team

Get Started

Get Started with your requirements and primary focus, that will help us to make your solution

First Name *

Last Name *

Business Email ID *

Contact Number *

Company *

Industry Belongs To *

Please Select your Industry

Banking

Fintech

Payment Providers

Wealth Management

Discrete Manufacturing

Semiconductor

Machinery Manufacturing / Automation

Appliances / Electrical / Electronics

Elevator Manufacturing

Defense & Space Manufacturing

Computers & Electronics / Industrial Machinery

Motor Vehicle Manufacturing

Food and Beverages

Distillery & Wines

Beverages

Shipping

Logistics

Mobility (EV / Public Transport)

Energy & Utilities

Hospitality

Digital Gaming Platforms

SportsTech with AI

Public Safety - Explosives

Public Safety - Firefighting

Public Safety - Surveillance

Public Safety - Others

Media Platforms

City Operations

Airlines & Aviation

Defense Warfare & Drones

Robotics Engineering

Drones Manufacturing

AI Labs for Colleges

AI MSP / Quantum / AGI Institutes

Retail Apparel and Fashion

Proceed Next

Interested in Solving your Challenges with XenonStack

Personalization

Get Started with your requirements and primary focus, that will help us to make your solution

What is your Key focus areas? *

AI Workflow and Operations

Data Management and Operations

AI Governance

Analytics and Insights

Observability

Security Operations

Risk and Compliance

Procurement and Supply Chain

Private Cloud AI

Vision AI

In Which Agentic Platform and Accelerator you are Interested? *

Akira AI - Agentic AI Platform Multi Agent System

Metasecure - Autonomous SOC

Nexastack – Build and Managed Compound AI Stack

Data Foundry

XAI – Vision and AI Platform – Visual AI Agents

Strategy Consulting

AI Managed Services

Others (Please Specify)

Which segment does your company belong to? *

Startup

Scale Startup

SME

Mid Enterprises

Large Enterprises

Federal Government

Non Profits

Others (Please Specify)

At what stage is your AI use case currently in? *

Conceptualized: Use case defined, PoC pending

POC Completed

In Production with challenges

Not yet defined

Others (Please Specify)

What are the primary challenges in adopting AI? *

Data Quality Issues

Data Privacy and Compliance

Aligning AI with business goals

Unclear ROI from POCs

Integration with existing ERP systems

Scalability Challenges

Moving POCs in Production

Infrastructure Limitation

High Implementation costs

Others (Please Specify)

What kind of infrastructure does your organization currently using? *

AWS

Microsoft Azure

GCP

IBM Cloud

Oracle Cloud

On Premises

Others (Please Specify)

Are you using any Data platform? *

Databricks

SnowFlake

Amazon Redshift

Azure Synapse Analytics

Microsoft Fabric

Teradata

Oracle Database

SAP Hana

Informatica

Google Cloud BigQuery

Others (Please Specify)

Preferred Approach for AI Transformation *

Assisted Intelligence Agents as Co-Pilot

Collaborative Intelligence Agents as AI Teammates

Autonomous Intelligence Agents – AI Agents

Agentic Actions

Agentic Process Automation

In Which Domain your Solution/Organization belongs to in-terms of Data Privacy, Trustworthy AI *

Internal Organization

Highly Regulated Industry (Healthcare, Financials etc)

Medium Regulated

Non Regulated

Captcha Verification *

Review Previous

Submit

Unified Data Ingestion Solution -Apache Gobblin

What is Apache Gobblin?

Apache Gobblin is a unified data ingestion framework for extracting, transforming and loading a large volume of data from a variety of data sources. It can ingest data from different data sources in the same execution framework and manages metadata of different sources in on place.

Data lake architecture has capability to quickly and easily ingest multiple types of data, such as real-time streaming data and bulk data assets. Click to explore about, Data ingestion methods

What are the components of Apache Gobblin?

Gobblin provides six different component interfaces, so it's easy to scale and customize development:

Source

Source is primarily responsible for integrating source data into a series of work units and indicating what the corresponding extractor is.

Extractor

Extractor specifies the data source information through the work unit, such as Kafka, indicating the starting offset of each partition in the topic, which is used for this extraction. Gobblin uses the concept of the watermark to record the starting position of each extracted data.

Converter

Which performs some filtering and conversion operations on the extracted data, such as converting byte arrays or JSON format data into a format that needs to be output. A conversion operation can also map a piece of data into zero or more pieces of data.

Quality Checker

It is a quality detector with two types of checkers: record level and task-level policies. The checked data is output to an external file or given a warning by a standard policy or an optional policy.

Writer

Writer writes the exported data, but it is not written directly to the output file, but written to a staging directory. When all the data has been written, it is written to the output path for publishing by the publisher. The path of the Sink can be in HDFS or Kafka or Amazon S3, and the format can be Avro, Parquet, or CSV format. At the same time, the Writer can output the output file to the directory named “hour” or “day” according to the timestamp.

Publisher

Publisher is based on the path written by the writer to output the data to the final path. At the same time, it provides two kinds of submitting mechanisms: full commit and partial commit; if it is a full commit, it needs to wait until the succeeds before publish. If it is a partial commit mode, when the task fails, some data in the directory has been published.

An Open source, Data Ingestion and dispersal framework and library for Apache Hadoop, build on the top of the Hadoop ecosystem. Click to explore about, Uber Marmaray Features and its Best Practises

Why Apache Gobblin?

Apache Gobblin is a generic data ingestion framework, which is easily configurable to ingest data from several different types of sources and easily extensible for new data sources. Gobblin handles the common routine task required for all data ingestion ETLs, including job/task scheduling, task partitioning, error handling, state management, data quality checking, data publishing, etc. It ingests data from different data sources in the same execution framework and manages metadata of various data sources all in one place. Gobblin features-

Auto scalability
Fault tolerance
Data quality assurance
Extensibility
Handling data model evolution

Which makes Gobblin an easy-to-use, self-serving, and efficient data ingestion framework.

Some challenges Gobblin addresses

Source integration - Gobblin provides out-of-the-box adaptors for all of commonly accessed data sources such as S3, Kafka, Google Analytics, MySQL and Salesforce
Processing paradigm - It supports both standalone and scalable platforms, including Yarn and Hadoop. Yarn Gives the capability to run continuous ingestion in addition to scheduled batches.
Extensibility - Own adaptors can be integrated with the Gobblin framework and make it leverageable for other developers in the community.
Self-service - It's standalone support mode so data ingestion and transformation flow can be composed in a self-service manner, test locally using standalone mode and deploy the flow in production using scale-out mode without code change.

An open source for distributing and processing of data supporting data routing and transformation. Click to explore about, Data Ingestion Platform Using Apache Nifi

How Apache Gobblin Works?

Gobblin job ingests data from a data source into a sink. A job may consist of multiple tasks, or work units, each of which represents a unit of work to be done.

Guide to Computation Model

Gobblin Standalone

Single process, multi-threading
Testing, small data, sampling

Gobblin on Map/Reduce

Large datasets, horizontally scalable

Gobblin on Yarn

Better resource utilization
More scheduling flexibilities

Sources Overview

Determines how to partition work
Partitioning algorithm can leverage source sharding
Group partitions intelligently for performance
Creates work-units to scheduled

Guide to Job Management

Job execution states
Watermark
Job state, Task state, quality checker output, error code
Job synchronization
Job failure handling: policy-driven

Extractors Overview

Specifies how to pull data from the source and get the schema
Return ResultSet iterator
Track high watermark
Track extraction metrics

Guide to Converters Allow for schema and data transformation

Filtering
Projection
Type conversion
Structural change

Composable - Can be specified as a list of converters to be applied in the given order. Quality Checkers Process

Ensure the quality of any data produced by Gobblin
Can be run on a per task, per record, or per job basis
Can define a list of quality checkers to be used
Schema compatibility
Audit check
Sensitive fields
Unique key
Policy-driven
FAIL - when the check fails then so does the job
OPTIONAL - when the checks fail the job continues
ERR_FILE - the effecting row is written to an error file

Writer Architecture

Writing data in Avro format onto HDFS
One writer per task
Flexibility
Configurable compression codec
Configurable buffer size

Publisher Overview Determines job success based on Policy.

COMMIT_ON_FULL_SUCCESS
COMMIT_ON_PARTIAL_SUCCESS

When the job is a success then commits data to final directories.

Data is everywhere, and we are generating data from Centre of Analytics - Product Discovery and Recommendation different Sources like Social Media, Sensors, API’s, Databases. Click to explore about, Real Time Big Data Integration Solutions

How to use Apache Gobblin?

Self-serve - User can create jobs programmatically through REST APIs or via UI on any Gobblin deployment, leaving operations to focus on only deployment and upgrades.
Optimal resource usage - User can submit jobs and leave it to Gobblin-as - a -Service to optimally choose executor instance and compile logical job or single tenant job based on resource and SLA constraints.
Failover and upgrades - The technology executing the job behind Gaas can be transparently swapped out in case of failover or upgrades without their intervention.
Global state - The unifying factor of Gaas across hybrid technology deployment enables operations team to easily monitor and manage the global state of data landscape and lineage in their organization.

What are the best practices of Apache Gobblin?

A child class of EmbeddedGobblin is based on a template. The constructor should call setTemplate(myTemplate), and the model should be automatically loaded on construction. All required configurations for a job need to be parsed from the constructor arguments. A user should be able to run new MyEmbeddedGobblinExtension(params...).run() and get a sensible job run. Convenience methods must be added to the most common configurations users would want to change. For example -

public EmbeddedGobblinDistcp simulate() {

    this.setConfiguration(CopySource.SIMULATE,
    Boolean.toString(true));
    return this;

}

If the job requires additional jars in the workers that are not part of the minimal Gobblin ingestion classpath, then the constructor should call distributeJar(myJar) for the additional jars.

What are the benefits of Apache Gobblin?

The benefits of Apache Gobblin are listed below:

Auto scalability
Fault tolerance
Data quality assurance
Extensibility
Handling data model evolution

Key Apache Gobblin Tools

Conclusion

Gobblin combined with other features such as auto scalability, fault tolerance, data quality assurance, extensibility and the ability of handling data model evolution. It an easy-to-use, self-serving, and efficient data ingestion framework.

Explore about Data Serialization in Apache Hadoop
Read more about Apache Hudi Architecture and Best Practices

Interested in Solving your Challenges with XenonStack Team

Get Started

Interested in Solving your Challenges with XenonStack