Unified Data Ingestion Solution -Apache Gobblin

What is Apache Gobblin?

Apache Gobblin is a unified data ingestion framework for extracting, transforming and loading a large volume of data from a variety of data sources. It can ingest data from different data sources in the same execution framework and manages metadata of different sources in on place.

Data lake architecture has capability to quickly and easily ingest multiple types of data, such as real-time streaming data and bulk data assets. Click to explore about, Data ingestion methods

What are the components of Apache Gobblin?

Gobblin provides six different component interfaces, so it's easy to scale and customize development:

Source

Source is primarily responsible for integrating source data into a series of work units and indicating what the corresponding extractor is.

Extractor

Extractor specifies the data source information through the work unit, such as Kafka, indicating the starting offset of each partition in the topic, which is used for this extraction. Gobblin uses the concept of the watermark to record the starting position of each extracted data.

Converter

Which performs some filtering and conversion operations on the extracted data, such as converting byte arrays or JSON format data into a format that needs to be output. A conversion operation can also map a piece of data into zero or more pieces of data.

Quality Checker

It is a quality detector with two types of checkers: record level and task-level policies. The checked data is output to an external file or given a warning by a standard policy or an optional policy.

Writer

Writer writes the exported data, but it is not written directly to the output file, but written to a staging directory. When all the data has been written, it is written to the output path for publishing by the publisher. The path of the Sink can be in HDFS or Kafka or Amazon S3, and the format can be Avro, Parquet, or CSV format. At the same time, the Writer can output the output file to the directory named “hour” or “day” according to the timestamp.

Publisher

Publisher is based on the path written by the writer to output the data to the final path. At the same time, it provides two kinds of submitting mechanisms: full commit and partial commit; if it is a full commit, it needs to wait until the succeeds before publish. If it is a partial commit mode, when the task fails, some data in the directory has been published.

An Open source, Data Ingestion and dispersal framework and library for Apache Hadoop, build on the top of the Hadoop ecosystem. Click to explore about, Uber Marmaray Features and its Best Practises

Why Apache Gobblin?

Apache Gobblin is a generic data ingestion framework, which is easily configurable to ingest data from several different types of sources and easily extensible for new data sources. Gobblin handles the common routine task required for all data ingestion ETLs, including job/task scheduling, task partitioning, error handling, state management, data quality checking, data publishing, etc. It ingests data from different data sources in the same execution framework and manages metadata of various data sources all in one place. Gobblin features-

Auto scalability
Fault tolerance
Data quality assurance
Extensibility
Handling data model evolution

Which makes Gobblin an easy-to-use, self-serving, and efficient data ingestion framework.

Some challenges Gobblin addresses

Source integration - Gobblin provides out-of-the-box adaptors for all of commonly accessed data sources such as S3, Kafka, Google Analytics, MySQL and Salesforce
Processing paradigm - It supports both standalone and scalable platforms, including Yarn and Hadoop. Yarn Gives the capability to run continuous ingestion in addition to scheduled batches.
Extensibility - Own adaptors can be integrated with the Gobblin framework and make it leverageable for other developers in the community.
Self-service - It's standalone support mode so data ingestion and transformation flow can be composed in a self-service manner, test locally using standalone mode and deploy the flow in production using scale-out mode without code change.

An open source for distributing and processing of data supporting data routing and transformation. Click to explore about, Data Ingestion Platform Using Apache Nifi

How Apache Gobblin Works?

Gobblin job ingests data from a data source into a sink. A job may consist of multiple tasks, or work units, each of which represents a unit of work to be done.

Guide to Computation Model

Gobblin Standalone

Single process, multi-threading
Testing, small data, sampling

Gobblin on Map/Reduce

Large datasets, horizontally scalable

Gobblin on Yarn

Better resource utilization
More scheduling flexibilities

Sources Overview

Determines how to partition work
Partitioning algorithm can leverage source sharding
Group partitions intelligently for performance
Creates work-units to scheduled

Guide to Job Management

Job execution states
Watermark
Job state, Task state, quality checker output, error code
Job synchronization
Job failure handling: policy-driven

Extractors Overview

Specifies how to pull data from the source and get the schema
Return ResultSet iterator
Track high watermark
Track extraction metrics

Guide to Converters Allow for schema and data transformation

Filtering
Projection
Type conversion
Structural change

Composable - Can be specified as a list of converters to be applied in the given order. Quality Checkers Process

Ensure the quality of any data produced by Gobblin
Can be run on a per task, per record, or per job basis
Can define a list of quality checkers to be used
Schema compatibility
Audit check
Sensitive fields
Unique key
Policy-driven
FAIL - when the check fails then so does the job
OPTIONAL - when the checks fail the job continues
ERR_FILE - the effecting row is written to an error file

Writer Architecture

Writing data in Avro format onto HDFS
One writer per task
Flexibility
Configurable compression codec
Configurable buffer size

Publisher Overview Determines job success based on Policy.

COMMIT_ON_FULL_SUCCESS
COMMIT_ON_PARTIAL_SUCCESS

When the job is a success then commits data to final directories.

Data is everywhere, and we are generating data from Centre of Analytics - Product Discovery and Recommendation different Sources like Social Media, Sensors, API’s, Databases. Click to explore about, Real Time Big Data Integration Solutions

How to use Apache Gobblin?

Self-serve - User can create jobs programmatically through REST APIs or via UI on any Gobblin deployment, leaving operations to focus on only deployment and upgrades.
Optimal resource usage - User can submit jobs and leave it to Gobblin-as - a -Service to optimally choose executor instance and compile logical job or single tenant job based on resource and SLA constraints.
Failover and upgrades - The technology executing the job behind Gaas can be transparently swapped out in case of failover or upgrades without their intervention.
Global state - The unifying factor of Gaas across hybrid technology deployment enables operations team to easily monitor and manage the global state of data landscape and lineage in their organization.

What are the best practices of Apache Gobblin?

A child class of EmbeddedGobblin is based on a template. The constructor should call setTemplate(myTemplate), and the model should be automatically loaded on construction. All required configurations for a job need to be parsed from the constructor arguments. A user should be able to run new MyEmbeddedGobblinExtension(params...).run() and get a sensible job run. Convenience methods must be added to the most common configurations users would want to change. For example -

public EmbeddedGobblinDistcp simulate() {

    this.setConfiguration(CopySource.SIMULATE,
    Boolean.toString(true));
    return this;

}

If the job requires additional jars in the workers that are not part of the minimal Gobblin ingestion classpath, then the constructor should call distributeJar(myJar) for the additional jars.

What are the benefits of Apache Gobblin?

The benefits of Apache Gobblin are listed below:

Auto scalability
Fault tolerance
Data quality assurance
Extensibility
Handling data model evolution

Key Apache Gobblin Tools

Conclusion

Gobblin combined with other features such as auto scalability, fault tolerance, data quality assurance, extensibility and the ability of handling data model evolution. It an easy-to-use, self-serving, and efficient data ingestion framework.

Explore about Data Serialization in Apache Hadoop
Read more about Apache Hudi Architecture and Best Practices

Interested in Solving your Challenges with XenonStack Team

Get Started

Interested in Solving your Challenges with XenonStack

Personalization

In Which Agentic Platform and Accelerator you are Interested? *

Which segment does your company belong to? *

What is your primary focus areas? *

At what stage is your AI use case currently in? *

What are the primary challenges in adopting AI? *

What kind of infrastructure does your organization currently using? *

Are you using any Data platform? *

Preferred Approach for AI Transformation *

In Which Domain your Solution/Organization belongs to in-terms of Data Privacy, Trustworthy AI *

Captcha Verification *

your request has been submitted successfully !

Unified Data Ingestion Solution -Apache Gobblin

What is Apache Gobblin?

What are the components of Apache Gobblin?

Source

Extractor

Converter

Quality Checker

Writer

Publisher

Why Apache Gobblin?

Some challenges Gobblin addresses

How Apache Gobblin Works?

Guide to Computation Model

How to use Apache Gobblin?

What are the best practices of Apache Gobblin?

What are the benefits of Apache Gobblin?

Key Apache Gobblin Tools

Conclusion

Table of Contents

Related Articles

Data Governance - Benefits and Best Practices

Guide to Data Catalog Architecture | Components and Work Process

What is Real-Time Intelligence? Complete Guide

Interested in Solving your Challenges with XenonStack Team

Get Started

Interested in Solving your Challenges with XenonStack

Personalization

In Which Agentic Platform and Accelerator you are Interested? *

Which segment does your company belong to? *

What is your primary focus areas? *

At what stage is your AI use case currently in? *

What are the primary challenges in adopting AI? *

What kind of infrastructure does your organization currently using? *

Are you using any Data platform? *

Preferred Approach for AI Transformation *

In Which Domain your Solution/Organization belongs to in-terms of Data Privacy, Trustworthy AI *

Captcha Verification *

your request has been submitted successfully !

Unified Data Ingestion Solution -Apache Gobblin

What is Apache Gobblin?

What are the components of Apache Gobblin?

Source

Extractor

Converter

Quality Checker

Writer

Publisher

Why Apache Gobblin?

Some challenges Gobblin addresses

How Apache Gobblin Works?

Guide to Computation Model

How to use Apache Gobblin?

What are the best practices of Apache Gobblin?

What are the benefits of Apache Gobblin?

Key Apache Gobblin Tools

Conclusion

Share Article

Table of Contents

Share Article

Explore Related Topics

Subscribe to our Latest Technology Insights and Resources

Get the latest articles in your inbox

Related Articles

Data Governance - Benefits and Best Practices

Guide to Data Catalog Architecture | Components and Work Process

What is Real-Time Intelligence? Complete Guide