Apache Gobblin Best Practices, Tools and Benefits

November 02, 2018 

Apache Gobblin Best Practices, Tools and Benefits

What is Apache Gobblin?

Apache Gobblin is a unified data ingestion framework for extracting, transforming and loading a large volume of data from a variety of data sources. It can ingest data from different data sources in the same execution framework and manages metadata of different sources in on place. Gobblin combined with other features such as auto scalability, fault tolerance, data quality assurance, extensibility and the ability of handling data model evolution. It an easy-to-use, self-serving, and efficient data ingestion framework.

Apache Gobblin Components

Gobblin provides six different component interfaces, so it's easy to scale and customize development -

  • Source
  • Extractor
  • Converter
  • Quality checker
  • Writer
  • Publisher

Source - It is primarily responsible for integrating source data into a series of work units and indicating what the corresponding extractor is.

Extractor - Extractor specifies the data source information through the work unit, such as Kafka, indicating the starting offset of each partition in the topic, which is used for this extraction. Gobblin uses the concept of the watermark to record the starting position of each extracted data.

Converter - Which performs some filtering and conversion operations on the extracted data, such as converting byte arrays or JSON format data into a format that needs to be output. A conversion operation can also map a piece of data into zero or more pieces of data.

Quality Checker - It is a quality detector with two types of checkers: record level and task-level policies. The checked data is output to an external file or given a warning by a standard policy or an optional policy.

Writer - Writer writes the exported data, but it is not written directly to the output file, but written to a staging directory. When all the data has been written, it is written to the output path for publishing by the publisher. The path of the Sink can be in HDFS or Kafka or S3, and the format can be Avro, Parquet, or CSV format. At the same time, the Writer can output the output file to the directory named “hour” or “day” according to the timestamp.

Publisher - Publisher is based on the path written by the writer to output the data to the final path. At the same time, it provides two kinds of submitting mechanisms: full commit and partial commit; if it is a full commit, it needs to wait until the succeeds before publish. If it is a partial commit mode, when the task fails, some data in the directory has been published.


Why Apache Gobblin?

Apache Gobblin is a generic data ingestion framework, which is easily configurable to ingest data from several different types of sources and easily extensible for new data sources. Gobblin handles the common routine task required for all data ingestion ETLs, including job/task scheduling, task partitioning, error handling, state management, data quality checking, data publishing, etc. It ingests data from different data sources in the same execution framework and manages metadata of various data sources all in one place. Gobblin features-

  • Auto scalability
  • Fault tolerance
  • Data quality assurance
  • Extensibility
  • Handling data model evolution

Which makes Gobblin an easy-to-use, self-serving, and efficient data ingestion framework.

Some challenges Gobblin addresses

  • Source integration - Gobblin provides out-of-the-box adaptors for all of commonly accessed data sources such as S3, Kafka, Google Analytics, MySQL and Salesforce
  • Processing paradigm - It supports both standalone and scalable platforms, including Yarn and Hadoop. Yarn Gives the capability to run continuous ingestion in addition to scheduled batches.
  • Extensibility - Own adaptors can be integrated with the Gobblin framework and make it leverageable for other developers in the community.
  • Self-service - It's standalone support mode so data ingestion and transformation flow can be composed in a self-service manner, test locally using standalone mode and deploy the flow in production using scale-out mode without code change.

How Apache Gobblin Works?

Gobblin job ingests data from a data source into a sink. A job may consist of multiple tasks, or work units, each of which represents a unit of work to be done.

Guide to Computation Model

Gobblin Standalone

  • Single process, multi-threading
  • Testing, small data, sampling

Gobblin on Map/Reduce

  • Large datasets, horizontally scalable

Gobblin on Yarn

  • Better resource utilization
  • More scheduling flexibilities

Sources Overview

  • Determines how to partition work
  • Partitioning algorithm can leverage source sharding
  • Group partitions intelligently for performance
  • Creates work-units to scheduled

Guide to Job Management

  • Job execution states
  • Watermark
  • Job state, Task state, quality checker output, error code
  • Job synchronization
  • Job failure handling: policy-driven

Extractors Overview

  • Specifies how to pull data from the source and get the schema
  • Return ResultSet iterator
  • Track high watermark
  • Track extraction metrics

Guide to Converters

Allow for schema and data transformation

  • Filtering
  • Projection
  • Type conversion
  • Structural change

Composable - Can be specified as a list of converters to be applied in the given order.

Quality Checkers Process

  • Ensure the quality of any data produced by Gobblin
  • Can be run on a per task, per record, or per job basis
  • Can define a list of quality checkers to be used
  • Schema compatibility
  • Audit check
  • Sensitive fields
  • Unique key
  • Policy-driven
  • FAIL - when the check fails then so does the job
  • OPTIONAL - when the checks fail the job continues
  • ERR_FILE - the effecting row is written to an error file

Writer Architecture

  • Writing data in Avro format onto HDFS
  • One writer per task
  • Flexibility
  • Configurable compression codec
  • Configurable buffer size

Publisher Overview

Determines job success based on Policy.

  • COMMIT_ON_FULL_SUCCESS
  • COMMIT_ON_PARTIAL_SUCCESS

When the job is a success then commits data to final directories.


How to Use Apache Gobblin?

  • Self-serve - User can create jobs programmatically through REST APIs or via UI on any Gobblin deployment, leaving operations to focus on only deployment and upgrades.
  • Optimal resource usage - User can submit jobs and leave it to Gobblin-as - a -Service to optimally choose executor instance and compile logical job or single tenant job based on resource and SLA constraints.
  • Failover and upgrades - The technology executing the job behind Gaas can be transparently swapped out in case of failover or upgrades without their intervention.
  • Global state - The unifying factor of Gaas across hybrid technology deployment enables operations team to easily monitor and manage the global state of data landscape and lineage in their organization.

Apache Gobblin Best Practices

A child class of EmbeddedGobblin is based on a template. The constructor should call setTemplate(myTemplate), and the model should be automatically loaded on construction.

All required configurations for a job need to be parsed from the constructor arguments.

A user should be able to run new MyEmbeddedGobblinExtension(params...).run() and get a sensible job run.

Convenience methods must be added to the most common configurations users would want to change. For example -


public

EmbeddedGobblinDistcp simulate() {

this.setConfiguration(CopySource.SIMULATE,

Boolean.toString(true));

return

this;


}

If the job requires additional jars in the workers that are not part of the minimal Gobblin ingestion classpath, then the constructor should call distributeJar(myJar) for the additional jars.


Apache Gobblin Benefits

  • Auto scalability
  • Fault tolerance
  • Data quality assurance
  • Extensibility
  • Handling data model evolution

Key Apache Gobblin Tools