XenonStack Recommends

Big Data Engineering

Azure Data Factory vs. Apache Airflow | In depth Case Study

Chandan Gaur | 18 December 2022

Subscription

XenonStack White Arrow

Thanks for submitting the form.

Overview Azure Data Factory and Apache Airflow

Data-driven decision-making allows organizations to take strategic decisions and actions that align with their objectives and goal at the right time. Undoubtedly, organizations are generating petabytes of data but still struggling to collect data, create pipelines, and manage or monitor before extracting and understanding the data patterns and insights to make decisions. Azure Data Factory and Apache Airflow automate jobs and make it easy to monitor them.

What is Azure Data Factory?

Azure Data Factory (ADF) is a data integration and migration service. It is a fully managed serverless data ingestion solution to ingest, prepare and transform all data at scale. Microsoft offers ADF within Azure for constructing ETL and ELT pipelines. It creates an automated data pipeline that automatically performs these processes, thus reducing manual tasks.

Click to discover about Azure Data Factory vs. Apache Airflow | In depth Case Study

What are the advantages of Azure Data Factory?

Below given are the advantages of Azure Data Factory:

  1. Easy to use: It rehosts and extends SSIS in a few clicks. ADF helps to modernize the SSIS. It makes it easy to move all SSIS packages to the cloud. Moreover, it builds ETL and ELT pipelines code-free, with built-in Git and CI/CD support.
  2. Cost-effective: ADF is cost-effective by nature as it allows pay as peruse. It is a fully managed serverless cloud service that scales on-demand.
  3. Powerful: It has 90 built-in connectors that allow it to ingest data from all on-premises and software as a service (SaaS) sources. Prepare and monitor data pipelines code-free at scale.
  4. Intelligent: Autonomous ETL unlocks operational efficiencies and enables citizen integrators.

What is Apache Airflow?

Apache Airflow is a solution that runs, builds, and manages workflows. It represents workflow as directed acyclic graphs of operations called tasks, where an edge represents a logical dependency between operations.

Airflow installation consists of the following components:

  1. Scheduler: It handles triggering schedules workflows and submitting tasks to the executor to run.
  2. Executor: It handles the running of tasks. It runs everything inside the scheduler by default, but most production-suitable executors push task execution out to workers.
  3. Web Server: It presents a handy user interface to inspect, trigger and debug DAGs behavior and task.
  4. DAG file: A folder of DAG files that are read by the scheduler and executor.
  5. Metadata database: It is a metadata database that is used by scheduler web server uses a metadata database and executor to store data.

What are the advantages of Apache Airflow?

The advantages of Apache Airflow are described below:

  1. Open Source: Apache Airflow is an open-source service wherever improvements can be made quickly. It has no barriers and prolonged procedures.
  2. Easy to use: Anyone with Python knowledge can deploy a workflow. It can be used to transfer data, manage infrastructure, build ML models, and more.
  3. Robust Integrations: It offers plug-and-play operators that can be used to execute tasks on Google Cloud Platform, Amazon Web Services, Microsoft Azure, and other third-party services. This capability makes Airflow easy to apply to current infrastructure and extends to next-generation technologies.
Explore about A Crucial Question-Adopt or not to Adopt Data Mesh?

Why do we need Apache Airflow and Azure Data Factory?

As organizations move into the cloud and big data, data integration and migration will remain essential elements for organizations across industries. ADF helps to address these two issues efficiently and hence enables to focus on data and allow to schedule, monitor, and manage ETL/ELT pipelines with a single view.

Let’s discuss some reasons why the adoption of Azure Data Factory is on the rise:

  1. To drive more value.
  2. Improve business process outcomes
  3. Reduce overhead expenses
  4. Better decision-making
  5. Increase business process agility
  6. Cost-effective process

How does Apache Airflow and Azure Data Factory help businesses?

Here it will discuss some customer stories and their view to justify how ADF and Airflow change their business and helps them to reach their goals:

Apache Airflow

Case 1:

Problem: The organization needs to create workflow orchestration for solving some tasks in game dev. They didn’t have any suitable tools with built-in functions to orchestrate the process manually and from scratch every time. As a result, it increases complexity in managing dependencies and monitoring processes in complex workflows. They need a centralized tool to tell them logs, retries, and performance time at one location. Moreover, they are lacking in backfilling historical data and restarting the failed tasks.

Solution: Airflow provides some built-in solutions having integrative ones also. With their vast feature, Apache airflow simplifies the process of building complex workflows. DAG models avoid errors and follow general patterns while building workflows. It allows them to run their game development processes such as processing messages to support the team, working with churn rate, sorting bank offers, and other similar issues to run efficiently.

Read more about Data Quality - Everything you need to know

Case 2:

Problem: Big data systems require sophisticated data pipelines that connect to a variety of backend services in order to support complex operations. These workflows must be deployed, monitored, and executed regularly or in response to external events. Organization’s Experience Platform component services designed and developed an orchestration service that allows users to author, schedule, and monitor complex hierarchical workflows for Apache Spark and non-Spark jobs. While working with various applications and managing them, organizations face several issues due to its complexity.

Solution: Apache Airflow allows Organisations Experience Platform to create smooth orchestration services to meet customer requirements. It is built on guiding principles to leverage an off-the-shelf, open-source orchestration engine abstracted to other services via an API and extendable to any application via a pluggable framework. The platform uses the Apache Airflow execution engine for scheduling and executing various workflows. Moreover, it provides insight related to workflows.

ADF

Case 3:

Problem: The organization creates a Saas data solution that organizations can use to make transformative, data-driven decisions. As the data warehouse grew, the maintenance of existing data increasingly required updates to accommodate changes to the data feeds. Keeping updating ETL processes, and data models is a big maintenance effort; therefore, there is a need for a more intelligent approach. To solve this problem they use Microsoft technologies that automatically generates data warehouses and performs ETL process for customer specs. This process has drastically reduced the development cost and time.

What is the key feature of Apache Airflow and Azure Data Factory?

Feature

Azure Data Factory

Apache Airflow

Focus

ETL

Orchestration, scheduling, workflows

Database replication

Full table; 

Incremental via custom “SELECT” query

Only via plugins

SaaS

About 20, with several more in preview

Only via plugins

Ability to new data sources

No

Yes

Connects to data warehouses / Data lakes?

Yes/Yes

Yes/Yes

Support SLAs

Yes

No

Compliance, governance, and security certifications

HIPAA, GDPR, ISO 27001, others   

None

Data sharing

No

Yes, via plugins

Developer tools

REST API, .Net and Python SDKs

Experimental REST API

Apache Airflow Vs Azure Data Factory: Comparison

Let’s deep dive to compare ADF and Airflow based on some features:

Transformations

  1. Azure Data Factory: It supports both pre and post transformations with a wide range of transformation functions. Transformations can be applied using GUI or Power QueryOnline in which coding is required,
  2. Apache Airflow: is a tool for authoring, scheduling, and monitoring workflows as directed acyclic graphs of tasks. DAG is a topological representation that explains the way data flows within a system. Apache Airflow manages the execution dependencies among jobs in DAG and supports job failures, retirements, and alerts. Data can be transformed as an action in the workflow using python.

Connectors: Data sources and Destinations

These tools support a variety of data sources and Destinations

  1. Azure Data Factory: ADF could integrate with about 80 data sources, including SaaS platforms, SQL and NoSQL databases, generic protocols, and several file types. Moreover, It supports approximately 20 cloud and on-premises data warehouse and database destinations.
  2. Apache Airflow: Apache Airflow orchestrates workflow for ETL and stores data. It can run tasks, which are sets of activities, via operators, templates for tasks that Python functions or scripts can create. These operators can be created for any source or destination. Moreover, it also supports plugins to implement operators and hooks(interfaces to external platforms). It has some built-in plugins for databases and SaaS platforms.
Click to Explore about Data Catalog Architecture for Enterprise Data Assets

Support, documentation, and training

Working with these services can be complex, such as data integration; therefore, to support their customer, they offer some support via documentation, forums, and training.

  1. Azure Data Factory: ADF provides support by an online request form and forums. It gives official comprehensive documentation. Customers can also contact via phones and Emails. It also offers digital training materials that can be completed.
  2. Apache Airflow: Apache Airflow offers documentation that includes a quick start and how-to guide. Moreover, it provides support to the Slack community. It also provides some tutorials on its official website.

Pricing

Azure Data Factory: Pricing of Azure Data Factory

Azure Data Factory v1: The pricing for Data Factory usage is calculated based on the following factors:

  1. Frequency of activities: Based on the frequency such as high or low. Low-frequency activity does not execute more than once in a day rather than high-frequency activity can execute more than once in a day.
  2. Pipeline activity: It checks whether the pipeline is active or not.
  3. Place where activity is running: It tracks where the activity is running, such as on cloud or on-premise.
  4. Re-running activities: Activities can be re-run. The cost of rerunning depends on the place where the activity is running.

Azure Data Factory v2: The pricing of the data pipeline is calculated based on the following factors

  1. Pipeline orchestration and execution
  2. Data flow execution and debugging.
  3. Number of Data Factory operations such as creating and monitoring pipeline

Apache Airflow

Apache Airflow is free and open source. It is licensed under Apache License 2.0. Deploying Airflow to a robust and secure production environment has always been challenging. Therefore several companies, consultants, and cloud services offer enterprise support for deploying and managing Airflow environments such as AWS, Google, Astronomer, etc. So its price may vary according to the company. Here shows the pricing table of AWS below.

Azure Data Factory and Airflow Together

ADF is a service that is commonly used for constructing pipelines and jobs without writing tons of code. It can easily integrate with on-premise data sources and Azure services. However, it has some limitations when used alone:

  1. It isn't easy to build and integrate custom tools.
  2. Limited integration with services outside of Azure.
  3. Limited orchestration capabilities.
  4. Custom packages and dependencies are complex to manage.

Conclusion

Here is the role of Airflow in overcoming these limitations. ADF and Airflow can be used together to leverage the best of both tools. ADF jobs can be run using Airflow DAG, giving the full capabilities of Airflow orchestration beyond the ADF. Thus organizations can use ADF to write their jobs comfortably and use Airflow as the control plane for the orchestration.

The main building blocks of Airflow are Hooks and Operators that can easily interact and execute the ADF pipelines.