XenonStack Recommends

Big Data Engineering

Adopt or not to Adopt Data Mesh? - A Crucial Question

Chandan Gaur | 27 December 2022

Subscription

XenonStack White Arrow

Thanks for submitting the form.

Introduction

An Energy Analytics Company provides Predictive Analytics & Power Forecasting Solution to various Wind & Solar Farms worldwide. That is how their typical data lake architecture looks like. And while explaining the challenges faced during this journey, we will walk you through how we reach out to Data Mesh Architecture Patterns and How it solves the Major Data Platform Architecture Problems.
Data Mesh is not some kind of new technology or framework. It is just an idea of reusing an existing ecosystem of technology & tools in innovative ways to solve the enterprise's significant problems when they become a Proper Product Company from a Start-Up.

What is Data mesh ?

Data mesh creates a layer of connectivity that removes the complexities associated with connecting, managing, and supporting data access. It is a method of connecting data that is spread across multiple data silos.

Data as a Product

data-as-a-product

The concept of Centralized Storage for various kinds of entities integrated from different systems has become very popular in recent years. But it makes it very hard for DownStream Consumers to understand the data without the Data Catalog and Separate Team required to maintain the Data Catalog having Domain Knowledge.

But many fundamental principles are followed in Data Mesh while designing your Data Platform.

Discoverability

Once the data is available as a product, it must be discoverable through a data catalog; Each data product should have metadata information such as owner, lineage, source, and sample data. The data consumers teams should be able to register for easy discoverability of the data. The mind shift provides the data as a product in a discoverable fashion to the downstream teams.

Addressability

A data product should be available for accessing any information easily. The standard should be set for addressing the data. Under different domains, they might store and serve their data into other formats like CSV, serialized parquet format into s3, or they can store and access it through streams such as Kafka topics. But a common convention should be developed, which helps users to address it pragmatically.

Trustworthy

Without data truthfulness, data products have no meaningful use for analytics and other operations. The data owner must provide an acceptableSLOfor the data's truthfulness.

Also, how it is going to reflect the real-time scenario and the insights that have been generated based on those data points. Automated data integrity testing can help provide acceptable data quality at the time of creating a data product. Providing data lineage as metadata with data products helps users gain confidence in data integrity.

Interoperability

In distributed domain data architecture, the key concern is to have interoperability between domains. Users should correlate data across different domains and insightfully stitch them using joins, filters, aggregates, etc. There should be standards sets for type formatting, identifying common metadata fields, and dataset address conventions to enable interoperability in polyglot domains.

Domain-Driven Data Models

Microservices Architecture allows Product Teams to break their overall Solution into a Group of different independent/interconnected services, making it more manageable.

Similarly, While Defining Storage Architecture, Instead of going for a common database kind of approach, it makes sense to segregate your storage into different domains and define the Data Lake Entities accordingly. It will help the BI & Analytics Team see the other Data Domains available instead of spending their own time understanding the same.

Cross-Functional Data Engineering Teams

Microservices Architecture inspires us to split our Data Engineering Team into sub-teams having complete domain knowledge of the datasets they produce, transform, and serve to analytics teams. This Team Structure will help different sub-teams focus on their respective domains easily and also becomes easy to collaborate between cross teams elegantly.

Clear Ownerships & Governance of DataSets

Once the Data Platform Teams have a clear understanding of What they are ingesting, it eases the process to define the datasets' ownerships. Instead of a Centralized Governance Approach, Data Mesh Architecture makes it easy to define Data Governance Policies.

Challenges in Current Architecture

Lack of Domain Knowledge in Data Platform Team

data-platform-architecture

Typically, Data Engineers focus on just bringing the data from whatever Data Sources and working with the BI & Analytics Team and understanding their Usage Patterns and defining the Data Lake Structure.

However, Data Engineering doesn't have that Domain Knowledge for given datasets. Once that data reaches the Analytics Team, Data loses its context. It can happen that the Data Platform Team has created its own version of Data Sets in Data Lake or Warehouse according to their understanding.

Data Platform Team becoming Bottle Neck for Serving Data with Context

Many times it happens, Customers want to expose their data to the Analytics Team, and both Customer & Analytics teams understand the data's context. But the Engineering Team doesn't have much idea of the Data Domain. And for bringing the data into the Data Lake in reading Optimised Format becomes challenging, then traditional KT session starts between Customer, DE, and Analytics Team for designing storage for Data Lake to make data available for BI & Analytics Teams.

Lack of Ownership of DataSets in Centralized Data Platform

Traditional Data Lake Architecture uses ETL/ELT Processes to bring data into the Platform and Data Platform Teams, entirely focusing only on building those Data Lake Tables and Exposing the Datasets to the Analytics Team through some MetaStore ( or using Catalog nowadays). But the question is Who will take ownership of those datasets, which means who can guide the DownStream Teams that What that Data means and how it needs to be used.

Lack of Domain-Driven Data Quality

Nowadays, many Data Quality Tools and Frameworks can help us profile our data and understand their quality. But this isn't enough for the Analytics Team because, along with Basic Data Quality Metrics, they have many Domain-Specific different aspects of defining Data Quality.

Adopt or not to Adopt Data Mesh?

The adoption of the Data mesh is dependent on the following factors in the organization.

Number of data sources

Take the data number of data sources into consideration before ramping up for the Data Mesh. How many data sources do you have in the organizations.

Team size

What is the size of the team? Size of the data scientists, Data Engineering team.

Data Domain Quantity

How many products the company owns. Do other team marketing or sales teams rely on the data to decide on it?

Bottleneck

Its data engineering team is a bottleneck in implementing any new product.

A Fundamental Shift

To move from traditional data architecture to data mesh need to consider some fundamental shifts.

FROM

TO

Centralized ownership

Decentralized ownership

Pipelines as a first-class concern

Domain data as a first-class concern

Data as by-product

Data as Product

Solid Data Engineering Team

Cross-Functional Domain data teams

Centralized data lake

An Ecosystem of data products

Typical Data Lake Architecture

data-mesh-architecture

Almost Every Data Platform company is working around the above architecture pattern. When different Enterprises reach a stage of serving customers of Different Domains, they will find their basic Team Structure & Architecture Approach principles non-scalable.

Usually, The Data Ingestion Team brings data into Data Lake. The Engineering & Analytics Team defines Standard Data Structure for LakeHouse / Warehouse and processes & transforms the data into read optimized format.

So, Generally, three teams are working across :

Data Ingestion:Dedicated Team for the integration of Customer Data Sources

Data Platform:Maintains Data Platform including Data Lake, Warehouse, Marts, Governance, Catalog

Analytics Team:Responsible for deciding based on Data, i.e., Business Intelligence & Data Science Team

Conclusion

Data meshes address the shortcomings of data lakes by giving data owners more autonomy and flexibility, allowing for more data experimentation and innovation, and reducing the burden on data teams to meet the needs of every data consumer through a single pipeline.