XenonStack Recommends

Big Data Engineering

Understanding Serverless Data Mesh Architecture Challenges

Chandan Gaur | 17 December 2022

Subscription

XenonStack White Arrow

Thanks for submitting the form.

What is Data Mesh Architecture?

Data Mesh Architecture is an architecture paradigm that offers self configure, self-service architecture design embracing the distributed, diverse nature of the data while leveraging Domain-Driven Design. A paradigm where database, pipelines, structure, and code language correspond with the business domain. Unlike centralized monolithic architecture, Data Mesh is a distributed architecture. Big data to provision innovation must have accountability and ownership with domain-driven owners who can provide data as a product. InterOperability, Customer satisfaction, scalability are the main focus points of Data Mesh.

Data mesh architecture also focuses on building Data Infrastructure as a platform for self-serve use. With this architecture, Data consumers like Data scientists, ML Engineers, Product Teams can have easy and direct access to Data Products with agreed SLA’s. It also allows micromanaging the access rights to data.

Example of Data Mesh Architecture

Think of a data platform that has to be built for eCommerce. In Mesh Architecture, you define domains as products, orders, carts, sessions, leads, etc. These domains are cohesive to each other and complete on their own at the same time. Each domain will have its pipeline, own storage while allowing the flow of data to cross domains. To control access rights on data, each domain will have a governance layer to take care of Policies, granular Access control, Data Monitoring.

Click to explore Adopt or not to Adopt Data Mesh? - A Crucial Question


What are the Challenges of Data Mesh Architecture?

With the ever-increasing diversity of data, the modern data tech stack has evolved. Data teams are getting more diverse than ever. With this evolution, the world is now moving from Old monolithic architectures to Distributed Data Mesh.

Integration and ingestion of data in a Central Monolithic Data Lake/Data Warehouses Architecture is a good idea for analyzing and reporting data. Still, they welcome the organizations with a huge cost, unmanaged technical debt, the technical debt of hundreds of ETL jobs, Batch processing jobs, and Databases on which only a set of few specialized people can work, thus delivering over-promised, no ownership underperformed results and huge bills.

In the following sections, we will be talking about Data Mesh Architecture, which necessarily is not a perfect solution for the above issues but allows us to address and avoid most of the problems.

What is the Logical Architecture Design for Data Mesh?

The four principles act as a base for Data Mesh Architecture.

Domain Oriented

With a shift to microservices architecture, product thinking has been built with domain-driven development. Domain-driven teams provide Access to their product data by building APIs, and pub/sub-messaging systems. The teams focus on delivering the best developer experience by providing understandable and discoverable documentation. For self-serve data platforms, Domain-driven data teams must apply the same product thinking to their data set, considering their customers as Data Scientists, ML Engineers, and various cross teams.

Key points to be taken care of before serving data as the product are

  1. Data must be Reliable
  2. Data must be easy to understand semantics and well-defined schema.
  3. It should be easily discoverable.
  4. Security of data should be taken care of in both motion and rest.
  5. It should be InterOperable.

Decentralized Data Ownership

Domain data teams must be fed with new skill sets. A single domain data team must be accompanied with

Data Product Owner: The person who is Accountable and sets the path, makes decisions, and creates a vision around creating data as a product, the one who takes decision-related to architecture and decides KPIs for their data products

Data Engineers: In order to build data pipelines, and organize and convert data to useful information, the team must have Data Engineers.

Self Serve Data Platform

For example, we are building a data infrastructure to automate data ingestion through configurations, self-registering a data product with the catalog, etc.

A Success Criterion of Data Infrastructure as a platform can be measured by how much time it takes to register new data as a product on the platform infrastructure.

The checklist that must be considered but not limited to while building data infrastructure as a platform :

  1. Data governance according to defined standards
  2. Scalable data storage
  3. Data versioning
  4. Confederated identity management
  5. Data lineage
  6. Data Monitoring and proper logging
  7. Data Caching
  8. Data Quality Benchmarks
  9. Encryption of data and secrets

Monitoring and Governance

Movement, Accessing, and storing of data has a primary security concern, whether it is a centralized or a decentralized architecture. In Mesh architecture, Access is controlled at a finer granular level. Defining a role-based access control policy is a convenient way to monitor Access across the platform.

Data Mesh vs. Centralised Platform: How it is different?

Data Mesh brings a shift from the Centralised platform in architecture, technical solutions, operations, and thought processes in the data engineering team. It changes how couples manage, analyze and use analytical data.

  1. Architectural Shift: Architecture shifts from collecting and managing data in monolithic data lakes/warehouses to connecting data through the mesh.
  2. Accountability and Ownership: It shifts Ownership/Accountability from a single central specialized team that works on data platform technologies to decentralized business domain-specific teams.
  3. Operational Shift: It shifts data governance from a top-down, centralized operational model to a confederate model with policies enforced on each mesh node.
  4. Thought Process Shift: It brings a change in thought process from thinking data as an asset to be collected to data as a product to be served.
Explore about Kafka vs pulsar : Pick the Right one for your Business

Solutions for Implementing Serverless Data Mesh

Below mentioned are Solutions for Implementing Serverless Data Mesh:

Amazon Web Services (AWS)

Aws provides various services to set up Data Mesh Architecture easily. Let’s discuss one solution for building the Data Mesh Architecture Platform.
Data Mesh Architecture using AWS Lake Formation and Glue

  1. AWS Glue: AWS Glue is a serverless service for ETL operations. It gives an easy path for data engineers to extract, clean, transform and load data. It provides code-based and visual interfaces to make data operations easy.
  2. AWS Lake: AWS Lake Formation is a managed service that can build, secure, and manage data lakes. Lake Formation automates various steps, including collecting, cleansing, transferring, and cataloging data to create a Data Lake. Lake Formation is pointed at data sources, and it crawls through attached sources and moves the data into the new Amazon Simple Storage Service (Amazon S3) data lake.

The workflow for data as a product includes the following steps:

  1. Data sources are hosted in the AWS Glue Data Catalog and enrolled with Lake Formation.
  2. Lake Formation Data Catalog entities will be created under the central governance account when a dataset is added as a product.
  3. The Central Catalog shares the catalog resources with domain accounts enabling required permissions via Lake Formation.
  4. Data changes made within the domain account are automatically transported to the central governance copy of the catalog.
  5. Lake Formation in the main domain account can define and enforce access permissions on these datasets for users to consume. Like Data Scientists and ML Engineers, user accounts can access data using their chosen tool, such as Athena.

Microsoft Azure

Azure cloud provides various services to support Data Mesh Architecture like Azure Data Lake Storage (ADLS), Azure Synapse, and Microsoft SQL Server.

  1. Azure Data Lake: It provides all the capabilities required to store data of any size, and shape and do all types of processing across platforms and languages.
  2. Azure Service Fabric Mesh: It is a fully managed service that allows you to build, deploy and manage applications consisting of polyglot services running within containers using a “serverless” approach. The service fabric platform has a built-in, highly available persistent state store. This service will be retiring this year (2021) though it will continue to support existing deployments.

Google Cloud Platform (GCP)

Anthos Service Mesh is a suite of tools that help monitor and manage a reliable service mesh on-premises or on Google Cloud. It is a powerful tool for managing and monitoring distributed applications.

Features of Google Cloud Platform

  1. Traffic Management
  2. Observability Insights
  3. Fine-grained security policies
  4. Failure recovery
  5. Easy Authentication and Encryption
  6. Load Balancing

Conclusion

While Centralized Architecture can be good for small firms with a simpler domain, large organizations with diverse domain data need a solution to manage data for a domain independently while providing its data as a self-serve platform. Data Mesh Architecture allows a way to cater to the above problems.