Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Data Mesh Principles and Logical Architecture Defined

Data Mesh Principles and Logical Architecture Defined

This item in japanese

The concept of a data mesh provides new ways to address common problems around managing data at scale. Zhamak Dehghani has provided additional clarity around the four principles of a data mesh, with a corresponding logical architecture and organizational structure. Her article is intended as a follow up to previous articles, presentations, and podcasts that introduced people to data mesh and domain-oriented data.

Dehghani emphasizes the "great divide" between operational data and analytical data. Traditionally, a data pipeline of ETL jobs (extract, transform, and load) spans this divide between transactional data used for running the business, and data lakes and data warehouses used to provide insights about the business. Data mesh acknowledges the need for these two distinct viewpoints and use cases, but instead of organizing teams and architectures along technology boundaries, data mesh unites them by focusing on domains. 

By following this topology, analytical data is able to scale in the way microservices and self-contained databases have allowed transactional data to scale. To achieve the promise of scale, along with quality and integrity, Dehghani lays out four principles of a data mesh:

    1. Domain-oriented decentralized data ownership and architecture
    2. Data as a product
    3. Self-serve data infrastructure as a platform
    4. Federated computational governance

Logical architecture of data mesh approach

Source: - Figure 13

For each of these principles, Dehghani provides a technology-agnostic logical architecture. She specifically avoids being too prescriptive about tools or implementation details, hoping that leaving some questions unanswered will encourage imagination and creativity for data mesh solutions.

The idea of domain-oriented, decentralized data ownership is based on organizations being decomposed based on business domains, and therefore the data mesh should follow these seams to find the axis of decomposition. While Conway's Law is not directly mentioned, one can see how it would apply, with the teams responsible for both the operational and analytical data, as well as any ETL process between them, belonging to that business unit.

Because each business domain is responsible for both sides of the data, those teams are responsible for providing both operational APIs as well as analytical endpoints. One domain can depend on another for operational and/or analytical data, accessed by the exposed endpoints.

Example: domain oriented ownership of analytical data in addition to operational capabilities
Source: - Figure 5

Just as microservices have led to a need for service discovery, a distributed data mesh needs to have discoverable data products. If the data is not easily discoverable and understandable, then the act of distributing it can exacerbate any problems in an existing analytical system. To solve this, a data mesh implementation should treat domain data as a product, with a corresponding product owner and developers responsible for building and maintaining the data products. In her logical architecture, Dehghani describes the data product as the architectural quantum:

"Architecturally, to support data as a product that domains can autonomously serve or consume, data mesh introduces the concept of data product as its architectural quantum. Architectural quantum, as defined by Evolutionary Architecture, is the smallest unit of architecture that can be independently deployed with high functional cohesion, and includes all the structural elements required for its function."
For each data product to be a viable node on the data mesh, each team is responsible for code, data and metadata, and infrastructure. Keeping these three structural components within the same bounded context is a departure from past paradigms where the pipeline between transactional and analytical data created a strong separation.

To ensure many teams are all able to deliver data products that meet the organization's standards, there must be a self-serve data infrastructure as a platform. Again, the parallels to creating successful microservice are apparent, as this follows the idea of the "paved road" for a microservices PaaS. While an existing delivery platform for services can be used as a starting point, the addition of data analysis and pipeline code brings a new level of complexity. Dehghani's personal hope is that we will start seeing a sensible convergence of operational and data infrastructure.

The logical architecture model for the self-serve platform is organized into three planes, for data infrastructure provisioning, data product developer experience, and data mesh supervision. These are intended to create a clear separation of concerns, but are not meant to imply a strict layering or physical hierarchy.

Multiple planes of self-serve data platform *DP stands for a data product

Source: - Figure 10

While the data mesh allows data teams to work independently, most use cases will require interoperability that easily joins multiple data products together. Without a single data warehouse to enforce the rules (and create a bottleneck in the pipeline) data mesh requires federated computational governance.

A group composed of domain data product owners and data platform product owners must decide what global rules all teams must adhere to, while still empowering teams to have enough control over their domains and implementations. " Ultimately global decisions have one purpose, creating interoperability and a compounding network effect through discovery and composition of data products."

Example of elements of a federated computational governance: teams, incentives, automated implementation, and globally standardized aspects of data mesh

Source: - Figure 12


Rate this Article