BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News The Distributed Data Mesh as a Solution to Centralized Data Monoliths

The Distributed Data Mesh as a Solution to Centralized Data Monoliths

This item in japanese

Bookmarks

Instead of building large, centralized data platforms, enterprise data architects should create distributed data meshes. Such a change in approach requires a paradigm shift, according to Zhamak Dehghani, principal technology consultant at ThoughtWorks, in a presentation at QCon San Francisco and a related article. As data becomes ever more ubiquitous, traditional architectures of data warehouses and data lakes become overwhelmed, and are unable to scale efficiently. Dehghani argues that a distributed data mesh approach can overcome these inherent inefficiencies by embracing domain-oriented data ownership.

"I suggest that the next enterprise data platform architecture is in the convergence of Distributed Domain Driven Architecture, Self-serve Platform Design, and Product Thinking with Data."

Her presentation included some real-world examples, but mostly focused on new governing principles, accompanied with new language to support the mindset. For example, serving over ingesting, and discovering and using over extracting and loading.

Dehghani sees three failure modes in traditional data platform architecture. First, they are centralized and monolithic; putting all types of data together may work for small organizations, but eventually fails for enterprises with large numbers of data sources and diverse consumers of the data.

Secondly, is a problem Dehghani describes as "coupled pipeline decomposition." Generations of architects have decomposed data platform architectures into "pipelines of data processing steps." These pipeline steps are orthogonal to the axis of change, with new features requiring updates to all steps.

Siloed and hyper-specialized ownership is the final failure mode. The centralized architecture naturally creates categories of data source teams providing data, and consumer teams retrieving the processed data. In the middle are data and machine learning specialists. While the two outer groups are domain-oriented, the central team must be domain-agnostic.

The motivation for a data mesh: avoiding siloed data teams.

Image Credit: Zhamak Dehghani

Dehghani compared these challenges to those of n-tier monoliths, where new customer requirements require the modifcation of all the tiers. Microservices are better aligned with the elements that change, but require a different design approach. A similar, dramatic shift in thinking will be required to successfully implement a data mesh architecture.

"In order to decentralize the monolithic data platform, we need to reverse how we think about data, its locality and ownership. Instead of flowing the data from domains into a centrally owned data lake or platform, domains need to host and serve their domain datasets in an easily consumable way."

The envisioned architecture focuses on domain data products as first class components, each with corresponding ownership by teams that understand the domain. The monolithic, rigid data pipeline is no longer the primary design concern, nor is data clearly segregated into source and consumption patterns. Decentralized teams are able to use the data they need, and can provide their output back into the mesh for other teams.

For such an architecture to be successful, the data products must be discoverable, addressable, trustworthy, self-describing, interoperable, and secure and governed by global access control. These traits are the responsibility of individual data product owners, and are aided by federated governance and a platform for prviding data infrastructure.

Overview of the Data Mesh

Image Credit: Zhamak Dehghani

The data warehouse and data lake can still exist in this architecture, but they become just another node in the mesh, rather than a centralized monolith. If teams still need functionality that is best accomplished by data warehouses and lakes, then they should be free to embrace it. Again, there are correlations to the adoption of microservices and polyglot solutions.

Dehghani's QCon presentation, Data Mesh Paradigm Shift in Data Platform Architecture, will be available in the coming weeks. Her article, How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh, is available now. She will also be an upcoming guest on the InfoQ podcast.

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Consultant buzzword salad...

    by Greg Liebowitz,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Monolithic “data lakes” (storage and compute on commodity servers, e.g. HDFS) are long obsolete and were surpassed by better tech on public cloud. Actually I was hoping for an article on IMDGs, which are still widely-used in distributed computing applications.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT