Thoughtworks’ VP of Data and AI Shares Insights for Building a Robust Data Product at QCon London

During his QCon London presentation, Danilo Sato, vice president of data & AI at Thoughtworks, reemphasized the importance of using domain-driven design and Team Topologies principles when implementing data products. This ensures effective data encapsulation in a more complex landscape where data responsibilities are "shifting left" towards the developer.

Cassie Shum, the host for the Architectures You've Always Wondered About conference track, mentioned during the conference:

Given the importance of data today with everybody moving towards AI, I cannot imagine how this presentation would not be part of the track.

Sato started the presentation by walking through several data architectures from different industries (from traditional to streaming), stressing that all have the familiar components: ingestion (batch or stream), data pipelines, storage, consumers and analytics.

Moreover, the two "worlds" of the data universe - the operational and the analytical world - are moving closer together. He underlined the importance of using the concepts of data products rather than the underlying technologies: "architect for data products or with products" meaning treating data entities like real products with versioning and contracts.

Sato pointed out that the technical landscape evolved from the simple decisions of the 2000s when you had just to choose what DBMS you were going to use to an ever-growing landscape that now is identified as Machine Learning AI and Data (MAD). Regardless of the evolution, he underlined: "modelling is still hard", especially as the usefulness of the model depends on the problem its users are trying to solve, and there is no way of objectively evaluating its efficiency.

George E.P. Box: All models are wrong, but some are useful

Starting from Gregor Hohpe’s "architect elevator" concept – that states an architect should be able to communicate at different levels of a company from the vision to the nitty gritty implementation details – Sato introduced the data architecture concerns to bear in mind from the bottom, upwards.

The bottom level refers to the data that flows within a system. Although historically, it was acceptable to break encapsulation, e.g. "Give me access to data, and I will handle the analytics" when thinking from the product perspective, there are more aspects to consider than just the operational aspects like volume, velocity, consistency, availability, latency or access patterns.

The analytical side refers more to the data product, the socio-technical approach to managing and sharing data.

The data needs to be more encapsulated, with input, output and control ports that handle the data flow, and also the provision of a versionable data product specification. If the data within the systems are well encapsulated, you can even choose to replace the underlying database without any impact on the outside world. As usual with architectural decisions, there are trade-offs to consider when deciding what type of database to use for implementation.

The middle level refers to the data that flows between systems. At this stage, decisions made have a wider impact. Exposing data for other systems creates a contract that needs to be respected, i.e. this exposes an API of your data product. Among the points that should be considered as part of the data contract are the data formats supported, cross-organisation standards, the schema for the data, the metadata and discoverability.

When talking about data in motion, there are two major paradigms: batch(a bounded data set for the period) and streaming(an infinite data set). For the last one, you need to add extra control in case data arrives late or fails to process. The data mesh book mentions three types of data products: source-aligned data products, aggregated data products, and consumer-aligned data products.

The highest level refers to the enterprise-level focus of organisational structure and data governance. At this level, there is more about how the business is organised, which are those domains and who owns them, and the scenes between those domains. The strategic design from DDD helps in responding to these questions.

Moving to a decentralized model will require ensuring that the data product is owned for the longer term, even though a team might own multiple products. A self-service platform is key to implementing cohesive data products. It will allow for avoiding heterogeneous approaches when building the products and make governance implementation easier.

At the enterprise level, it is important to speak also about data governance: how does the company deal with ownership, accessibility, security and data quality? Given that governance is also about the people and processes, Sato points to Team Topologies as an inspiration for how to organise teams around data.

He concluded by stating, "Thinking about data is a lot of things: data in the system, between systems and at the enterprise level."

Access recorded QCon London talks with a Video-Only Pass.

About the Author

Olimpiu Pop

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

About the Author

Olimpiu Pop

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter