BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Data Gateways in the Cloud Native Era

Data Gateways in the Cloud Native Era

Leia em Português

Bookmarks

Key Takeaways

  • Application architectures have evolved to split the frontend from the backend, and further split the backend into independent microservices.
  • Modern distributed application architectures created the need for API Gateways and helped popularize API Management and Service Mesh technologies.
  • Microservices give the freedom for using the most suitable database type depending on the needs of the service. Such a polyglot persistence layer raises the need for API Gateway-like capabilities but for the data layer.
  • Data Gateways act like API Gateways but focusing on the data aspect. A Data Gateway offers abstractions, security, scaling, federation, and contract-driven development features.
  • There are many types of Data Gateways, from the traditional data virtualization technologies, to light GraphQL translators, cloud-hosted services, connection pools, and fully open source alternatives.
     

These days, there is a lot of excitement around 12-factor apps, microservices, and service mesh, but not so much around cloud-native data. The number of conference talks, blog posts, best practices, and purpose-built tools around cloud-native data access is relatively low. One of the main reasons for this is because most data access technologies are architectured and created in a stack that favors static environments rather than the dynamic nature of cloud environments and Kubernetes.

In this article, we will explore the different categories of data gateways, from more monolithic to ones designed for the cloud and Kubernetes. We will see what are the technical challenges introduced by the Microservices architecture and how data gateways can complement API gateways to address these challenges in the Kubernetes era.

Application architecture evolutions

Let’s start with what has been changing in the way we manage code and the data in the past decade or so. I still remember the time when I started my IT career by creating frontends with Servlets, JSP, and JSFs. In the backend, EJBs, SOAP, server-side session management, was the state of art technologies and techniques. But things changed rather quickly with the introduction of REST and popularization of Javascript. REST helped us decouple frontends from backends through a uniform interface and resource-oriented requests. It popularized stateless services and enabled response caching, by moving all client session state to clients, and so forth. This new architecture was the answer to the huge scalability demands of modern businesses.

A similar change happened with the backend services through the Microservices movement. Decoupling from the frontend was not enough, and the monolithic backend had to be decoupled into bounded context enabling independent fast-paced releases. These are examples of how architectures, tools, and techniques evolved pressured by the business needs for fast software delivery of planet-scale applications.

That takes us to the data layer. One of the existential motivations for microservices is having independent data sources per service. If you have microservices touching the same data, that sooner or later introduces coupling and limits independent scalability or releasing. It is not only an independent database but also a heterogeneous one, so every microservice is free to use the database type that fits its needs.

Application architecture evolution brings new challenges

While decoupling frontend from backend and splitting monoliths into microservices gave the desired flexibility, it created challenges not-present before. Service discovery and load balancing, network-level resilience, and observability turned into major areas of technology innovation addressed in the years that followed.

Similarly, creating a database per microservice, having the freedom and technology choice of different datastores is a challenge. That shows itself more and more recently with the explosion of data and the demand for accessing data not only by the services but other real-time reporting and AI/ML needs.

The rise of API gateways

With the increasing adoption of Microservices, it became apparent that operating such an architecture is hard. While having every microservice independent sounds great, it requires tools and practices that we didn’t need and didn’t have before. This gave rise to more advanced release strategies such as blue/green deployments, canary releases, dark launches. Then that gave rise to fault injection and automatic recovery testing. And finally, that gave rise to advanced network telemetry and tracing. All of these created a whole new layer that sits between the frontend and the backend. This layer is occupied primarily with API management gateways, service discovery, and service mesh technologies, but also with tracing components, application load balancers, and all kinds of traffic management and monitoring proxies. This even includes projects such as Knative with activation and scaling-to-zero features driven by the networking activity.

With time, it became apparent that creating microservices at a fast pace, operating microservices at scale requires tooling we didn’t need before. Something that was fully handled by a single load balancer had to be replaced with a new advanced management layer. A new technology layer, a new set of practices and techniques, and a new group of users responsible were born.

The case for data gateways

Microservices influence the data layer in two dimensions. First, it demands an independent database per microservice. From a practical implementation point of view, this can be from an independent database instance to independent schemas and logical groupings of tables. The main rule here is, only one microservice owns and touches a dataset. And all data is accessed through the APIs or Events of the owning microservice. The second way a microservices architecture influenced the data layer is through datastore proliferation. Similarly, enabling microservices to be written in different languages, this architecture allows the freedom for every microservices-based system to have a polyglot persistence layer. With this freedom, one microservice can use a relational database, another one can use a document database, and the third microservice one uses an in-memory key-value store.

While microservices allow you all that freedom, again it comes at a cost. It turns out operating a large number of datastore comes at a cost that existing tooling and practices were not prepared for. In the modern digital world, storing data in a reliable form is not enough. Data is useful when it turns into insights and for that, it has to be accessible in a controlled form by many. AI/ML experts, data scientists, business analysts, all want to dig into the data, but the application-focused microservices and their data access patterns are not designed for these data-hungry demands.

API and Data gateways offering similar capabilities at different layers

This is where data gateways can help you. A data gateway is like an API gateway, but it understands and acts on the physical data layer rather than the networking layer. Here are a few areas where data gateways differ from API gateways.

Abstraction

An API gateway can hide implementation endpoints and help upgrade and rollback services without affecting service consumers. Similarly, a data gateway can help abstract a physical data source, its specifics, and help alter, migrate, decommission, without affecting data consumers.

Security

An API manager secures resource endpoints based on HTTP methods. A service mesh secures based on network connections. But none of them can understand and secure the data and its shape that is passing through them. A data gateway, on the other hand, understands the different data sources and the data model and acts on them. It can apply RBAC per data row and column, filter, obfuscate, and sanitize the individual data elements whenever necessary. This is a more fine-grained security model than networking or API level security of API gateways.

Scaling

API gateways can do service discovery, load-balancing, and assist the scaling of services through an orchestrator such as Kubernetes. But they cannot scale data. Data can scale only through replication and caching. Some data stores can do replication in cloud-native environments but not all. Purpose-built tools, such as Debezium, can perform change data capture from the transaction logs of data stores and enable data replication for scaling and other use cases.

A data gateway, on the other hand, can speed-up access to all kinds of data sources by caching data and providing materialized views. It can understand the queries, optimize them based on the capabilities of the data source, and produce the most performant execution plan. The combination of materialized views and the stream nature of change data capture would be the ultimate data scaling technique, but there are no known cloud-native implementations of this yet.

Federation

In API management, response composition is a common technique for aggregating data from multiple different systems. In the data space, the same technique is referred to as heterogeneous data federation. Heterogeneity is the degree of differentiation in various data sources such as network protocols, query languages, query capabilities, data models, error handling, transaction semantics, etc. A data gateway can accommodate all of these differences as a seamless, transparent data-federation layer.

Schema-first

API gateways allow contract-first service and client development with specifications such as OpenAPI. Data gateways allow schema-first data consumption based on the SQL standard. A SQL schema for data modeling is the OpenAPI equivalent of APIs.

Many shades of data gateways

In this article, I use the terms API and data gateways loosely to refer to a set of capabilities. There are many types of API gateways such as API managers, load balancers, service mesh, service registry, etc. It is similar to data gateways, where they range from huge monolithic data virtualization platforms that want to do everything, to data federation libraries, from purpose-built cloud services to end-user query tools.

Let’s explore the different types of data gateways and see which fit the definition of “a cloud-native data gateway.” When I say a cloud-native data gateway, I mean a containerized first-class Kubernetes citizen. I mean a gateway that is open source, using open standards; a component that can be deployed on hybrid/multi-cloud infrastructures, work with different data sources, data formats, and applicable for many use cases.

Classic data virtualization platforms

In the very first category of data gateways, are the traditional data virtualization platforms such as Denodo and TIBCO/Composite. While these are the most feature-laden data platforms, they tend to do too much and want to be everything from API management, to metadata management, data cataloging, environment management, deployment, configuration management, and whatnot. From an architectural point of view, they are very much like the old ESBs, but for the data layer. You may manage to put them into a container, but it is hard to put them into the cloud-native citizen category.

Databases with data federation capabilities

Another emerging trend is the fact that databases, in addition to storing data, are also starting to act as data federation gateways and allowing access to external data.

For example, PostgreSQL implements the ANSI SQL/MED specification for a standardized way of handling access to remote objects from SQL databases. That means remote data stores, such as SQL, NoSQL, File, LDAP, Web, Big Data, can all be accessed as if they were tables in the same PostgreSQL database. SQL/MED stands for Management of External Data, and it is also implemented by MariaDB CONNECT engine, DB2, Teiid project discussed below, and a few others.

Starting in SQL Server 2019, you can now query external data sources without moving or copying the data. The PolyBase engine of SQL Server instance to process Transact-SQL queries to access external data in SQL Server, Oracle, Teradata, and MongoDB.

GraphQL data bridges

Compared to the traditional data virtualization, this is a new category of data gateways focused around the fast web-based data access. The common thing around HasuraPrismaSpaceUpTech, is that they focus on GraphQL data access by offering a lightweight abstraction on top of a few data sources. This is a fast-growing category specialized for enabling rapid web-based development of data-driven applications rather than BI/AI/ML use cases.

Open-source data gateways

Apache Drill is a schema-free SQL query engine for NoSQL databases and file systems. It offers JDBC and ODBC access to business users, analysts, and data scientists on top of data sources that don’t support such APIs. Again, having uniform SQL based access to disparate data sources is the driver. While Drill is highly scalable, it relies on Hadoop or Apache Zookeeper’s kind of infrastructure which shows its age.

Teiid is a project sponsored by Red Hat and I’m most familiar with it. It is a mature data federation engine purposefully re-written for the Kubernetes ecosystem. It uses the SQL/MED specification for defining the virtual data models and relies on the Kubernetes Operator model for the building, deployment, and management of its runtime on Openshift. Once deployed, the runtime can scale as any other stateless cloud-native workload on Kubernetes and integrate with other cloud-native projects. For example, it can use Keycloak for single sign-on and data roles, Infinispan for distributed caching needs, export metrics and register with Prometheus for monitoring, Jaeger for tracing, and even with 3scale for API management. But ultimately, Teiid runs as a single Spring Boot application acting as a data proxy and integrating with other best-of-breed services on Openshift rather than trying to reinvent everything from scratch.

Architectural overview of Teiid data gateway

On the client-side, Teiid offers standard SQL over JDBC/ODBC and Odata APIs. Business users, analysts, and data scientists can use standard BI/analytics tools such as Tableau, MicroStrategy, Spotfire, etc. to interact with Teiid. Developers can leverage the REST API or JDBC for custom built microservices and serverless workloads. In either case, for data consumers, Teiid appears as a standard PostgreSQL database accessed over its JDBC or ODBC protocols but offering additional abstractions and decoupling from the physical data sources.

PrestoDB is another popular open-source project started by Facebook. It is a distributed SQL query engine targeting big data use cases through its coordinator-worker architecture. The Coordinator is responsible for parsing statements, planning queries, managing workers, fetching results from the workers, and returning the final results to the client. The worker is responsible for executing tasks and processing data. Recently the PrestoDB community split and created a fork called PrestoSQL that is now part of The Linux Foundation. While forking is a common and natural path for many open-source projects, unfortunately, in this case, the similarity in the names and all of the other community-facing artifacts generates some confusion. Regardless of this, both distributions of Presto are among the most popular open-source projects in this space. 

Cloud-hosted data gateways services

With a move to the cloud infrastructure, the need for data gateways doesn’t go away but increases instead. Here are a few cloud-based data gateway services:

AWS Athena is ANSI SQL based interactive query service for analyzing data tightly integrated with Amazon S3. It is based on PrestoDB and supports additional data sources and federation capabilities too. Another similar service by Amazon is AWS Redshift Spectrum. It is focused around the same functionality, i.e. querying S3 objects using SQL. The main difference is that Redshift Spectrum requires a Redshift cluster, whereas Athena is a serverless offering that doesn’t require any servers. Big Query is a similar service but from Google.

These tools require minimal to no setup, they can access on-premise or cloud-hosted data and process huge datasets. But they couple you with a single cloud provider as they cannot be deployed on multiple clouds or on-premise. They are ideal for interactive querying rather than acting as hybrid data frontend for other services and tools to use.

Secure tunneling data-proxies

With cloud-hosted data gateways comes the need for accessing on-premise data. Data has gravity and also might be affected by regulatory requirements preventing it from moving to the cloud. It may also be a conscious decision to keep the most valuable asset (your data) from cloud-coupling. All of these cases require cloud access to on-premise data. And cloud providers make it easy to reach your data. Azure’s On-premises Data Gateway is such a proxy allowing access to on-premise data stores from Azure Service Bus.

In the opposite scenario, accessing cloud-hosted data stores from on-premise clients can be challenging too. Google’s Cloud SQL Proxy provides secure access to Cloud SQL instances without having to whitelist IP addresses or configure SSL.

Red Hat-sponsored open-source project Skupper takes the more generic approach to address these challenges. Skupper solves Kubernetes multi-cluster communication challenges through a layer 7 virtual network that offers advanced routing and secure connectivity capabilities. Rather than embedding Skupper into the business service runtime, it runs as a standalone instance per Kubernetes namespace and acts as a shared sidecar capable of secure tunneling for data access or other general service-to-service communication. It is a generic secure-connectivity proxy applicable for many use cases in the hybrid cloud world.

Connection pools for serverless workloads

Serverless takes software decomposition a step further from microservices. Rather than services splitting by bounded context, serverless is based on the function model where every operation is short-lived and performs a single operation. These granular software constructs are extremely scalable and flexible but come at a cost that previously wasn’t present. It turns out rapid scaling of functions is a challenge for connection-oriented data sources such as relational databases and message brokers. As a result cloud providers offer transparent data proxies as a service to manage connection pools effectively. Amazon RDS Proxy is such a service that sits between your application and your relational database to efficiently manage connections to the database and improve scalability.

Conclusion

Modern cloud-native architectures combined with the microservices principles enable the creation of highly scalable and independent applications. The large choice of data storage engines, cloud-hosted services, protocols, and data formats, gives the ultimate flexibility for delivering software at a fast pace. But all of that comes at a cost that becomes increasingly visible with the need for uniform real-time data access from emerging user groups with different needs. Keeping microservices data only for the microservice itself creates challenges that have no good technological and architectural answers yet. Data gateways, combined with cloud-native technologies offer features similar to API gateways but for the data layer that can help address these new challenges. The data gateways vary in specialization, but they tend to consolidate on providing uniform SQL-based access, enhanced security with data roles, caching, and abstraction over physical data stores.

Data has gravity, requires granular access control, is hard to scale, and difficult to move on/off/between cloud-native infrastructures. Having a data gateway component as part of the cloud-native tooling arsenal, which is hybrid and works on multiple cloud providers, supports different use cases is becoming a necessity.

About the Author

Bilgin Ibryam is a product manager at Red Hat, committer and member of Apache Software Foundation. He is an open source evangelist, blogger, occasional speaker, and the author of Kubernetes Patterns and Camel Design Patterns books. In his day-to-day job, Bilgin enjoys mentoring, coding and leading developers to be successful with building open source solutions. His current work focuses on blockchain, distributed systems, microservices, devops, and cloud-native application development.

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Data ESB?

    by Richard Clayton,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    I cannot comment on the efficacy or ergonomics of Teiid (looks cool), but I suspect this pattern is not going to be well-liked by developers. It's reminiscent of ESBs, but treads more dangerous waters. Shimming services with an ESB was already a hard sell for many, particularly when you had to deal with clunky glue code or declarative bindings. I can imagine Data Gateways being even worse.

    Some of my immediate concerns would be that you would lose some of the vendor-specific features offered by data stores because the gateway would have to support the lowest-common-denominator interface for the protocol (at the very least, devs would have to deal with upgrade latency between the gateway and DB vendors). Another concern would be the increased latencies by introducing a proxy between the app and the data store (though I admit this could be minimal).

    If you are a large enterprise and have the resources to invest in this strategy, this approach might make sense. I would predict "Data Gateways" will continue to be an effective strategy for building "piers" on Data Lakes, but not really a solution for transactional applications. If I was that large enterprise, I would instead invest in change-data capture at the service level (where I can), and resort to tools like Debezium for systems I could not affect (Debezium being a fine tool, but putting engineers in the position of having to infer model changes).

  • Re: Data ESB?

    by Bilgin Ibryam,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Thanks for sharing your toughts Richard.

    I agree, there are definitely “Data ESBs” around to be mindful. If you look behind the shiny UI and the D&D tools for some of these “platforms”, you will see the architecture and what you build is an ESB for the data layer. (BTW you can do the same with an API management platform if you start putting a bit of business logic there - that is coming in a different post soon). A data gateway can be implemented as an ESB, but also as a microservice that you put in a sidecar and deploy next to your microservice - completely independent and withing the context of your microservice. It can open controlled data-intensive APIs (JDBC/ODBC) for BAs, data scientists to explore data, and enable architectures such as data mesh. Data gateway is a pattern and can be implemented in good or bad ways.

    As for the “lowest-common-denominator” - the same can be said about not having read optimized features such as federation, materialization, query optimization, data firewall on every data source. Having specialized capabilities, independent from the data sources, can be useful as a common denominator API - SQL.
    
Back to data: I’m glad you brought up Debezium as it is one of my preferred tools too. But CDC and federation are the two sides of the same problem: Debezium is good for unifying writes, whereas a data gateway is good for unifying reads.
    The idea of a data gateway is to provide a unified query interface (it is not ideal for writes/transactions) to a variety of storage engines i.e. federation.
    While federation addresses the read-only querying across several systems, Debezium address writes propagation in a reliable manner across different systems. It is described in “Designing Data-Intensive Applications” page 501 by @martinkl Martin Kleppmann. An awesome book, and both approaches are referred to as Federated databases and Unboundled databases.

  • Re: Data ESB?

    by Gunnar Morling,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    > I would instead invest in change-data capture at the service level (where I can), and resort to tools like Debezium for systems I could not affect

    Note you also can benefit from Debezium and change data capture when being in a position to affect the source system: via its support for the outbox pattern. You could have your application produce messages for downstream consumers into a separate outbox table, which would be captured efficiently via Debezium (no polling etc.). The schema of those messages would be independent from that of the internal business table, providing more flexibility in case of internal model changes.

  • Excellent post

    by BMK Lakshminarayanan,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Excellent post Bilgin Ibryam. Thank you for compiling these list of projects and categorizing them brilliantly and presenting this to us.

    Thank you
    BMK

  • Is the Data Gateway for OLTP, OLAP or Both?

    by Zhong Yang,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    I can see how a data gateway can be used for OLAP. It will not work for OLTP. From an Architecture perspective, data belong to a certain domain. Some times data may belong to more than one domain. Microservices are built to operate on data within clearly defined domains. When developers' code needs to access certain data it is very clear to them what they need to do. If the data is in your domain, your code and directly operate on them. If data belongs to a different domain, you make a service call. A data abstraction layer will destroy data ownership and allow any service to directly(through data gateway) operate on any data. Data integrity will become a major issue. What is even worse is that people will begin to leverage the shared data layer for integration. Now we have a cloud-native monolithic application.

  • Data gateway sounds like new monolith.

    by Kartik Rallapalli,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    A good rest API that abstracts underlying data and provides granular access can serve this purpose. This data can be used for data lake, business activity monitoring and prediction models. Using data gateway terminology for this is not appropriate. In micro services architecture, the driver is business function as a service, not data as a service. Data gateway with SQL layer would lead to same issues as doing olap on monolithic transactional db. This approach contradicts isolation and abstraction of data layer - which is core to MSA.

  • Re: Is the Data Gateway for OLTP, OLAP or Both?

    by Bilgin Ibryam,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    I agree the sweet spot here is OLAP - read-only, federated data access, rather than OLTP.
    But there are a few exceptions here that apply to OLTP too:
    - Data gateway is a generic pattern, it can be something transparent such as Amazon RDS Proxy that your application is not even aware of using it.
    - It can be used for Monolith to microservices migration scenarios. I’ll cite another book here, Monolith to Microservices by Sam Newman describes patterns such as database view pattern, database as a service patter, multichema storage pattern. These patterns can be implemented with data gateways.

  • Re: Data gateway sounds like new monolith.

    by Bilgin Ibryam,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    A REST API, and business function as a service is only one (developer-centric) view of data. There are also other users interested from the data that REST API cannot satisfy. And that is the reason for tools such as Debezium stream data directly from database, or tools such as Apache Airflow for running ETL jobs to move the data. Many of these challenges are described in the following Data Mesh Paradigm talk www.infoq.com/presentations/data-mesh-paradigm
    A data gateway is nothing different than these for allowing controlled (I'd say primarily read-only) access to data. And having a data gateway as part of the microservice/domain is actually better than expecting a different team to extract out data directly from the database.

  • Re: Data gateway sounds like new monolith.

    by Lawrence Hecht,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    I wish the term Data Mesh were widely used outside of the Thoughtworks consulting community.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT