Key Takeaways
- RESTful API design is great for developing API first platforms but not so great for developing GUI applications.
- GraphQL API Design is great for Backends for Frontends (BfFs) but not so great for data services.
- Accidental complexity is a bad thing, and should be avoided if possible. It can be managed if architects enforce a clear separation of concerns in the various layers and components.
- Coupling is a bad thing too; however, it also is inevitable. While architects should strive to keep the coupling loose between GUI and data services, it is understood that there will be tight coupling between frontend applications and BfFs.
- In addition to putting a BfF between each GUI application and the data services, adopting an API gateway is also useful for maintaining a clear SoC.
It’s been over a decade since the current incarnation of microservice architecture has been embraced by technology companies, with mixed results. The benefits of microservices are limited when the dependencies between them are such that changes to one necessitate changes to the others. This is what is known as tight coupling. Dependencies between components are a fact of life.
What good is a microservice if it cannot deliver what the user wants? This article focuses on how to maintain useful microservices and still benefit from the process improvements promised by the migration of monoliths to microservices. The key is to maintain a clear separation of concerns in the various layers and employ design principles best suited to each layer.
RESTful API Design
In the year 2000, a co-founder of the Apache HTTP server project submitted a Ph.D. dissertation called “Architectural Styles and the Design of Network-Based Software Architectures.” Chapter 5 covers REST, or Representational State Transfer, as an architectural style for distributed hypermedia systems. A good resource for learning about REST as it serves in API design is this blog on the Richardson Maturity Model.
There is a close and consistent alignment between the design of APIs and the HTTP standard. With RESTful API design, the path part of the URI identifies a specific entity (also known as a resource) in a hierarchical taxonomy. The HTTP verb identifies the type of action to be made to the entity. Entities can link to other entities via the path part of the other entity’s URI. There is also consistency in how to interpret the meaning of the resulting HTTP status code.
There are many reasons why companies that provide online applications are strongly motivated to design their APIs as if they were a platform. Perhaps they wish to open up additional revenue streams from 3rd parties or upsell to power users. They may wish to make it easier for different teams to call each other’s APIs. They may wish to organize their APIs in such a way that they are easily reused by similar or related products in the same portfolio. Finally, they should design APIs that are easy to develop test automation for.
Just as companies develop GUI applications that are easy for users to understand, platform companies should develop APIs that are easy for developers to understand. RESTful APIs are well suited for this requirement.
In the days before RESTful API design, APIs were based on what is known as the Remote Procedure Call (RPC). There was no consistency in the design of these APIs. This made those APIs hard to reason about or understand what they do. REST introduced consistency in API design. When you combine REST with OpenAPI, you have a very easy way for developers to quickly learn how to use your APIs.
Swagger specification for a RESTful API that supports a rudimentary news feed.
Although RESTful APIs are easy for backend services to call, they are not so easy for frontend applications to call. That is because an emotionally satisfying user experience is not very RESTful. Users don’t want a GUI where entities are nicely segmented. They want to see everything all at once unless progressive disclosure is called for. For example, I don’t want to navigate through multiple screens to review my travel itinerary; I want to see the summary (including flights, car rental, and hotel reservation) all on one screen before I commit to making the purchase.
When a user navigates to a page on a web app or deep links into a Single Page Application (SPA) or a particular view in a mobile app, the frontend application needs to call the backend service to fetch the data needed to render the view. With RESTful APIs, it is unlikely that a single call will be able to get all the data. Typically, one call is made, then the frontend code iterates through the results of that call and makes more API calls per result item to get all the data needed. Not only does this complicate frontend development, but it also increases page load time for the application. More on this later.
Here is another issue where RESTful APIs are not a good fit for frontend GUI concerns. There is no support for push notifications in RESTful API design, per se, but there is support for callbacks (implemented by webhooks) in the open API spec. Webhooks don’t help as much for push notifications as WebSockets. Webhooks and WebSockets are different in that a web browser does not support webhooks but does support WebSockets. That is because the WebSocket is initiated by the frontend and stays connected while the backend sends updates to the frontend. A webhook is initiated by the backend but the browser doesn’t have a permanent IP address to receive those requests. Because the path identifies a specific entity in RESTful API design, the request and response shouldn’t change shape dramatically. To save on connections, an SPA may open a single WebSocket for all types of push notifications. The shape of each message could be very different.
New user requirements, such as additional data fields, may mandate change on both the frontend and backend. This is the fundamental cause for tight coupling. Tight coupling across teams slows the pace of development for reasons best understood via Conway’s law.
Communication costs across teams are higher than within a single team. Teams that have a mix of both the frontend and backend developers can also be lacking effectiveness. While notionally on the same team, this structure often still has a divide between the frontend and backend developers. Almost as an unwritten addendum to Conway’s law, the hidden sub-teams hide some of the complexity of the software.
Release management also gets complicated when backend services and frontend apps are tightly coupled. One of the biggest advantages to microservices is that you don’t have to release everything all at once. Tightly coupled components often do have to be released almost at the same time and both have to be rolled back if one has to be rolled back.
GraphQL API Design
In the year 2015, Facebook introduced a different approach to API design known as Graph Query Language (GQL) or GraphQL. It includes the specification of a schema of hierarchically organized types, including up to three special types: query, mutation, and subscription. The caller sends a command (consistent with the schema) that provides the criteria and specifies the shape of the data that it is expecting in the response.
The most popular, full featured, and mature implementation of the framework for GraphQL servers is shepherded by a small San Francisco startup known as Apollo. Their approach makes it very easy to compose new functionality from the clients without a lot of changes to the server.
The backend developer has to code the schema and the resolvers. The framework calls the resolvers identified in the request then stitches together each resolver’s response in a way that matches what the caller specified.
GraphQL schema of a similar rudimentary news feed.
Because resolvers are at the attribute level and because the mechanism for fetching the underlying data may actually retrieve more than one attribute at a time, there is the possibility of wastefully fetching the same data repeatedly. This is what is known as the N+1 problem. Backend code should compensate for that with some kind of per-request cache.
Time to Live (TTL)-based, Least Recently Used (LRU) caches are limited in their effectiveness in GQL. The flexibility in specifying what the payload looks like makes it difficult to achieve a highly efficient cache with a high hit ratio and a low dirty-read rate. For this reason, GQL caches tend to be much bigger than their RESTful counterparts. The Apollo GraphQL framework supports caching via either the use of cache hints annotation in the schema or dynamically set in the resolver. This can be implemented either with browser-side caching or with either an in-memory cache or an external cache such as Memcached or Redis.
Mutations are similar to RPC in that there are no well-defined standards for how they work. For that reason, they are hard to reason about or understand.
At the time of this writing, Application Performance Monitoring (APM) for GraphQL is not as mature as for RESTful APIs. Support for GraphQL doesn’t come built in but there are plugins or work-arounds for New Relic, Data Dog, Prometheus, and App Dynamics. I believe that, over time, APM monitoring of Apollo-style GraphQL will become more mainstream.
With RESTful APIs, the client specifies the path, maybe some query string parameters, maybe a little authentication, and that’s it. For GQL, the client has to specify what the payload looks like. This sort of mini-program raises the complexity of calling GQL services which increases the likelihood of introducing bugs. It also makes it prohibitively expensive to develop test automation to cover every conceivable query. The usual approach is just to query for everything in the test automation. This should be sufficient unless the resolvers exchange out-of-band data via the context object.
Separation of Concerns
The key to understanding whether you should base your API design principles on REST or GQL is to grasp a concept in computer science known as Separation of Concerns (SoC). Well-designed yet non-trivial software is composed of many layers where each layer is segmented into many modules. If the SoC for each layer and module is clearly articulated and rigorously followed, then the software will be easier to comprehend and less complex. Why is that? If you know where to look for the implementation of any particular feature, then you will understand how to navigate the codebase (most likely spread across multiple repositories) quickly and efficiently. Just as REST and GQL queries provide consistency in API design, a clear SoC means that you have a consistent approach to where the implementation for each feature belongs. Developers are less likely to introduce new bugs in software that they understand well.
It is up to the software architect to set the standard for a consistent SoC. Here is a common catalog of the various layers and what should go in each layer.
The main layers in typical, modern business software are frontend and backend.
- Frontend software is what the user interacts with directly and usually runs on a mobile device or laptop. The frontend consists of mobile and web and should be mostly about rendering, binding, interaction, and User eXperience. It is best organized internally as some variation of Model View Controller (MVC).
- Backend software is what the frontend software communicates with. In production, backend software almost always runs on servers in centralized data centers such as public clouds.
The backend is further layered into data, edge, and integration services mostly.
- Data services protect the database(s), enforce business rules, maintain consistency, and focus on scalability, performance, and possibly resilience. It is best organized internally into resource controllers, services, models, and Data Access Objects (DAOs).
- Edge services handle push notifications, cross-endpoint aggregation, and security.
- Integration services should serve as a reactive anti-corruption layer with 3rd party apps such as e-commerce (backend integration) and spreadsheets (frontend integration).
There are other types of services that won’t get much coverage here.
A full-featured service mesh can be used to handle resilience, discovery, internal authentication, encryption, and observability. The latter will require interaction with other types of services for monitoring, alerting, log aggregation, and distributed tracing if desired.
From a software architect’s perspective, systems start to get into trouble when the engineers decide (typically for reasons of expediency) to blur these SoC. For example, DevOps immaturity may tempt you to decide that data services should be responsible for edge service or integration concerns, or that frontend applications should do some work that more naturally belongs to the backend.
A Modern Technology Stack
About six years ago, I started seeing a particular type of edge service known as Backends for Frontends (BfFs). Instead of the client application calling the data services directly, calls go through an intermediate service that is designed specifically to accommodate the requirements of the calling application. If done properly, this approach decouples the data services from the ever-changing needs of the GUI. The data services are still tightly coupled with the databases whose schemas usually change at a much slower pace, at least for more mature projects.
What are some useful technical requirements for BfFs? Surfaced as GraphQL, they should provide an aggregating orchestration layer over RESTful data services. They should either surface WebSockets or exploit the GQL subscriptions capability. They should be owned by frontend teams and adopt a frontend developer friendly tech stack.
You might be tempted to believe that adding a BfF will increase latency due to the extra server roundtrip. It is actually the opposite that is true. Here is why. It is not server hops that increase latency, it is the distance that the data packets travel. To illustrate why, consider the following hypothetical. Assume a web page calls an API endpoint that returns, on average, ten items. It must make three other API calls per item to get all the data needed to render the page. For a user in San Francisco loading the page from servers in Amazon’s us-east-1 region, each request travels about 5,600 miles round trip. Since that is 31 requests, data must travel 173,600 miles or almost 7 times the circumference of the Earth. If you wrap those 31 requests up in a BfF that resides in the same data center, then you may have made in total 32 calls but the data traveled was only 5,600 miles. That is 3% of the distance when not using a BfF. Web browsers can use some parallelism when making these API calls but connection limits for browsers are more restrictive than for backend services.
Another kind of edge service is what is known as the API Gateway which is responsible for authentication, authorization, rate limiting, Single Sign-On, and enforcing access rights. It can also route queries to the GQL BfFs and route commands straight to the RESTful services if desired. You will need different BfFs for different types of client applications but a single common API gateway should work well for all client types. It may also need to proxy calls from third parties or developer-savvy power users directly to the data services behind the firewall. The API gateway is a very general-purpose piece of software that either your backend engineers will support (OpenResty is a popular choice) or it can be a third-party product such as Kong or a PaaS offering available in all three of the most popular public cloud vendors.
Modern web development is all about the SPA. In the old days, a user would point their web browser to an HTML page that was probably generated by a server-side web application. When the user clicked a link on that page, the browser would render a brand new HTML page also generated by a server-side web application. With an SPA, the user points their web browser to a static web page that has only very basic HTML in it. Lots of JavaScript and Cascading Style Sheets (CSS) get downloaded, too. It is the execution of the JavaScript that makes API calls then generates most of the Document Object Model (DOM) that the browser renders as the GUI. When the user clicks on a link, the JavaScript on the existing page just destroys some old DOM and creates some new DOM. The page appears to be different but the browser does not load a brand new document.
These days, the frontend developer is most likely coding in TypeScript (type-checked variation of JavaScript) using either the Angular or React frameworks. The TypeScript gets transpiled into JavaScript and CSS as a part of the build process. While developing locally, developers use Node.js to serve the JavaScript and CSS and also to route requests to the APIs. I recommend using Nginx to perform those functions when not developing locally. Doing so allows us to exploit the Same Origin Policy with a configuration that is in the same source code repository as the application itself. The corresponding Docker image bundles the transpiled assets of the application along with Nginx and related configs which can also take care of other issues such as CORS headers or cache control. You may need to adjust the configuration of your CDN if you go with this approach.
Most mobile applications these days are built on either iOS or Android. Those Operating Systems have their own specific technical requirements too numerous to elaborate on here. You can either develop separate iOS and Android apps or use technology like the Ionic Framework or React Native where one codebase is used to generate the OS-specific binaries. You may need to develop separate BfFs for both iOS and Android or perhaps simply a single mobile BfF to accommodate both.
Conclusion
You don’t have to choose between REST and GraphQL. REST works well for platform-oriented data service concerns while GraphQL is a good fit for GUI-focused edge concerns. If you maintain data services and edge services in separate layers, then you can afford the luxury of having both REST and GQL at their best.
Maintaining a clear separation of concerns between the different component types helps manage accidental complexity. Be advised that you will be strongly tempted to blur those concerns to meet certain short-term goals after discovering some additional complications. Architects should endeavor to remain open to the needs of all concerned parties. Consider applying a short-term solution that temporarily blurs SoC and raises accidental complexity to meet the immediate needs in a timely fashion. This must be almost immediately followed up (guaranteed by all parties) with the long term refactoring work necessary to restore the system to a clear SoC and lower accidental complexity overall. This is the lesson of tech debt.
Where practical, keep cross-component coupling loose. Components that need to be tightly coupled should belong to the same team and use similar or complementary technology. You should not have to dramatically restructure your teams to accomplish this.
About the Author
Glenn Engstrand is a Software Architect at Rally Health. His focus is working with engineers in order to deliver scalable, server side, 12 factor compliant application architectures. Glenn was a breakout speaker at Adobe's internal Advertising Cloud developer's conferences in 2018 and 2017 and at the 2012 Lucene Revolution conference in Boston. He specializes in breaking monolithic applications up into micro-services and in deep integration with Real-Time Communications infrastructure.