Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles How Zalando Delivers APIs with Radical Agility

How Zalando Delivers APIs with Radical Agility


Key takeaways

  • Radical agility relies on small engineering teams that are first class citizens and promotes concepts like autonomy, mastery, purpose and trust to create optimum conditions for teams to deliver innovative products at scale.
  • An API-first approach relies on designing the API outside of code first, and on getting early feedback on this design to improve the quality and align it with overall architecture guidelines.
  • API guilds are transversal teams that develop and maintain RESTful API guidelines and contribute to API reviews, helping keep a uniform look and feel when designing APIs at scale.
  • To bring changes quickly to a live system, have both the old and new services operating in parallel for a certain time and shifting the traffic progressively while carefully monitoring the operation.
  • Developing a culture around API design and API as a Product principle requires teams to learn how to accept code reviews without getting defensive, to give good constructive feedback that really provides value, not focusing on the details, but on things that really matter.

InfoQ interviewed Thomas Fraustein, architect at Zalando, about his team’s radical agility development organization that is optimized for an API-first approach. He explains what an API-first approach is, and provides tips on building good APIs for scalable microservice architectures where a large number of services are offered efficiently.

InfoQ: Can you introduce yourself and your role at Zalando?

Thomas Frauenstein: I work at Zalando as an architect and I focus on global architecture topics around our new business platform, as well as on how we organize our architectural work.

Zalando is Europe's leading fashion e-commerce company and its Tech department has more than 1400 employees, largely software engineers and data scientists distributed over six main tech hubs in Germany, Ireland and Finland.

InfoQ: How did you get interested in web APIs in the first place?

Frauenstein: Our APIs purely express what our systems do and are highly valuable business assets. By combining services to deliver new innovative products, APIs represent a key element in the efficient creation of complex software solutions.

We all know that APIs have to be sustainable while implementation of services might change radically. You can easily break APIs and bad APIs, slowing down engineering as well as business processes, and this generates high efforts in terms of service operations and engineering.

InfoQ: What are the main challenges you face with those APIs?

Frauenstein: We organize our group in a specific way that we call Radical Agility. Radical agility is centered around small engineering teams who are first class citizens of our organization and promotes basic concepts like autonomy, mastery and purpose as well as trust to create optimum conditions for the teams to deliver innovative products at scale.

With this also come a couple of architectural principles. One is that each team owns applications, and is responsible for developing and operating their applications in the cloud using their own AWS cloud account.

These applications are managed as Software-as-a-Service based on microservice architecture. There's a huge set of microservices implementing different products like product and stock management, order processing, payment services, logistic processes and so on for e-commerce business. Most of these services provide functionality via APIs, which are basically public APIs designed as OAuth2 secured RESTful APIs with JSON payloads, and we also apply API-first principles.

InfoQ: How do you ensure consistency and best practices among your various APIs and teams behind them?

Frauenstein: When we aligned on radical agility and architectural principles, we made the API-first approach one of our key engineering principles. At Zalando, API-first basically means two things.

The first principle is that we design the API outside of code first. We use a standardized language to define the APIs which is independent from the implementation language. We decided to use the OpenAPI Specification (OAS, aka Swagger Spec). We start service development with API design based on a profound understanding of the domain and required functionality. The design is not use case or implementation specific, but based on generalized business entity resources. We see API definitions as a source truth, part of a contract between a service provider and consumers.

The second API-first principle is to get early feedback on API design. We defined a lightweight review process involving peers and consumers. This feedback is important to improve the quality of the APIs and align them with our microservice architecture.

Our consistency results in a uniform look and feel for the different APIs. In the context of API-first, we created what we call an API guild that develops and maintains RESTful API guidelines and also contributes to API reviews.

As you know, REST is more an architectural style and does not really specify API design details. We need to have some standards in the API design practices to establish a consistent API look and feel. Ideally, all the APIs should look like they were created by the same person. That’s a very ambitious target, but our guidelines help. We recently open-sourced them and have already received external contributions.

The API guidelines standardize easier things like naming conventions and resource definitions, but also includes more complex things like non breaking changes and how we want to do versioning.

Peer review by API Guild members make sure that the APIs are consistent with these guidelines. In the end, the more critical aspect is that all the different services that are part of the platform fit in an overall architecture where you have really clear, separated functions that can easily be orchestrated to build the business functionality that we have in mind.

InfoQ: Can you describe the typical API development workflow at Zalando from a tooling point of view?

Frauenstein: We have quite a high variance among the different teams. We want to give high autonomy to the teams about their technology decisions and how they implement and design their services.

As I mentioned, one of the first things is to the define the API using OAS in a separate document. We then gather feedback on this API and then in parallel or in advance depending on how the team executes, start the implementation of the API and related tests.

We have a very heterogeneous implementation infrastructure used by different teams, including Java and Scala as main languages for microservice implementation and other languages like Python, Closure and Go. At Zalando, a couple of teams manually implement the API definitions. We also have a generator that creates Java client and server stubs out of OAS definitions.

Some teams use this or similar tools for Scala. I’ve also seen some teams that use Spring REST to initially create their payload descriptions. So, I would say it is very heterogeneous.

InfoQ: Can you also describe how you deliver and then operate your APIs in production?

Frauenstein: Each team has its own AWS account and is responsible for the development, deployment and operations of their microservices. We also have a platform team that provides some infrastructure support for easy deployment and to be compliant with our audit and security requirements. It is an infrastructure layer on top of AWS using Cloud Formation and other AWS APIs.

An application with a couple of APIs gets deployed and implements a specific REST endpoint to publish their API definitions. As part of our infrastructure we have a central place where all the deployed APIs can be discovered. People can search through these APIs and also see older versions that have been deployed.

This API discovery service is fed by a piece of infrastructure that crawls all the deployed services and gets access to the YAML API definitions via the endpoints provided by the services.

InfoQ: What are the challenges that you encountered with APIs in production, in terms of scaling, availability or security?

Frauenstein: In respect to security, all our external APIs are accessible via the public Internet. Of course, all these APIs have to be secured so we have an open identity management infrastructure and use OAuth2 flows with JWT for the authorization of our API access.

If you define an API you also have to define the security scopes mandatory to access the endpoints and there is an approval flow implemented in our infrastructure where these scopes are then assigned to teams or services led by teams.

For scaling, our applications are stateless. We autoscale the number of instances using an AWS capability that our tooling can configure. If the application uses data, then we have to make sure that these services also scale by using different technologies from AWS or built by us.

For availability, we usually have various levels of business criticality. For high availability, we have at least four nines so we cannot stop the service when we deploy something and test it. We generally do not make use of specific staging environments as part of our quality assurance and only use staging environments in specific situations.

We prefer to bring changes quickly to the live system, and have both the old and new services operating in parallel for a certain time as we shift the traffic progressively while carefully monitoring the operation. Another pattern is that we work with feature toggles that activate certain features for specific flows through our system.

InfoQ: How do you ensure the best experience for developers and final users who access the Zalando APIs from various geographical regions?

Frauenstein: Part of the documentation around APIs is making explicit statements around what service level agreements I want to support with this service and its APIs. I also document the load the API is ready to accept, if the API has rate limits, what latency the API provides and, what the API availability promise is. These aspects should be clearly defined in the context of the service API documentation but are usually not part of our API definition.

In addition to the API definition, we often have a kind of API user manual that provides service context information, usage examples, details on error handling, and also includes information about SLAs. This document must be accessible online and the URLs are linked in the API definition.

Each team has to make sure that the SLAs are supported by the service implementation and operation. They have defined different metrics that are continuously monitored, and where you can usually define alerts based on thresholds such as latency and throughput.

We also have built and open sourced some tooling such as Zalando Monitor (ZMON), Zalando Tracer and Zalando Problem. These support infrastructure monitoring and alerting of our services in a very flexible way. We also use Appdynamics for certain monitoring and logging operational use cases. We also have implemented a solution based on Kafka and Flink for monitoring of long running processes, for instance, the order fulfillment and logistic flows.

We use caching technologies where it is necessary to improve latencies, but this is very use case specific. For instance, sometimes we use Redis or ElastiCache. We also use a CDN for our content delivery.

We used to operate our systems in data centers which are located here in Germany in two different sites. But we have made good progress with our AWS transition and are moving this functionality to AWS cloud hosted services.

For latency critical applications, we use the AWS Frankfurt region and we also use the Dublin region for fallback in case we have regional outage (Frankfurt region only has two availability zones).

The more performance critical stuff includes the shop, search and order processing. We also have a good deal of other functions that are not that performance critical such as asynchronous insight processing or reporting functions. These functions are often operated in the Dublin region, but this is basically a decision of the team, depending on the requirements for the specific service.

InfoQ: How do you engage with your community of developers and partners around your public API?

Frauenstein: We are currently in a big transition process. In the past, we made our money with our shop activity. We now have different e-commerce applications and we evolve into a fashion platform provider. Basically we want to connect different business partners, like retailers, merchants or logistic providers, with different business models around fashion e-commerce.

All the services within this new fashion platform will support multi-tenancy, and many of them will provide public interfaces to our business partners and will bring more traffic than via our classic shop consumer interfaces. We see other platform providers like Facebook, Twitter, and Salesforce generating more traffic with these APIs than with their consumer interfacing channels.

Today, we do not yet have a great API portal; but you can discover all these APIs and you also have very consistent API documentation and test support with sandbox integration to try out these APIs. These are the major topics we are currently designing. Right now, we only have a basic entry solution in place in our infrastructure that provides a central place where the API reference and user manual are accessible. We are still discussing the best way to expose this to external partners.

InfoQ: Why did you choose to describe your API in OAS 2.0 compared to other API languages? What has been your experience with this API language and tooling so far?

Frauenstein: I think the main driver for the decision on using Open API Specification was the impression that Swagger is widely accepted in the industry. The fact that it’s become an open source project is also important. We felt that the main things we wanted to define in a standardized way were already covered.

Of course, there are some aspects that are not yet covered by OAS, like support for content negotiation. Our impression is that OAS has quite an active community and hope that we will benefit from its advances. We also plan to contribute there.

We use the tooling but as we have about 110 teams, there is a lot of diversity around their usage of Swagger.

InfoQ: How do you ensure the testing and quality of your APIs during its development and its technical operations?

Frauenstein: Basically the teams are responsible for the quality of their services, end-to-end. They are responsible not only for delivering, but also for the quality and the integration of their applications. We use a couple of different practices which each team develops. There’s a certain freedom to how they decide to implement these responsibilities.

As soon as we design an API, we define the first test cases in order to have a very high level of test automation. These tests are sometimes executed on offline environments, or sometimes in a CI test environment operated in the cloud as a service for the teams.

We are part of a continuous delivery pipeline that includes these automatic tests. There are also more complex features where you have to orchestrate a couple of services where integration tests involve several teams coordinating themselves.

InfoQ: Do you provide client SDKs to facilitate the consumption of your APIs? If so, how do you develop and maintain them? If not, do you plan to provide some in the future?

Frauenstein: The teams usually only have to provide their API definition. They do not have the responsibility to provide client support for implementation of a specific language for the APIs. For the time being, we use the same approach for our public APIs.

To a certain extent, things become easier for the client if you provide some support and some shared libraries. But as you know, it is not easy to maintain the libraries, deploy them and then manage their lifecycles. We have decided against this approach. For the time being, we do not have plans to change this for external partners.

InfoQ: How do you deal with change in your APIs over time?

Frauenstein: Part of our guidelines is that service providers cannot break APIs. Usually we have to build backward compatible extensions. There are a couple of things defined in our API guidelines where a client has to be robust and offer more flexibility to support revisions. If we cannot avoid breaking compatibility, then we have to at least support the older version using versioned media types and content negotiation.

During an API lifecycle, if you want to have changes that could break clients, it is easier if you only have internal clients. You need to have an internal dialogue to phase out these old APIs and substitute them with new APIs.

As part of the discussion, you need to clarify the parts of the API that are deprecated and include this information as part of the API definition. We specify a deprecation date and additional version information so new users don’t use these deprecated parts and instead reference new parts of the API.

When it comes to public APIs, it is not always easy to ensure that external clients contribute to your API evolution and use new APIs. We are currently discussing lifecycle models where we can deprecate an API and inform the client. We need to provide time guidelines when we guarantee support of the old API and allow them time to switch to the new APIs. We are discussing this process and how it should look from a contract point of view.

Part of the guidelines also tell you to notify the consumer about using this deprecated part of the API as an additional effort. The services should know which clients use which APIs, especially which deprecated parts are used by which clients, with which frequency, and so on. It’s part of our monitoring and logging recommendations for API operation.

InfoQ: What are the next innovations you are planning to work on at Zalando?

Frauenstein: Technology-wise, HTTP 2.0 is definitely a topic of discussion, but we currently do not have a clear opinion on how to proceed. We don’t have hot technology topics on our radar right now that are specific to APIs. We have more engineering challenges at this stage.

The next API infrastructure steps focus on improving client developer experience via a centralized API portal that provides API discovery, consistent API reference and user manuals, search experience and communication features, like API help contact, blogging and feedback.

The challenge is to balance autonomy and ownership of all the different teams with high consistency and quality standards necessary to nurture our platform ecosystem.

InfoQ: What is your vision for the future of APIs?

Frauenstein: One important aspect is to develop a culture around API design and API as a Product principle. Teams have to learn how to accept code reviews without getting defensive, how to give good constructive feedback that really provides value, not focusing on the details, but on things that really matter.

Teams also have to understand that the API-first principle is not in contradiction with the agile principle. They have to learn how to get feedback on early drafts of their API and not wait for the API to be deployed.

These are more cultural topics around API design and API-first which are also crucial in an organization, and they have to evolve. We’ve been on this path for more than one year and as this gets adopted more and more, we are seeing the value of this culture.

About the Interviewee

Dr. Thomas Frauenstein is Senior Software Architect at Zalando. Frauenstein brings with him over 15 years of in-depth leadership experience in industry grade software engineering. Previously, Frauenstein held technical leadership positions at Siemens and Nokia. Frauenstein holds a PhD from Technical University Berlin.

Rate this Article