Mature Microservices and How to Operate Them: QCon London Q&A

Microservices is an architectural approach to keep systems decoupled for releasing many changes a day, said Sarah Wells in her keynote at QCon London 2019. To build resilient and maintainable systems, you need things like load balancing across healthy nodes, backoff and retry, and persistence or fanning out of requests via queues. The best way to know whether your system is resilient is to test it.

The Financial Times adopted microservices because they wanted to be able to experiment with new features and products both quickly and cheaply. To do that, you need to be able to release code multiple times a day, and you can only do that if the individual changes are small and independent, argued Wells.

Wells mentioned that microservices are harder to maintain and operate than a monolith. The complexity is in between the services - the services themselves are simple to understand. But any request going through your system will likely touch a number of different services, maybe multiple queues and data stores, she said. The logs and metrics will be generated on lots of different VMs, and the path of the request changes a lot as teams build new things, combine services, add new functionality.

With microservices, you have to accept that these are complex distributed systems. That means you are generally running in a state of "grey failure" where something is not quite working perfectly - which likely doesn’t matter, as long as you have enough resilience for your business functionality still to be working as expected, argued Wells.

Chaos engineering is about changing the state of your system - for example by taking down nodes or increasing the latency of responses from a non-critical system - to test that everything else still works as expected. Chaos engineering should be done in production, but it shouldn’t have an impact on users, said Wells. You are coming up with a hypothesis about how your system will cope, then checking if you are right.

InfoQ interviewed Sarah Wells, technical director for operations and reliability at Financial Times, about the challenges that come with microservices and how we can deal with them.

InfoQ: At QCon you presented a problem with redirects on the FT website. What happened, and what made it so difficult to solve?

Sarah Wells: We often set up redirects for ft.com, so that we can share human-readable urls like "https://www.ft.com/companies" rather than our unique urls like "https://www.ft.com/stream/c47f4dfc-6879-4e95-accf-ca8cbe6a1f69". The human-friendly url redirects to the unique one. The problem in this case was a badly setup redirect, where the destination we were being redirected to didn’t exist - so people were getting a "Not found" page. And we couldn’t work out how to reverse this through the url management tool.

The problem was that the url management tool is just one of hundreds of services we operate at the FT. And because we have so many services to operate, no-one had really had much experience making changes to this one, or really doing anything with it. We discussed restoring data from a backup but no-one was really that sure where the backup would be, or what the steps to restore it were: polyglot architectures where you have lots of different data stores are great, but it means you need to document exactly how this particular data store does backups and restores, and we found that documentation wasn’t detailed enough.

We managed to fix the problem, but we weren’t able to act with confidence, even with a very experienced set of developers involved. For an individual service, it’s easy to take action after the incident to practice a restore from backup, and to update the documentation. But that’s just a spot solution - we also need to work out how to set ourselves up so that all services have this level of ownership.

InfoQ: What have you learned about operating microservices?

Wells: If microservices give you the chance to release many times a day, then that additional complexity is worth it. You can make it easier by building in observability - log aggregation, metrics, business-focused monitoring: things that let you understand what’s going on in your production system. You need in particular to be able to find all the logs that relate to a particular event - by stamping them all with a transaction ID. And you can also improve things by changing the way you test to do more of it in production and to using monitoring to maintain quality.

When people and teams start moving on to new challenges, you need to make sure there is still active ownership of systems - people who know how to restore from backup, failover, release code, find relevant logs. We have a store of system information (we call it BizOps) that contains runbook information for every service and we want this to link every service to a team that is responsible for it. We’re also starting to introduce some automated scoring of the quality of that data, to find the places with the most risk that we wouldn’t know what to do in case of an incident.

InfoQ: How do you do experiments at the Financial Times?

Wells: For ft.com, we have A/B testing built in, and managed via feature flags. We run many experiments and do statistical analysis on each of them to see whether they had the impact we were looking for.

Because it’s cheap and easy to experiment, and because we ask people to predict what "success" would look like ahead of time, we often have experiments that prove we were better off the way we were before. So we don’t roll out that feature. That’s only really possible because there isn’t a huge amount of effort and cost already invested in the new feature - people are really reluctant to abandon an idea they’ve invested a lot into, even if it doesn’t work!

InfoQ: What are your suggestions for documenting microservices-based systems?

Wells: I think you need to document as close to the code as possible - even if someone writes a good runbook to start with, they won’t change it every time there’s a code change unless they can easily see the things that are no longer correct. We’re looking to update runbooks automatically on code release, based on information stored in the code repository.

I think the documentation should be about how to work out what’s going on with the service, rather than trying to identify the likely failure scenarios up front. With microservices, most problems are unexpected and involve someone digging into the detail using logs, metrics etc.

You also need to appreciate that lots of information that lived in a traditional runbook is shared across many microservices. You need to find a way to allow that shared context - no-one should have to type in the same information for 10s of services!

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

Rate this Article

This content is in the Culture & Methods topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter