BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Monitoring All the Things: Keeping Track of a Mixed Estate

Monitoring All the Things: Keeping Track of a Mixed Estate

Bookmarks
29:55

Summary

Luke Blaney talks about how to approach monitoring an estate of many technologies and what the Financial Times did to improve visibility across systems built by all its teams.

Bio

Luke Blaney has worked for the Financial Times since 2012. Currently, he is a Principal Engineer on their Reliability Engineering team, where he works on improving operational resilience and reducing duplication of tech effort across the company.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Blaney: I'm Luke. I'm going to talk to you about monitoring. Monitoring a microservice can be a bit tricky. You need to think a lot more about network and communicating with lots of other things that you might not have thought about in your monolith. I'm not going to talk about that. Even trickier is when you have a group, like all your microservices in your team, and trying to keep track of all of those together, how they're doing. Different failure states between them. How they interact, and all that stuff. That's a much trickier problem. I'm also not talking about that. What I want to talk about is how do you monitor every service across your entire organization?

I work for the "Financial Times." I'm probably interested in what tech stack we use. We use Go microservices. They're on Kubernetes, which is, I'm running across regions. We use CoreOS on EC2. At the FT, we run Node.js microservices. We deploy them onto Heroku using CircleCI, and we have a CDN in front. Our Node.js microservices, we deploy them onto AWS Lambda using a serverless framework. We put that all behind an API gateway. Actually, what we do is we have a PHP monolith. We deploy that using Puppet. We've got a Jenkins box that kicks all that off. We have Varnish sitting in front of that, which is a reverse proxy server and does some caching. We write Java apps and we deploy them using Apache Tomcat, onto Red Hat Linux boxes. We run these on-premise in our own data centers, on Cisco UCS. They're fronted by two manually configured JetNexus load balancers.

You're going to ask me now, "Luke, you've just given me five different contradictory things, which of these does the FT actually use?" The answer is all of them. We use a lot more architectures than that. It's not just those. There's tons and tons of different ones. I could go off in details of lots of different things. The reason we do that, we decided a while back that we wanted to empower teams to make their own technical decisions for what works for them, which is really good. It speeds up delivery. Teams can go off and say, "This works for us. We're going to do this." What it results in is just this wide variety. That's tricky when you're trying to monitor things. We do a lot of different things at the FT. We have a website, ft.com. Six times a week, we print off a version of that, and we send it out to people. We also have various other services: How to Spend It, Investors Chronicle. Then we have internal facing apps as well for our staff. CMS is sales things, lots of different tools that we built.

We have 130 years of quality journalism at the FT. As a technologist, I look at that, and I go, do you know what that actually means? That means we've got 130 years of legacy that we have to support. An interesting story, a few years ago in the newsroom, they were trying to push towards a more digital-first newsroom. We used to always think in terms of getting out the physical newspaper. Now we're like, we want the website to be the first and foremost. They were looking at the journalists' workflows when they had to file copy. Then from that, the copy would get put together. It would get sent to our print sites. Things will be printed. They'd be distributed. They were trying to work out why a certain deadline, this deadline for when a journalist had to file copy every day. It was the same deadline.

All the journalists knew, you need to get a copy in by this time. Someone started asking, why? Why this particular time? No one really knew. They just knew that was the way it was done. They drilled back through history and they tried to work it out. It turns out that the reason they filed copy was because of steam train timetables from the 1900's because they needed to get the newspapers on the last steam train from London to Edinburgh so people in Scotland could read the newspaper. Nowadays, we have better ways of getting news to people in Scotland. They don't need to wait for a steam train to arrive. No one had ever revisited this. They'd just be like, that's the way we do it. We will keep on doing that.

Monitoring Systems at the Financial Times

At the FT, we have quite a few different monitoring systems. Each of these different architectures uses different tools behind it, and other people have just decided, "I want to use this here. I'm going to use that there." We were trying to get a grasp of, how do we keep track of all of these things? There are so many things going on in different places. We didn't want to replace these all and build the one monitoring tool, "You must use this." Because, A, that's really hard. You have to get all the teams to buy into that. B, for some really old systems, you don't necessarily understand that old system. Pulling out all the monitoring from a really old system and trying to replace with something new can actually cause more trouble than it's worth. You want to keep these monitoring systems in place, but try and get an overall view of what was going on with our state. We added a single data store. We use Prometheus for this. We could pull in lots of different metrics from all these tools. Then we built our own custom frontend in front of that. I've zoomed out quite a lot to try and get all the services in but, actually, I couldn't even fit them all on one slide. We have a lot of things going on.

I'm going to talk about three main sections. First of all, how do you monitor those legacy monitoring systems? How do you pull them in? Secondly, some current monitoring approaches. Finally, a bit about how you think about monitoring for the future, for the systems that haven't yet been built.

Legacy Monitoring Systems

First of all, legacy monitoring systems. We have a couple of systems. They're there. They were often built at the time of the legacy systems and whatever monitoring was put in place. Some people like to shy away from legacy, they're like, "Let's get off all the legacy." We've realized, legacy tends to be the thing that's bringing in the money right now. You probably have lots of dev teams that are off building shiny new things. Shiny new things tend to not be bringing in the current money. It's the legacy stuff that brings most of the money into the business. It's really important to know whether the legacy systems are working or not.

My first bit of advice too around this, is to admit that you have some. If you shy away from it, and hide it, you're not going to be able to solve a problem. You need to admit to a problem. Then you can solve it. Be upfront about your legacy. The way we do that is we have to think about, which of these monitoring systems do we want to actively encourage teams to use, and which things are we supporting for legacy purposes? We have a little tool that lists all the different monitoring solutions available. For each of them, we put a little tag in the corner saying either it's recommended, or it's deprecated. Then, when a new developer comes along, they can see, these are these recommended ones. It's the golden path approach. We will support you in doing that. These deprecated tools, we still need to support. We're still using them day in, day out to keep track of systems. Please, don't build anything new on them, because that just extends how long we have to support these things.

Next, when we were integrating some of these old tools, I think we made a bit of a mistake. We spent a lot of time trying to understand the ins and outs of lots of this monitoring. We'd spend hours discussing like an NTP alert, and saying, is the warning state different than the error state? Is a severity 3 more major or less than a warning? How do these things play into each other? At the end of the day, it didn't really matter. We weren't changing the monitoring of these old systems, we were just trying to pull it in. The main thing we should have been focusing on, and we eventually did, was the human reactions. That is if this thing is to go red, what is the person going to do? We have a first-line operations team who then escalates to delivery teams across the company. They need to know, when this thing goes red, what am I doing? Am I going to wake someone out of bed in the middle of the morning? Is it something that we can tolerate until Monday when someone comes in and they're going to fix it? It's those things that you need to care about, first and foremost. It's what the humans are going to do with the monitoring. After that, all the detail just follows, really.

Also, one thing I came to realize when we were pulling in these different systems is there was a lot of config required to keep track of these. Actually, it's ok to hardcode this config. For example, we had Nagios boxes that were spun up across the estate. Lots of different teams were using them. As a first pass, we were like, we'll stick this list of Nagios boxes, just hardcoded, in 250 lines of YAML. That will do for now. We'll come back and look at it later. We actually found that once we'd done this, these things rarely changed. Legacy once it's there, if you don't have a delivery team necessarily working on it day in, day out, not a lot changes to these old systems. They have the same IP addresses. They're sitting in the same place. We didn't need to make this config really dynamic, really easy to change. In fact, having that small bit of friction, someone could still come and change it.

We had a repo, and someone could do pull requests and stuff. A bit of friction for teams to do that actually really helped with legacy, because often legacy is misunderstood. People may not have touched it in months, if not years. A bit of friction in the process of changing these things was really useful. Because people would stop, they'd think. Then they were still able to make the change, but they wouldn't need to. We actually found that most of the time when people were updating this, it's as part of decommissioning. They weren't really adding anything new. Most of our pull requests, looked something like this. It was all red lines, which is my favorite type of pull request. It's just delete. Didn't need to add anything new. That tended to be all that needed to happen to these legacy systems. We never actually went around and built a more dynamic way of configuring these because hardcoded was fine.

The first thing you need to do is be upfront about legacy. Everyone's got it, just admit it. Secondly, focus on the human reactions to the monitoring. Thirdly, it is ok sometimes just to hardcode a bunch of config.

Current Monitoring Approaches

I think the most important thing is to add metadata to your monitoring. Monitoring is not there in isolation, you need some context. In the FT we have some runbooks. These give some useful information, stuff around how important it is to us, whether it's in production yet, even things like when it was last deployed. There's three questions you want to ask yourself, first of all, if this thing's broken, why should anyone care? Your monitoring is going off, but in terms of business use, why should someone care that your system is down? The second thing is then who's relying on this. We need to send out comms. We want to communicate that it is down. We don't want to bombard the entire company every single time something goes down, we want to be able to target those comms. We need to understand who's using it. If they're external people, maybe we need to talk to our customer care team so they're on hand to deal with any requests and that thing. Finally, we want to know who can help fix it. If we need to get someone out of bed in the middle of the night, we don't want to be ringing the wrong team. We want to ring the right team first. That is ideal. I'd recommend not to hardcode these things in your monitoring. Often, you're thinking about these things while you're writing your monitoring. You tend to put them in the same place.

When you think about each of these, each of these can change independently of the technology itself. For example, you might have a system that was built for one purpose, and some other team in the business goes, "That's useful. We could use that for our thing." I've seen that happen. Then they start using it. Then multiple teams start relying on this for a completely different purpose than you built it for. You're like, that's fine. Maybe the team that it was first built for no longer needs it anymore but three other teams have started using it. Also, from the engineering side, you might have a reorg. People come, people leave. The person that you're ringing up to fix it can also change. These are all organizational changes. These are not dependent on the tech.

Putting all these information inside the repo, or the tech with the monitoring itself, actually means it's much harder to change. Your org structure will change. I think it's the only guaranteed thing in technology is that at some point your organizational structure is going to change. You need to be prepared for it. You want to make your config as easy to change as possible, because it will change a lot.

In the FT, we've built our own tool, we call it Biz Ops, business operations. It stores data around our operational stuff and how it relates to the business. To make it easy for people to make changes, we have a UI. We've also got APIs. Engineers love APIs. They find that a much easier way to give us the data. It doesn't matter how people give it data, just make it as easy as possible to make those changes. Because these changes will happen a lot. You might think, an organizational restructure only happens every couple of years and it's a big announcement. The CTO gets everybody and tells them a big thing. Actually, there's small changes happening constantly throughout any large organization. You might know about every change, but they are happening. Teams are merging. They're splitting apart. They're getting renamed. Some leaving. We actually find by the time you go through and gather all the information you need about all the systems you're monitoring, the first thing you gathered will have already changed by then. That is the pace of change we have these days with technology. You need to be prepared for change.

For us, we've actually found that a graph model, it makes it easier to make those changes. With a graph, there's less stuff to update. Your org structure will change. For example, the way we used to think about our systems is we'd have a runbook and the system would attach to it. We'd have what team owns it, who the tech lead is, what the cost center is, and a bunch of other information, which in the days of monoliths was quite easy. You might have a handful of these. You'd have to update them every so often. If you think of microservices, if you've got hundreds of these, and your tech lead leaves, or they move teams, you get a new tech lead. You then have hundreds of entries to update. No one wants to do that manually. Even running scripts and stuff is just a pain, every time you have personnel change, having to update all of these things. Using a graph model, we tease these things apart. We said, actually, the tech lead is the tech lead of a team. A team can own all these microservices. When the tech lead of the team changes, we only need to make one update. Then we pull all that information through, and then we automatically surface it on the runbooks.

I think it's really important that if you're doing this, so our essential team supporting other delivery teams, you don't want to be a blocker to those changes. It can be tempting. You can think, "We are the monitoring experts. We know microservices are the best. We know all these X, Y, and Z." It would be great, when a team needs to do a thing, they'll just run it by us. They'll probably be fine. We'll approve the PR, and X, Y, and Z. As soon as you add a cross team dependency like that, it really slows the pace of delivery. You find that anywhere, cross team dependencies slow down delivery. The same as here, being that central team that tries to be in control of everything, you just become a blocker. Every other team will start resenting you. You really want to trust teams. The teams that build their thing, we say you build it, you run it. You monitor it. You manage the config around it. They should be in charge of all of that themselves, and be capable of making those changes.

One other thing we find is, often, you might have a source of this data around who looks after which system. You might not trust it very well. For example, back in the day we had our runbooks which are static pages on a Google site. Occasionally, someone would go in and make a change, but no one really liked doing it. We went through a bit of time trying to parse all this information and at least put it in a data store. Then we looked into it and we were all a bit like, "I'm not sure we can use that. Most of it is fine but this references a team that doesn't exist anymore. I think that guy left three years ago." We'd look through this, and we'd see this as low quality data. It can be tempting at that stage to go, we can't touch that. We'll need to start again from scratch. We actually found, using that data helps drive change. Provided it's easy for teams to make those changes, relying on it for your tooling really helps improve it.

For example, our tool, we listed all the teams for our department based on whatever data we had at hand. Then we gave each team their own monitoring dashboard. This wasn't something that they got to configure. This is something we used the data that we had. We said, according to our data, you own these systems. Therefore, your dashboard has these systems on. Obviously, we get some people come to us, and go, "Hold on a sec, but we don't own that system anymore and you're missing this system." Then we turned to them and said, "This is the data store that we're using for this. If you have a problem with what you see in this dashboard, update it in this data store." Teams were like, that seems fair. They went and made those changes. Actually, it really drove the quality of data that we then held. That's the same data that our first-line operations team used for escalating things. The fact that development teams were looking at these dashboards and spotting errors quickly meant that actually we had much better data around who to escalate incidents to that I don't think we would have had otherwise.

Do embellish your monitoring with metadata, but don't hardcode it. Keep the two decoupled, so people can make changes. The more you rely on this data, actually, the better the data quality becomes.

Monitoring For the Future

You don't want whatever tool you build to become immediately legacy as soon as you've built it. It's always a risk with these things. You're like, we've incorporated all of the monitoring that we have in the organization. Then you're like, what happens if something comes along the next day? A key to this, and it's a key to a lot of tech, is focus on the interface, not the implementation. Really think about how the data is given to you rather than necessarily stipulating, you must do it this way. In the FT, we have a JSON standard for health checks that each system can expose. We find this really useful. It's just a list of fields and some JSON. Then we say, we can parse that.

Every team, it's up to them how they want to do it. It started off, I've to do it dynamically on websites and stuff. There would be a special input for them. We got to a point where some data pipelines and stuff we're like, "That doesn't really work for us. We're just going to push this JSON into an S3 bucket, and you can read it from there." That worked too. Just having a blob of JSON, that's how we're defining. Teams are free to do it however they wanted. We actually found that teams started using this JSON in ways we'd never imagined. It wasn't just to surface stuff from an application itself. It was also, they were using it to pull in other monitoring systems that we didn't have, which we found really interesting, because we never really thought of it that way.

This is where you may have heard the phrase paving the cow paths. You let teams go off, do what they want with an interface like this. Then you watch what they do, and see if you can help them along their ways. We found that there was a lot of teams and they had a Grafana dashboard, for example. They wanted to pull in stuff from Grafana. They'd go off. They'd write a little tool themselves that would expose their Grafana alerts in this standard JSON health check. Then we'd pull them in, it would be surfaced on the big dashboard for the whole company. That was great. We weren't a blocker for that. They were able to do that. One team went and did it. Then another team were like, "We could do that as well." We got to about three or four teams. Then we're like, we've not blocked anybody from doing this.

At this stage, we could probably make the whole process easier. We were able to do this. Because we weren't blocking them, they weren't dependent on us. We could do this on our own time. We got to the point, a lot of people were doing it. We were like, we'll replace this all with our own Grafana exporter. We have one shared Grafana instance, so we can actually pull straight from there. We made it actually as easy as teams just had to add a little extra tag onto the end of their dashboards just to say what system it was. As soon as we saw that tag, we'd pull it in and go, "We're going to surface it on your dashboard," your alerts on the dashboard for that team.

One final thing we're starting to think about at the FT is to pull away from monitoring individual microservices. Microservices are great, but no one outside of tech really understands them. We're starting to think more in terms of business capabilities. One key thing we do in the FT is we publish news to a website. We have a content management system. Then that pushes stuff to our publishing platform. Then from there, it goes to ft.com, where it gets surfaced, and users can read it. Each of these is made up of lots of microservices. For example, the publishing platform looks a bit like this. I couldn't actually fit it all on one slide. There is a lot going on here, which is great. Microservices let you do lots of stuff.

Each of these things can be really specialized. When it comes to monitoring, often our monitoring goes, "This content ingester for Neo4j is broken," which is great for the developer who's trying to do the thing. The rest of the business have no idea. Our journalists, for example, they don't care about the content ingester Neo4j microservice. They don't even care about this whole platform. They barely even care about this pipeline. What they actually care about is the question, can we publish news? That's what they want to know. We're starting to think how we can step back from these microservices and monitor end-to-end, whether we're actually doing the thing we set out to do in the first place.

To recap on things you can do to protect yourself from the future, is firstly to find an interface. Don't stipulate what the implementation has to be. Let teams go off and do things however they want, provided it meets the interface. Secondly, is to spot patterns when they happen and make them easier. If lots of people are doing the same thing, there's usually an opportunity to make that path really smooth. Finally, start thinking about what the capabilities you want to support are, not the individual microservices.

Recap

Be upfront about legacy. Everyone's got legacy, admit it. It makes life so much easier. Embellish your monitoring with metadata. People want to know what this means. Don't have monitoring in isolation. Always relate it back to what the business impact is. Start thinking about these business capabilities rather than just the individual microservices.

Questions and Answers

Participant: How do you set up the end-to-end monitoring of a capability?

Blaney: There's various ways to do this. One thing we quite like is synthetic checks. For example, for publishing, we have one really old article, and every five minutes, we just republish that. It's a known article, we're like we're going to just keep republishing this article again. We see then on the website, at the other end, has that happened? Then we can check if that stops happening. I think the beauty of having that thing is it doesn't care about all the microservices in the middle. Because you can have all your microservices that each individually are working fine. Then overall, doesn't quite work. We've had an incident once where I think it was to do with our image publishing. It went through four different teams. Every single team swore that their thing was working fine. All their monitoring was green. The journalists were like, "My image isn't publishing." That's what the journalists care about. They don't care about every team within the publishing pipeline saying that their thing is green. You want that end-to-end. The more realistic you can make that monitoring, I think the better it is.

 

See more presentations with transcripts

 

Recorded at:

Jul 14, 2020

BT