BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Making a Lion Bulletproof: SRE in Banking

Making a Lion Bulletproof: SRE in Banking

Bookmarks
49:04

Summary

Robin van Zijll and Janna Brummel talk about the history, present and future of ING’s SRE team and practices. They touch upon people (hiring, coaching, organizational aspects, culture), process (way of working, education), technology (observability, infrastructure), and share lessons learned that can be applied to any organization starting or growing SRE, financial or not.

Bio

Janna Brummel is Site Reliability Engineer at ING Bank, where she helps other teams within the bank to know more about their services' reliability and to be able to respond more efficiently to incidents. Robin van Zijll is Site Reliability Engineer & Product Owner at ING Bank. He applies his experiences to help other engineers with operations related problems by creating a reliability toolset.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Brummel: We work for ING, and for those of you not familiar, it's a global financial organization active in 41 countries, also in the U.S. Today, this talk is going to be about the Dutch retail bank of ING. This means that we are talking about a bank with 9 million debit cards, 8 million retail customers, and 7 million ATM transactions per month. For those of you not familiar with the size of the population in the Netherlands, this means that we serve about half the country, so we're pretty big. Whenever we have a large-scale outage, it often hits the national news, so in the public eye pretty much.

One of our most well-known products is mobile banking, so we have an app for that. It's used by 4.5 million customers. Together, they log in 6 million times a day, so we hit 100+ TPS on a daily basis. We often see people on the trains with this app open. People seem to be very happy with it. They can check their balance, transfer money to their savings account, stuff like that, or do payment requests. We are not a digital-only bank. We also have branch offices and basically offer any type of banking product that there is.

van Zijll: For those of you wondering why we are talking about lions, it's actually our logo. It's a real Dutch thing. We're not really working with actual lions.

We are really important for customers, because they rely on us for shaping their money and to make sure that they can do transactions and to see their balance. Because we are this large and this big of a part of the society, our regulators demand from us that we prove that we are in control of the risks so the people could rely on storing their money in our bank accounts. They do the same for our uptime, and what they demand from us is, do we have an uptime for 99.88%? What we can see here is a bar chart with our uptime of doing a payment transaction through either mobile banking or through web. If you look closely, it says prime time. It's availability figures for prime time and the prime times being 6:30 a.m. till 1 a.m. For me, as a customer, I don't care about prime time. I want to make my transactions in the middle of the night if I want to, especially if I'm in a bar, for example. Guess what, I'm not the only one.

In this graph, what you see is our actual traffic load on our systems. As you see, we have a load on mobile banking for logins, and this is in the middle of the night, it's around 10 transactions per second. We have actual customers doing banking in the middle of the night. If you'd plot that on our availability figures, then something changes, because then we're not reaching the 99.88% our regulators demand from us. If you have a customer expectation of the five nines, then we're not achieving it. As SRE, this our main focus. We want to be available even if it's in the middle of the night.

What is site reliability engineering team? For those that don't know, site reliability is typically what you get if you have a software engineer working on operations, and what's really important if you are a software engineer is to create software. An SRE spends at least 50% of its time on software engineering but with a focus on operations. The main difference is, for example, our DevOps team is focusing on software engineering for customers, creating new functionalities - like payment transactions, bank request, Apply Pay - and not any concern of the operational stuff. Then you have SRE, so they have the full focus on creating a reliable service, so working on reliability.

People

Brummel: A little bit more about our organization before we dive more into the people side of things. We are organized in tribes with BizDevOps squads responsible for build and run. This means that, in addition to dev and ops engineers, we have customer journey experts part of the same team, where they serve as a product owner. Combining business and IT in one squad allows for higher development speed, less handovers, and more interaction between business IT, so more understanding about what the customer actually wants and, on the other hand, more understanding of someone from the business side of the importance of tech in that development.

These squads are responsible for the full customer journey, so from customer interaction to run of their services. A majority of tribes are focused on product engineering. A tribe is often centered around a customer functionality - so we have a tribe for mortgages or for payments - but there's also a set of different tribes that's a much smaller set, and they focus much more on productivity engineering or for IT, as it were.

Our squad is actually part of such a tribe, with only DevOps teams and no traditional business representatives. With our squad and tribe, we support all other engineers in our organization. That means that we actually support 1,700 engineers that work in the Netherlands, across 340 squads, which is a lot. This means that we can solve the bigger reliability problems for engineers and also those that spend multiple tribes. If you work in tribes, there is a risk of tribalism, where everyone's really focused on their silo, and we try to see and connect engineers across those siloes to solve common issues.

We do some heavy lifting for them since not every product engineering team has the time or expertise to set up a complete monitoring stack. It would also be very inefficient to do that in multiple teams across every tribe. That's what we're trying to do with that. We're not the only team working on reliability within our tribe, but we are the only ones who have adopted the SRE way of working. For example, we have some separate monitoring teams in our organization, but they really focus more on delivering the product, and we do some extra stuff on top of that, which we'll tell you more about.

We have seven engineers in our team, four dev and three ops, so we are DevOps within our team. Basically, anyone can do anything, so there's no real restriction there, but we do see that some people, like the people from the dev side, solve issues in a different way than the ops engineers in our team. For example, if we want to know if a team has used a correct configuration, then a dev engineering would solve that by building an API so teams can check their config, and ops engineers would make a script to check that. Together, we seem quite well rounded, and we use different approaches to solve issues.

We have two more SREs joining soon, which is very exciting. We have one product owner, which is Robin [van Zijll]. He takes care of prioritization, stakeholder management. We often define strategy and vision with the team as a whole, but Robin [van Zijll] and I have a part in that, too. I'm the chapter lead, that means that I'm sort of an engineering manager who is also part of the team. I take care of hiring, coaching way of working.

Because SREs want to spend 50% of their time on engineering, or preferably more in our case, this means that, as product owner and chapter lead, Robin [van Zijll] and I do a lot of the SRE outreach to other teams or to senior leadership and some of the SRE promo, so the rest of the people in our team can just code and do what they do best. Our team composition has been quite stable, so people tend to stay and enjoy their time there, which I think, for me, as a manager, that's great, even though I also encourage people to look in other parts of the organization.

Most of the engineers in the team have actually been working at ING for quite some time. They do have experience working in the product engineering part of our organization, most notably, working for our mobile banking application or guard domains. We think that's really valuable, because that means that these engineers, when they visit other departments as SREs, they often already have some sort of level of credibility or a network to reach out to. They understand the problems of the DevOps engineers and the product engineering teams.

People just tend to listen to their advice, because we try to approach like the operations heroes from other teams and ask them if they want to join us. They can also take a lot of knowledge to our teams, so we understand the rest of the organization better, because that's, I think, one of the dangers of being like a silo in our team. We might risk losing touch with the rest of the organization. We try to hire internally whenever we can.

Another important thing to mention is that a lot of engineers in our team have been on-call in the product engineering organization, I think about 90% actually. We also think that it's very valuable if SREs want to focus on topics like monitoring or incident response to have that real-life experience. Whenever we hire a new SRE, we have sort of a checklist of what we're looking for, and we don't like to grow with a lot of new people at once. We do it incrementally, because every change in team composition affects your delivery speed, as it were. We've learned that, sometimes, it can be pretty hard to hire a new SRE, because it's still a relatively new field of expertise. It's desired by more companies than just ours. Also, in the Netherlands, there's not a lot of SREs in the field, so it's not as easy as publishing a vacancy saying, "I'm looking for a Java developer," and a person already has years of experience.

What we do is we have a heavy focus on nontechnical skills and mindset, and most of all, ability to learn since we'll be likely needing to teach the person to become an SRE. Firstly, we look for someone who's really passionate about reliability, solving problems, a DevOps culture where a team takes full responsibility for build and run, but we're also looking for people who believe in open source and creating better quality codes together. This is also because people in our team tend to be quite passionate about these topics. If someone joins who is a bit more neutral, this can cause a mismatch and some sort of misunderstanding since we are all very dedicated to our cause and very driven, so we want to make sure that there's a match.

We also feel, of course, that it's easier to get other people excited about reliability if they see that we love this topic ourselves. We also look for people that are ok with failure. Especially coming from a more traditional organization, where sometimes blame can happen, we feel that this hinders learning or an open culture where we can discuss anything. We either test this by asking some questions about this or sometimes an interviewee pops up with a few remarks him or herself , "Sometimes I make mistakes, but that's fine," remarks like that, and then we're, "Ok, this person seems to be ok with learning and failure." It's really important.

We also look for people who are insensitive to hierarchy. We feel that we are experts in the reliability domain, and we try to be innovators and change our organizational way of working for the better. This means that, sometimes, we have to take initiative and tell senior management what we think is the best approach in case of reliability topics. If someone's very sensitive to hierarchy, this might seem a bit of a daunting task, like talking to a CIO, for example. We sometimes also want to help product engineers to get time on their backlog to focus on reliability topics instead of shiny new customer features, and if they cannot break through themselves, we want to do some stakeholder management for them, and then it also helps if we focus on the topic at hand and not on management roles or whatever.

We also look for willingness to teach and advise engineers about reliability. Since we are a small team covering a lot of ground, we feel that teaching is one of the ways to get us there. Not all engineers have experience with this, and that's fine. We, at least, look for someone who wants to try to do this, and we try to also match a way of transferring of knowledge that fits with the engineer's personality type. We, for example, have a very senior engineer who is a lot more introverted, and of course, we want him to transfer his knowledge, too. In the end, we came up with a solution where he actually built an online tutorial and a practice program for people to learn more about the Prometheus querying language, and that's great, too. It scales even better. That way, we can make sure more people know about that topic.

We also look for experience in on-call duties, for the reason that I said before. Preferably, this is the first real technical requirement. Preferably, a person who has experience with at least one language in our stack. This makes it easier to just start and be productive very quickly. When we do interviews, we basically do at least two. The person meets a lot of people in a team because you want to make a decision together. We don't really do a technical code assessment, because we found it's very hard since we're working with a lot of products, a lot of languages, and it's very hard to find a standardized one that works. Basically, we test coding skills by talking to the most senior engineer there and seeing if he or she is ok with that. We set some realistic job expectations about life on the work floor, as it were. We sometimes have those noisy work floors, and we have some dominant people in the team. We want to make sure that the person doesn't feel really bad. If people don't like noise or people who voice their opinion, then our team might not be the best fit.

Process

How did we start? A lot of engineers in our team used to be on-call, and they were actually part of a small team. There was on-call for all online channels. These people did this on top of their regular work. They were ops engineers in different teams, Robin [van Zijll] was one of them. They were the first ones to wake up at night, but they didn't have the power to structurally improve service reliability because of our DevOps model, where we believe in the ability to run it. They could not actually touch the code of the teams involved.

van Zijll: Basically, we were like the incident commanders, so the first line of defense. We were the one connecting to all the other DevOps teams that were able to solve the issue, because they were the owner of the services really.

Brummel: It really sucks if your amount of sleep is dependent on other people's service reliability. Something had to change, and we had to do something. It was not sustainable. These people were called so often and doing their regular job wasn't good for health and it wasn't good for reliability. What we did is we started with an SRE pilot. The people in the team had read some articles about SRE before the Google Book actually launched or was published, and I said, "Let's give this a go." We decided to start a pilot for six months, was supported by senior management, and there actually was a transformation in the team. About half of the people in the team made it through the selection. Selection was done based on asking what the person would do if he or she would become an SRE. Half of the team remained. They still had a period of doing their old work, so their on-call shifts. During the six months, they decided on a model of doing SRE, way of working, roadmap, stuff like that. They did a conference visit, they did research. They bonded as a team, and in the end, they presented this proposal to senior management.

Another reason why senior management understood that we had to change something is because our senior management is also waking up at night in case of major incidents. They are part of the on-call rotation, so they knew these people very well and they know all about stuff that goes on fixing reliability in the middle of the night. That was quite nice. Everyone said, "Yes, let's start." After knowledge transfer of old tasks, SRE was fully launched, and that's when I joined. I wasn't part of the pilot, but I joined quickly after as chapter lead and Robin [van Zijll] became product owner at this point in time.

A bit more theoretical background. We generally see three organizational models that you can pick for SREs. The first one is one in which you share service ownership between product engineering and SRE. This means you have separate dev teams and separate SRE teams. I think this is most similar of the Google model that's used. The SRE teams take care of the actual run of the services. We have another model where SREs are distributed across product engineering teams. I think this is most similar to the DevOps model, where the ops engineers are actually SREs. In this case, service ownership is also shared.

Then we have another model, and that's service ownership is with product engineering and SRE serves more as a creator of tools and does some consulting. This is the model that we chose. I think we chose this part because of how it started to exist, but also because it's more scalable and because we didn't want to interfere with the DevOps mentality of "You build it, you run it." For us, it's really important that we spend lots of years getting out there, getting people on-call, getting people to feel responsible, so we didn't want to interfere at all, but we also saw that the ops engineers and the teams didn't have enough time to structurally improve bigger reliability issues. That's what we chose.

What do we work on? This is the service reliability hierarchy. It's from the O'Reilly Google Book, and it represents everything that goes into making a service reliable. The bottom tier is monitoring, because if you don't know if you're service is running or not, then you don't know what you can actually improve, or you don't know if you need to fix anything. Then the layer on top of that is incident response. Once you've got your monitoring in place, you can respond to incidents. Once you respond to incidents, you can start learning from them in a postmortem or root cause analysis. Once you've done that, you can prevent outages by doing some testing or changing your release procedures or making sure that outages are resolved quicker.

The blue layers are the topics that we work on. We've only recently started to move more into the testing and release procedures domain. We spend a lot of time on monitoring since some of the fundamentals were not there yet. Robin [van Zijll] will tell you a lot more about the technology behind this.

Just a quick promo for our track. If you're more curious about the postmortem/RCA layer, please check out Jason's [Hand] and Ryan's [Kitchens] talk, who will talk about learning from failures. If you're more interested in testing, check out Lorne's [Kligerman] talk on chaos engineering. If you're more interested in high impact outlier system failures, please come see Laura's [Nolan] talk later today.

A little bit more of what we actually do. We spend 80% of our time on engineering, we deliver a white box monitoring and alerting stack called the Reliability Toolkit, and we work on a secure container platform. We have a service mesh in the public cloud. Robin [van Zijll] will tell you a lot more about this.

We spread SRE love and best practices, so we reach out to engineers for consulting and feedback, and we do education. What we don't do is we're not on-call for the product engineering organization, and we also don't work on SRE topics already covered by other teams in our organization. We already, for example, have a major incident commander team. That's been in place for a long time, and they do a good job of coordination during incidents. We're not going to go dive in that topic at all. They also take care of RCAs, postmortems. That's why our focus is different, because we want to add something and not be in competition.

A little bit more on outreach and education. What I mean with "We reach out" is we have a feedback loop for our product that I just mentioned. We have face-to-face interviews, or we do surveys or whatnot. We seem to think that face-to-face interviews are more effective since we can understand the problems of the teams much better. We also serve as reliability advocates in our organization. This means that we do not only do promo about the stuff that we work on, but we actually go to engineers, understand their problems, and also refer them to other reliability-related solutions in our organization since it can often be daunting to find everything in a big, more traditional organization.

van Zijll: And it helps us filling up the bag of all our things within our own tribe.

Brummel: Yes, that, too. We educate. When new engineers join our organization, they go through something called an engineering onboarding. It's three days of education where they delve into different topics, like architecture or risk or monitoring, for example, and that's where we come in. We do some teaching there, helping new engineers, and those engineers then join all kinds of different teams in our organization, taking the information we give them to those teams. That's actually pretty nice. We also give workshops on our products. Our stack is based on Prometheus, so we give workshops there.

van Zijll: The important difference is that, during the onboarding, we're not talking exactly about products but more about the purpose of the products. What kind of purpose do you need to have good monitoring, for example? This is different from our Prometheus workshops, which really focus on the uses of Prometheus.

Brummel: Lastly, we facilitate knowledge sharing about SRE topics. We have sort of a meet group, which we call a guild for SREs, where interested engineers from other domains can join and share information, basically, like super mini-conferences. We also organize monthly demo sessions open to all. Usually, teams tend to give demos to their own organizational unit, but we open it up to as many people as we can, where we demo new features or insights. We also help engineers via chat channels or intranet. On our intranet, we document ways to improve your operational resilience. We give guidance on purposes of tools, what tools to use, what Robin [van Zijll] said before. We also have a Prometheus user community. This was actually set up by one of our users, so that's even better. Not being done by ourselves but someone who really loved the product. We sponsor that by making sure that we also deliver content there. We promote that to our product users. We try to grow the community.

Lastly, we organize conference report out sessions, which I wanted to mention, because maybe it might inspire you after this conference. We organize sessions where engineers can share what they've learned at conferences. We organize those for our tribe, so that's a lot of productivity engineering topics, but we also, of course, try to set the example and share on our behalf. This means that we often share information about reliability, and we can also learn more about other technologies. I think this is a great way to just share what you've learned and make sure that it's not just one person who went somewhere and learned something to make that reach out.

When we demo, we sometimes block the hallway since we do not fit in a regular meeting room, but it's all fun. We just pop up a big screen, one of those monitoring screens, and there's a large crowd of people standing. Sometimes, I order drinks, because that seems to encourage people to join at the end of the day, so just make sure that there's a nice incentive and it's at a nice time. It's really fun.

For the last part of my deck is we also have a few principles we use in our way of working. We try to use industry standards whenever we can. In our organization, this is not always been a given. We do a lot of building applications ourself, because we are a financial organization, we want to make sure everything's in control of IT risk. We cannot always use standard products out there, but we try to, whenever we can. We want to reuse best practices, we want readily available resources. It also makes hiring so much easier. We work with open source products and practices whenever we can. We want to create better quality code together and learn from engineers inside and outside of ING. We really like that part of open source.

We do everything to enable engineers and the product engineering teams to focus on coding for customer value instead of wasting their skills on administrative tasks or maintaining hardware. We try to automate whatever we can, whenever we can, but that will be clearer in Robin [van Zijll]'s part.

Technology

van Zijll: Let's talk some tech, because we spend 80% on software engineering. We have to create some software. Janna [Brummel] already mentioned something about Reliability Toolkit, and it's just a name we came up. It's a toolkit for other DevOps teams to enhance their reliability. Why would we create this? First of all, we start it.

We've seen that mean time to repair was too long. It took around 69 minutes of incidence before an alert ended up that the engineer was able to fix the incident. Basically, it's because we have this very basic system of alerting, which says something about the machine is up or down, or in the case of an alert, the system is down. The alert was sent from the system to a centralized organization we call mass control room, which is kind of our knock. When they receive the alert - and they receive 4,000 of them each month - they have work instructions created by the teams able to fix the incident. The majority of the work instruction says one thing, "Call me now," and it takes a long time. What we wanted to do is to make a simpler setup for alert from application or from system to the engineer.

The other thing we have seen is that a lot of teams have a lack of insight of the health status of their application or their service. That's not only a problem for the teams itself, it's a problem for SRE as well. If you want to help teams, if you want to advise them, to consult them about reliability, you want to know if you succeed in that. You want to know if the uptime improves or the health of the application improves. If you don't have any insight into the health, you don't have any baseline, so you could never see if you are any successful as a team, as SREs. We needed a tool to measure that, but we have this great variety of technologies. We do have Java API, RESTful API, so the modern stuff. We do own that, but we have also message queuing or even mainframe systems. We needed some kind of tool that fits on all of that.

We came up with the Reliability Toolkit, and it's all based around Prometheus. Prometheus is a time series database with an alert functionality. Basically, that's it. It's a pulling mechanism, so it pulls metrics from the application or from a service. That means that you need to create this metric endpoint on your system, and Prometheus comes by every 10 seconds or so and says, "Ok, what's the state now? Ok, I'll save the state in my database." We provision it with Grafana, because we're in ING, we love to have these big, black screens with all kind of colored graphs and stuff like that, because it looks really interesting. The most important thing is it comes with an alert functionality, and that comes with an Alert Manager. You're able to send out the alerts to an Alert Manager, and you tell the Alert Manager, "Ok, this alert is really important. I want this as an actionable alert. I want to be woken up right now, so this you have to send to my phone by SMS," or "This type of work needs to send to my ChatOps channel." We use Mattermost for our ChatOps solution instead of Slack, which is in the cloud and hosted somewhere else.

Brummel: Mattermost is basically the open source version of Slack, so it has all the functionality. It's just a little less shiny.

van Zijll: Yes, and it runs on phones. You could send it to your email if you want to be notified, but you could look at it the day after or the morning after. We provisioned something, it is custom made, which we call Model Builder. The Model Builder is an application that's capable of getting a metric from Prometheus, analyze it for one, two, three weeks, whatever you want, and it will expose a new metric, which is the mole of the metric it analyzed. For example, if you have to load on your service, the http.request, for example, if we'd look at it for a couple of weeks, generate a model out of it, and expose it as the model http.request, which makes it able to compare those two, it's kind of an anomaly detection on your load, for example. The reason why we wanted that is because we want to know when traffic drops, for example. A lot of things we’d like to know from an application. If traffic drops, there's something going on. You want to be alerted on it, and simple basic threshold isn't going to do that.

How do we provision this toolkit? We make sure that the configuration for Prometheus is combined with some default ruling and default configuration that's generated by us. We also do the checking of the configuration so that we make sure that Prometheus is not going to break when you make a mistake in your configuration. We make sure that the binaries you run are updated and the latest versions, and we deploy that on stack for the team itself.

If a team comes to us, "We want to have this reliability," ok, they get their own instances. They get their own stack of five on three environments, where we deploy their Prometheus, their Alert Manager, their Grafana, and their Model Builder for themselves. The reason we do that is because we want to encourage teams to do monitoring. We want to encourage them to learn monitoring so they have their own toolsets, so they can't break the toolsets of others, which is kind of resilience matter as well.

What we do is we help them generate the metrics, so implement client libraries in heavily used frameworks within ING so that if they use a standard framework, they will receive the metrics for free, so they have a basic set of metrics. We do the same for implementing this on the machine level, so you have all those metrics and stuff. If they use standardized environments in ING, they have metrics for free and they could start.

That way, we thought it was a good idea to deploy it for teams using the same way of deploying monitoring as they deploy their services. We have a Git repository, and then we combined their configuration with our generated configuration to generate a build, and the build is packaged and it is deployed on the machine we provision to them. In the end, we looked at the usage of our toolkit by the teams and we went to the teams itself to get some feedback, and what we've noticed is that they had a successful first test deployment, and then they got stuck. Then we ask them, "Why didn't you promote it to acceptance environment or your production environment, so you have a way of monitoring?" Most of the teams were, "I was really enthusiastic about Prometheus, but now I have to create build and stuff, and monitoring is an ops thingy, and most ops engineer in ING are not used to create builds." In the end, it was too much of a work for the teams to really use it.

We redesigned our stack, and now, the only thing they have to do is maintain configuration. If we onboard teams, we ask them, "Ok, create your own project within Git. Clone our example project. You will find a couple of configuration files, one for your test environment, one for your production environment, and another one for an acceptance environment, and you have some other tool files." Would you create? Tell us where your project lives, and we will find a project and we will make sure we deploy it to the instances that we provision to you. It helps us, because now teams can easily use our stack, and we have everywhere the same binary and the same overview. We make sure that teams could simply use it.

What else do we do to increase and improve our Reliability Toolkit? As already mentioned, we implement the client libraries into frameworks. We make sure that we still go to teams and see how they are using it. We have this overview of all the teams using our stack, and we see how many targets there are in one tooling, but also how many active alerts they have or how many notifications they send out the last week, for example. If you see a team on production firing 2,000 alerts in a week, they might need some help.

We do education during onboarding. We have a lot of help and information in our templated configuration files. We also do the workshops. Janna [Brummel] already mentioned the Prometheus querying language workshop that's an application created by us generating FIG metrics. They're very basic, the logins into ing.nl or the logins into mobile banking, those kind of FIG metrics. Then we have all kind of exercises that people could do with Prometheus querying language, and by doing so, they'll learn about querying language. We're working on integrating this querying language workshop into a larger workshop, where engineers can spin up an application, implement new metrics, and learn how to scrape the metrics, do querying work on the metrics, and create alerts out of it.

What we do is we template dashboards with teams, we create templated alerts with teams so that if they start using us, they have this basic set of alerts and dashboards that they already could use in the use for generating their own alerts. We have dashboards where they focus you on all APIs running behind a proxy. We have this black box overview of application, and we share that with all teams, and we make it available to them to request like alert based on that information.

Does it help? Yesterday, I was looking with a colleague of mine, and what we've seen is that the usage of our toolkit increased with around 30 teams over the past 40 days. Yes, it's helping an easier setup. This is actively using. We onboard now 70 teams, and 60 are actively using it now on production. It's pretty good. Is it enough? No, I don't think so. Now we have the toolkit for them, the service for them, and now, they have to implement all the libraries themselves, they have to enable it, and they have to generate metrics themselves. We reached out to 70 teams, but how are we going to reach out to all 340 of them? Because we want to have a bigger impact.

Then we came up with the service mesh. This is the big buzz word, I guess. I'm not going to explain everything about service mesh, because there is a trick. We had a couple of talks about service meshes. If you are interested in services meshes, go to those talks. I'm going to explain why we think that the service mesh is going to help us here.

Because service mesh comes with the centralized control plane, we're able to create new features and force them to teams from a centralized position. Especially because we have all those policies by the regulators, we are able to enforce those policies on a centralized location. We're not going to the teams asking them if they would implement the latest version of the TLS that we use. Now, we do it on a centralized level.

Same goes for observability. We could just simply inject through site guards our observability and have all kind of metrics, login, distributed tracing in place, once they land within our service mesh. We could introduce A/B testing, canary releasing. We're doing that slightly within 90, but not a lot. It would really help if you could do a lot more of this stuff, and service mesh is going to help us here.

Engineers could focus on the service and creating functionalities for customers, which is really important. They only have to focus on guidance security on application level, but we can do the heavy lifting on IT risk for them. That means that they don't have to spend a lot of time proving that their service is fully risk-controlled, and we take away a lot of the work for them.

What's up next for us? We want to scale our stack. Now, we give a lot of instances to teams. We want to scale down the stack or scale in, but we want to increase the number of usage, of course. We're looking in to see if we could have a multitenant solution for Prometheus, like Cortex or Thanos. We're going to look into that. We want to expand our role as reliability advocates. Now, we have an easy rollout tool to measure the services. We could easily go to teams and help them with their reliability issue and to see their reliability issue, because suddenly, we have a tool which makes us capable of seeing that. We want to complete a proof of concept on the service mesh, because we think it's really promising.

Takeaways

Brummel: For takeaways, for the people side, we recommend hiring SREs from your product engineering domain, so they have the experience that's needed to truly change something. I feel that you should never compromise on mindset in hiring SREs. I always believe that people can learn technical stuff. It might be, sometimes, easier to learn than mindset or cultural aspect. That's really important.

If you want to start with SRE, then you might want to start with a pilot if you're not sure if it works for you or if you want to convince other people in your organization. Just make it very low level to start with. Pick a SRE model that works well within your organization. Think really hard about where you want the service ownership to be basically, and then see how you can match that with SRE. Try to get senior management support and understanding. This has really helped us in our complete journey. Invest in SRE outreach and education.

van Zijll: On the technical part, focus on scalability. Start out, for example, for us, with Prometheus, but we had to reach out to three other teams, so we had to focus on scalability and ease of use. Otherwise, if it's too hard to use or too hard to install, no one is going to use it. Then you will lose the purpose of it. Don't be afraid of redesigning it if it makes the life of your engineers easier.

Questions and Answers

Participant 1: I'm very curious to know why would you give a separate instance of your Reliability Toolkit to each of the product engineering teams. I think you really came back on saying you're looking to have a single solution for all of them, because Prometheus does not probably give you the ability for each team to have their own siloes. Having a separate instance adds on to the maintenance cost and other things as well. What are the key advantages versus some of the downsides to them, having a separate instance of Reliability Toolkit?

van Zijll: The question is, why we chose like this Prometheus for every team, and why are you choosing now to combine it? The reason we chose that is because we have this BizDevOps model where our team is responsible for the build and run, and we wanted to have them be in control of their own monitoring, but we took care of the maintenance. We wanted to take away the burden of maintaining a full monitoring stack, but we wanted them to be in control of the monitoring stack. The reason why we're now scaling in is because of the costs, of course, and because it's not very efficient to have every team their own instance. What we see is that there are multiple teams want to combine their monitoring instance with multiple teams, because they use one on-call team all for multiple teams. We see the wish for combining sort of our Prometheus service.

Brummel: We've also had some experience that we had some centralized monitoring solutions available for teams, but they're not always scaled properly. If, say, the mobile banking application team would send all their logging to a centralized logging environment, then sometimes that could blow up. Maybe we went a bit too far, but we wanted to make sure that everything would be there and that it could handle all the load.

 

See more presentations with transcripts

 

Recorded at:

Oct 21, 2019

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT