Transcript
Wrightson: Last year I attended QCon London, and I went to this talk by Stuart Davidson from Skyscanner. He spoke about how Skyscanner were doing true continuous deployment. No test environments, no validation in lower environments. That talk led me to this spot today, both in terms of giving this talk, but also professionally. I am Nicky Wrightson. I'm a Principal Engineer at Skyscanner. Skyscanner is a leading travel site with a truly global market. I wanted to know how they were able to achieve this.
What Lower Environments Mean
Why do we not replicate our production environment? First, I just like to confirm what we mean by lower environments. These are completely replicas of production. They're only accessible internally and not to our consumers. There's often a development cycle where the code progresses towards production, and gets into a more similar replica as it nears production. Staging tends to be a completely replicated environment.
Why We Use Lower Environments
I tried to think of all the use cases why you would want a lower environment that didn't involve testing or at least gaining confidence in the code you wanted to release. I couldn't come up with anything. In fact, I Googled it and the only additional one was chaos engineering, which is a testing of another form. I wouldn't do that in a lower environment. This is not only for the code we want to release, but also the infrastructure and third-party applications such as databases.
Historically, we had lower environments that were like production so that we could test releases that we couldn't confidently reason about the effect of the changes. Monolithic applications where there are many commits. Often, and historically, with no tooling background, I started believing there's no such thing as test driven development. More recently, with modern monoliths, purely the time that they're around means that there's stacking up of PRs against the code base, so it becomes harder to reason about the source of any issue. However, maintaining a production-like environment that will behave like production is not easy.
Real Data: Replicating the Real World in Lower Environments
For lower environments to behave like production is not just about the services being like for like deployed. Equally, it's the data. Depending on your system, it can be relatively straightforward to access real-time data in lower environments. However, replicating user behavior consistently is difficult because we have those peaks and troughs in traffic that mutate due to the time of day, or unforeseen circumstances as news events. Load tests cannot anticipate in which the load will happen.
Real Data: Reasoning About Where Data Is and How It Is Used
Not to mention real data means real responsibility to users. At Skyscanner, we have a saying, traveler first. This particularly applies to how responsible we are with their data, and we adhere to their wishes. Taking user data and publishing it all over the place means it's harder to reason about where that data has ended up. How can we produce a report on how we're using that traveler data? Those are only the problems of intentional data to use. What about mistakes, accidental printouts of passwords somewhere? How do we effectively scrub all that data all over that place, and quickly?
Infrastructure Replication
Then it comes to infrastructure. Even with every nook and cranny of your estate covered with infrastructure as code, you cannot replicate exactly what you have in production in the lower environments. Firstly, we're given layers of abstraction over the actual infrastructure that the cloud providers are running our code. Secondly, we have different permissions and other things in lower environments to allow engineers more flexibility in different environments. Thirdly, we have identifiers such as ARNs or account references, which invariably means we split up our code path for staging and production anyways, so we're not testing our code in the first place. Plus, your infrastructure as code, the whole process needs to be incredibly well defined. How do emergency hot fixes in production make it back into lower environments? How do we make sure that there's no manual intervention?
What It All Means For Skyscanner (Stateless Java Services)
These are general difficulties in replicating environments. I want to talk a bit more about what difficulties Skyscanner had that drove us towards this production-first approach. I have used this diagram for the last year since I joined Skyscanner pretty much. I realized that it doesn't need to be up to date. It just shows the complexity we have in our distributed systems nowadays. Just put it into perspective, this is just our Java stateless services. We have Python. We have Node. We have various other bits and pieces that go. We also have our stateless microservices spread over both Kubernetes and ECS. We're currently trying to roll out to 10 regions. Even just trying to work out what we needed to replicate in a lower environment would be a mammoth headache. Currently, our AWS bill is pretty eye watering. I can't even think about what it would look like if we started trying to do these replicated environments.
Elasticsearch
I work on the data platform that delivers all of our application logging and business events through to our ELK stack, and also our data lake. Recently, we bumped into a problem. We found that there was a hard limit of 4000 containers on ECS. We had this automatic way of creating Logstash container per new event being created in production. This was meaning that the new mappings to the Elasticsearch were created. Imagine if we purely replicated these 4000 containers, the cost of transfer alone would be enormous. I'm doubtful that we would have caught this in a lower environment anyway, because it's dynamic, we'd need to have all of those triggers in place to make sure that we were creating like for like all the time as it was automatic. Every new data point, new event logging in production would need to be mimicked down, or at least tested in the lower environments first, which doesn't really make sense.
Operational Costs
We have a hard job keeping production up and running at the best of times. It takes a lot of effort. It's a large system. That's just our data platform. I think we would have to have that doubled to take over looking after a like for like production environment. Actually, probably more because you actually need a load of operational tooling, things like you want to turn it off at weekends to save costs, getting real data, reduction of PII, sampling, reconstruction of those ebbs and flows that I mentioned earlier. It could actually operationally cost you more than running in production.
Spikey Loads
We wanted to do load tests on normal traffic, so there is load tests on top of those spikey loads. I had quite a problem on our platform. What happened was a Japanese morning show, their presenter put a phone up, and on the phone was the Skyscanner app. This immediately called a flood of downloads of our app, and all of these people opening the app at the same time. We were barely able to survive that one. It taught us a really good lesson on when and where we do our load tests.
Cookie Cutter Applications
How do we actually do this? What's the nuts and bolts that we use to get there? Tooling is critical, and tooling in many aspects. I'll go through all the aspects, and how we exactly do it now. At Skyscanner, we absolutely adore cookie cutters. We cookie cut so much. Microservices through to our monitoring tooling, our big data ETL jobs. It's one of the ways that we can provide the engineers with a golden path. It makes it easier for people to align on approaches and then collaborate on evolving those approaches. This gives us consistency in those approaches. We even add things like dummy integration tests, so that it ensures that people are thinking about the automated testing beyond unit tests. It also provides us consistency in deployments, GitHub repos, AWS resource tagging, security scanning, and so much more. It means we can get a microservice of guaranteed level of quality into production in a matter of minutes. Worth noting here that although we have owners for all of our code, we have the idea of internal source as well, so everybody can submit PRs to improve the code base. This is particularly important for developing an experience tooling like this, because those cookie cutters could involve many teams' pieces of work, and we need to make sure that they evolve as our development evolves as well.
Automatic Rollbacks
We have our own continuous deployment tool called Slingshot. We've had it quite a while. What I love about this is we can configure it to pull releases based on metrics. I've done quite a harsh example in the slides. This is just saying, if we get more than 1% of errors, which seems quite a lot, but to put it in perspective, this is the entry point to our data platform. I've seen it handle 2 million messages per second. 1% of those is actually not a huge margin of error.
Request Shadowing
We started moving over to Kubernetes. We created a sidecar that could shadow our traffic over to our Kubernetes cluster to check that the deployment works right, for example, from ECS to Kubernetes. This allowed us not to start our Kubernetes development in a lower environment, or even worse have a big bang approach of moving of course. It meant that we could have like for like production environments and monitor on both sides.
Integration Testing
Most CI systems will allow you to spin up services to be able to run tight load tests. This is one that I worked on at the FT, and it's spinning up a Neo4j instance alongside it. However, we saw opportunity here. We realized that integration tests are so critical when you're going production-first. We realized that having the CI setup meant that you were running your tests using localstack/docker-compose, and you're running them locally in that way. Then, they went to the CI system, and they were run by the CI's way of initiating them. We contributed heavily to an open source project called Testcontainers. What Testcontainers allows you to do is actually spin up containers from your tests, from your actual code. We only have these in Java at the moment. What it allows you to do is also isolate the length of time that those Testcontainers are up and running. They're only alive for the length of that test, so you don't end up with leftover data on a previous test run, or previous tests prior to this. Meaning, you won't get a shock when you get into production.
Dependabot
We have our little friend Dependabot. This is for creating automatic PRs against code bases, particularly for library bumps, maybe third-party library bumps where there's a security issue. Equally, this one is for our logging library that logs all of our data. Everybody needed to update it.
Double Dispatching Data
At the moment, we're also rolling out a new data ingestion pipeline. Our old one had become a bit of spaghetti code. Instead of doing a big bang approach, again, we are just double dispatching our data. At the moment, it's going through both pipelines, and our visualization tooling can abstract over the data sources, and we can slowly migrate. Important here though is, don't double dispatch and never switch over in DCOM. You got double the operational cost, double the money. It's critical to work out a decommissioning path for the old legacy. Also, it encourages best practices. It encourages our code to be highly instrumented. At the end of the day, just make it a lot easier to track issues in production with really well instrumented code rather than try to replicate those issues in lower environments.
Monitoring Over E2E Tests
We prefer monitoring over end-to-end tests, real data through real systems reacting to real problems. Above is the monitor we actually use to see the completeness of our data, arrival through our platform, so that we make sure that we don't lose data. We make it easy to do this. We have a cookie cutter for microservice. Then, we have a cookie cutter for also our alerting system. Our microservice is shipped with a baseline of standard logging of metrics automatically being sent to the data platform. Then our alerting system, we use an open source one from Bosun that we've recently taken ownership of, and that can be automatically created too. The squads are only a few clicks away from having those services automatically monitored, and hooked into various tooling such as VictorOps.
Architecture
Our architectural decisions are necessary for this as well. You architect differently. How can we update the software that's running our software? Kubernetes. Imagine this enormous Kubernetes cluster, we have thousands of services from across the business. Then, there's an update to Kubernetes itself that we need to install. We can't test it in a lower environment, so we roll it straight to prod. You know what happens next. Although it did take down our entire cluster, we actually had to resort to rebuilding it before we got a working cluster. Lucky for us, it was a lesson learned early on in our migration. The impact was not as extensive as it could have been. We were burnt, but it allowed us to pivot.
Now we're in the process to moving to a cells based architecture where in each availability zone, we have many cells that each are running their own Kubernetes cluster. It means entire clusters can go down and not affect the availability of the services as they're balanced over many clusters. There would be no impact if we rolled out to an entire cluster and it didn't work, we could just either roll back or actually just rebuild that cell from scratch.
Sandbox
We do actually have a lower environment called sandbox. However, this is definitely not a fully replicated environment, and is utter cowboy land, much more akin to a scratch environment. Resources are not long-lived, in fact, we nuke the account every month. We also make sure we block the account access between sandbox and production. We tend to use it for smaller POCs as well. I actually do a lot of those in production. Exploratory checks, but again, it is optional. It can be done in production. The biggest one is psychological safety. This is our safety blanket. With all the best practices in the world, it can be pretty darn daunting to release straight to production. Sandbox can help allay the fears of engineers, and double check all that we think is happening as well. However, we understand this is not a production environment.
When Is Pushing To Production Not The Right Move?
Some of the idea is when. When is it not ok to move straight to production? I work for a travel company. We have a good tolerance towards issues, some unfortunately don't. I've also been using this image all the way through, and it's a bit unfair because it's inherently monolithic. You can't forward easily because you can't get out hundreds of updates a day. We actually support our data testing by having a preview data area that's in our production account so that we can allow for the local testing in the app without actually going straight to production.
Integration with Third Parties (Tooling)
Sometimes it's hard to integrate with third parties and test that integration in production, just because of their facilities and their setup. This is definitely not an all-the-case time. It's an as-and-when, and definitely dependent on the integration. It really comes back to tooling. Even if you were to use fully replicated lower environments, you would need to really invest in the tooling to produce those like for like environments. However, investing in tooling to work towards a production-first approach means you're actually investing in many things that will have a positive effect on your data quality, speed to market, operational cost, and cloud bills. It doesn't need to be an all or nothing. You can either slowly migrate towards production-first, or just do certain areas of your business incrementally. You can reduce the overhead of those lower environments. That's why we did it at Skyscanner.
See more presentations with transcripts