In this podcast, Daniel Bryant sat down with Dave Sudia, senior DevOps engineer at GoSpotCheck. Topics discussed included: the benefits of PaaS; building a platform with Kubernetes as the foundation; selecting open source components and open standards in order to facilitate the evolution of a platform; and why care should be taken to prioritize the developer experience and create self-service operation of the platform.
Key Takeaways
- When starting a business and searching for product-market fit, creating an application using a monolithic code base deployed onto a commercial Platform as a Service (PaaS) product is a very effective way of iterating fast and minimising operational costs.
- There may come a point where the PaaS cannot provide bespoke requirements, or it has trouble scaling, or the costs become prohibitive. At this point many teams choose to build a custom platform using cloud technologies, such as Kubernetes.
- Building a Kubernetes platform can be an effective solution, but the appropriate effort needs to be put into designing, building, and maintaining the platform. The platform effectively becomes another product within the business that must be managed accordingly.
- Embracing open standards provides many benefits, especially for the long term. Implementations that are consumed through well-defined interfaces and abstraction can be more readily swapped at a later point in time. It is also generally easier to integrate components that share common interfaces.
- Attention and resources must be provided to create an effective developer experience for the platform. It is essential to prioritize self-service operations, and also to understand the core requirements of the engineers and QA specialists that will be using the platform during their daily work.
- Establishing an effective continuous delivery pipeline can enable more repeatable and scalable testing of applications, and also allows the codification of cross-functional requirements.
- The cloud native landscape has now evolved to a point where most of the frameworks and tooling required to build a platform have become viable for general purpose usage. However, some assembly may still be required, and engineers should be prepared for change, as the ecosystem moves fast.
Subscribe on:
Show Notes
You started in teaching and then transitioned into software development and ops - can you share your journey?
- 01:00 I was a teacher for seven years in special education and then I made a career change into technology with a boot camp organised by Galvanise.
- 01:10 Coming out of that, my path in wasn't as the most advanced engineer on any team, but when I learned about software testing I recognised it as a behaviour plan which I had a lot of experience in.
- 01:25 I started my career in tech as a quality engineer writing end-to-end tests for a retail platform, and then writing a platform to run those tests for other engineers.
- 01:40 That company had a cloud team of 3 people and no manager, that basically said we are going to be making VPCs and security roles, and you have to find a way to deploy them.
- 01:50 I volunteered to learn that, got up to speed in eight months with the cloud formation technology and devops - then I got headhunted as a devops engineer.
- 02:05 I've been doing that for a little over two years, and it's a space that I really enjoy, thinking about how to build systems, and I love teaching people things to expand their careers.
- 02:25 As a special education teacher, most of my job was spent getting general education teachers to understand how to do my job - education teachers in the US are generally now expected to do everything.
- 02:45 My job was pushing in and helping those people was pushing in - and devops is the same thing, pushing ops onto developers and getting them to believe that they can and should be doing it.
- 02:55 My through line on my skill set has been getting people to take on more work.
What was the elevator pitch for your KubeCon 2019 conference talk [https://www.youtube.com/watch?v=AqMxaxJsJKY]?
- 03:20 There are so many talks that you can go and see when you are learning about tools, and fewer case studies of how to use the tool.
- 03:30 When people are talking about tools, they intimately know them - they know the use cases; their elevator pitch is that they are amazing.
- 03:40 There's a reality of implementation that is difficult, while also providing the benefits that they claim to provide.
- 03:45 We wanted to show a real-life scenario - this is what worked out for us, and these are the challenges that we ran into.
You started with a monolith and moved into micro-services?
- 04:00 The company started with a Rails monolith, on Heroku.
- 04:05 As we've grown, it has become more difficult to scale - at one point, we were one of Heroku's top ten customers by revenue.
- 04:15 We hit the edge of what they provide - they are an amazing service, and we didn't move off because we were unhappy.
- 04:20 Heroku enabled us to become big enough that we needed to move off Heroku.
- 04:25 For any given service, they will provide up to 10 servers - and we were at 10.
- 04:35 On their provisioned Postgres database service, it's great until you hit use cases where their settings don't match what you need - and there's nothing you can do about it.
- 04:45 One of the reasons I was hired on was because there was a need in moving off of Heroku, and I had been working exclusively on AWS and had been doing that work to help that transition.
Would you recommend starting with monolith and then evolving to micro-services later?
- 05:20 I think things should start as simple as they can be.
- 05:25 You make a micro-service because you need to scale the team differently to anything else.
- 05:30 If you're a team of three people, then there's no team to scale differently.
- 05:35 If you have a service that needs to scale separately, that's very unlikely to happen in the first couple of years of your existence, unless you have an insane event.
- 05:40 My advice to anyone starting is to write an elegant monolith first.
- 05:50 The reason why you have problems with a monolith is everything is tied up together (like our ActiveRecords).
- 06:00 There's a certain extent to which we are never going to break apart that particular monolith, because it's just not possible at this point.
- 06:10 We can take little bits and pieces and new functionality where we can write new systems where we can transition functionality.
- 06:15 On the flip side, a lot of our micro-services are deployed as one unit, but internally we have adopted Go and are using Go Kit as a framework for writing those.
- 06:25 Peter Bourgon, who writes Go Kit, will tell you that it's for enterprise micro-services.
- 06:40 Each deployable unit might have 10 services inside it, but each of those services in the future could be removed and broken out into its own deployable unit if we need to scale it differently.
- 06:50 What we have is clusters of mini service groups that are all sharing a repo and deploying together, and just happen to be very micro implementations that are sharing a deployable unit.
- 07:10 My recommendation is: start with a monolith, but with clean interfaces where you could break them apart, and deploy it to an app server.
- 07:20 My favourite keynote from last year's KubeCon was "When is Kubernetes going to have its Rails moment?" [https://www.youtube.com/watch?v=ZqQTEdHVaCw]
- 07:30 Kubernetes is hard: in my talk, I say that in the age of Heroku, you don't need my team - but now we have my team for building the Kubernetes stuff.
- 07:40 Our goal for this year is to build a better platform internally, and we're relying on a whole bunch of other people off my team, because it's a big enough project and we don't have time to do it all.
You mentioned there were two ways to fix a problem: throw more money at it, or stop throwing money at it. How do you know when to do which?
- 08:25 That wasn't addressing scaling - that was the two phases in the lifecycle of a startup.
- 08:30 GoSpotCheck grew 100% year-over-year for the first five years of its life - if there was a problem, we threw money at it, as funding wasn't a problem then.
- 08:45 So, for example, if we ran out of space in a database, we simply bought the next bigger one from Heroku.
- 08:50 Then we hit a year where we weren't growing 100% year-over-year, and we've risen to the challenge of hitting that first plateau very well.
- 09:00 We were given the mandate that it was time to move off of Heroku, and then two months later there was a spending freeze.
- 09:15 One of the challenges we had was how to build this parallel platform while as much as possible trying not to spend money.
- 09:20 We ended up spending more money, but it wasn't the same way as before - we had to do it in a cost conscious iterative way.
- 09:30 We weren't going out making huge multi-regional disaster recovery cluster systems - keep it as simple as possible.
- 09:35 We are still running zonal clusters in Google at the moment - we'll get regional clusters soon, but they bring their own set of operational complexity.
Why did you choose Kubernetes as the base for your platform?
- 10:10 There was a great presentation at QCon London 2020 on "Kubernetes is not your platform; it's the foundation" [https://www.slideshare.net/ManuelPais/kubernetes-is-not-your-platform-its-just-the-foundation-qcon-london-march-2020]
- 10:15 You have to build something on top of it to make it easy to use.
- 10:20 We were moving off of Heroku, and at the time we had a platform team, and if we were going to make a seismic shift in our infrastructure we wanted to future-proof ourself as much as possible.
- 10:40 You have to make a huge investment in building infrastructure - it is something that is going to be with you for some time.
- 10:45 There were assumptions about the Heroku environment that we had to figure out how to replicate or bring over into Kubernetes.
- 10:55 We could have gone to auto-scaling groups, but then moving to Kubernetes we would have to invent everything again.
- 11:00 The fact that we are two and a half years on, and we're still working on the platform, speaks to the investment that we are willing to make.
- 11:20 There are tools that you wish existed but don't - one of the lessons I had in my talk was: if you can wait six months, do so.
- 11:25 We are just putting stateful things into our cluster now, like a CouchBase autonomous operator.
- 11:30 When one of my devs wanted to use CouchBase with Kubernetes support, we wanted to wait until we had run Volero to backup clusters.
- 11:50 I know how to back up Postgres databases by taking a snapshot of the files, but I didn't know how to take a backup of a persistent volumes or restore them.
- 12:00 If you provision a system with Kubernetes with a persistent disk, it will get one for you - it's just a disk, and you can snapshot it - but what's the process for bringing it back in?
- 12:10 Every time we come to a new problem, the first problem I do is read what has cropped up in the last couple of months, because the ecosystem is moving so fast.
- 12:35 Kubernetes has won the war on being the foundation of any future platforms; Heroku behind the scenes is moving to Kubernetes, for example.
- 12:50 Google Cloud runs on top of Borg, which is their internal version of Kubernetes - you can look at the project structure and recognise where you are.
- 12:55 That was our reasoning, and that leads on to a number of other decisions - our platform team was looking at Elixir and Go; I think Elixir is cool, but there's no gRPC library for it.
- 13:10 There wasn't any Prometheus metrics monitoring, and all of the accompanying ecosystem of CNCF projects exist for Kubernetes and the languages that are there, but it's more difficult for anything else.
- 13:30 It's becoming less of a binding agreement now - you can get Prometheus for Elixir now - but it felt like the longest term play we could make knowing that there would be difficulty up front.
How did you go about choosing the tech stack from the Cloud Native Computing Foundation (CNCF)?
- 13:55 I have been following the CNCF since there were four projects - and the landscape has grown.
- 14:05 It's not just the technologies - it's all of the companies that have a tech stack build on those technologies as well.
- 14:10 If you look at the CNCF sandbox, and there's forty things in there - I remember when it was just Kubernetes, Linkerd, and Prometheus.
- 14:20 We picked Kubernetes, and from there, the obvious choice at the time was Prometheus for metrics - everything else has now been built on Prometheus since.
- 14:35 We already used SumoLogic to ship logs out of Heroku, and they jumped on this very early, and one of their team wrote a FluidD adapter that you could install as a HelmChart to ship the logs to Sumo.
- 14:50 That became an easy choice; fluidd was the choice of how to ship the logs somewhere, and Sumo for the log destination, and Prometheus for metrics.
- 14:55 The platform team made an early commitment to commit to those CNCF open source based projects - a lot of these were long term good decisions but caused pain in the meantime.
- 15:10 We could have stuck with NewRelic, which is what we were using for Heroku, or we could have moved to DataDog or other more proprietary system.
- 15:20 I said this in my talk, that these companies are aiming to provide the best value-add on top of these open-source technologies.
- 15:30 It gives us the flexibility to do what we need to do in any situation we find ourselves in.
- 15:35 As an example, right now SumoLogic came out with a system for receiving Prometheus metrics - we were in the beta for it.
- 15:40 You have to use Sumo's metrics queries, and they don't quite match up with Prometheus PromQL because they already quantise stuff in the query.
- 15:55 That's fine, because while we're figuring that out, we're using Grafana in our cluster.
- 16:00 People can go in directly to the Prometheus and run queries, and we can run all of our dashboards in SumoLogic to give an overview of our telemetry.
- 16:15 Let's stick with the open projects, because then we have the best set of options, and we don't have to ask the developers to go back and completely re-instrument things.
- 16:30 We had stuck with New Relic, and then SumoLogic had come out with Prometheus metrics support, we would have to go back and re-instrument it with Prometheus.
- 16:40 If we had picked some proprietary tracing system instead of OpenTracing - it's great, because you only need to add five lines at the beginning of the main.go you can send it to Jaeger.
- 16:55 My developers are using Jaeger to send their traces, and we're in an early pilot with someone where we're sending traces to them - but they're in an alpha at the moment.
- 17:05 In the meantime, I deployed the open telemetry collector to receive my developers' traces, and bifurcate the traces and send it to this vendor as well as Jaeger.
- 17:15 The vendor made their product as an open telemetry collector so they can receive traces shipped to them.
- 17:20 It's a great pipeline, because the only thing that had to change for my developers to adopt a brand new alpha application was changing the address of where to send Yaeger traces.
- 17:35 It was a one-line configuration change, and no other changes were necessary.
- 17:40 If you invest in the open technologies, and everyone else invests in those open technologies, then you have a lot of options to choose from.
- 17:50 I don't want to run my own ELK stack - I don't want to run my own Prometheus with lots of storage; I have three people.
- 17:55 I'm happy to pay for it, but there's been some pain for us in waiting for the market to catch up to where we are and support those open technology.
Are you using Terraform or CloudFormation to knit things together?
- 18:15 All of our infrastructure is created in Terraform, and that's worked out really well for us.
- 18:20 In Heroku, we had a presence in Amazon - we had a couple of S3 buckets, and some experience in running Elastic Container Service.
- 18:30 Terraform has been nice because it handles all these providers and allows us to wrangle everything without having to go through a bunch of different codebases.
- 18:40 For databases, we're using a provider called Aiven as a microservices database, and everyone's making a plugin for Terraform.
- 18:50 From there, on the developer side, we're working with a company called Harness, and they're not built on top of Spinnaker - it's a 100% original platform for people who are investigating the tools.
- 19:05 It's a product that sits in the lifecycle where Spinnaker is; it's a continuous deployment tool that's built around pipelines and workflows and putting things together.
- 19:15 At the time we were investigating, they were the only people in the game - and we like them, it's working out really well for us, and pushing forward their roadmap in some ways and stretched the capabilities.
- 19:30 In terms of these things being early days; I've used Spinnaker once in a Google learning deep dive session on Istio, and something went wrong, and the error message was plain text Java logs in the browser.
- 19:50 Pintrest did a talk at KubeCon last year and they have been building an internal deployment tool that is sugar on top of Spinnaker.
- 20:05 So many tools are written 'for me' and not for developers, and we have QA people who are familiar enough with the command line to run some Heroku commands, but that is about it.
- 20:15 Their preference is to work in a GUI where they can go and push a button - I've reviewed it, ship it.
- 20:20 Previously we had built these horribly built things in CircleCI, and we've moved things over to Harness because it's a purpose built tool.
- 20:30 Circle is what we use for continuous integration and testing.
How important is developer UX and self-service actions?
- 20:50 A couple of people have asked how we built a self-service platform on top of Kubernetes, and my response is we haven't: we've glued together some tools.
- 21:05 Circle remained our CI platform, we got Harness, and my team got focussed on writing YAMLs of how to build good ones, and started with copy-and-paste.
- 21:25 The first non-greenfield application that we moved over from Heroku and we also had spun up a new team writing completely Kubernetes native applications.
- 21:40 We had people writing Go micro-services that were being natively build and push out as micro-services to Kubernetes.
- 21:45 Then we had these Rails apps that we had to figure out how to containerise, replicate Heroku build packs, and a lot of work over.
- 21:55 We did a lot of those things, and the most successful ones we had was when someone from my team has sat down with a developer, and we've co-worked in a devops partnership to build it.
- 22:15 That holds true as we've advanced our conception of the platform.
- 22:20 Early on, we just had every service behind its own load balancer, then we adopted Ambassador as an API gateway.
- 22:30 Recently we had a team have been working on year-old conceptions of our infra, with separate load balancers, and going back out and in, instead of inside a service mesh inside of Ambassador.
- 22:45 I wasn't going to put it on him to migrate, but he couldn't put it on me because he didn't know his RPC names, so we sat down together for a couple of hours and made it happen.
- 22:55 What we're trying to do this year is to sit down and build a platform-as-a-service.
- 23:00 I'm not sure what that's going to look like yet, but mainly that's because I'm not going to be the only person working on it.
- 23:10 What we've learned in the last year and a half is that you can't just have a platforms team that goes off and builds something.
- 23:15 One of the reasons that our platforms team is not a platforms team is that most of the people (who are still here) are sitting with the application developer teams.
- 23:25 The thing that worked for Pinterst is that they put a product manager on it, and treated it like a product.
- 23:30 That's the tack you have to take - even with our tiny team, we don't have a product manager, and I manage our backlog (very badly).
- 23:40 Even with that, I have borrowed product owner time, I have got our VP of technology in and we have made personas.
- 23:50 We interviewed our developers internally, and we identified people who loved to get deep in the weeds of configuration, and there are people who want to ship.
- 24:00 Both of those things are completely valid - and there's also brand new developers who don't know how any of this works, and there's QA people and support.
- 24:10 One of the things that we've learned over the last year is how you support software that is being continuously delivered.
- 24:20 A lot of companies that are full-in on continuous delivery are still very small - and it's easy to ship five times a day when you don't have Tier-1 and Tier-2 support teams who need to know if something has broken.
- 24:30 You've got all these teams who run their experiments in production by canarying their traffic over, which is great - but then you have to let support know that they may get 50 tickets from broken canary ticks.
- 24:45 The support manager keeps googling how to support continuous delivery, and there's nothing - so we wrote a persona for them as well.
- 24:55 We are trying to build this out in a way where I don't have a platform team to just build this platform - we're trying to do this in a workgroup fashion.
- 25:05 We've split the SDLC up into phases - the first one being local development.
- 25:15 One of the pieces of feedback from people is that the way of improving developer efficiency is to give everyone a 64-core laptop, because with the full stack of micro-services my fan is blowing out.
- 25:25 I can't do that - but I can give it to you in the cloud, so we're seriously looking at how you could do that in a cluster.
- 25:30 We're looking at Telepresence and Scaffold and everyone is asking how you can do that.
- 25:45 These working groups have representation from someone who is a bit wonky, someone who just wants to ship, someone who is brand new, a QA person - we're trying to get a cross-section of the company into these things.
- 25:55 As we try and get further towards monitoring and observability and shipping it for a second time, we'll get in support people.
- 26:05 The work doesn't rest on me or my team - we're not having to build this whole thing, we're not the arbiters of truth or what's the best thing.
- 26:15 Our role is turning out to be letting people know there's a tool that can let you hot-reload the code into the cluster.
- 26:20 I'm the encyclopaedia - I keep getting the newsletters, reading up on what the new things are, regularly searching for new things, and I'm feeding them out to others.
- 26:30 I'm identifying four or five options to fill a particular gap, and we can collectively decide what looks good.
- 26:35 From there, we will discover that we need our own cloud native buildpacks [https://buildpacks.io/], because the existing ones aren't quite right.
- 26:45 We'll write some stories, write a buildpack that fits our needs for our Ruby on Rails applications, and someone can go and write it and others can benefit from it.
- 27:00 What we found having greenfield or open processes for all those teams is that we wrote those templates for others to use.
- 27:05 A senior engineer from a different team can look at the template and like 97% of that, but not these five pieces; and the next team would say similar things.
- 27:15 I don't think that was bad - we ended up coming to an agreement across an engineering org that we needed to standardise on these things.
- 27:25 We needed one way to write a Go service, because we had people transferring teams, and they needed to interact.
- 27:35 The promise of gRPC is that you have one canonical set of contracts as a package that will allow you to talk to other services internally.
- 27:40 We had three different repositories following three different conventions.
- 27:45 From an early experimentation standpoint, that's fine - we got to the place that we like, and now we're re-convening and landing on the process of how to work through it.
- 28:00 The canonical way of doing things is decided by workgroups - and if you need to do something outside that, fine - but before it goes to production you need to have worked with other teams to bring it in to the box.
- 28:10 If you want to invent the way we need to do functions, that's fine - but we need to have the canonical way of doing functions by the time you go to production and that's the way to do it.
- 28:20 I don't know if that's the right way to do it or not, but let's get together in a year's time to do another podcast to find out.
- 28:30 We're trying to find a balance of how to standardise across an engineering organisation that has scaled to more than 50 people while allowing for experimentation.
What's your one bit of focussed advice to those who are starting the journey?
- 29:00 There are two things, which are closely related because they are opposites.
- 29:05 We are at a point where you can be purposeful about what you're going to build.
- 29:15 When we started, it was just Kubernetes - you spun up a cluster, and there were a few tools, but apart from that you were just writing YAMLs.
- 29:20 There are so many tools now: you can go out and find the one that looks like it's going to work best from your team and can build more of a platform from the get go.
- 29:30 If you're jumping into Kubernetes, there are more tools now and it's a bit more polished.
- 29:35 The flip side of that is: always be ready for change - the ecosystem is moving so fast that you are going to get overwhelmed with the amount of information coming in.
- 29:45 The reason to be purposeful is that you can discard what is superfluous to what you are trying to build.
- 30:00 Keep things as simple as possible; there are new tools coming out every couple of days for this environment, and you don't need most of them.
- 30:10 You need what will get your application to production, and what will allow you to make sure it's still running - and that's what you should limit yourself to.