InfoQ Homepage Articles Platform Engineering as a (Community) Service

DevOps

Platform Engineering as a (Community) Service

Mar 28, 2021 19 min read

Nicki Watt

reviewed by

Manuel Pais

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Key Takeaways

Platform Engineering is a discipline that involves doing whatever it takes (both technical and non technical) to build, maintain and provide a curated platform experience for the communities using it
Feature and Product Teams may well be your obvious primary users, but they’re not the only ones! Remember to recognise and take all your communities into account.
Start by building on a solid foundation which requires having executive (C-Level) buy-in and being able to leverage strong technical expertise and experience.
Aim to create a platform that has clear boundaries & responsibilities; prioritises self service & automation; offers a flexible & evolvable experience; and is wholly reliable
Adjust your thinking! Aim not just to provide a service, but rather to be a service to your communities. This involves being guided by community driven principles including making teams independent of you; promoting freedom not autonomy; being a role model; as well as respecting & recognising community differences

Platform Thinking Is Not Just About The Tech

As companies push to adopt innovative new technology platforms, they seem to have lost sight of the human innovators and communities actually needing to use it. Platform initiatives that don’t quite work out as planned, often seen to suffer from an unhealthy focus on the cool new innovative technologies, but not so much on the broader human or developer experience around it. I believe a different way of approaching this challenge is required.

So what does it take to build a fit-for-purpose platform? And specifically, what does it take to build one that manages to successfully target all the communities they aim to serve?

Having worked with a number of different clients and projects, this article will share observations and insights of what worked as well as what didn’t, and hopefully get us a little closer to trying to answer these questions. TLDR; it's not just about the tech!

Platform Engineering - What Is It?

At its core, platform engineering is all about building, well, a platform. In this context, I mean an internal platform within an organisation, not a general business platform for external consumers. This platform serves as a foundation for other engineering teams building products and systems on top of it for end users. Concrete goals include:

Improving developer productivity and efficiency - through things like tooling, automation and infrastructure-as-code
Providing consistency and confidence around complex cross cutting areas of concerns - such as security and reliable auto scaling
Helping organisations to grow teams in a sustainable manner to meet increased business demands

Matthew Skelton concisely defines a platform as “a curated experience for engineers (the customers of the platform)”. This phrase “curated experience” very nicely encapsulates the essence of what I have come to recognise and appreciate as being a crucial differentiator for successful platforms. Namely, it’s not just about one technology solving all your problems (Kubernetes I’m looking at you). Nor is it about creating a wrapper around a bunch of tech. Just as important, are the processes, interactions and end experience of those using the platform.

But I would argue we should take this definition a step further. Looking beyond the obvious engineering community it targets, to the broader diverse communities involved - taking their collective requirements into account as a first class concern when deciding what should, and should not constitute the platform. In summary:

Platform Engineering is a discipline which involves doing whatever it takes (both technical and non technical) to build, maintain and provide a curated platform experience for all the communities using it

Who Are These Communities?

Traditional Engineering Teams

The primary and most obvious community comprises a mixture of feature- and product-style teams. Focused on delivering business value to end users as fast as possible, this community does not want to get bogged down writing plumbing code. For this community, a good platform will ease their lives considerably by providing tools and services to help with provisioning infrastructure, getting decent and possibly templated/cookie-cutter CI/CD pipelines up and running fast, and wherever possible, minimising the wrangling needed to do non business critical work.

Data Analysts & Scientists

Less obvious, but becoming more prominent are the data analysts and scientist community. Commoditization and incorporation of machine learning (ML) and artificial intelligence (AI) into systems has also seen a rise in the need to improve the maturity of the operational processes and life cycles backing it. Unlike with traditional engineering teams, you can’t just apply standard DevOps style processes to ML operations - there are some significant differences. Enter MLOps. It’s not only code involved now, but also data (both training and real), model parameters, multiple pipelines and more which all needs versioning, tracing and handling. Skill sets also differ vastly - the more scientist and less engineering focused nature of the work, requires processes be adapted to allow for a more experimental approach whilst still providing safety nets such as source control etc. A platform which supports this community must take all these unique challenges into account.

Leadership & Governance

Then there is the “sometimes forgotten, sometimes over emphasized” community of leadership and governance. Primarily your C-Level sponsors looking for the platform to deliver on bigger, faster, better promises, but may also include others such as regulatory and compliance, governance and finance functions. These often less technical groups look at the platform from a completely different angle. Instead of using the nuts and bolts of the platform to produce software artifacts, this community is more interested in extracting valuable information out of the platform to measure and assess broader benefits, impacts and outcomes. For example they are often looking to gain information which will help them answer questions such as: Is our overall cloud spend and utilisation efficient and within limits (perhaps broader cloud contracts need negotiating, or budget/opex projections considered)? Are privacy, security and regulatory compliance standards being adhered to across the teams?

So whilst traditional feature and product teams may well be your obvious primary users, they are not the only ones! Remember to recognise and take all your communities into account.

Patterns Underpinning A Successful Platform Experience

So for all of these communities, what does a successful platform experience look like, what does it feel like? And conversely, what does it look like when it all goes a bit wrong?

Pattern #1: Clear Boundaries & Responsibilities

Teams need to understand what is required of them to be a good platform citizen, as well as where they have license to go off-piste and the conditions allowing that. A good platform engineering team will make it very obvious what is a platform responsibility vs that of the teams they serve. Drawing such lines in the sand early helps minimise frustration and also promotes better collaboration and a faster, efficient delivery of overall value.

Anti-Pattern: The “Blame Game” Or “Pass The Hot Potato”

When clarity in this space is lacking, the end result is that problems simply get moved round, but never solved. Additionally, unrealistic expectations regarding future functionalities also begin to crop up - "Let's just wait for the platform team to do it, it's their responsibility".

We worked with one client where the developers kept getting out of memory issues, but it was not clear where the issue lay and what were considered platform issues vs application errors. Debugging microservice problems became a nightmare, and whilst it started with this one specific issue, within months it had snowballed. At the first sign of trouble, no matter what the problem was, the answer from the dev team was always “It’s not our problem: the platform team released a new feature, they must have broken it”. Likewise the platform team responded with something along the lines of “devs don’t bother looking at logs, when they’ve done that, then we’ll get involved”.

Besides a lot of education and coaching to help ensure each side took appropriate ownership, another key outcome was to explicitly define the categories of errors which belong to the platform vs application team. Something which can be done in a platform contract (see next section) or similar.

Pattern #2: Self Service & Automation

End users want to have tools and platforms which provide freedom and independence to go as fast as they can delivering value to their end users. Innovation and experimentation is also encouraged when there is little to no friction from a platform team. That is why one of THE top wishes of platform end users is to have self service capabilities in the platform through a combination of clearly defined interfaces, tools and processes. And ideally, for this to be as automated as possible to make it reliable and fast.

Anti-Pattern: Death-By-A-Thousand-Jiras

When this is missing, delays become part and parcel of delivery life and lead times move from hours to days or even weeks. At one large enterprise organisation we encountered, you needed to raise 5-10 Jira requests (all in the right order) in order for people to create the right machines or infrastructure resources. This was extremely inefficient and caused enormous frustration for all involved. Side note: Jira does not classify as self service unless it actually kicks off a fully automated process off the back of it. Jira, and similar tools, can often hide an army of manual processes behind it if you are not careful.

Pattern: #3 Flexible & Evolvable

It is understood platforms will be opinionated to some degree to help ensure order, as well as provide improved productivity and efficiency. But to be practically useful, platforms must be able to adapt to diverse community needs, including allowing for deviation where required. This means moving away from a one-size-fits-all approach to one which instead offers guardrails and guides (through templates, tooling etc). This helps steer engineers in the right direction, ensuring they don’t veer off track completely, but also doesn’t box them in unnecessarily either. Spotify have actually written up a great example of how they did this here.

Anti-Pattern: Rogue Teams & Solutions

If the platform is inflexible, it simply becomes a bottleneck. By restricting teams' abilities to evolve and take advantage of innovation - you inadvertently land up birthing rogue teams who simply bypass the platform, rendering it a white elephant.

Worse still, teams may choose to stay within your strictly set boundaries, but then find ingenious but completely inappropriate ways to make their stuff work. One client decided the only legitimate actions available to teams was to create services and resources within designated OpenShift platform clusters. At that time, no access to native client services (AWS) was allowed. Sticking to the letter of the law, everything landed up in OpenShift including common databases and messaging systems. In reality, a much better architecture would have involved consuming some of the native cloud services. (Note: This was before Kubernetes got better at running stateful workloads). The pain and inability to access the right technology because of inflexible platform rules made the teams work around the problem. This resulted in an overall solution that was far from optimal with unnecessary duplicated storage and synchronisation challenges.

Pattern: #4 Reliable & Caters for Day 2 Operations

If engineers and operational staff are going to run their solutions on your platform, they will expect the fundamental operations underpinning it to be rock solid. Confidence that low level plumbing activities such as auto scaling, and provisioning will “just work” (provided of course it’s configured correctly) is key to building confidence in both technical and management teams. This includes making appropriate tooling and dashboards available to help troubleshoot and diagnose issues - whether in development or production environments.

Anti-Pattern: DIY Nightmares

At one end of the scale, if the platform itself is unstable, you will inevitably land up with unhappy end clients as the customer experience will be directly negatively impacted.

At the other end of the scale, if the tools and processes to manage the platform are not reliable, teams may well start taking matters into their own hands. Building their own coping mechanisms, mini ecosystems begin popping up everywhere. In one case we found a client’s team had resorted to building their own customised log aggregation and visualisation system because they were not given access and insight to real time logs and metrics. They were flying blind, and in the end this DIY system became “the way” to debug your app, instead of using simple but proven and stable options.

So How Do I Platform Engineer Well?

Prerequisite: Executive Buy-In & Technical Expertise

Required changes affect not only technology, but also organisational structures and processes as well. All the successful platform engineering initiatives I have come across have been built on a foundation which enjoys both executive (C-Level) buy-in AND access to a sufficient level of technical expertise and experience.

Executive buy-in empowers teams to make the organisational changes needed to support a platform initiative. Without it you are unlikely to succeed. Likewise, a core part of your team and leadership should include technical expertise with real world experience in modern cloud and distributed systems. This ensures the technical approach is sound and helps prevent going down blind alleys and rabbit holes.

Note that I am not saying you need a whole team of hard core Google or Facebook style engineers, but without some of these core skills and experience “the road tends to be long with many a-winding turns”.

Community Driven Principles

Moving away from the purely traditional “technical” as-a-service mental model, I believe we would be better off moving towards one which puts our communities at the heart of what we do. Platform engineering should aim not just to provide a service, but rather to be a service to our communities.

How? By following what I call four basic “Community Driven Principles” and letting these guide the practical solutions and approaches we come up with to solve problems.

1. Make Teams Independent of You

The platform should create an empowering environment for teams. Whilst teams will be dependent on the platform to some extent, high on your priority list should be avoiding scenarios where they need you, or your team, to personally intervene each time they need to do something. Often referred to as blocking dependencies, these should be avoided at all costs. Instead you want non-blocking dependencies through things such as self-service offerings and great documentation.

Define a Platform Contract

Much like the AWS Shared Responsibility Model' concept, defining a platform contract which clearly defines areas of responsibility covered by the platform, vs those of the consuming teams is a great start. Often just a simple document on a wiki, below are examples for a Kubernetes-based platform of some of the areas typically covered in such a contract:

Security & Compliance: The platform will be responsible for patching Kubernetes nodes, but teams must take responsibility for scanning their own containers for vulnerabilities. Whilst the platform offering may provide the tools and services to do the scanning, it’s the team’s responsibility to incorporate these into their pipelines.
Resource Handling: Teams will be responsible for setting their own CPU and memory limits on pods and so on. However, management of overall clusterwide resources will be a platform concern.

Be careful however to ensure boundaries are defined very concretely and ideally have examples. If not, you may fall into the "it's not my responsibility" anti-pattern. For example, the statement below is too broad and leaves far too much open for interpretation.

The Platform team is responsible for the Kubernetes clusters security

Favour Automation & API Interactions

Automation and integration via APIs removes reliance on humans as well as minimises errors. Areas which benefit greatly include onboarding (getting new teams up and running quickly), as well as infrastructure provisioning. Depending on the nature of the teams and structure this will come in various forms.

If there is a high level of trust between teams, the boundary may simply be based on good principles. For instance, “As long as you use an Infrastructure-As-Code approach (for example using Terraform), go for it!”.

Many larger organisations however, tend to operate within firmer boundaries. For example requiring usage through a set of common configurable Terraform templates or modules.

For your leadership and governance community, things are a little different. Often underpinned by good data driven APIs, independence comes by being able to easily access the right information at the right time for decision making purposes. This typically comes in the form of up-to-date, self service dashboards.

Provide Good Documentation

A lost art nowadays, but good documentation is really important to ensure people can help themselves. Going beyond textual documentations on wikis, this includes keeping up-to-date reference implementations (for example how to deploy a specific microservice stack) through Helm chart templates, and so on. Also promoting a good developer experience via good command line docs and usage info.

Remember to keep the readers cognitive load in mind. How much do they need to keep in the memory to understand what they're reading? Aim for small chunks of information with good examples that don't require a lot of previous context.

2. Promote Freedom Over Autonomy

Platforms should aim to provide teams with as much freedom as possible, but within agreed boundaries so that everyone can play nicely with each other. Carefully considered boundaries will ensure some semblance of being able to extract value and return on economies of scale for platform sponsors. This as opposed to allowing a free for all which just creates a headache for both management and the platform team. For example you find teams developing in obscure languages no one cares for, nor can be recruited for, or maintained in the long term.

Greek Lesson

The word Autonomy comes from two Greek roots, autos (meaning self) and nomos (meaning law or rule). To be autonomous can mean to be a “law unto oneself” or “self-rule”. Autonomy knows no boundaries and often manifests as a bit of a free-for-all.

Whilst some may disagree, I would argue that freedom is different. Freedom is often defined as being closer to the concept of liberty, and many argue it needs some form of boundaries to make any practical sense. So if we think of freedom in the context of choice instead, this implies the power to choose among alternatives rather than merely being completely unrestrained.

Establish Ground Rules

Establishing fundamental ground rules upfront is needed to make it crystal clear what the framework and context are for further decisions and options moving forward. This could be done through the platform contract where typical ground rules may include:

Cloud strategies: eg.
- AWS Only: Initially only US regions will be targeted. EU and UK will follow in 6 months. Any native services used must be available within all 3 regions.
- Multi Cloud: Deployments will target AWS, Azure and Google. Apps should be deployable across all Kubernetes-as-a-service offerings (EKS, AKS, GKE)
Technology risk appetite eg:
- Only “stable” / GA or supported versions of technology X is allowed (As opposed to simply the latest and greatest)

Choice Over “Anything Goes”

Below are some examples of areas where it has proved beneficial to provide a limited, but evolvable set of choices for teams using the platform.

Technology Stacks

Teams get to choose between a few main tech stacks depending on their solution and problem space (eg Java/SpringBoot; Go; NodeJS) but not just anything. Platform teams provide support and value through templates and reference implementations which should increase their productivity.

Ecosystems

Teams get to choose from a variety of technology and approaches with the proviso it is compatible with a specific ecosystem. For example Infrastructure-as-Code and provisioning through the Terraform ecosystem. This in turn may have arisen to enable easier multi cloud maintenance (as per a multi or hybrid cloud strategy). Or perhaps all services must be deployable within a specific orchestration scheduler such as Kubernetes or Nomad.

Templated Pipelines

Predefined CI/CD pipeline templates (guard rails) may be established. There should be specific jump off points to allow inclusion of different steps, tools, and so on, even if the broad path and required outputs are set. This latter approach allows for the inclusion of common gates to ensure certain compliance checks can be met.

3. Be A Role Model & Walk The Talk

People are far more complicated than technology. Putting yourself in their shoes, as well as modelling the behaviour and approaches you would like and expect will go a long way towards achieving overall success.

Eat Your Own Dogfood

Where possible try to use the tools you’re building for your own work. Being exposed to your own self imposed constraints is a surefire way to get problems unblocked. If not through tooling, you can still do this by continuously exercising “reference examples” or templates provided for teams. “Exemplar tenants” can be set up to test the platform through regular testing to ensure it is always up to date and the experience of the team is as smooth as it can be.

Offer “Professional Services”

Unlike setups at external product organisations, there is no equivalent of AWS professional services or some 3rd party service to call on. You and your platform engineering team are the professional services for your respective client teams. Techniques such as workshops as well as embedding platform team members into client teams should be part and parcel of how you operate to make teams succeed and work better.

Consider Dedicated Platform Evangelism

Finally you may want to consider dedicating specific time and effort to evangelising and advocating for the platform. Not to pedal the latest platform snake oil, but rather explain and promote the benefits and good practices of the platform to different teams. Principal engineers often find themselves doubling up in this capacity. Practically, knowledge can be promoted through share and learn sessions, proactively visiting teams or leads and building up an understanding of where their latest needs are.

4. Respect & Recognise Community Differences

Teams using the platform will have a wide variety of skills, maturity and experience. This too will change over time as people come and go, and requirements change. Getting the best out of teams will require flexibility tailored around situational context and awareness. This often comes in the form of evolving team structures and ways of interacting.

Adapt Ways of Working for Different Communities

Whilst there are many ways to look at this problem, Matthew Skelton and Manuel Pais have done a great job in their book Team Topologies. Without going into too much detail; they talk about four main team types (which includes a “platform” team) and 3 main ways in which teams can interact, namely “collaboration”, “X-as-a-service” and “facilitating”. The argument is that if you want to promote “fast flow” (e.g. being able to quickly respond to customer needs and issues) you will need to mix and match between different team structures and interaction modes at different stages along your journey.

Interaction wise, as a platform team, I would argue your de facto interaction mode with other teams should be “X-as-a-Service” as this promotes independence and self service. However, there will be times when platform team members will need to collaborate or embed themselves into less experienced teams for a while, to help get them going in more of a collaborative or facilitating role.

Structure wise, we’ve seen cases where single platform teams needed to be split up to better align with different platform service offerings. For one client, a dedicated platform team was created to deal with machine learning supporting services and the specific requirements of the MLOps community.

In all these cases, there is a movement of people, and change of process or interaction methods which needs to be understood, expected and allowed to happen. Provided we have our executive buy-in, they should be able to help facilitate this.

Summary

Assuming you are able to start with a good foundation, i.e. you have executive buy-in and enough technical expertise to progress safely and wisely, we observe that successful platform engineering initiatives start by adjusting their thinking to centre around people and communities, and their experience consuming the platform. Whilst the use of good technology is key, it is not the primary driver of success!

About the Author

Nicki Watt is the Chief Technology Officer at OpenCredo, a pragmatic hands-on software consultancy with specialisms in data engineering, ML & cloud native solutions. Her technical career has seen her wear many hats from Engineer, Systems & Technical Architects to Consultant and CTO. A techie at heart, her current focus is on graph and data platforms as well as ensuring the successful delivery of large scale platform and cloud development projects. Nicki can be found speaking at various conferences and is also co-author of the graph database book Neo4J in Action.

InfoQ Software Architects' Newsletter