Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Camille Fournier on Effectively Managing Internal Platform Teams

Camille Fournier on Effectively Managing Internal Platform Teams

This item in japanese

Camille Fournier, managing director, head of platform engineering at Two Sigma, recently shared her learnings from managing internal platform engineering teams. Two of the key challenges she shared are the smaller size of the customer base and the challenge in understanding how your customers will use your product. She also stressed the importance of ensuring invested development work is aligned to the best interests of the product and the end user.

As Fournier noted, "great platform teams can tell a story about what they have built, what they are building, and why these products make the overall engineering team more effective." She stressed the importance of using a metrics-driven strategy to ensure that the team remains "customer-focused and strategic about your platform offerings".

InfoQ sat down with Fournier to discuss her learnings and approach in more detail.

InfoQ: You mention that a metrics-driven strategy is hard to apply when the customer base is small. However, you also note that ignoring these metrics can be dangerous. What balance should a platform team be looking to strike here?

Camille Fournier: You may not be able to use metrics to easily drive your next platform decision. It's hard to run an A/B test on a small user base, for example, and some of the classic data-driven approaches that consumer application product managers employ are much harder to use.

However, that doesn't mean that you should ignore metrics completely. Some of the metrics you may use to drive product decisions are going to be around things like "which are the slowest commonly-called APIs for this system?" Metrics can also help you identify the biggest users of your system, which provides a place to start conversations with those users about what is going well and what isn't. Conversely, this data also allows you to find teams who aren't using your offerings and ask them why they aren't!

I also encourage platform engineers to instrument their systems so that they can track usage metrics. That is a common problem I see in using metrics to drive platform decisions: we don't often track the internal usage of our systems, particularly when the product we are delivering is a library or a framework. Try to instrument your systems early not just for operational and performance metrics, which you need to operate the system effectively, but also for usage metrics and data about how the people are using the product, so that you can make better product decisions.

InfoQ: Can you elaborate on what you mean by "software that is building to be built"?

Fournier: Engineers tend to enjoy writing code. And sometimes when they see a problem, the solution they always jump to is that the problem needs to have code written to solve it. So you see companies that have homegrown metrics and monitoring systems instead of using an SaaS vendor, because the engineers decided that vendors aren't worth paying for and they could easily meet the unique needs of the company themselves. Or you see engineers building a new web framework because they don't love any of the off-the-shelf open source. You get the impression that no one asked the question why is this the most important thing for us to be working on right now? Is it really that much better to build this than to use an off-the-shelf tool, or do some customization to make an off-the-shelf tool work?

A version of this is that sometimes an engineer has an idea for a product that could be cool. They think, "we could store this data in a custom caching system and it would be incredibly fast and useful!" And so they start building, and the system is always one feature away from being widely useful. Maybe it has one team using it but only a little bit, and they aren't actually that pleased with the offering. But still we're investing a lot of work into building, running, and supporting the system. That to me is a sign of software that we're building to be built. We love the idea, but we haven't figured out how to make the idea useful, and we won't admit that it's a sunk cost to walk away from and so we keep building.

InfoQ: When taking over pre-built systems, how do you ensure the team is focused on building on the system and not being distracted by "decisions that they don’t agree with"?

Fournier: I recently had a conversation with an engineer who is running a large legacy system. We've been trying to figure out how to extend this system to do a couple of new things for a while, and it's hard. He was talking about going back and forth between wanting to just get rid of the old system and replace it, knowing that would be a really expensive migration to manage, or to fix the system. So I asked him, what percentage of the original system is good enough? Is it 20%? 50%? 80%? Is the old system mostly good, or mostly bad? I think that we have to look at our systems this way, whether they are our own legacy systems or someone else's systems that we are taking with hopes of expanding. No system is perfect. If you end up obsessed with the annoying parts of the system, it is tempting to rewrite too much of it, and you may find yourself making no forward progress on creating a system that is useful for other teams beyond the team that you took the system from. Instead you have to focus on the parts that will break at scale, or the parts that are really preventing you from adding new features, and tackle those first. When you have broadened adoption, and added new features, and gotten really comfortable working with the system, you may start to appreciate the decisions you didn't agree with, or you may realize that those decisions don't get in the way of the overall effectiveness. Or, rarely, you will end up fixing them. But try not to just rewrite parts of a system for pure aesthetics, or because you don't love the language or framework the original team used.

InfoQ: You encourage taking an account of the existing ecosystem and culture. In many companies, these are complex and diverse concepts. Do you have any suggestions on how to go about building this account?

Fournier: The challenge with ecosystem and culture mostly comes in when you are looking at software and practices that other companies have adopted, which you want to bring into your own company. Google tools are a great example of a place where platform engineers tend to stumble. Google is a great company, with really amazing internal tools and infrastructure, and they've open sourced a lot of these tools. But Googlers don't just use one of these tools in isolation. They use them within the ecosystem of a bunch of internal products and within the culture of Google.

Google's build and test tools are a great example. Bazel is their open-sourced build system. Google has done a bunch of work to make building software within their monorepo really fast. But inside of Google, Bazel doesn't operate on its own. It operates within an ecosystem of other tooling, and it is used within a culture that has certain expectations about what developers do when they write code, what they are expected to fix or own themselves, etc. It operates within a company that has invested a ton of money, and a lot of talented people, into making the developer experience amazing. And so a lot of people have tried to take Bazel and use it at their own company, with the assumption that Bazel is a crucial piece of the magic that will make their internal developer tooling stack more like Google, and therefore their developer experience will be more Googley. But the tool alone is not enough to get that experience.

Within your own company, you have to be realistic. If almost every team uses Java, and you want to get people to start using Python, you are going to be in for a much harder road than if you find or build an offering that uses Java. That doesn't mean you can't make the Python offering work, but you're going to have a lot more work to do. If your company culture is that people do not touch code other than their own code, it's going to be hard to scale a model that expects people who make major API changes to implement them across all the code and systems themselves.

InfoQ: You stress that teams should "only build when you have exhausted the alternatives". What suggestions do you have for helping a team of platform engineers not jump into building as their initial solution?

Fournier: Start by appreciating that you can get more done building on top of other people's solutions to create a true portfolio and platform, rather than individual tools. It is hard to build a monitoring system, and a log aggregator, and a good visualization dashboard, and a good CI/CD system, and a good compute orchestrator. But if you can take these products from other people or open source, and get them to work really seamlessly well together for all of the developers at your company, you have now built a very powerful platform. And you're going to write some code to do that, but it may not be in the obvious places. You may need to build your own orchestration logic to actually get all of this to work together. You may need to build a Kubernetes plugin because they lack something that does what you need. You may need to build a more efficient logging library for the language of your company. I don't know, but I bet that in the process of bringing together a suite of cloud-based, SaaS, open source tools to create a productive platform for the developers at your company, you're going to find plenty of code that needs to be written. And focusing on how all of these products work together well, that is where you can deliver outsized product value for your customers.

Rate this Article