Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Building a Successful Platform: Acceleration, Autonomy & Accountability

Building a Successful Platform: Acceleration, Autonomy & Accountability



Smruti Patel discusses successful platform adoption. She explores topics including failed platform-building efforts, the three pillars of a successful platform, and how to bake in acceleration autonomy, and accountability to a platform.


Smruti Patel is the VP of Engineering at Apollo Graph, the leading graphQL platform for building highly performant APIs at scale for rapid digital transformation. She has led and scaled high performing engineering teams at Stripe and VMware, building critical infrastructure for global businesses. Her interests include mentoring and coaching, hiking with her boys, and traveling the world.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Patel: Successful platforms. What do these really look like: unicorns and rainbows? I wish. A Cost Platform which didn't make our business efficient. A Service Delivery Platform which didn't make our developers productive. An Async Processing Platform, which didn't make our ecosystem sustainable. I'm Smruti Patel. I've had the privilege to build and deliver platforms over the last two decades. What I just shared are but three examples. Three examples of what sound like unsuccessful platforms. They even felt like massive failures back in the day. These are ones that we turned around. We turned them around to build an efficient business, productive developers, and a healthy sustainable ecosystem. How do we do that?

Before I talk about that, a quick short segue into my first introduction to platforms. I grew up in Mumbai, a city that's known for its pace and hustle. My fondest memories are of waking up early, taking the 6:00 morning train from Ghatkopar, where I lived, to Dadar for my math and science classes. Daily, hundreds of thousands of commuters would use this massively critical infrastructure. As you can see here, unfortunately, platforms were either not wide enough, tall enough, or secure enough to meet everybody's needs. I too had a dream of traveling the world and building and delivering platforms, which are just inherently so delightful to get me where I wanted to be, when I wanted to be, on time, safe and sound, and with my bag and baggage.

Coming back to my second passion, that's engineering. As an engineering manager, a manager of managers, and now an executive leader, I've delivered a wide range of platforms. Whether it was disaster recovery, or data protection, which was an enterprise platform at VMware, to solving for cloud cost efficiency or data needs for mission critical transaction processing at Stripe, to now building GraphOS at Apollo Graph, which is a platform for the platform developers of our world. Through these experiences, I've made many mistakes. I've learned some valuable lessons. I'll share with you three journeys. For each, where we got stuck, how do we unblock ourselves, and the key memorabilia that we collected for life. I talk about the Cost Platform or Solis, where we learned how to drive purpose and create value for our business. I'll then talk about Service Delivery Platform or ShuttleCrawler, where we adapted our mindset and focused on driving velocity and autonomy for our users. Lastly, I'll talk about the Async Processing Platform, which is called Monster. It's true to its theme, and I'll tell you why, where we baked in a foundation of trust and accountability, both in the ecosystem, but also in the platform itself. These journeys have basically empowered me to build richer, more empathetic, and sustainable platforms. That's what we're now applying to building GraphOS at Apollo.

Cost Platform, a.k.a. Solis

Welcome to the journey of Cost Platform or Solis. It was 2018, I was back from my trip from Amsterdam. I was looking for a new challenge. I'd heard a lot about Stripe. I'd read about its inspiring leadership. I knew that they had some exciting problems to solve. I joined to lead a team called Prosperity. Good old days. A few months in, I had a one-on-one with John Collison, who said, "Smruti, we're spending a lot on our cloud costs. Stripe is a low margin business as you can imagine. What is your mental model? What's your mental model if we'd made every single decision just right over the last decade of Stripe?" I was mind blown. Like, what a question? I instantly got nerd-sniped, did some statistical number crunching, some theoretical modeling, sat down with the team, and we came up with this thesis that we were spending a lot, because we didn't know what we were spending on. We started with building cloud observability. We said, 100% of our cloud costs would be owned by our users, and 100% of our users would see every single penny they were spending on. We embarked on building the Cloud Cost Platform, and because we wanted to shed light on observability, we called it Solis after the sun.

A few months in, we were so wrong. We'd attributed every single dollar of our cloud costs. We'd built out this shiny, nice platform, and a UI where a user could drill down into their cost, see what they spent on compute, on their big data infrastructure. Even their S3 buckets, we had petabytes of unknown stuff. They could see all of that. As you can imagine, Stripe didn't magically become efficient. Our platform didn't magically drive impact. The mistake we made was assuming that the platform and its existence itself was the outcome. We thought that we'd build it and the results would just follow. Our North Star here was not observability into cloud costs, but our North Star should have been making Stripe efficient and bending that curve of spend. Here the platform itself was not the point. Its existence was not the outcome. The value we actually had to create had to be anchored on some business outcomes. How do you create those business outcomes? First and foremost, starting with the why. Why did we truly need to build the platform: to build or not to build? Platforms take time, they take commitment, they take effort, and they have a high degree of opportunity cost. Engineering, as we know it, is the single biggest asset in any software driven company. The opportunity cost of going down a wrong path or fixating on a wrong problem can be very grave for the entire organization. Before you embark on this journey, you want to be very intentional about the why behind your platform.

A part of asking that why, is finding out what's your North Star. What's your durable North Star, which will roughly be true for the next 5 or 10 years? In our case, cloud cost observability was not a 5-plus year problem. Also, at this stage, you don't have to get the specifics right just yet, but you have to know enough to know that the value maximization function for you is roughly going to be true for the next 5 years or so. For us, cloud cost observability was definitely a stepping stone, but it was not the outcome. We went back to the drawing board, we said, what problems were we looking to solve? What did success look like? What were the business KPIs? As we did that, we moved our metrics from cloud cost observability, to defining cloud efficiency, which we came up as the total spend on cloud infrastructure as a function of the total revenue at Stripe. As a compensating control, we also defined a unit metric to see if Stripe was getting more or less efficient over time, which was the cost per transaction. Once we had business KPIs, it was time to get our stakeholders aligned. For us, it was the CTO, the CFO, the founders. We all sat down, did a handshake, asked ourselves, what was the efficiency target that we needed to sign? Did we need to have in-year savings, or do we optimize for year-over-year savings? Also, remember that building a platform isn't naturally going to make us efficient. How do you then create time and space to take on the cost efficiency projects which do get identified? How did that align with the business and the product strategy? Once we'd got an alignment all around, we then focused on finding ourselves an exec sponsor. This was someone who would advocate for the value of the work, the impact of the work, and then spread that message, because we wanted people to make the right decisions as well. Think of it as upfront product marketing, someone who's an ally, who's pitching on your side once the platform does get built out.

Once you have a North Star, the second thing is leverage. Platforms have to be able to give you a leg up in some ways. For us, we didn't want all engineering teams to become 5% efficient, but we wanted those key 2 to 3 teams to become a whole lot more efficient. Our database teams, our data platform teams, our machine learning teams, they had to account for 60% to 80% of those efficiency targets. How do you then find those points of leverage across your engineering organization? This is where I found it very useful to think about the 3R's, reduce, reuse, recycle. When you think about platforms, a big leverage point is that you can abstract away a bunch of complexity. This complexity could either be domain shaped, like you don't want multiple teams to figure out the guts of your banking vendors, or your cash platform. Or it could also be technical or skill set specific, where you don't want many teams to have to think about the guts of Kubernetes, and service networking, and service mesh. What does reducing that complexity look like? In that, platform teams are those quintessential 10x team driving that systemic efficiency across the entire organization. The next is finding ways to reuse policies. What I mean by this is when you're building out a platform, think of an hourglass or a funnel, if you will, where it becomes that point where you can introduce or standardize your policies. For us, we wanted to make sure that 100% of our cloud resources would be owned. We used Solis to be the point where we had 100% ownership of everything, including those S3 buckets. This was the place where we could then even enforce some of those policies. Lastly, recycling code. We built out data primitives for data archival and retention. Then over time, we could reuse the same for different data regulatory needs, whether they were PII, GDPR, data locality. I found it very useful when thinking about leverage through your platform, think about what reduce, reuse, recycle looks like as you're building it out.

The last is coverage. If leverage is about being slightly tall, coverage is about being wide. Here's where you want to think about what use cases you want to support. You don't want your platform to be too narrow that it just feels like point to point. For us, when we were building out Solis, we knew that after we'd sold the observability of AWS vendor spend, we could also extend that to other SaaS vendors, whether it was logging, observability, machine learning, or even streaming. When you're thinking about coverage, the three things that come to mind are being opinionated, cohesive, and coherent. When I say opinionated, what I mean by this is, you don't want your platform to be too wide, where it's doing something for everyone, but not really doing anything for anyone. At the same time, you don't want it too narrow, where it feels like it's a tool, but not really a platform. What is the 80/20 Pareto style where your 80% of use cases are solved by 20% of your effort? You want to be very opinionated about which users and which use cases you're going to support. To do that, you might have to even be intentional and say no to some. For us, finance got very excited when they said, you can have budgets on team level basis. They wanted to introduce travel spending and offsite spending. I was like, "No, thank you." Solis was all about managing spend on our software systems and services. It's important as you're thinking about how wide your platform is going to be to think about whom you're going to say yes to, and whom you're going to say no to.

That brings me to the second bit, which is cohesive. Think about the different nouns and verbs and the building blocks and Legos that are going to make up your platform. You want these to cohesively fit with each other. For us, we had to build out secondary and even tertiary attribution of some of the managed platforms within Stripe. Whether it was your compute platform, or storage platform, or data platform, each had to be able to plug and play their own attribution methodology so that as an end user, as a team, you could drill down into each of these spend and see that aggregated spend view. Lastly, coherent. As you're building out all these pieces of your platform, your platform needs to harmoniously and logically evolve. For us, because we were thinking about the user, we said, if a user wants to know their spend, and then drill down as to why their month over month spend might have spiked, they might come down to some S3 bucket which didn't get read, or a Hadoop job which suddenly was bursty and cost a lot. That user should then be able to take some actionable insights. That's what we did. We first brought in observability, then we brought in predictability. The platform then coherently evolved with the evolving needs of our user. Opinionated, cohesive, and coherent are three things to keep in mind as we're building out the platform.

We'd identified the why. We got opinionated about our leverage and coverage, and we got ourselves a lofty North Star. We'd identified and established our purpose, and that purpose was driving and accelerating value for our business. We used Solis ourselves. We dogfooded it to identify small to extra-large shaped engineering projects. That spectrum was pretty wide. We went from switch off these unused instances, all the way to migrating compute heavy workloads onto Graviton. We saved Stripe tens of millions of dollars with compounded savings year-over-year. What was super rewarding was we'd actually shifted left the entire engineering culture from being reactive, and having to manage spend, all the way to thinking about what the cost of an initiative would look like at the very design stage. People leverage this platform and our team to actually design out the multi-region and BCDR story for Stripe. The lesson we learned was building the platform was not the outcome. It was not the end, in and of itself. The value and the purpose of the platform had to be anchored on creating that value, and accelerating it for our business. That is story one about Solis.

Service Delivery Platform, a.k.a. ShuttleCrawler

Coming on to the second story, which is about our Service Delivery Platform, this journey is also called ShuttleCrawler. We had a monolith and a monorepo with some gnarly dependencies, because we'd accrued years of tech debt. We would routinely in our dev prod surveys, have users saying, I don't know what's running in production. We also saw how this would then manifest in late breaking changes in production. I said, let's go build a Service Delivery Platform. We called it ShuttleCrawler. We had our why. We had a lofty North Star, where we said we're going to make our developers a whole lot productive. We will reduce the lead times by over 50%. In doing so, we would actually abstract away a lot of those complexities that just inherently come with service deploys, whether it's the guts of Kubernetes, scheduling, service networking, service mesh, all this stuff. Because we didn't have enough skill set in the house to actually debug and diagnose some of those issues. We asked ourselves, what would the platform look like? What should it actually offer to our users? The why is about creating value for the business, the what is all about driving velocity for your users.

Thinking about users and users first and driving that velocity, the thing to keep in mind is how do you drive a product mindset to building a platform. This is very new to me, because I'd always been an infrastructure engineer focusing on platforms for internal developers. This is where I found it very useful to think about this framework of double diamond, where the first diamond talks about product discovery and problem definition. It's this phase where you can do divergent thinking and ideation. You can think about a problem either widely, or deeply. Then the second diamond is all about solutioning. It allows for that action-oriented, focused, convergent thinking into developing and delivering the solution. Applying that product mindset requires you to first say, who's my user? Who's my customer? Who are the user cohorts that we're going to support? Here I found it very valuable time and time again, to focus on 10 super delighted users, versus 1000 who are partially satisfied with what we have to offer. We sampled all the cohorts through engineering. We talked to frontend engineers. We talked to full stack engineers, infra engineers, product engineers. We also found a team who was bringing a new product to market and they were going through their product market fit, and they needed some rapid iteration. We said, we'll keep you out of scope for the first iteration. Once we knew whom we were talking to, whom we were focusing on, the next stage was all about product discovery. We had deep meetings with some of these users. We were really able to hone on what problems they were facing. As we had those deep conversations on user understanding and user research, we found that we could check our own assumptions. We found that none of the developers wanted domain-based namespaces in Kubernetes cluster. It was also going to add a lot of risk and scope to the project. We said, we're going to rescope even that. What was really interesting is, as we had these user conversations, we actually found out if there is true demand for what we're going to offer in the v0. We found some users whose eyes literally lit up. They actually signed up to say, if you ship this today, I would use it. We bookmarked those users as our potential alpha users. Once we'd done the product discovery, the next step for us was honing that problem definition, finding out what were the true tradeoffs, what were the challenges, what were the constraints? For this, we built out early mocks, early prototypes. As we did a roadshow with that, we found certain issues. We found that the frontend engineering cohort, which was quite a big popular cohort, was having some difficulty with the CLI based deploy solution. We then said, let's add something more, let's add a nice UI based solution to the scope as well, to make the whole experience super delightful.

Once you're done with your first diamond of product discovery and problem definition, coming to the second diamond, which is all about honing your solution. This is where it's all about, how do you make your users feel awesome? I love this book by Kathy Sierra, which is, "Badass: Making Users Awesome." The key message in the book is, your users are awesome when they are getting better results. It uses this analogy to say, if you're using your camera, and you're trying to perfectly catch that moment in time of a special someone doing a special something in a special place, you're not really anchoring so much on the focal length, or the aperture, or the zoom factor of your lens. Don't really focus on building a better camera, but focus on making your user a better photographer. That just blows my mind. I'd ask myself, what does it take to make our users feel awesome, and have them get better results? A big part of that, that I've found, is meeting the users where they are. If you've done your user analysis really correctly, and you understand your persona skill set, and their strengths, then using the platform should feel like a Swiss knife, where you've got different tools in your toolkit. Using the platform should not feel like eating a mushroom. What do I mean by this? You eat the right one, you get high. You eat the wrong one, you could die. You use it just right, lots of flavor in your food. You don't want your platform to feel like you're eating a mushroom. It has to feel like that Swiss knife. Think about, what are the developer workflows? What is the toolkit that you can also offer, in addition to the platform itself? Here's where, as platforms abstract away complexity on one end, the flip side of it is thinking about, what are those paved paths? What are those gardened paths, where a developer is definitely going to find it easy: easy to do what is right, and hard to do what is wrong? If you've truly built a platform in that way where your user is feeling easy to do what's right, they will continue to find themselves in these peaks of success, where they're delighted and they feel awesome, and they're doing great results. Versus in valleys of doom, where they're lost and they're confused and they're demotivated. Identify what those workflows are, what are those gardened paved paths? What are those guardrails that you want to offer in the platform itself?

That brings me to the last bit, which is progressive disclosure. What I mean by this is, it shouldn't feel like Everything Everywhere All at Once, as much as I love Michelle Yeoh. Think about it as layers. Remember that platforms are for developers, and as you're abstracting away some complexity, you will always have some developer who is curious enough to peel the onion, look under the hood, and tinker around to suit their own unique needs. Think about what are those programmable and composable building blocks that you can offer. For us, we had abstracted away bin packing for cloud cost efficiency. We had some service owners who had very bursty characteristics, who were really curious about needing some of those heuristics and wanted to fine tune it for themselves. We whitelisted a select set of engineers for whom we opened up that abstraction and let them tinker around as they needed to, to suit their own needs. We build it out, had a lofty North Star, broader product mindset to building the platform, focused on delighting our users. A year in, we had less than 5% of production traffic on ShuttleCrawler. Not only that, we now had the dual cost of maintaining two delivery platforms. The new one was not quite at feature parity, because we had correctly been very methodical about its scope, and also the old one which still continued to bleed developer productivity. Net-net, we were worse off than where it begins 18 months ago. Why was that? The mistake we had made was leading the adoption of the platform to a build and they'll come mindset. That was inherently wrong. It was not enough for us to just build a delightful platform. It was very important for us to also be super intentional about how our platform would get adopted. What was the organizational investment that we were expecting these teams to make? Did they carve out that time and space? How did this fit into their grand priorities and their own roadmaps? Our work was not really done until we thought about what adoption of our platform looked like.

We thought long and hard about, what does making adoption easy look like? Here's where we build out a detailed migration strategy. We said, what was the off-ramp from the old platform, what was the on-ramp onto the new platform? As we thought about that, we actually realized we had to build out an A/B testing tool, which originally wasn't in scope, so that our developers could in increments dial up the traffic from the old to the new, and make sure that there's roughly zero downtime as far as the migration went. Once we'd done that, we chose to deeply lead some three to five migrations. We had the choice on, whom would we go after? Do we go after this low hanging fruit, the stateless workloads, or do we go after some of those juggernaut services? We saw that, because we hadn't dogfooded the platform ourselves, it was important for us to figure out what that experience truly looked like. Some of our senior staff, ICs, and our technical program manager, we sat down, we identified those teams. Remember those alpha users who were super excited, their eyes lit up. We went back knocking on their doors to say, we've got some v0 here, would you be open to trying it out? We did a deep embed with those teams. Our ICs sat down with their teams. We migrated some of those services over. As we did that, we realized that we had to automate a lot of that tooling to reduce that cost to migrate services. Once we had migrated about 80/20 use cases, worked out the kinks of the platform itself, built out feature parity, it was then time for us. It was time for us to open up the doors to the masses once those 80/20 needs were sought out. There we then had to think about, what does self-serve onboarding at scale look like? Whether it's docs, whether it's processes, whether it's best practices for operational excellence. We also went on to roll out SLOs and SLAs which previously just did not exist in the organization. Last, we went on a big roadshow. We went to all-hands and check-ins, where we got our service developers who had migrated their services, who were reaping the benefits of the platform to talk about how successful they were in the time that it took for developing a feature and taking it all the way to prod. That was super useful.

About a year in, 100% of our services were not really on ShuttleCrawler, but every net new service started onboarding directly on ShuttleCrawler. Our average lead times reduced not by 50%, but actually 65%. We saw far fewer lead breaking changes in production. Like I said, we were very successful in shifting left the culture of operational excellence, because we'd use the platform and leverage it to drive the SLOs, SLAs, and general discipline that we wanted to see. The big lesson that we'd learned through all of this is moving from a build and they'll come, to being very intentional about our objective, which now was, how do you lower that barrier to entry? How do you reduce the friction? Doing all of that and focusing on day-zero adoption enabled us to drive both autonomy but also high velocity for some of our users.

Async Processing Platform, a.k.a. Monster

The last journey. If Solis was all about creating value for the business, and ShuttleCrawler is where we learned what done truly looked like, by being intentional about the adoption of the platform, Async Processing Platform or Monster was our journey in learning what unsustained use of the platform could look like. What was this? This was essentially a messaging platform, with remote task execution for a monolith written in Ruby. It was 2020, COVID had picked up. We saw unprecedented use in online commerce. Some of that demand was also showing up in the state of our systems. Deploy success rates to Monster fell all the way down to 20%. This was a platform that was used by over 1500 engineers who ship changes multiple times a day. This problem was not only impacting developer productivity, but it was causing massive business impact. We routinely found ourselves in SEV0s, which were user facing. Everyone was rapidly losing trust in whether they should or shouldn't use Monster.

The big question we were now faced with was how. How would we support sustainable and safe, healthy growth of our platform? How should the platform be used, and what did good look like here? If the why is about creating value, and the what is about velocity for users, the how was all about driving that veracity, baking in trust, baking in accountability, both in the platform, but also in the ecosystem around it. It was very helpful for us to remember that the users of our platform, they have obligations from their users. If users are choosing to use our platform, they're taking a bet on us, they're taking a leap of faith on us, our decisions, our intentions. There is this weird power dynamic at play then where the platform holds a certain power over its users. With great power comes great responsibility. A big point of being responsible is, how do you lead with trust? How do you lead with empathy? How do you lead with credibility? When I think about what good and great look like here, I go back to my love for trains and platforms. What you see here on the left, is this above-ground track in Tokyo, which couldn't scale to the scaling needs of the population around it. What they needed to then do is move this above-ground track, to a below-ground subway line. Eighty thousand people worked on this project for 8 years. They were deeply aware of the trust that the people put in this massively critical infrastructure. People had decided where to buy houses, where to go to work, where to send their kids to school. They took this trust quite seriously. They aimed for zero downtime, and that's what they got. Between 12:00, which was the last train of the day, and 5 a.m., which was the first of the next day, some 1200 engineers worked in serious collaboration, and they were successful in moving this above-ground track, to a below-ground subway line. That's what I think about, which is, how do you bake that trust and empathy in your very platform? It involves asking, what's your backward compatibility story? What's your long-term support story? What's your deprecation story? If you're going to deprecate some APIs or some functionality. It's thinking about, how are you making pivots or doubling down on your core product or platform strategy? Then, how are you being thoughtful and intentional about communicating some of those changes to your users, where if you're making some massive roadmap changes, it is likely to impact them? How are you being thoughtful about communicating some of these changes to your users?

With that, it comes to the second point, which is, platforms cannot be shaky. Our platform's got to be stable. Operational excellence has to be baked in. It's not a nice to have, but its core table stakes. For us, the reliability of Monster was seriously at stake. This is true, whether it's reliability, security, privacy, governance, compliance. Think about what these fundamentals are, and how can you lead with trust, lead with stability, lead with credibility? For us, because reliability was so crucial, we decided to put a stop on all feature delivery for about a quarter. We had the choice, do we go after giving our users temporary relief or do we take on a six-month massive lift and shift rewrite? Because our users were in tremendous pain and it was causing even business impact, the team did a methodical analysis of all the failures that they were seeing. They found out some core patterns. Data-driven decision making solved some of those, and we saw our deploy success rates creep up. We went from 20% to 99%, which felt good. It was far from ideal, but it had given our users some temporary relief. What was more important at that time was it gave us enough of a runway to dig into what was truly going on. What was deeper? What was under the hood? As we dug in to that, we found that most of those Monster tasks didn't have owners. They were not rate limited. They were not isolated. We had noisy neighbor issues. We were running into this extreme case of organizational debt, either due to frequent reorgs, people having left the company. Also, in some cases, we saw priority inversion where some of those teams just didn't care to solve those problems, given the other files that they were putting out.

We realized that a big reason for this was the platform had baked in the culture that was the culture all around us, which was trust and amplify. There was no accountability that was baked in to the very platform. We couldn't boot off some of these tasks, which were essentially like rotten mangoes, creating just an unhealthy and unsavory experience for the broad majority. We thought long and we thought hard about what this baking in accountability looks like. We started with what was in our control at the system's level itself. We said, what does preventing abuse and misuse look like? We went in and put an isolation of some of those basics. We put in an isolation, rate limiting, avoiding those noisy neighbor issues. We also built in graceful degradation for some of that. If you're a building platform for external developers, this is where thinking about DDoS, and governance, and all become super useful. Once we'd done that, and we solved some of the systemic problems, we then started enforcing invariants. This is where we went on to say, 100% ownership, every Monster task had to have an owner. We then moved on to also providing chargeback and accounting. One other thing that we did for invariants was we said, no task can use more than x compute and y memory. What we soon found was this actually necessitated the task owner to drive that task optimization upfront, as opposed to throwing it over the wall. Last, chargeback in accounting. We worked with our Cost Platform or Solis, to be able to put in accounting and the cost of executing that task. We billed those users and those owners for what it would cost. Ultimately, through all of this, we had a much healthier use of our ecosystem, and the platform. The thesis that we moved away from was essentially trust and amplify, such that at 10x load with 2x changes per day, we were able to not only get to 99.9% deploy success rates, we had almost eliminated most of our SEV0s, and we had a culture of discipline. This only happened because we baked in verification into the very platform, and led ourselves with a bit of trust, because we had cleaned up our act. Most importantly, this is what helped drive that veracity in the entire ecosystem with accountability to boot.

Key Takeaways

Across these journeys, my key takeaways are, platform engineering is not just about a technical problem to solve. It's about systems of systems. It's about being intentional about the why of its existence, to the people who build it, who operate it, but also to the ecosystem that leverages and consumes it. That makes it messy, but that is what is beautiful about platform engineering, that it's evolving with the evolving environment organically. In that, that platform then is not an end in itself. It is a journey toward that ideal. Our role in that journey is to have our feet planted on the ground, being careful and acknowledging what today looks like, but also having our eyes to the skies, where we're aiming for that ideal. That can be super hard. It is a hard investment. For that, it is important to remember, how can we be adaptable? How can we be flexible? Then bake in those right feedback loops, so you can constantly suss out what your delta to that North Star looks like.

Apollo's GraphOS

These are some of the lessons and guiding principles that we're bringing to building GraphOS at Apollo, which is an API platform for microservices.

GraphOS today has a very composable architecture, which is built for reuse for the supergraph. It's got an infrastructure that's scalable, secure, governed, and highly reliable, and performant, to be able to operate and scale your supergraph. Lastly, we've also built-in a very delightful developer toolkit and workflows where you can drive efficiency and productivity for teams and teams of people, not only those who are publishing to the graph, but also who are consuming from the graph. In a sense, GraphOS provides everything you need to build that API for your modern stack.


Platform engineering is a journey. It is possible to build successful platforms. Some of the guiding principles that I've found valuable are value, velocity, and veracity, drive acceleration for the business, have a lofty North Star, and be opinionated about your leverage and coverage. Make your users awesome and autonomous by making sure you've got that focus on day-zero adoption, not just on a delightful user experience. Lastly, foster a healthy, sustainable ecosystem where you're leading with trust, but also making sure that you're building accountability into your very platform.

Chronological Timeline to Solis and Monster

Patel: Was there any chronological timeline and sequencing to Solis and Monster?

Yes, Monster existed before Solis. Monster existed for about 7-plus years before Solis showed up. Then the journeys intertwined a bit and continued in parallel.

Questions and Answers

Participant 1: Did you see any of the approaches you took, did the development and evolution of the platform change over time?

Patel: Yes, absolutely. For both, actually. For Solis, like I said, we'd started off thinking that we'll give observability and everything gets better, but it didn't. We had to use that platform ourselves, identify what work we'd have to do, make the time and space to take on some of those engineering projects. We realized that the platform in itself needed additional features. We needed budgets. For us, it was like Maslow's Hierarchy of Needs: observability first, predictability, accountability, and then automation. That's how we rolled out the entire efficiency strategy. The platform then had to evolve as we were going through those hierarchy of needs. That's about Solis. For Monster, we saw a similar improvement that the platform needed to make over time, where a big part of it, the mess that we uncovered was just hundreds and thousands of tasks, which were just sitting unknown. Whether it was being surgical, because these tasks were actually executing business logic for Stripe, processing volume. We had to be very thoughtful about which ones we did boot off the platform. At steady state, after two years in, we realized that the platform was truly not scaling to the needs, and we started introducing a new ground zero framework and platform called Event Bus, which was a more opinionated and thoughtful platform versus this kluged Monster because it just wasn't scaling.

Participant 2: You talked about the adoption of your platform and how you got it through [inaudible 00:43:06]. I was curious to see if my team is currently at that stage, trying to drive adoption from an old platform to a new platform. We're currently working on what we call door-to-doors, which is one-on-one time with [inaudible 00:43:22] teams that you were talking about, and hold their hand in the beginning and show them through the platform, show them the documentation on how to migrate to that platform. How do we continue that, to drive that adoption and the migration? [inaudible 00:43:40] that transition, and making sure that your platform is successful?

Patel: It definitely wasn't as hunky dory as I had to fit in in 10 minutes. We continue to sit in with those teams, not just the alpha, beta. We started off working on the stateless and things which felt easy, because we wanted to work out like, what automation was needed, what kinks in the platform we had to work out. The deepest, longest ones, were getting some of those stateful workloads over. In that, we continued to sit in with them, we had a long embed. We'd obviously rotate the engineers who would sit in, because that will be that feedback loop on what to improve. Until we had a good coverage of that stateful and stateless, we continued to do small but sure embeds, until we got to that state where we're like, this is good enough. We've solved most of the use cases that we officially do want to support. Until then we continued the embed every quarter. That's why we also worked with the TPM, not just our staff ICs to say who would be the next wave that we would go sit with, and grease the wheels there.


See more presentations with transcripts


Recorded at:

Jun 14, 2024