Key Takeaways
- Reliability and ergonomics are not opposing trade-offs: A platform with poor ergonomics is inherently unreliable because it invites the human errors it was trying to prevent.
- When the same workaround appears across multiple teams, it is a signal to absorb the pattern into the platform as a safe default.
- A control plane that continuously reconciles actual and desired state, handling placement, self-healing, and rebalancing, turns reliability from a function of operator response time into a function of code logic.
- Observability should be for a linked hierarchy – "something is broken" leads to "where" leads to "why" – paired with declarative, idempotent tooling that helps an engineer on their first on-call rotation resolve incidents as effectively as a ten-year veteran.
- Automated reliability, developer ergonomics, and operator ergonomics form a virtuous cycle: ergonomic SDKs produce predictable traffic, predictable traffic reduces operator burden, and unburdened operators enable further platform improvements.
We are in the era of Internal Developer Platforms (IDPs). The industry promise is seductive: Abstract away the "undifferentiated heavy lifting" of the cloud so product teams can focus entirely on shipping business value.
But as someone who has spent a decade in the trenches of infrastructure, I have seen this promise fall short. Usually, a platform hits a wall. It becomes a leaky abstraction where developers are forced to understand the underlying infrastructure anyway, or it becomes so rigid that it slows down the very teams it was intended to accelerate.
The common diagnosis is that we are struggling to balance a trade-off: reliability vs. ergonomics.
We assume that for a system to be "enterprise grade" (i.e., reliable), it must be complex, heavily guarded, and slow to change. Conversely, we assume that "developer friendly" (i.e., ergonomic) translates into removing the safety rails.
I'd like to share a different framework, one drawn from patterns I've seen work repeatedly across these systems. In my experience, reliability and ergonomics are not in opposition; they are a virtuous cycle.
A platform with poor ergonomics, interfaces that are confusing, manual, or "sharp", is inherently unreliable because it invites human error. To build a platform that scales, we must serve two distinct user groups through three interconnected pillars: automated reliability, developer ergonomics, and operator ergonomics.
Pillar 1: Reliability as Automated State Management
In a small-scale system, reliability is often reactive. An alert fires, a human logs in, and a human fixes the state. But at the scale of a global database or a massive caching fleet, operational heroism doesn't scale. Reliability must be treated as a managed state.
The Control Plane as a "Brain"
The most resilient systems I've worked on follow a strict separation between the data plane (the bit-shifters) and the control plane (the decision-maker).
Think of the control plane as a continuous control loop, much like a thermostat. Its job is to constantly reconcile the actual state with the desired state. While this list is not exhaustive, I've encountered some repeated use cases of such a control plane.
Automated Placement and Rebalancing
In a distributed system, "hot spots" are inevitable. Whether it's a specific database partition or a cached leaderboard for millions of players, some nodes will work harder than others.
Following the manual way, an operator sees a high-CPU alert and manually initiates a shard split. With the control plane way, the brain observes traffic patterns, identifies a heat threshold, provisions a new node, and moves the partition.
Note that we aren't trying to undermine the significant complexity of undertaking the work behind "identifying the threshold" and "moving the partition". In the case of a durable database, the moving part often involves serializing the on-disk representation of your data, extracting it in a backward-compatible format, moving it across geographic locations bounded by network throughput, and deserializing the data and writing it back on the new machine. The critical insight is the control plane should be able to orchestrate all of this movement without impacting the latency and availability profile of actual customer traffic on these nodes, which are often bound by CPU and network throughput.
These processes require the control plane to be idempotent. Because the network is unreliable, the command to "move partition X to node Y" might be sent twice or fail at the halfway point. The control plane must be able to recover from these partial failures without corrupting the system state.
Fleet Health and Self-Healing
Hardware fails. Bit rot happens. A reliable platform doesn't page a human for a single node failure; it expects it. The control plane should constantly health-check storage and compute nodes. When a node stops responding to heartbeats, the "brain" should automatically:
- Cordon the node to stop new traffic.
- Re-replicate the missing data to maintain the desired replication factor.
- Decommission the old hardware.
Global Decision Making
A single-leader architecture for the control plane goes a long way, often years if not more than a decade. A single leader allows the control plane to have a global view of the world, which assists in efficient and simpler decision making.
Consider rate limiting as an example. You might want your infrastructure to throttle or reject requests (also known as traffic shaping) from bad or noisy actors. Local rate limiting often falls short in such scenarios, because individual hosts do not have knowledge of what others in the fleet are doing. Another model is the single control plane leader communicating bidirectionally with all the stateless routers or proxies of your fleet, sending information about rate limiting and other metadata accordingly. This model removes the need to implement complex distributed coordination protocols or other mechanisms to achieve global rate limiting.
When the single leader does turn out to become a bottleneck, a common evolution of such architecture is to retain the single leader but distribute/offload its work to others in the fleet. This approach unlocks scalability while still having a single, global, and consistent view of the world.
Note: Leader election is not trivial without thinking about situations such as split brain. The literature on leader election is outside the scope of this article, but I recommend reading this article by Martin Kleppmann.
By automating these "lower-level" decisions, you ensure that the system's reliability is a function of your code's logic, not the speed at which an engineer can find their laptop at 3 AM.
Pillar 2: Developer Ergonomics
Automation beats documentation. Developer documentation is necessary, but it is an insufficient control for reliability. If you tell a developer, "Please don't use this API in a tight loop without a 100ms sleep," someone, somewhere, will miss that note.
Developer ergonomics is the art of making the "golden path" the path of least resistance. It is about shifting reliability "left", moving it from the runtime environment into the very tools the developer uses to write code.
The Opinionated SDK Pattern
The SDK is the true user interface of your platform, where your infrastructure meets the developer's business logic. A raw SDK is just a set of wrappers around API calls. An ergonomic SDK is a reliability engine.
Pattern-Based Abstractions
In any sufficiently mature platform, you start noticing something interesting: Your users are all solving the same problems, in slightly different ways, with slightly different bugs. They're implementing distributed locks, building rate limiters, or rolling their own leaderboard logic on top of your primitives. This is a signal. When you see the same pattern implemented a dozen times across a dozen teams, it's time to absorb that pattern into the platform.
Consider the locking use case. A distributed lock is surprisingly hard to do correctly; you need TTL heartbeats, fencing tokens, and reservation logic to handle the case where a lock holder dies mid-operation. Most teams underestimate this complexity, and the result is subtle race conditions that surface only under production load. The fix isn't better documentation about how to implement locks. The fix is a Lock client where the developer calls lock.acquire() and lock.release(), and the SDK handles the heartbeating, the fencing, and the reservation logic internally.
The same principle applies to connection management. If your platform uses gRPC or long-lived connections, connection pooling, health checking, and load balancing shouldn't be an exercise left to the reader. A connection provider that handles pool sizing, retries on transient failures, and graceful draining encourages the developer to think about what to query, not how to maintain a healthy connection. This orchestration is particularly important in polyglot environments where each language's ecosystem has different defaults and gotchas around connection lifecycle.
Environment-Aware Defaults
We are operating in a plethora of different system environments: bare metal instances, container orchestrators, and serverless functions. Each environment exhibits traits and properties that are very specific, and a single set of SDK defaults can be dangerous.
As an example, HTTP keepalives are a standard performance optimization; they reuse TCP connections across requests to avoid the overhead of repeated handshakes. In most environments, this reuse is exactly what you want. But in a serverless environment like AWS Lambda, the execution context freezes between invocations. The keepalive timer keeps ticking on the server side, but the client is frozen. When the function wakes up minutes later, it tries to send a request on a connection the server has already closed. The result is mysterious timeout errors that look like a server-side issue, not a client-side configuration problem.
The fix itself was straightforward: configuring keepalive behavior appropriately for the serverless execution model. Keepalives are not inherently bad; in long-running services they are essential for connection health. But in a freeze-and-thaw environment like Lambda, the default keepalive settings become actively harmful. The hours spent tracing gRPC logs to arrive at that conclusion were the real cost. The deeper lesson here is that your SDK should know where it's running. Having environment-specific configuration profiles, ServerlessConfig, ContainerConfig, and LongRunningConfig, with defaults tuned for each execution model can save your users significant debugging time chasing problems that aren't bugs in their code, but mismatches between infrastructure assumptions and runtime reality.
Solving the Retry Storm
One of the most common causes of cascading failures in distributed systems is the retry storm. Let's imagine that a backend service has a momentary latency spike. Every client, following a simple "retry 3 times" logic, immediately hammers the service with three times the traffic. The service, which was already under pressure, now collapses entirely. What makes this collapse particularly dangerous is that every individual client is doing the "right" thing, retrying a failed request. But the emergent behavior of thousands of clients retrying simultaneously is catastrophic.
An ergonomic SDK addresses this behavior not by documenting the right retry strategy, but by enforcing it as a default. Exponential backoff with jitter is a technique that ensures that retries spread out over time instead of arriving in a coordinated wave. Circuit breaking ensures that a client experiencing repeated failures gives the backend room to breathe rather than continuing to add load. By baking these into the client library as defaults rather than opt-in configurations, the backend has the best possible chance of gradual recovery.
The takeaway across all three patterns is the same: Discover the repeated problems your users are solving and absorb those solutions into the platform itself. By doing so, you've reduced the developer's cognitive load and eliminated entire classes of bugs that they can no longer accidentally introduce.
Pillar 3: Operator Ergonomics
Operator ergonomics is the missing link! We often obsess over the developer experience, and rightly so. But there's another user of the platform that we consistently forget: the internal operator. These are the engineers who build, maintain, and, most importantly, debug the platform when things go wrong.
If your platform's internal state is a black box, or if fixing a common issue requires a complex series of manual steps run in precisely the right order, your platform has a reliability problem hiding in plain sight. Poor operator ergonomics directly leads to high Mean Time to Recovery (MTTR ).
Why do operators need ergonomics?
Consider a scenario where you need to scale out your database to have four partitions instead of two. A runbook to achieve this partitioning might look like:
- Provision two new instances for the desired partition type.
- Provision the logical partition of the system that wraps those instances.
- "Connect" the two new partitions to the database.
- Optionally, perform any data transfer required to balance storage among the partitions.
These four tasks can be run as four individual scripts or a series of commands by an operator, which seems straightforward enough. But consider when an alert fires and cognitive load is at its peak. The operator accidentally runs step 3 before step 2. The system is now in an inconsistent state: Instances exist, but the logical partition they belong to hasn't been created yet. Now they're debugging two problems: the original hot partition and the mess that was just created. Or they complete all four steps for the wrong partition ID because their terminal history had a similar command from last week. Or they skip step 4 because the runbook says "optionally", not realizing that in this particular case the cluster will become imbalanced and tip over an hour later.
Every infrastructure team has a version of this story. The problem isn't that the operator was careless; it's that the tooling required perfect execution under the worst possible conditions.
Declarative Over Imperative
This is where the connection back to Pillar 1, reliability as automated state management, becomes important. Instead of four imperative steps, we can have the operator specify the what through a declarative interface, while the control plane figures out the how:
update partition <partition-id> --scale-up --new-num-nodes 4
The control plane handles provisioning instances, creating logical partitions, connecting them, and rebalancing data in the correct order, idempotently, with validation at each step. This approach removes the cognitive load on operators for how the desired state is achieved. It also reduces human errors because the sequencing and safety checks are codified in the control plane rather than in a runbook.
A well-designed tool should go further with built-in safety rails.
- Dry-run mode
Before applying any change, the tool shows what will happen: "This will provision 2 new nodes in us-east-1 and rebalance 3 partitions. Estimated time: 12 minutes." This text lets the operator verify intent before committing. - Blast radius controls
The tool should make it harder to accidentally target production when you intended staging, or to affect more nodes than intended. Requiring explicit confirmation for high-impact changes goes a long way here. - Idempotency recovery
If a command fails halfway through due to network blip, timeout, etc., the operator should be able to rerun the same command and have it pick up where it left off, rather than creating a second set of half-provisioned resources. This approach connects directly to the idempotency requirement we discussed in the control plane.
Operational Awareness: Beyond Simple Dashboards
Operational awareness isn't just about having a dashboard with a hundred squiggly lines; it's about having dashboards that answer the question: "What is broken right now, and why?"
I've found it helpful to think of this as a hierarchy of observability.
What is broken? High-level health indicators that tell you the service is degraded are what wake you up. At this level, keep them binary whether they are healthy or not. The goal isn't to encode every failure mode into the top-level view. The goal is to answer as quickly as possible "Do I need to act right now?" The distinction between what is down versus what is degraded belongs at the next level. Where is it broken? Intermediate dashboards narrow your search. The control plane is failing to place new shards. Replication lag is spiking in a specific region. Latency is elevated on nodes serving a particular tenant. Why is it broken? Deep-dive tooling exposes internal state. Both Nodes X and Z think they own Partition Y. The connection pool is exhausted because a downstream dependency is slow. A load shedding rule is misconfigured and is matching too broadly.
The key insight here is that each level should link to the next. When the high-level alarm fires, one click should take you to the intermediate dashboard already filtered to the affected region. One more click should surface the specific node or component that needs attention. This progression from "something is wrong" to "here is exactly why" should feel natural and fast.
Encoding Knowledge Into Tooling
A subtler cost to poor operator ergonomics is worth calling out: It creates tribal knowledge. When recovery procedures are complex and poorly built, only the engineers who have been through a specific failure mode before know how to handle it. The "runbook" becomes a Slack thread from six months ago, or it lives in one person's head.
This is a scaling problem disguised as a people problem. You can't hire your way out of it, because every new on-call engineer goes through the same learning curve. The fix is the same principle from Pillar 2, developer ergonomics, applied inward: Absorb repeated operational patterns into tooling. If your team has manually rebalanced partitions ten times, that procedure should be a single command. If diagnosing a replication lag issue always requires checking the same five things (metrics, logs, database state, etc.), build a diagnostic command that checks all five and reports the result.
The goal is to have an engineer on their first on-call rotation able to resolve the same incidents as a ten-year veteran, not because the problems are simple, but because the tooling encodes the knowledge that the veteran accumulated over those years.
Synthesis: The Virtuous Cycle
The core argument of this framework is that these three pillars do not exist in isolation. They form a feedback loop that determines the long-term health of your engineering culture.

Figure: A feedback loop between developer ergonomics, operator ergonomics, and automated reliability that builds trust.
Ergonomic SDKs lead to predictable usage patterns. When developers find it easy to use the right patterns, proper retries, connection pooling, and environment-aware defaults, the fleet becomes more stable. The control plane sees fewer anomalies because the clients are well-behaved.
Stable fleets protect operators from frequent pages. When the control plane handles the drudge work of rebalancing, healing, and traffic shaping, operators are no longer fighting fires around the clock. On-call becomes manageable rather than dreaded.
This approach completes the cycles: Confident operators have the bandwidth to invest back into the platform. They can build better tools, refine the SDKs, and run resilience drills to find latent bugs before they become outages. They fix the CLI tool that's slow, rewrite the dashboard that's confusing, and add the dry-run mode that prevents the next operator error. These small ergonomic improvements are direct investments in future reliability.
The opposite is also true. When tooling is poor, operators make more errors. More errors lead to more incidents. More incidents lead to exhausted engineers who don't have the bandwidth to improve the tools. The tools stay poor. It is a cycle that feeds on itself, and breaking out of it requires a deliberate investment in all three pillars simultaneously.
The next time you are faced with a choice between a quick fix and an architectural improvement, consider which of the three pillars it strengthens. By focusing on the virtuous cycle of reliability and ergonomics, you aren't just building a set of tools; you are building a foundation that allows your entire organization to scale.
When This Framework Is Overkill
This framework is shaped by the realities of large, complex distributed systems, the kind where a single misconfigured client can cascade into a fleet-wide outage. If you're running a small service with a handful of engineers, the overhead of building a full control plane, opinionated SDKs, and layered observability may not pay for itself. A well-written runbook and a simple deploy script might be all you need. The investment starts to pay off when the cost of human errors exceeds the cost of building the automation, when your team is large enough that tribal knowledge becomes a liability or when your system is complex enough that no single person can hold the full picture in their head.
The Goal is Trust
Building an infrastructure platform is not just a technical challenge; it is a trust-building exercise.
Developers trust the platform when it is ergonomic and helps them move faster without getting in their way. Operators trust the platform when it is reliable and they know the "brain" is handling the mundane tasks and the tools are there to support them when things get complex.
Trust, once established, compounds. A platform that developers trust gets adopted willingly, not by mandate. A platform that operators trust is invested in, rather than worked around. That trust, more than any architectural diagram or design document, is what separates infrastructure that scales from infrastructure that becomes the bottleneck it was intended to eliminate.