BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Engineering Speed at Scale — Architectural Lessons from Sub-100-ms APIs

Engineering Speed at Scale — Architectural Lessons from Sub-100-ms APIs

Listen to this article -  0:00

Key Takeaways

  • Treat latency as a first-class product concern — designed with the same discipline as security and reliability.
  • Use a latency budget to turn "sub-100ms" into enforceable constraints across every hop in the request path.
  • Expect speed to regress unless you actively guard it as the system, traffic, and dependencies evolve.
  • Keep performance ownership broad by baking it into reviews, dashboards, and release practices — not a single "performance team."
  • Let architecture create the fast path, and let culture (measurement + accountability) keep it fast over time.
     

The Cost of a Millisecond: Why Latency Shapes Experience

When we talk about API performance, it’s tempting to think in neat technical terms - response times, CPU cycles, connection pools, and the occasional flame graph. But in real-world systems, especially global commerce and payments platforms, latency has a very human cost. A delay of just 50 or 100 milliseconds rarely registers in isolation, but at scale it can nudge a customer away from completing a purchase, disrupt a payment flow, or simply chip away at the trust users place in your product.

Speed shapes perception long before it shapes metrics. Users don’t measure latency with stopwatches - they feel it. The difference between a 120 ms checkout step and an 80 ms one is invisible to the naked eye, yet emotionally it becomes the difference between "smooth" and "slightly annoying". On a small scale, that’s forgettable. Across millions of sessions, it becomes the friction that compacts into lower conversion rates, abandoned carts, and reduced revenue. And the irony? The engineering effort needed to recover from that friction - new features, experiments, retention strategies - often dwarfs the work needed to prevent it in the first place.

In high-throughput platforms, latency amplifies. If a service adds 30 ms in normal conditions, it might add 60 ms during peak load, then 120 ms when a downstream dependency wobbles. Latency doesn’t degrade gracefully; it compounds. And once your tail latency (p95, p99) drifts, it silently "taxes" every upstream service that depends on you. Each service adds its own jitter, serialization overhead, and network hop. What starts as a tiny bump in one API becomes a cascading slowdown across dozens of interconnected services.

This is why high-performing architecture teams treat speed as a product feature, not a pleasant side effect. They design for latency the same way they design for security and reliability: intentionally, with clear budgets, well-defined expectations, and patterns that protect the user experience under stress.

A helpful way to see this is through a "latency budget". Instead of thinking about performance as a single number - say, "API must respond in under 100 ms" - modern teams break it down across the entire request path:

  • 10 ms at the edge
  • 5 ms for routing
  • 30 ms for application logic
  • 40 ms for data access
  • 10–15 ms for network hops and jitter

Each layer is allocated a slice of the total budget. This transforms latency from an abstract target into a concrete architectural constraint. Suddenly, trade-offs become clearer: "If we add feature X in the service layer, what do we remove or optimize so we don’t blow the budget?" These conversations - technical, cultural, and organizational - are where fast systems are born.

The heart of this article is simple: low latency isn’t an optimization - it’s a design outcome. It emerges from the choices we make about data locality, async vs. sync flows, cache boundaries, error isolation, and observability. Achieving sub-100 ms is possible for many systems, but sustaining it under load takes alignment across engineering, product, and operations.

In the sections that follow, we’ll break down how real systems are structured, how engineering teams make trade-offs when milliseconds matter, and how organizations sustain performance long after the first release ships. Fast systems don’t happen accidentally - they’re engineered with intent.

Inside the Fast Lane: How Low-Latency Systems Are Structured

Before we talk about optimizing performance, we need to zoom out and understand what a low-latency system actually looks like. Sub-100 ms responses don’t come from a single clever trick; they emerge from a carefully orchestrated pipeline of components that work together with minimal friction. Think of it less as "making one thing fast" and more as "removing unnecessary steps from an entire journey".

Most modern systems - especially in commerce and payments - follow a layered architecture that looks deceptively simple from the outside: a client makes a request, it hits an API gateway, flows through a service layer, talks to a database, and returns. But behind that simple flow is an intricate chain where every hop, every serialization, every cache hit or miss shapes the user’s experience.

Let’s walk through the anatomy of a fast system and where milliseconds typically hide.

The Request Journey: Where Latency Sneaks In

A typical sub-100 ms request flow might look like this:

  1. ClientCDN or Edge Network
    The closest node absorbs the request and routes it smartly.<
    Latency target: 5–15 ms
  2. EdgeAPI Gateway
    Authentication, routing, throttling.
    Latency target: 5 ms
  3. GatewayService Layer
    Business logic, orchestration, fan-out.
    Latency target: 20–30 ms
  4. Service LayerData Layer
    Reads from databases, caches, or search systems.
    Latency target: 25–40 ms
  5. ServiceGatewayClient
    Serialization and network hop back.
    Latency target: 5–10 ms

When done right, the entire hop chain stays predictable - even during peak load. But if any single hop drifts, the whole chain inherits the slowdown. This is why fast systems start with understanding the complete journey, not just the part you own.

Where Latency Really Comes From (Not Where You Expect)

Latency is rarely caused by "slow code". In production systems, it usually comes from:

1. Network Hops

Each hop adds cost:

  • TLS handshakes
  • Connection pool waits
  • DNS lookups
  • Region-to-region travel

Shaving one hop often saves more latency than rewriting 100 lines of Java.

2. Serialization & Payload Size

JSON serialization/deserialization is more expensive than people realize. Every unnecessary field is extra work. Binary formats (e.g., Protobuf) can help, but they add operational overhead.

3. Cold Caches

A cache miss at the wrong time can double or triple service latency. This is why "warming strategies" matter when deploying new versions.

4. Database Query Shape

Database latency is often an access-pattern problem: query shape, indexes, and cardinality matter.A poorly indexed read can turn a 10 ms query into a 120 ms spike.Multiply that across thousands of requests per second, and tail latency explodes.

5. Dependent Services

This is where latency becomes unpredictable: If your service calls three downstreams, your response time is often gated by the slowest one.

This is why async fan-out, caching, and circuit breakers become critical. (We’ll go deeper soon)

The Latency Budget: Your Most Important Architectural Tool

High-performing engineering teams don’t just "measure latency"; they budget it. A latency budget is like a financial budget: Everyone gets a fixed allowance, and nobody is allowed to overspend.

A typical 100 ms budget might look like:

Layer Budget (ms)
Edge/CDN 10
Gateway 5
Service Logic 30
Database / Cache 40
Network Jitter 10

Budgets make performance manageable and negotiable. Engineers can now ask:

"If we add feature X, which layer gives up its milliseconds?"

Without budgets, performance conversations become vague and subjective.

Why Understanding the System Structure Matters

Everything we cover in later sections - async fan-out, caching hierarchies, circuit breakers, fallback strategies - only makes sense when you understand the system anatomy. Optimizing a single service without understanding the ecosystem is like upgrading a car engine but ignoring the wheels, brakes, and fuel system.

Fast systems share these traits:

  • Fewer hops
  • Aggressive local caching
  • Predictable data-access paths
  • Parallelism over serial execution
  • Isolation of slow components
  • Stable tail latency under load

With the system anatomy clear, we can now move into the engineering playbook - how to actually make these systems fly.

The Engineering Playbook: Trade-Offs That Keep APIs Lightning-Fast

Engineering for low latency is really engineering for predictability. Fast systems aren’t built through micro-optimizations - they’re built through a series of deliberate, layered decisions that minimize uncertainty and keep tail latency under control. This section breaks down the actual patterns, trade-offs, and guardrails used in high-throughput systems.

Async Fan-Out: Parallelism Without Pain

Slow APIs often boil down to one root cause: serial dependencies.

If your system performs three downstream calls at 40 ms each, you’ve already lost 120 ms without doing any real business work.

Fan out in parallel

Java’s CompletableFuture is a natural fit, especially when paired with a custom executor tuned for downstream concurrency:

ExecutorService pool = new ThreadPoolExecutor(
        20, 40, 60, TimeUnit.SECONDS,
        new LinkedBlockingQueue<>(500),
        new ThreadPoolExecutor.CallerRunsPolicy()
);

CompletableFuture<UserProfile> profileFuture =
        CompletableFuture.supplyAsync(() -> profileClient.getProfile(userId), pool);

CompletableFuture<List<Recommendation>> recsFuture =
        CompletableFuture.supplyAsync(() -> recClient.getRecs(userId), pool);

CompletableFuture<OrderSummary> orderFuture =
        CompletableFuture.supplyAsync(() -> orderClient.getOrders(userId), pool);

return CompletableFuture.allOf(profileFuture, recsFuture, orderFuture)
        .thenApply(v -> new HomeResponse(
                profileFuture.join(),
                recsFuture.join(),
                orderFuture.join()
        ));

But here’s the caution most articles never mention:

Async code doesn’t eliminate blocking - it just hides it inside a thread pool.

If your executor is misconfigured, you can trigger:

  • CPU thrashing
  • Thread contention
  • Queue buildup
  • Out-of-memory errors
  • Cascading slowdowns across the entire fleet

Thread Pool Rule of Thumb:

For downstream IO-bound calls, size your pool to:
2×CPU cores × expected parallel downstream calls per request
(adjust using p95/p99 load testing)

Multi-Level Caching: The Art of Fast Paths

Fast systems don’t eliminate work - they avoid doing the same expensive work repeatedly.

A typical hierarchy:

  1. Local cache (Caffeine) - sub-millisecond
  2. Redis cache - 3–5 ms
  3. Database - 20–60+ ms

Use the dual-level caching pattern. In this example, Redis uses a 10-minute TTL, and the local in-memory cache should also be time-bounded (usually shorter), otherwise it can quietly become a "forever cache" and serve stale data across instances.

public ProductService(RedisClient redis, ProductDb db) {
    this.redis = redis;
    this.db = db;
    this.localCache = Caffeine.newBuilder()
        .maximumSize(50_000)
        .expireAfterWrite(Duration.ofMinutes(1)) // shorter than Redis
        .build();
  }

public ProductInfo getProductInfo(String productId) {
    ProductInfo local = localCache.getIfPresent(productId);
    if (local != null) return local;

    ProductInfo redisValue = redis.get(productId);
    if (redisValue != null) {
        localCache.put(productId, redisValue);
        return redisValue;
    }

    ProductInfo dbValue = db.fetch(productId);

    redis.set(productId, dbValue, Duration.ofMinutes(10));
    // localCache is configured with expireAfterWrite(1, MINUTES)
    localCache.put(productId, dbValue);
    return dbValue;
}

This drives most requests into the fast path and reserves slow work for the cold path.

Cache Invalidation: The Hardest Problem in Computer Science (Still True)

Low-latency systems are heavily cache-driven, but caching without a clear invalidation strategy is a time bomb.

There are three kinds of invalidation:

Cache invalidation style Pros Cons
1. Time-based (TTL) Simple, safe, widely used stale data risk increases with longer TTLs
2. Event-based Producer sends "invalidate" events to downstream caches whenever data changes Requires strong data ownership
3. Version-based Cache keys include a version: product:v2:12345 When you bump the version, old data becomes unreachable

There is no universally "best" invalidation strategy. The right choice depends on how often data changes and how costly staleness is - which is exactly why classification matters.

Data Classification: Not Everything Belongs in Cache

This is the part almost every caching article ignores, but real systems cannot.

You cannot treat all data as equal. Before caching anything, classify the data:

Classification What it means Caching guidance Examples
Public Safe to cache anywhere (CDN, Redis, local memory). Cache freely. TTL based is usually fine. product titles, images, metadata
Internal Cacheable with restrictions. Cache with guardrails (scope, TTL, access control). internal IDs, flags
Confidential (PII) Sensitive user data. Can only be cached if encrypted and with strict TTL. email, phone, full user info
Restricted (PCI) Highly regulated payment data. Never cached. raw card numbers, CVV, unmasked PAN

When to be strict vs. loose?

The caching strategy depends on the type of data… For example:

  • Product catalog → loose TTL is fine (staleness OK)
  • Pricing, offers → tighter TTL or event-based
  • Payments, balances → never cached, or only tokenized/aggregated versions

A simple classification check can protect engineering teams from accidental compliance violations.

if (data.isRestricted()) {
    throw new UnsupportedOperationException("Cannot cache PCI/PII data");
}

Circuit Breakers: Don’t Let Slow Dependencies Infect Your Tail Latency Downstream

Slowness is one of the biggest drivers of p99 spikes. A dependency doesn’t need to be fully down to cause trouble - sustained latency is enough. If every request waits on a degrading downstream call, you start consuming threads, building queues, and turning a local slowdown into a broader tail-latency problem.

A circuit breaker helps by acting as a boundary between your service and an unstable dependency. When errors or timeouts cross a threshold, the breaker opens and temporarily stops sending traffic there. That shifts the system from "wait and accumulate" to a predictable outcome: fail fast and fall back, keeping your own API responsive.

Resilience4j gives lightweight protection:

CircuitBreakerConfig config = CircuitBreakerConfig.custom()
        .failureRateThreshold(50)
        .slidingWindowSize(20)
        .waitDurationInOpenState(Duration.ofSeconds(5))
        .build();

CircuitBreaker cb = CircuitBreaker.of("recs", config);

Supplier<List<Recommendation>> supplier =
        CircuitBreaker.decorateSupplier(cb, () -> recClient.getRecs(userId));

try {
    return supplier.get();
} catch (Exception ex) {
    return Collections.emptyList();  // fast fallback
}

When the breaker opens:

  • Calls fail fast (<1 ms)
  • No threads are blocked
  • Your API stays stable

Fallbacks: When "Fast and Partial" Beats "Slow and Perfect"

Fallbacks keep your fast path intact when a dependency is slow or unavailable. The point isn’t to pretend nothing happened - it’s to stop downstream slowness from consuming your latency budget. In many user flows, a slightly degraded response delivered quickly is better than a perfect response delivered late.

Fallbacks should:

  • Provide something useful
  • Be predictably fast
  • Not cause additional load
  • Be easy to reason about

Timeouts are part of the design. If a downstream timeout is "a few seconds", it can quietly destroy a sub-100ms target. Timeouts need to align with the latency budget you set earlier and the dependency’s p95/p99 behavior - especially in fan-out paths where one slow call can dominate tail latency.

Here’s an example that returns a cached snapshot if the full page can’t be assembled quickly. This only works because it builds on the caching approach discussed earlier - another reminder that low latency is holistic (budgets, caching, timeouts, and resilience patterns working together):

public ProductPageResponse getPage(String productId) {
    try {
        return fetchFullPage(productId);
    } catch (TimeoutException e) {
        return fetchCachedSnapshot(productId);  // warm, minimal, safe
    }
}

Fallbacks don’t eliminate failures - they bound the user impact when things get slow.

Data Partitioning: Reducing Hotspots and Tail Spikes

Partitioning reduces lock contention, narrows index scans, and improves locality.

Here is a simple example where data is partitioned by region:

CREATE TABLE orders_us PARTITION OF orders FOR VALUES IN ('US');
CREATE TABLE orders_eu PARTITION OF orders FOR VALUES IN ('EU');

The application layer needs corresponding updates to use the partition effectively:

String table = region.equals("US") ? "orders_us" : "orders_eu";
return jdbc.query("SELECT * FROM " + table + " WHERE user_id=?", userId);

Partitioning is essential for read-heavy API systems.

Observability: Making Speed Measurable

Fast systems are not just the result of good architecture - they’re the result of relentless observability. Latency budgets, circuit breakers, caching layers, thread pools… none of them matter if you don’t know when and where your system drifts under real traffic.

The biggest myth about low latency is that once you achieve it, you’re done. The truth is the opposite:

Speed decays unless you actively guard it.

This is why high-performing engineering teams treat observability as a first-class citizen - not a debugging tool, but a continuous performance governance mechanism.

Measure What Matters: p50, p95, p99, and Beyond

Most dashboards proudly show average latency, which is almost useless in distributed systems. What users actually feel is tail latency:

  • p50 → "typical user"
  • p95 → "slightly unlucky user"
  • p99 → "customer who will abandon your product if this happens too often"

If your p50 is 45 ms but your p99 is 320 ms, your system isn’t fast - it’s merely nice sometimes.

Fast systems tune for predictability, not just averages.

Instrumentation with Micrometer

Micrometer is the de facto standard for metrics in modern Java systems, and it makes latency instrumentation almost trivial.

Here’s a Micrometer timer for an API endpoint:

@Autowired
private MeterRegistry registry;

public ProductInfo fetchProduct(String id) {
    return registry.timer("api.product.latency")
            .record(() -> productService.getProductInfo(id));
}

This single line produces:

  • p50, p90, p95, p99 histograms
  • throughput (requests/sec)
  • max observed latency
  • time-series for dashboards
  • SLO burn-rate signals

Custom tags can be added for deeper insight.

registry.timer("api.product.latency",
        "region", userRegion,
        "cacheHit", cacheHit ? "true" : "false"
);

A rule we use internally:

Tag everything that might affect latency.
Region, device type, API version, cache hit/miss, fallback triggered, etc.

This creates semantic observability - the opposite of blind metrics.

Distributed Tracing: The Truth Serum of Low-Latency Systems

Metrics tell you how long something took. Tracing tells you why.

Using OpenTelemetry + Jaeger, you can map an entire request journey:

Span span = tracer.spanBuilder("fetchProduct")
        .setSpanKind(SpanKind.SERVER)
        .startSpan();

try (Scope scope = span.makeCurrent()) {
    return productService.getProduct(id);
} finally {
    span.end();
}

When visualized in Jaeger, you’ll see:

  • Gateway time
  • Service logic time
  • Parallel calls
  • Cache vs DB path
  • Downstream delays
  • Serialization time

This is how teams discover issues like:

  • "DB is fine, but Redis has a spike every hour".
  • "API gateway is spending 10 ms doing header parsing".
  • "Thread pool starvation during peak traffic".

Tracing pinpoints latency leaks no dashboard could reveal.

SLOs and Latency Budgets: The Guardrails That Keep Teams Honest

Latency budgets, as discussed earlier, only work when teams measure and enforce them.

A typical SLO (Service Level Objective):

  • Target: p95 < 120 ms
  • Period: Rolling 30 days
  • Error budget: 5% of requests may exceed threshold

SLO burn rate is simply how fast you’re spending that error budget compared to the "expected" pace. A burn rate of 1 means you’re consuming budget at the rate that would use it up exactly by the end of the SLO window; anything above 1 means you’re burning it faster than planned.When the burn rate spikes, teams slow down feature releases and prioritize performance fixes (rollback, reduce load, tune hot paths, fix a slow dependency, etc.). This is one of the most practical ways to keep "sub-100ms" from becoming a quarterly goal that slowly drifts.

A very useful burn-rate alert rule:

Alert if burn-rate > 14.4 over 10 minutes
Translation: 14.4 is a commonly used "fast-burn" threshold - if that pace were sustained, you’d consume a 30-day error budget in roughly ~2 days (≈50 hours), which is why it’s treated as urgent.

How this prevents issues from reaching customers: Burn-rate alerts are designed to fire early, while the regression is still small (or still limited to a subset of traffic). That gives you time to pause or reverse a rollout and fix the underlying cause before the slowdown becomes widespread and sustained. Teams often pair this with progressive delivery (canaries) and synthetic checks, but the key is that burn-rate is an SLO-native early warning tied directly to user-facing latency.

Thread Pool Observability: The Hidden Latency Killer

Thread pools are one of the easiest ways to accidentally break a latency budget. They look like a performance win ("parallelize downstream calls"), but under load they can become a bottleneck: threads saturate, queues grow, requests start waiting, and what used to be "async fan-out" quietly turns into backpressure and tail-latency spikes. The tricky part is that this doesn’t always show up as high CPU - it often shows up as waiting.

That’s why observability matters here. Without visibility into pool saturation and queue growth, you only notice the problem after p99 has already exploded.Instrument your pool:

ThreadPoolExecutor executor = (ThreadPoolExecutor) pool;

registry.gauge("threadpool.active", executor, ThreadPoolExecutor::getActiveCount);
registry.gauge("threadpool.queue.size", executor, e -> e.getQueue().size());
registry.gauge("threadpool.completed", executor, e -> e.getCompletedTaskCount());
registry.gauge("threadpool.pool.size", executor, ThreadPoolExecutor::getPoolSize);

If you see:

  • Active threads == max size
  • Queue constantly growing
  • Rejection count increasing

… then your async fan-out is turning into async pile-up, which leads to:

  • Retries
  • Timeouts
  • Cascaded slowness
  • p99 explosions

Thread pool monitoring is non-negotiable in low-latency environments.

Observability Isn’t a Dashboard - It’s a Culture

The most important insight is cultural:

  • Teams own their latency
  • Dashboards are reviewed weekly
  • SLOs drive engineering priorities
  • Performance regressions trigger postmortems
  • Cache hit rates are tracked like uptime
  • Every change has a performance blast radius

Fast systems stay fast only because teams keep them honest.

Beyond Architecture: How Organizations Keep APIs Fast - and Where the Future Is Heading

Engineering a sub-100 ms API is challenging; keeping it consistently fast as the system grows is even harder. Over time, feature creep, new dependencies, shifting traffic patterns, and organizational changes all conspire to slow systems down. Architecture provides the foundation, but long-term performance comes from habits, ownership, and a culture that treats latency as a first-class concern.

The most reliable lesson from real-world systems is simple:

Fast systems stay fast only when teams behave like performance is everyone’s job.

Culture Keeps Performance Alive

High-performing organizations treat performance as a shared responsibility rather than a backend problem, and that cultural mindset is ultimately what keeps APIs fast over the long term. Teams own the latency of the services they build, planning features with an explicit understanding of how many milliseconds each change will cost and holding themselves accountable when performance drifts. Engineers routinely ask performance-conscious questions during design reviews - "How many extra hops does this add?", "Is this cacheable?", "What’s the worst-case p99 impact?" - ensuring that latency remains part of everyday decision-making. And when things do go wrong, these organizations practice blameless learning: instead of pointing fingers, they analyze tail latency, refine patterns, adjust SLOs, and strengthen guardrails. In these cultures, performance isn’t a special project - it’s simply how teams work.

Hard Lessons from Real Low-Latency Systems

Patterns that repeatedly surface in production:

  • Thread Pools Can Quietly Break Everything - Undersized pools cause starvation; oversized pools cause CPU thrashing. Misconfigured pooled async work is a top contributor to p99 explosions.
  • Cache Invalidation Is More Critical Than Cache Hits - Cache hits are only a win when the data is correct. If you can’t invalidate safely, it’s better to be slower than serve stale results. Event-based invalidation helps teams stay fast and correct.
  • Variance Hurts More Than Speed - A dependency that is always 50 ms is far safer than one that fluctuates between 10 and 300 ms. Predictability beats raw throughput.
  • Proximity Beats Optimization - Cross-region calls consistently appear at the root of high latency. Keeping reads close to users matters more than indexing tricks.

These lessons form the "engineering muscle memory" that separates teams who can sustain speed from those who only achieve it once.

Anti-Patterns to Avoid

Even mature systems fall into predictable traps:

  • Treating staging latency as meaningful
  • Overusing reactive patterns without isolation
  • Logging synchronously on the hot path
  • Putting too much logic into the API gateway
  • Using one massive cache instead of layered caching

These anti-patterns create slow drift - small regressions that accumulate unnoticed until p99 collapses.

The Next Frontier of Low-Latency Systems

Fast systems of the next decade will be defined not just by new frameworks, but by intelligent, self-adjusting behaviors:

  1. Adaptive Routing Based on Real-Time Latency - Requests will route to the region, shard, or instance with the lowest real-time tail latency.
  2. AI-Assisted Prediction - Models will predict cache misses, traffic spikes, and dependency degradation - enabling preemptive optimizations.
  3. Predictive Cache Warming - Systems will use access patterns to warm caches minutes or seconds before high-traffic bursts begin.
  4. Edge-Native Execution - Critical logic and pre-computed views will continuously shift closer to users, making "global <50 ms" more achievable.

These shifts push systems from reactive performance tuning toward proactive performance orchestration.

The Real Insight: Architecture Is the Blueprint, Culture Is the Engine

The final and most important takeaway:

Architecture can make your system fast.
Culture is what keeps it fast.

Teams that monitor p99 as closely as correctness, that design with latency budgets, and that learn from regressions are the ones who consistently deliver instant-feeling experiences at scale.

Sustained low latency is not luck - it’s the outcome of small, disciplined decisions made across time, teams, and technology.

About the Author

Rate this Article

Adoption
Style

BT