Transcript
Brendan McLoughlin: Good morning gumshoes. Today in here this sure case that landed on my desk not too long ago, you might say this one has all the makings of a classic cloud mystery. It started like they always do, with a phone call. Major customer, real big account, was reporting errors. Support tickets were piling up faster than coffee cups on an all-night stakeout. Then came those words that made every cloud detective's stomach drop, "If this doesn't get fixed by the end of the quarter, we're canceling our contract".The customer's reports were vague. They kept saying, sometimes things spin for a few seconds, and then an error page shows up. The issues were intermittent, with no clear pattern, and different users were all affected.
On our end, everything looked fine. Health checks were green, traffic patterns were normal, and our application logs showed no sign of trouble. In fact, every request was completed successfully. It was a classic mystery, a ghost in the network. If everything seemed to be working, where were these requests going? Why couldn't we see what the customer was seeing? It was as if the requests were disappearing into a black hole. Today, we're going to solve this mystery together. We'll meet the suspects, examine the evidence, and master the tools you'll need for a case like this. I promise that by the end, you'll know how to map the territory, track down elusive bugs, and most importantly, prevent future cases from haunting you. Because there's one thing that I've learned in all my years of debugging, it's that the cloud may be someone else's network, but the detective work, that's all ours.
The Fictional Greats (Holmes, Blanc, and Cupp)
First, let's talk about some of the greats. Sherlock Holmes shows us the key to solving any mystery is methodical investigation. His famous line, "When you have eliminated the impossible, whatever remains, however improbable, must be the truth", is appropriate for our cloud detective work. It teaches us to systematically eliminate components until we find the culprit. For example, if we see the same behavior, whether we go through the CDN or around it, we can remove the CDN from our suspects list. Like Holmes, we need to be intentional. No jumping to conclusions without evidence. Benoit Blanc tackles cases by breaking them down into fundamentals: means, motive, and opportunity. Let's map these to our cloud problems. Means, which component could technically cause the failure? Motive, what behaviors or situations would trigger the issue? Opportunity, which components could live across the request path where the error happened?
Finally, we have Cordelia Cupp. She excels because she understands that context, relationships, and history are essential to solving any mystery. In our cloud detective work, this translates to looking beyond the immediate error message. Context means understanding the full environment where the error occurred and asking, what recently changed in our infrastructure? Relationships refer to how components interact. Which services depend on each other and how? History reminds us to ask, what patterns have we seen before? Has this system shown related symptoms in the past? Like Cupp's methodical bird watching, cloud debugging requires patience, opportunity to notice broader patterns.
How Complex Systems Fail
Michelle's keynote already introduced our final great, someone who isn't chasing fictional killers, but hunts down the truth about how disasters really happen. Dr. Richard Cook is an anesthesiologist, professor, and safety researcher whose 1998 treaty, "How Complex Systems Fail", reads like a detective's handbook for the modern world. Originally studying hospital safety, his insights were quickly adopted by cloud detectives who recognized these same patterns in their own work. I believe Holmes, Blanc, and Cupp would all tip their hat to Cook's approach. How Complex Systems Fail has 18 points. You should all read them if you haven't already, and read them again if you have.
Today, let's talk about four of my favorites. Let's start with Cook's observation that catastrophe requires multiple failures. Single point failures are not enough. This reminds me of Blanc's most puzzling cases. There's always multiple suspects, all who have the means, motive, and opportunity. Cloud failures are similar. It's never just the load balancer. It's the load balancer failing, plus a missing retry logic, plus an undocumented dependency, all converging at the worst possible moment.
Next, Cook reminds us that post-accident attribution to a root cause is fundamentally wrong. Failure blossoms from a conspiracy of small missteps. Sherlock Holmes models this instinct in Silver Blaze. At first glance, it's a simple whodunit about a missing racehorse. Holmes keeps pulling at the threads. The silent watchdog, the stable boy's late-night meal, a clandestine opiate. Only when he assembles all threads does the picture emerge. An accident born of greed, a little luck, and a drug guard dog. Not one bad actor.
Similarly, when debugging cloud infrastructure, it's tempting to find a single culprit and call it case closed. Effective cloud detectives understand the real job begins when the obvious ends. We embrace blameless post-mortems, not because it's fashionable, but because we recognize that blaming a single component obscures the systemic issues that enabled the failure. Just as Holmes noticed the dog that didn't bark in the night, cloud detectives look for alerts that stayed silent. During a post-mortem, we don't only chart the metric that spiked. We scan the dashboards that sat suspiciously flat. Which SLOs slept through the incident? Which pager should have howled but never did? These non-events map the blind spots in our observability and show us exactly where to invest to prevent the next mystery. Remember, the most powerful tool we have in our toolkit is our curiosity. Not for finding someone to blame, but for truly understanding everything that went wrong and everything that kept it from getting worse.
I love Cook's observation that all practitioner actions are gambles. It applies perfectly to us cloud detectives. When investigating an outage, you really have the luxury of time and you're always placing bets with your attention. Do I check the logs first or the metrics? Should I inspect network connectivity or application code? Like Cordelia Cupp, assembling every suspect into one room to see who flinches, we're always making bets with incomplete information. The trick is to wager small and then watch closely.
For example, interrogate a single trace before pulling every log for a narrow, low-risk probe. Or toggle a feature flag before redeploying a cluster. It's usually a cheaper bet and it will give you a faster signal. Each move is a hypothesis. We place the smallest bet that we can prove or disprove. Read the table and keep enough chips on hand for the next round because complex systems always deal another round. Cook's final gem is complex systems run in degraded mode. Somewhere near cloud, right now, a replica is stale. A timeout is too low. A secondary index is corrupt. There are no alarms yet, but the fuse is lit. Just as a seasoned detective knows the city's crime ring doesn't vanish during a quiet shift, cloud detectives assume that latent failures are always prowling. Our first job during an incident is to ask, which of these silent defects just stepped into the spotlight? The second job is to tag the rest for follow-up before they audition for a sequel. Like our fictional greats, Cook's ideas reinforce that effective detective work, whether solving mysteries or debugging cloud systems, depend on systematic analysis, context awareness, and recognizing the interconnected nature of complex failures.
The Detective's Toolkit
All right gumshoes, every good detective knows you don't start investigating without the right tools. Documentation, request flows, runbooks, and observability are essential ingredients to solving a case. Every cloud detective needs solid documentation. Please trust me, in cases like these, the devil is always in the details.
First things first, we need to know what we're dealing with. We start by gathering our clues. C is for context. Why does this service exist? Capture its mission, the tradeoffs accepted, and the history of big design decisions. Without context, every line of code looks suspicious. L is for links. How does the data actually flow? Pay particular attention to service boundaries, where one team's scope ends and another's begins.
Mistakes often happen at the point where services meet. U is for uptime contracts. What promises do we make? What SLOs, error budgets, and on-call rotations are all important for knowing when to panic and who to page? E is for edge cases. Where have things already gone wrong? Document known failure modes, timeouts, retry rules, and that one weird query that always blows up the cache. Treat these like past crime scenes. They're informative, but don't over-index on them. S is for signals. How do we know it's alive? Include links for your service's logs, metrics, traces, and alerts. These are your system's heartbeat and fingerprints. Pro tip, don't wait for the incident to start, to start gathering these clues. Build your case file during calm periods when you have time to think clearly. Store it somewhere your entire team can access, whether it's a wiki, a README, or even a simple shared doc. When you're debugging in a crisis, you'll thank your past self for doing the detective work ahead of time.
Next is your request flow diagrams. Think of these as your victim's timeline. Where do they go? Who do they talk to? What should have happened at each step? Ideally, the request flow diagram shows the complete journey of a request, both upstream and downstream. Many teams meticulously document their downstream dependencies, databases, caches, backend services, but treat upstream components like a black box. This is a classic cloud mistake. A bug in your API gateway will break your user's experience just as thoroughly as a database failure. You may trust your cloud vendor to ship bulletproof components, but your company's fingerprints are still all over the settings and configuration. Document everything with equal care and precision. You need to know the latency expectations, timeout policies, and retry logics across the entire request path.
Finally, the runbook. This is your operation team's tactical guide for handling live incidents. While documentation tells you how the system should work, the runbook tells you what to do when it doesn't. It should cover common problems, known resolution steps, and include contacts and ownership for each system. A good runbook is like a case file.
At the top, you've got your most wanted list. The common suspects that cause the most trouble. For each one, you want to document the victim statements, user reports timeouts when uploading files, the evidence, load balancer logs show 504 errors, known associates, so which systems are usually involved, and the usual solutions, check CDN upload size limits. Here's the real trick. You organize it by what your victim reports, and symptoms, not the root cause. Your 3 a.m. brain will thank you later. A bad example might be memory leak detection and resolution. A better example would be 502 gateway errors. What you'll see, users reporting errors after midnight. Common causes, memory leaks in the processing service. Quick fix, restart processing nodes. Long-term fix, memory leaks track in JIRA-133K, and always include team contacts, who to wake up at 3 a.m. Required permissions, because there's nothing worse than lacking access during a crisis. It's also helpful to include links to past incidences, because criminals often return to the scene of a crime.
No detective goes to a crime scene without their forensic tools. When it comes to cloud debugging, we've got some powerful options. First up, observability tools. Think of these like your magnifying glass, zooming in on the details to reveal what's slowing down or breaking requests as they travel through your system. They're best for understanding the full request flow and identifying bottlenecks. In a case like ours, these tools could reveal if the requests are getting delayed at a specific component.
Unfortunately, they're usually integrated at the application level. Sometimes trace information is missing from the higher levels of the cloud network stack. Let's zoom in on a real-life detective tool, Honeycomb. Honeycomb gathers trace data from all your services using the OpenTelemetry spec. It stitches together traces from multiple components, giving you a unified picture of your entire request flow. Think of it as assembling all the pieces of evidence into a single investigation board, allowing you the opportunity to peer across boundaries. Notice these cascading bars. That's the trail of your request as it moves through each component. The length of each bar shows exactly how long each step takes.
At a glance, you can see right away if there's a particular suspect holding things up. In this example, look to the second to last bar. It's clearly the bottleneck. Like a good detective spotting a nervous suspect during questioning, Honeycomb lets you pinpoint exactly where a request is spending too much time. Remember, like any detective tool, Honeycomb only sees what you let it. If you skip instrumenting a critical piece of your infrastructure, you'll lose visibility and miss vital evidence. Good detectives never leave a stone unturned. Great cloud detectives never leave a component untraced.
Logs are like witness statements. They give us timestamps, error codes, request IDs, and client information. Just like in detective work, you need to know which logs to look at. System logs, network logs, load balancer logs, they all tell a different piece of the story. When combing through logs, pay attention to patterns, reoccurring errors, suspicious gaps in activity, or specific times when issues spike. These can help pinpoint a problematic service or an unusual time of day. Remember, gumshoes, logs can lie by omission. Sometimes the most important clue is what's missing.
Think of log analysis tools like Observe, Inc. here as your forensics labs. In this screenshot, we're tracking a single user's journey through HTTP access logs. We have every request their browser made along with timestamps over a 24-hour window. It's like reconstructing the sequence of a suspect's movements from the night of the crime. The key to success is filtering out the noise. Raw logs alone often overwhelm more than they inform. Get familiar with your log tooling. Good detectives learn their tools inside and out. Narrow down your investigation by filtering out specific user IDs, request paths, timeframes, or log levels to spot suspicious patterns. Another pro tip, visualize your log volume. Errors usually cause a spike in log count, and the human eye excels at noticing these patterns.
Most popular vendors offer simple log visualizations out of the box to make this easy for you. Finally, there's nothing like seeing the user's perspective to bring clarity to a case. User journey recorders let us watch the crime scene from a user's point of view. Tools like Sentry, LogRocket, or Fullstory record and replay user interactions on a website. They can be an invaluable source of clues because they often contain the full network logs from the user's browser. When requests disappear or errors don't show up in our own logs, these tools give us that missing perspective. We see the story as the user sees it, and that can be crucial for solving the case.
With Fullstory here, you're not just looking at logs, you see exactly what the user sees. Each interaction is recorded, including clicks, scrolls, even frustration signals like rapid clicking or repeated navigation attempts. The visual rendering helps you step into the user's shoes, helping you understand their state of mind. Are they curious, impatient, or downright furious? Pay particular attention to the network requests and status codes captured on the client. Discrepancies between what the user sees and your internal logs can be the key to breaking open a tough case. Just like a detective noticing mismatched witness testimony, these discrepancies highlight invisible problems you might not detect from your logs alone. Or worse, you might suspect these issues but lack the evidence to prove it.
The Usual Suspects
Any good detective knows that a crime doesn't happen in a vacuum. There's a whole scene to examine. In the cloud, our crime scene is vast, layered, and often under-documented. Most people picture the path as simple, from user to cloud to app. In reality, it's a complex web. A whole network of services and checkpoints, each with its own quirks, configurations, and points of failure. Let's break down this crime scene one step at a time.
To solve this case, we need to understand the exact path each request could take. Let's meet our cast of characters, the usual suspects in any cloud caper. The DNS has a notorious reputation. Its job is to translate names to addresses, routing requests to the right place. If the DNS isn't working correctly, your users will never find your application. Issues here can include propagation delays, misconfiguration, and cache issues. The old pro tip when dealing with any DNS change is to drop your TTL settings down to about 5 minutes, about 48 hours before a major DNS change, so you can quickly recover from any mistakes and only bump it back up to 24 hours once you're confident the entire system is working, not just the DNS.
Next, we have the CDN. These were originally built as large caching systems, storing content closer to users by placing copies of assets in key locations worldwide. In recent years, their role has grown. They now run code directly at the edge, using services like Cloudflare Workers or Lambda. Some common cases involving the CDN include cache poisoning, when stale or incorrect content gets stored and served. Regional routing mishaps. Traffic getting sent to distant edge nodes instead of nearby ones. Configuration drift between edge nodes, what works in one region might fail in another. Or edge function timeouts, because code running at the edge typically has stricter limits than your application. A pro tip for CDN investigations, always check your cache control headers. I've seen many a case where misconfigured cache headers led to content being cached when it shouldn't be, or worse, not cached when it should be.
Once a company reaches a certain size, incentives push them to install a WAF, or web application firewall, in front of all web traffic. WAFs are good at filtering requests, enforcing rules, detecting bots, and limiting rates, but sometimes legitimate requests get caught in its net. Rate limits and low-quality IP filters are some common sources of problems that might originate in the WAF, so keep an eye out if you have a particular feature that triggers a flurry of web requests, or if your customers like to use a VPN to access your website. Load balancers are responsible for distributing requests across multiple services to keep things running smoothly, but when misconfigured, they can cause uneven distribution, overloading some services while others sit idle. Misconfigured timeouts or sticky sessions are other common problems. Your application server might be completing a request, but the user still gets an error if the load balancer gets sick of waiting and closed the connection before the app was done.
As the middleman between users and your application, load balancers are perfectly positioned to cause confusion. They can make your healthy applications look broken to users, or mask real application issues by failing over to other instances.
Next, we have the API gateway, handling request routing, rate limiting, and sometimes authentication. Common issues that crop up here are routing mismatches or configuration drift, sending traffic to the wrong service. Authentication issues are also common in this layer if the API gateway and clients aren't configured correctly. What makes the API gateway particularly a tricky suspect is that they often transform requests and responses along the way. Requests might look fine going in, but the gateway could be mangling headers, changing payload formats, or applying unexpected rate limits. When debugging, always check what the gateway is actually sending downstream versus what it received.
Next, Kubernetes ingress is another tool for routing external traffic to services running inside your Kubernetes cluster. It acts as a gatekeeper, managing how traffic flows into your cluster and directing it to the appropriate pod based on routing rules. A common issue with ingress is misconfigured routing rules. If these rules don't align perfectly with your application's expectations, requests can get lost, redirected incorrectly, or blocked entirely.
For example, a typo in a host rule might send traffic intended for your service to a dead endpoint, leading to 404 errors. Another frequent challenge is dealing with the behavior of different flavors of ingress controller. Each one has its quirks, so find out which one your company uses because you can't always trust the default documentation. Firewalls block or allow traffic based on rules, port filtering, and network segmentation. If no requests are making it to your application, it's possible your firewall isn't allowing any traffic from outside its network. Fortunately, I found they don't log much. They don't explain much. They just block. Block ports or rule conflicts are common reasons that traffic might be getting dropped before it can access your internal network. Another tip for troubleshooting firewalls, always check both inbound and outbound rules. Experienced detectives know that many mysteries hide in overlooked outbound restrictions.
Finally, we have the application server. I'm sure you're already imagining why this is on the suspects list, so I won't bore you with its full rap sheet. Every one of these components was built by engineers like us. Each service, whether it's a load balancer, an API gateway, or a CDN, is just another software solution. Each one has its logs, metrics, and configurations, and we can debug these components like any other service. It's easy to get intimidated by cloud infrastructure to think of it as some mystical black box, but at the end of the day, we're dealing with code written by developers like us. Understanding ownership is crucial too. If we identify an issue with the load balancer, we don't know which team manages it, and at some point, we need to know who to call when we need backup. Just like old-fashioned detective work, building relationships is crucial.
Follow the Evidence
While detectives read fingerprints and tire tracks for clues, cloud detectives need to master their own form of evidence analysis. Fortunately, most cloud services speak a common language, HTTP. Status codes, response headers, and response bodies are our fingerprints. They tell us who last touched a request and what happened along the way. Here's the catch. Just like with forensic evidence, you need to know how to read the clues correctly. Let's start with our 400-level status codes. These are your client-side errors. Don't let that fool you, the client isn't always to blame. 400, bad request. This is the generic status code for an invalid request. Usually, it's a good sign of an issue in your application code, so checking there first is always a good place to start. Your infrastructure isn't always blameless. Some other suspects include firewalls.
Overly strict firewall rules, or misconfigurations might block or truncate requests leading to a 400. Load balancers are tuned for performance, so they often come with strict limits on header sizes or request bodies to protect themselves and make it easier to optimize. Surpassing these limits can trigger a 400, so keep an eye out for requests that have picked up too many cookies. That's an easy way for a browser to inadvertently get blocked at the load balancer.
Proxies and API gateways often include behavior that modifies or adds request headers. These can sometimes conflict with your application expectations and trigger unexpected 400s. When debugging 400 errors, make sure to check the headers closely. Look for oversized cookies or authentication tokens, as these often push headers past size limits. I like to use developer tools to grab the exact curl command for a failing 400 request. This lets me reproduce the issue and modify the headers as needed. Look at the request payload logs when they're available to spot malformed JSON or unexpected data structures. Remember, just because the client gets the blame doesn't mean the issue isn't actually server-side, especially in complex cloud environments with multiple hops along the request path.
401, unauthorized. A 401 is your system's way of saying, I don't know who you are, but the why behind the snub can be surprising. A common suspect here is the API gateway because these are often coupled with authentication logic. Make sure to check any component that's acting as a proxy. Anything that might be able to modify the upstream request before it hits your service is suspect. If something upstream isn't forwarding your authorization or cookie headers, then downstream services won't be able to authenticate the request.
Finally, don't forget to check your pocket watch. A JWT expiration claim or self-signed URL timestamp can fail because of that classic distributed system gremlin, clock skew. Some debugging tips. If you're using JWT tokens for authentication, grab your decoder ring and crack them open to see what's inside. jwt.io is a great resource for inspecting the contents. Make sure the JWT data that you're sending is what you expect and has the right context to your service needs to authenticate the request. To check for clock skew on a failing pod or VM, run date -u and compare it to an NTP source. You're looking for a drift greater than 30 seconds because that can usually indicate a timing issue.
When debugging, keep an eye out for a sudden switch from a 401 status code to a 403. 401 means not authenticated, but 403 means authenticated but not allowed. The change probably means you're making progress, so remember that clue and use it to further your investigation. Nothing beats notice-the-ground detective work. Sometimes the only way to make progress is tracking the request slowly across each hop.
418, I'm a teapot. The 418 status code was originally created as an April Fool's Day joke in 1998 as part of RFC 2324, Hypertext Copy Pot Protocol. Here's where it gets interesting for us detectives. What started as a joke has become a legitimate debugging tool. Modern web application firewalls have embraced this quirky code because it's so distinctive. When you see a 418, you know immediately that the WAF made a deliberate decision to block your request. It's like leaving a calling card at the scene of the crime. 429, too many requests, our rate limiting friend.
If you see a 429, it means some component, usually an API gateway or WAF, is putting the brakes on your traffic. Here's where it gets tricky for detectives. Rate limiting can happen at multiple layers. Your CDN might have one limit, your API gateway another, and your application a third. Each layer might be using a different window too: one per second, one per minute. When investigating a 429, you need to determine which layers triggered it, so check the response headers. Many services include hints like X rate limiting or a retry offer that can help you identify the culprit.
500, internal service error. This is the all-purpose something went wrong code. Nine times out of 10, when you see a 500, check your own application first. Most of the usual cloud suspects are well behaved and robust, so they typically don't break in unexpected ways. You can never say never. Here's a dilemma. 500s are often the least helpful status code for an investigation. Unlike a 404 that tells you exactly what's missing, or a 429 that points to rate limiting, a 500 just says, something's broken. So often, we need to get creative and look for other signals, like timing. Do these errors coincide with a deployment, a traffic spike, or some other service having issues? If timing doesn't help, look for other patterns.
Same user, same endpoint, same geography or device type. Sometimes the pattern can help you break open the case. In the absence of a stack trace, these clues can help you point to the real culprit faster than the error message itself. 502, bad gateway. This usually means that something is misconfigured in your request chain. The most common issue I've seen is around HTTPS certs preventing upstream services from forwarding requests to a downstream service. CDNs, API gateways, even your application could all be to blame here, because the issue usually involves some missing assumptions on both sides of the request. Think of a 502 as a broken telephone game. Somewhere in your chain of proxies, load balancers, gateways, is one component saying, I can't talk to the next guy in line. The tricky part is figuring out which link broke. Start by checking if you can bypass each layer. Can you hit your application directly? Can you go around the CDN? Each successful bypass now is down your suspects list until you find a component that can't forward the request properly.
503, service unavailable. This code usually means that a service is overloaded or temporarily unavailable. Unlike a 500, which indicates something is broken, a 503 suggests the service is intentionally refusing a request to protect itself. This is an optimistic response from your system. It's letting us know that the same request might work in the future. Health checks and resource dashboards are the best place to start here. If you're seeing this regularly, it could be a capacity issue. Ongoing deployments and maintenance modes are other common triggers for 503s. When you find an unexpected 503, timing is your best detective tool. Is the 503 happening during a known traffic spike or seemingly at random? I've debugged cases where 503s appeared during low traffic periods, which led us to discover a memory leak that only surfaced after the application had been running for several weeks. The service wasn't overloaded by requests, but by its own gradual resource consumption.
The other thing to watch out for when debugging a 503 is faulty retry logic. If service A starts returning 503s due to high load and causes service B's retry logic to kick in, you may find service A's workload just suddenly spiked even higher. This is why good detectives know to check both ends of the request. Is the service really overloaded by legitimate traffic, or is it drowning from internal retry attempts? Circuit breakers and proper backoff strategies are the best defense against a self-inflicted denial of service, yet they're often treated as nice-to-have features, at least until the first time your retry storm takes down your system. 504 gateway timeout. This usually means that the request timed out somewhere along the way. When you see a 504, start checking timeouts on load balancers, API gateways, and firewalls.
Often, it's a case of one component's timeout expiring before another component has finished processing the request. The tricky part is that the 504 tells you where the timeout was detected, not necessarily where the slowness originated. Your load balancer might return a 504 after waiting 30 seconds, but the real culprit could be a database query that's taking 45 seconds, three services downstream. That's what makes 504 so hard to debug. The downstream service could show a clean 200 response, completely unaware that the upstream component already gave up and served a 504 to the user.
Besides these common status codes, make sure to check the docs for your CDN or cloud vendor. AWS, Cloudflare, nginx, and others all document a list of non-standard status codes they use to provide additional context and nuance for cloud detectives. Some vendors have gotten quite creative with their error codes. Cloudflare uses the entire proprietary range from 520 to 530. Instead of just getting a generic 502 bag gateway, you might see a 521, web services down, telling you immediately that your origin is actively refusing the connection. Or a 525, SSL handshake failed, pinpointing exactly where in the TLS negotiation something went sideways. AWS's application load balancer uses the 463 status for too many IP addresses to let you know when someone's tried to stuff more than 30 IP addresses into an X-Forwarded-For header. Azure Application Gateway goes with a minimalist approach. It will log status code 0 when a client closes the connection before the server finished processing.
The Fastly docs reserve the entire 600 through 699 range for custom developer use within their infrastructure. I like to look at each vendor's custom codes and consider what it says about their operating philosophy. Cloudflare prioritizes immediate diagnostics. AWS focuses on the challenges of multi-service integration. Azure knows what it's like to serve an impatient customer. Fastly prioritizes developer experience and experimentation. When you're investing in an incident, don't just look at the number. Make sure you reference the docs to understand what a particular vendor is trying to tell you about where things went wrong in their stack.
Beyond the status code, HTTP responses contain plenty of other interesting clues. Response headers are like breadcrumbs left along the request path. They can tell us where the request has been and give us clues about any changes made along the way. CDN headers reveal a request journey, which edge location handled it, and how it was routed, and whether it was served from storage or fetched fresh from the origin. If requests are misbehaving, check if they're being routed to an unexpected region or if a stale cache entry is causing issues.
A pro tip, CDN vendors often use a three-letter airport code to identify their edge locations. If you and all your infrastructure are on the East Coast, but you see a response served from SFO, you might want to dig deeper into your CDN's routing logic. Don't forget to keep an eye out for CDN headers on internal service calls, too. It's surprisingly easy to accidentally route what should be a short internal hop all the way out to the public internet and back.
Application headers like X-Powered-By can be useful for pinpointing an error. X-Request-ID is a common one that's useful for tracing, while X-Response-Time can be helpful for answering performance questions. Many teams also add custom headers like X-App-Version or X-Build-ID to track which deployment served the request. These can be helpful for identifying when the deployed version is different from what you expect. Pro tip, when investigating headers, don't just look at what's there, look for what's missing. If your application always serves X-Powered-By Spring, but it's missing from a request that has an error in it, it might be the clue you need to rule out your application as a suspect. The absence tells you the error in it happened before your application could respond, and you might need to consider other components. The response by itself can reveal the origin of the error. Standard error pages often give away the culprit. nginx and Apache include unique signatures in their HTML error documents, while API gateways often default to JSON responses instead. Load balancers like to take a Spartan approach to reporting errors, and typically only return status codes and headers. They prefer an empty body, so they can spend more time serving other requests.
For example, if you see a bare-bones 502 bad gateway with your browser's generic error page, that's likely your load balancer. If you get a nicely formatted error page with CSS, that likely came from your own application. The layout and content of the error page can be a clue that confirms the involvement of a particular infrastructure component. Building an evidence file means keeping a record of every status code, header, and response time in each problematic request. Make sure to note things that seem out of place or don't align with your mental model of how the system works. Not only does writing this down help us with our current case, but over time, it builds a pattern library that we can use in the future investigations. Each clue adds to our case profile, helping us to detect anomalies faster the next time. As you collect evidence, refer back to your architecture diagram. See where requests might break down, and try to figure out what aligns with the evidence. Don't just collect the broken request. Grab examples of working requests, too, so you can spot the difference.
The Case of the Phantom Request
Gumshoes, we've gone through our toolkit, met the suspects, and learned how to collect evidence. It's time to revisit our opening case. Let's dive back into that story of the phantom request, the one that would intermittently vanish, leaving a major customer frustrated and our team stumped. First step in any investigation, map the crime scene. We trace the request journey from user to CDN, to the WAF, to the load balancer, through the API gateway, to Kubernetes ingress, and finally, to our application. Six places where the request could vanish. Six suspects in our lineup. Let's review what we knew. The customers were reporting errors. Different users were all affected without a clear pattern. Our application logs all showed success. There was no error spikes or unusual traffic patterns. I think we have enough info here to start testing hypotheses. First, let's start with some of our usual troublemakers.
First, we asked, could the request be getting lost before they hit our servers? DNS caching, regional routing of the CDN, or SSL misconfigurations have all led to partial site outages in the recent past. The evidence didn't line up. Usually, the DNS or CDN errors have more of a consistent pattern and don't strike intermittently. The customer should have noticed the entire site being down, not just issues with a handful of requests. Good detective work means ruling out suspects systematically. The DNS and CDN issues typically affect all users in a region or all requests to a service, not random individual requests. The intermittent customer-specific nature of our mystery pointed elsewhere. Our WAF is our next suspect because it has a reputation for unpredictability. Its rules, a mix of rate limits, behavioral analysis, and IP checks actually make it a good fit for our intermittent pattern. WAFs are designed to catch suspicious behavior, so the sporadic blocking would be right in character.
However, as we dug deeper, the evidence didn't support the theory. We knew our WAFs returned a distinctive 418 status code when it blocked requests, so we asked our customer, are you seeing any 418 responses mixed in with your errors? After some back and forth, the customer revealed they were only seeing 504 errors. This appeared to rule out our WAF as a suspect, but it was still a valuable clue, a 504 signal timeout somewhere along the response. We checked our application logs again and found the original customer's request. Timing and user IDs lined up perfectly. We originally missed them because the application was reporting a 200 status code. It assumed the request was successful. Now we knew the request was making it all the way to our application, but the 504 timeout told us that somewhere around the response journey, a component was giving up and cutting the connection before our application could deliver its answer.
Looking at the evidence we do have, a 200 in our logs and a 504 in our client suggest something is giving up on our application server before it can respond. Since the load balancer is directly in the middle of our network stack, it's a good place to start because if it isn't the culprit, it could help us narrow down the suspects to one of the services that comes higher or lower in the network stack. When we pulled the logs from the load balancer, that's when we found something interesting. A handful of requests were taking just over 30 seconds to complete, around 31 or 32 seconds. Our load balancer had gotten impatient. It had a timeout set to 30 seconds. We'd found the guilty party. Now that we knew the problem, we were ready to fix it. Here's what we did. We increased the load balancer's timeout to 45 seconds.
This gave the application enough time to complete the longer running requests. We also had to verify the timeout rules on the API gateway, Kubernetes ingress, and application to ensure they all matched up. It wouldn't do much good for our customer if we bumped the timeout one layer, only for the same issue to catch them at another hop along the journey. We also implemented a timeout monitoring alert. Now, if any requests approach the timeout threshold, we're alerted before it becomes an issue. Additionally, we documented the new timeout setting so anyone who revisits this component knows exactly what's configured and why.
Lessons Learned
Some lessons learned. Every detective needs good case files. It would have taken us forever to discover the issue if we didn't have a request diagram helping us to understand where to look. Now we know timeouts can be an issue. Updating the runbooks with timeout settings will save us and our colleagues future headaches. Evidence collection is everything. We couldn't accuse any suspect until we had proof of what was going wrong. Distributed tracing and consistent logging at each layer are your best tool for gathering the evidence you need. The real culprits always hide in plain sight. Most cloud networking bugs happen in the gaps between services, the integration points where assumptions break down. Each component here was working correctly in isolation, but they weren't in sync with each other.
Conclusion
Gumshoes, we've come a long way. We've unpacked the tools, met the suspects, and walked step by step through solving a real-life case. By now, you know how to map the layers of the cloud, gather evidence, and track down those elusive bugs. Just as important, you've seen how a small misconfiguration, like a timeout setting, can create a big headache. If you remember anything about this talk, I hope you take away the following tools for your future cloud mysteries. Start with evidence, not assumptions. Gather facts first, document your observations, and let the evidence guide your investigation. Document everything. Keep your architecture diagrams current, and update those runbooks after every incident. Build relationships with system owners. The cloud is too big for any one person, so get to know your infrastructure teams. Share your findings with them and learn from their expertise. Remember, every system was built by engineers. All systems are code, all code is bugs, all services have owners, and all problems are solvable. My name is Brendan McLoughlin. I'm a cloud detective at CarGurus.
Questions and Answers
Participant 1: How does a detective convince the people hiring him to spend that much money on cloud detection tools?
Brendan McLoughlin: I think it's mostly about pointing to the pain. If you do have a lot of incidences you need to investigate, I think figuring out which detective tool might provide the most value, and starting there.
Participant 2: Have you had the company fund the time for so many people to spend the money for you?
Brendan McLoughlin: Luckily, I have not had to do as much of that, but there are some tools out there that are pretty expensive that I haven't had the chance to use.
Participant 3: Speaking of tools, do you have any preferences in terms of telemetry tools or frameworks? Does something like OpenTelemetry work really well for you, or is there something else that's better?
Brendan McLoughlin: OpenTelemetry is great. Honeycomb is a great product. I use that all the time for visualizing the OpenTelemetry traces and understanding what's gone wrong.
Participant 4: You mentioned the runbooks. By the time those incidents might accumulate in multiple thousands, probably, of runbooks. What is the approach that you took to make it easy to search at the time of the incident, for example, that impacts customers? Is there an approach, from experience, to make it easier to find?
Brendan McLoughlin: What's the approach for making the runbook accessible during an incident? The runbook is always a work in progress in my experience, and I like to have a giant wiki page or a giant document, so you can control left through the document for common signs. Trying to organize it by what is most frequent at the top is the best practice, but I think, in reality, you end up appending stuff at the end as it comes up. Definitely, as I said in the talk, trying to list it based on the symptom, what someone is going to report, what the error is going to show up in the status, rather than the root cause that you've already figured out, is going to help that person that it's their first week on call and they don't know the system very well, and suddenly, some alert is popping up and they don't know what to do is the best approach.
Participant 4: For example, you have 503. Those notes are, for example, cases that happened and you had to investigate them, or each time something happened, for example, 503 has a runbook 504. Is that the structure that you took for a runbook?
Brendan McLoughlin: Yes. Usually, you have a runbook per service, and so a 503, there's probably something in the load balancer, and so you might have a section on there that tells you, when you see this error, what to go look for, and that's usually talking to some other services is not working, so here's the links to where you'd go to find, are those services up? Are they healthy, and stuff like that? It might be based on the status code or it might just be some other issue that had happened in the past, or it's like, the customer always calls and complains about this spinner. Here's the runbook case for that.
Participant 5: I saw in your runbook you have also a Jira item. Based on your experience, what happens when the Jira items or DevOps items stay in the runbook for a year or two?
Brendan McLoughlin: That happens more often than it should. Usually, it's because of an issue like a memory leak. As long as you're continuously deploying the service, no one notices the pain from it. I think it's useful to have the links to the Jira in the runbook, so when the incident does eventually come back up, you can put a comment on the Jira and be like, it happened this date, it happened this date, and at some point, someone's going to come across this Jira and see, it's got 30 comments about this issue happening in the last month. It's definitely time to fix it.
See more presentations with transcripts