BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Anthropic Traces Six Weeks of Claude Code Quality Complaints to Three Overlapping Product Changes

Anthropic Traces Six Weeks of Claude Code Quality Complaints to Three Overlapping Product Changes

Listen to this article -  0:00

Anthropic has published an engineering postmortem explaining six weeks of user complaints about Claude Code's quality.

Users reported wildly different symptoms depending on when they used Claude Code and which features they relied on. The reason: three unrelated product-layer changes shipped between March and April 2026, each affecting a different slice of traffic on its own schedule. The API and underlying model weights were not affected. All three issues have been resolved as of April 20 (v2.1.116), and Anthropic has reset usage limits for all subscribers.

The first change was a reasoning effort downgrade. On March 4, Anthropic switched Claude Code's default reasoning effort from high to medium to address UI latency issues where the interface appeared frozen during long thinking periods. The company acknowledged this was "the wrong tradeoff." Users reported that Claude Code felt less intelligent, and despite UI changes to make the effort setting more visible, most users retained the medium default. The change was reverted on April 7. All models now default to high or xhigh.

The second was a caching bug that progressively erased the model's own reasoning. On March 26, Anthropic shipped an optimization to clear old thinking sections from sessions idle for over an hour, since those sessions would be a full cache miss anyway. A bug caused the clearing to fire on every turn for the rest of the session instead of just once. Claude would continue executing, but increasingly without memory of why it had chosen its current approach.

Boris Cherny from the Claude Code team explained on Hacker News that in an extreme case, a user with 900K tokens in context who idled for an hour would face a full cache miss on the next message, consuming a significant percentage of rate limits, especially for Pro users. The fix they attempted to reduce this cost is what introduced the bug. It was fixed on April 10.

(The caching bug progressively cleared Claude's reasoning history on every turn instead of just once after an idle period. Source: Anthropic blog post)

The third was a system prompt change shipped alongside Opus 4.7 on April 16. Anthropic added a verbosity limit instructing the model to "keep text between tool calls to 25 words or less" and "keep final responses to 100 words or less." After weeks of internal testing with no regressions, they shipped it. Broader ablations run during the investigation revealed a 3% quality drop for both Opus 4.6 and 4.7. It was reverted on April 20.

The investigation surfaced an interesting finding about AI-assisted debugging. Anthropic back-tested its Code Review tool against the offending pull requests. Opus 4.7 found the caching bug when provided with sufficient repository context. Opus 4.6 did not. The company is now adding support for additional repositories as context in Code Review to prevent similar issues.

The  Hacker News thread ran predictably hot. Some commenters credited Anthropic for publishing a detailed postmortem at all. Others were less generous about the underlying motivations. One commenter questioned whether latency reduction was the real reason for pruning idle session context:

One of the following must be true: they actually believed latency reduction was worth compromising output quality for sessions that have already been long idle, or what I suspect actually happened: they wanted to cost-reduce idle sessions to the bare minimum, and "latency" is a convenient excuse.

Another pointed out a structural problem with how system prompt changes are communicated:

Changing the system prompt out from underneath users when you've published benchmarks using an older system prompt feels deceptive. At least tell users when the system prompt has changed.

The Fortune coverage noted that some users felt "gaslit" by Anthropic's initial response, which implied nothing was wrong. Anthropic told Fortune that "compute is a constraint across the entire industry" and that it is scaling capacity through expanded partnerships with Amazon and Google.

On Reddit, users flagged an issue the postmortem does not address: sub-agent delegation to Haiku. Claude Code delegates tasks to the cheaper Haiku model more often than users expect, visible only in verbose logging. One commenter highlighted the risk for automated workflows:

In interactive use, quality drops are obvious. You can course-correct. In automated pipelines they're silent until 3 tasks downstream. Much harder to catch.

For teams running Claude Code in CI or automated workflows, silent delegation to a smaller model represents a different class of quality risk than the three issues Anthropic identified.

Another Reddit user independently confirmed that the verbosity instructions remained in the system prompt even after Anthropic's announced revert. The user built a pre-tool hook script that fires before every Write, Edit, and Agent tool call, countering five specific failure modes. As the user explained:

The model's system prompt plus training reward produce these five failure modes: "short and concise" skips re-reading source material, "don't narrate deliberation" skips the verification narration that would expose the mismatch, no structural re-read gate before producing derivative output, the Read tool's "do NOT re-read" guidance is exactly wrong for this failure mode, and training reinforcement for "plausible plus confident" over "matches the source."

The fact that the workaround improved output quality validates the postmortem's diagnosis of the verbosity limit as a quality-degrading change.

The broader engineering lesson applies to any team shipping product-layer changes around AI models. Anthropic's internal evals and dogfooding failed to catch any of the three issues because internal staff were using different builds, the caching bug only manifested in a specific state (stale sessions), and the eval suite was too narrow to detect a 3% quality drop from prompt changes. Going forward, Anthropic says it will require more internal staff to use exact public builds, run broader per-model eval suites for every system prompt change, add soak periods and gradual rollouts, and version system prompt changes more carefully.

Stella Laurenzo published an independent audit on GitHub, analyzing 6,852 Claude Code session files, 17,871 thinking blocks, and 234,760 tool calls. Her findings showed Claude shifting from research-first behavior toward edit-first behavior, with reasoning depth declining measurably. While not every claim in her report is verified by the postmortem, the reported symptoms align closely with the three causes Anthropic identified.

About the Author

Rate this Article

Adoption
Style

BT