It was a normal release. Nothing fancy. A small change to “clean up” a piece of code that had been annoying us for months. The pull request had two approvals, all tests were green, and we deployed it like we had done a hundred times.
Twenty minutes later, our app started returning 500 errors for every logged-in user.
What Broke (In One Line)
The change looked harmless. We had a middleware that parsed an auth token. The old code used a try/except around token parsing and returned an anonymous user if parsing failed.
// Old behavior
try {
user = parseToken(token)
} catch (e) {
user = null
}
In the “cleanup,” someone replaced it with a more “correct” approach: let errors bubble up so we could see them.
// New behavior (the one-line change)
user = parseToken(token) // throws on malformed tokens
The intention was good: we wanted to catch bad tokens early. But production is where intention meets reality.
The Chain Reaction
We discovered that a small percentage of users had old tokens generated by a previous auth version. Those tokens were still valid enough for the old parser (because we were forgiving), but the new parser threw exceptions.
Now here’s the part that surprised us: one exception per request doesn’t just break that request. Under load, it becomes an amplifier.
- Errors increased CPU usage (stack traces, logging)
- Response times went up
- More requests piled up, causing more load
- Our autoscaler spun up new instances, which started failing too
In other words, the bug didn’t just fail; it created a feedback loop.
How We Diagnosed It
This was the play-by-play:
- We checked dashboards: error rate spiked right after deploy
- We checked logs: lots of “invalid token” errors
- We compared release diff: the token parsing line stood out immediately
- We reproduced locally with a token from a failing user
- We rolled back to restore service (first priority: stop the bleeding)
What We Changed Afterward
We didn’t just fix the line. We changed habits.
1) Backward Compatibility Rules
If you change parsing logic for a token, cookie, or any user-provided input, you must assume older formats exist in the wild. We added a versioned token format and a compatibility parser for older versions.
2) Safer Middleware
Middleware should be defensive. It sits at the edge of your system. Edge code should absorb weirdness, not amplify it.
try {
user = parseToken(token)
} catch (e) {
log.warn('Bad token', { reason: e.message })
user = null
}
3) Canary Releases
We adopted a simple rule: deploy to 5% of traffic first. Watch metrics for 10 minutes. Then roll out to 100%. It adds a bit of time, but it prevents full outages.
The Takeaway
Production bugs rarely come from complicated code. They come from assumptions. That day taught me a hard lesson: “small” changes aren’t small if they touch fundamental flows like authentication, payments, or caching. Treat those paths with respect.
