The Day a One-Line Bug Took Down Our App (And What It Taught Me) – HackemTU

It was a normal release. Nothing fancy. A small change to “clean up” a piece of code that had been annoying us for months. The pull request had two approvals, all tests were green, and we deployed it like we had done a hundred times.

Twenty minutes later, our app started returning 500 errors for every logged-in user.

Índice

What Broke (In One Line)

The change looked harmless. We had a middleware that parsed an auth token. The old code used a try/except around token parsing and returned an anonymous user if parsing failed.

// Old behavior
try {
  user = parseToken(token)
} catch (e) {
  user = null
}

In the “cleanup,” someone replaced it with a more “correct” approach: let errors bubble up so we could see them.

// New behavior (the one-line change)
user = parseToken(token)  // throws on malformed tokens

The intention was good: we wanted to catch bad tokens early. But production is where intention meets reality.

The Chain Reaction

We discovered that a small percentage of users had old tokens generated by a previous auth version. Those tokens were still valid enough for the old parser (because we were forgiving), but the new parser threw exceptions.

Now here’s the part that surprised us: one exception per request doesn’t just break that request. Under load, it becomes an amplifier.

Errors increased CPU usage (stack traces, logging)
Response times went up
More requests piled up, causing more load
Our autoscaler spun up new instances, which started failing too

In other words, the bug didn’t just fail; it created a feedback loop.

How We Diagnosed It

This was the play-by-play:

We checked dashboards: error rate spiked right after deploy
We checked logs: lots of “invalid token” errors
We compared release diff: the token parsing line stood out immediately
We reproduced locally with a token from a failing user
We rolled back to restore service (first priority: stop the bleeding)

What We Changed Afterward

We didn’t just fix the line. We changed habits.

1) Backward Compatibility Rules

If you change parsing logic for a token, cookie, or any user-provided input, you must assume older formats exist in the wild. We added a versioned token format and a compatibility parser for older versions.

2) Safer Middleware

Middleware should be defensive. It sits at the edge of your system. Edge code should absorb weirdness, not amplify it.

try {
  user = parseToken(token)
} catch (e) {
  log.warn('Bad token', { reason: e.message })
  user = null
}

3) Canary Releases

We adopted a simple rule: deploy to 5% of traffic first. Watch metrics for 10 minutes. Then roll out to 100%. It adds a bit of time, but it prevents full outages.

The Takeaway

Production bugs rarely come from complicated code. They come from assumptions. That day taught me a hard lesson: “small” changes aren’t small if they touch fundamental flows like authentication, payments, or caching. Treat those paths with respect.