← all writing
4 min read

The rollback button is part of the feature, not a step after

I used to treat rollback as something to figure out after ship. One Sunday morning at Remolution taught me that if you don't think about it before, you won't have time to think.

The Sunday morning at Remolution

Late 2023. I'd just shipped a small feature at Remolution — an update to how candidate data was mapped between a customer's system and their hiring pipeline. Code review passed, staging was green, I deployed on a Friday afternoon because "it's just tweaking how a few fields display." I closed the laptop and went out for hotpot with my wife.

Sunday morning, an enterprise customer in Singapore opened a high-priority ticket. Their candidate names were being matched to the wrong profiles — not a display bug, but real data overwritten in the DB by a background job I'd forgotten to flag. Several hundred records had been scrambled. I opened the laptop, hands cold, and went looking for the rollback button.

There was no rollback button.

There was git revert for the code. There were nightly DB snapshots — but the snapshots also contained new data the customer had entered after the bug ran, so restoring straight from snapshot meant erasing their entire weekend of recruiting work. What I needed was a way to reverse only the part the job had written wrong. I didn't have that. I had to write a SQL script on the spot, reconcile each row against the audit log, then call the customer to explain why they should wait another four hours.

Those four hours were four hours I should have spent two weeks earlier — when I was writing the feature.

What I had wrong about rollback

Before that day, rollback was a post-script step in my head. Something to "figure out if there's an incident" — like backups, like monitoring. Something outside the feature, owned by the platform.

Wrong. Rollback is a path the feature carves into the data. When I write a migration that adds a column, the rollback path is dropping the column. When I write a job that overwrites a record, the rollback path is the ability to restore the old value — which means the job has to save the old value somewhere before overwriting it. When I send an email, the rollback path is... nothing, because the email is gone — so the feature needs a stopping gate before sending, not after.

If I don't design that rollback path while writing the feature, it doesn't grow there on its own later. It has to be built from zero, under time pressure, with a customer waiting on the phone.

Rollback isn't something I do after a feature breaks. It's something I either did or didn't do while writing the feature.

How I changed the way I write specs

After that incident, I added a fixed section to every spec I write — even one-line specs for small tasks. The section is called The way back, and it always sits before the happy path, never after.

Story

I learned that if the "what if it breaks" part sits at the end of a spec, it gets cut when the deadline tightens. Putting it first means I can't pretend a feature is done while that part is still blank.

The way back answers three concrete questions:

  • Where does this feature write? (Which DB row, which file, which external API, which queue)
  • If one write is wrong, how do I reverse it? (Existing script, undo endpoint, soft-delete flag, audit log rich enough to reconstruct from)
  • How long do I have to reverse it? (Five minutes after merge, or three days after a customer notices)

If any of those three doesn't have a concrete answer, the feature isn't done — even with green tests and a pixel-perfect Figma match.

Rollback isn't just git revert

The second thing I came to understand: git revert reverses code, not the state the code created in the real world. After a feature has run in production for a few hours, the real world no longer looks like it did when you deployed. New data has been written with the new schema. Emails have gone out. Webhooks have been received by third-party systems.

I started splitting rollback into three layers when reviewing my own code:

Layer 1 — Code

The easiest layer. git revert or redeploy the previous version. If the code is written backward-compatible (new column defaults to nullable, feature flag wraps the new branch), this takes under a minute.

Layer 2 — Data

This is the Sunday-at-Remolution layer. Code can be reverted, but what the code wrote into the DB is still there. For this layer to be tractable, you need at least one of: a real audit log captured before the overwrite, soft-delete instead of hard-delete, or a migration shipped with an actual rollback script (not a // TODO: rollback comment).

Layer 3 — Outbound side effects

Email sent, payment captured, webhook fired. This layer mostly can't be rolled back — you can only compensate: send a correction email, refund, call a compensating API. So for side effects, the way back is a stopping gate before the side effect happens — confirmation step, dry-run mode, a queue you can pause. You can't fix it after; you have to slow it down before.

The question I ask before merging

Now, before merging any PR of mine, I ask a single question: If a serious bug is found 15 minutes after this hits production, what will I do — with which specific commands, against which specific files?

If the answer is "I'll figure it out when it happens," the PR isn't ready to merge. Not because the code is wrong, but because I haven't finished the feature — I've only finished half the happy path.

The other half, after that Sunday in Singapore, was never an afterthought for me again.