How Urgent Is This Bug?
I remember the first bug that I shipped to production. I was upset that I’d broken something and was anxious to fix it. But I noticed something curious: the calm demeanor of a senior mentor helping me. They refused to meet my intensity. While the world burned, they wanted to instead discuss the bug and its relative importance.
Engineers don’t want to break things. We don’t want to make mistakes or look dumb. We don’t want to ship a sup-par experience or disappoint customers. When I was a junior engineer I reacted strongly to bugs that I caused or contributed to, no matter how small.
In ‘Avoiding Code Catastrophes’, I wrote:
“Step one: don’t panic! Unless your software is powering a shuttle to Mars, the significance of the bug is probably lower than you think.”
Today I’d like to investigate this idea.
Mental Model: US Armed Forces Defense Readiness
As a frame of reference, we’ll use the US Armed Forces Defense Readiness Condition (DEFCON) system. Here’s a summary:
- DEFCON 1: Maximum military readiness (emergency!)
- DEFCON 2: Ready to deploy and fight in six hours or less
- DEFCON 3: Select forces are ready to deploy in 15 minutes
- DEFCON 4: Above normal readiness
- DEFCON 5: Normal or lowest state of readiness (everything is fine!)
DEFCON 1 is an emergency. DEFCON 5 is a low-priority bug we’ll either eventually fix or ignore. For the sake of this discussion, a high number is good and a low number is bad.
Let’s say that we know nothing about the bug and we’re starting at DEFCON 1. When you encounter a bug, ask the following questions:
- Who can experience this bug?
- How likely are they to experience this bug?
- If this bug is preventing an action, how important is that action?
- How long has this bug been live?
Who Can Experience this Bug?
Users are typically segmented into roles: you might have admins, employees, logged-in customers, and logged-out customers. Who can experience the bug? If the answer is only employees, you can manage expectations with them directly, so DEFCON raises (which, we remember, is good). It’s not an emergency and we can fix this bug like any other.
How Likely Are They to Experience This Bug?
Does this bug happen when a certain action is taken? When does it happen? If you aren’t sure how common it is, look for metrics or estimate. Sometimes the answer is severe: “The homepage is a white screen for every user.” And many, many other times, it isn’t: “This promo code that we sent to 10 people doesn’t work on a Leap Day, which ends in 30 minutes.”
How Important is that Action?
If the bug prevents an action, how important is that action? Is this preventing checkout (expensive), or updating a user’s avatar (not expensive)? How does this impact the business? If just a little, raise that DEFCON.
How Long Has This Bug Been Live?
Systems adapt to long-running bugs. Sometimes customers even come to depend on a bug; could that be happening here? If so, raise that DEFCON.
Conclusion: Few Bugs are Emergencies
Few bugs are emergencies.
Some might respond: “I’m an individual contributor and my management will tell me if this is priority.” I disagree. We, engineers, must advise our management on the importance of bugs. Often it takes a programmer to answer many of these questions. Be a consultant and teach people not to panic.
In the story I started this post with, my mentor cared a lot about the code and customers, and was calm. They understood that fixing a bug requires us to understand it. And you’ll do that better when calm. Your calmness will help you confirm that it’s truly fixed. And you’ll make continual progress because you don’t let every alarm distract you.
I’ll conclude with a relevant quote from Shape Up by Ryan Singer:
There is nothing special about bugs that makes them automatically more important than everything else. The mere fact that something is a bug does not give us an excuse to interrupt ourselves or other people. All software has bugs. The question is: how severe are they? If we’re in a real crisis—data is being lost, the app is grinding to a halt, or a huge swath of customers are seeing the wrong thing—then we’ll drop everything to fix it. But crises are rare. The vast majority of bugs can wait six weeks or longer, and many don’t even need to be fixed. If we tried to eliminate every bug, we’d never be done. You can’t ship anything new if you have to fix the whole world first. pg. 81