Outages and downtime are inevitable. When they happen, it’s important to triage, investigate, and fix the problem while at the same time, providing timely status updates to customers and our fellow PollEvians. Juggling all these responsibilities can be chaotic, which is why we believe it’s important to have standard practices and a checklist to follow to minimize the chaos.
The incident commander is primarily responsible for communication and coordination. They should be the single source of truth for Customer Service, Account Managers, and any other PollEvians who need updates about the incident. They are also who engineers should direct their findings to when investigating the incident. If those responsibilities don’t fully utilize the incident commander’s time or if the issue was low priority, they may also participate in the issue investigation.
The incident commander is responsible for following our downtime checklist to guide the team through these steps:
The primary goal of triage is to determine what is wrong and how bad it is. Typically, the alert sent to the on-call engineer will give you a good indication of what part of the system is experiencing problems. Determining the severity is much more subjective, but we do have some guiding principles.
An incident is classified as high severity if some percentage of requests to our servers are failing, if response times have significantly risen, or if a user facing feature is failing. Basically, if it’s an ongoing problem that will affect customers until it’s mitigated, it’s a high severity. High severity issues require a more aggressive response and may involve many people.
An incident is classified as low severity if it is a short-lived, transient issue or does not have a direct impact on our customers. Low severity issues typically only involve a couple engineers doing a root cause analysis to find and fix the problem. It’s likely that low severity issues can be addressed when it’s convenient.
After the on-call engineer knows what the problem is and how severe it is, the next step is to notify those who need to be involved in the incident response. For low severity issues, it may only require the on-call engineer restarting a service. For high severity issues, it will likely involve other engineers, a product manager, and customer support. At this point, the on-call engineer may transfer incident commander responsibilities to someone else if the on-call engineer is the best suited to investigate and mitigate the issue.
After everyone is notified, the next step is to stop the bleeding. Mitigation is getting our platform to a state where customers are no longer directly impacted. It’s important to note the goal here is not to find and fix the root cause; the goal is to get everything back working again as quickly as possible. While fixing the root cause may be the quickest solution, that’s not necessarily the case.
Unless the root cause was fixed as part of mitigation, it’s now time to dig in to find and fix the root cause. In other words, the primary reason for the resolution step is to figure out what the long term fix for the problem is, and implement it.
As soon as an incident is over, we log information about it for further analysis during our weekly Scrutineering meeting. Generally, this is high level information about the problem, how we addressed it, when it started/ended and a more detailed severity level. Our severity levels when reporting the issue are:
|Critical Bug||A Critical Bug is a “stop the line” type problem, and must be worked on until the mitigation is in place, even if it’s the middle of the night. It requires a story in Pivotal Tracker with the label “critical”. Critical Bugs are always High severity during triage.|
|Downtime Outage||A Downtime Outage is also a “stop the line” type problem and must be worked on until a mitigiation is in place no matter what time it is. It is also categorized as High severity during triage. The main difference between Downtime interruption and Critical Bugs are the cause. Critical Bugs are cause by bugs in our code, while Downtimes are likely cause by infrastructure level problems.|
|Downtime Interuptions||A Downtime Interuption is probably an infrastructure level problem, like Downtime Outages, but is a transient problem like a networking issue. It may be categorized as either High or Low severity during triage depending on liklihood of reoccurance.|
During our weekly Scrutineering meeting, the engineers review the incidents from the past week. The follow-up has a couple goals. First, we want to raise awareness of the issue and how it was resolved to all the engineers. This should allow others to identify and resolve the problem if it comes up again. Second, we want to identify any systemic problems underlying the incidents. We want to identify changes to our processes or coding practices that could prevent future outages.