How long does it take for you to “arrive on scene” for your incidents? That is, how fast did it take for you to go from receiving an incident notification, to dispatching the right team, to getting to the location of the fire? Many teams out there focus on mitigating an incident quickly, and overlook this critical metric that is the most squarely in your realm of control.
Assembly time
On-Call pages, customer support emails, and manual discovery are all unique ways we discover fires. In real life, we call emergency services (ie: 911) when we know something is wrong. But even 911 itself was an improvement on a convoluted system for getting help quickly. Residents had to know the local number for police and fire emergency services up until 1968, which is not very helpful when you’re in a different city visiting and suddenly need help.
A standardized way of dispatching emergency services was a critical step to have full-cycle incident management for cities, especially ones operating at enormous scale like New York City. Since 2013, NYC has been publishing their response times to 911 calls. This data is a treat in so many ways, but it has a notable metric missing: resolution.
Instead, NYC focuses on the part that is within the realm of their control: How fast they can route a 911 call to the right team suited for the job, and how long it takes for that team to travel to where the incident is occurring.
Assigning neighborhoods
New York City, and every city for the matter, can quickly respond to incidents because of the careful organization and planning that has gone into which departments respond to which incidents, and which of those stations get dispatched to emergencies when a 911 call is received. When it comes to software, it’s all too common to see teams haphazardly assemble before the first hypothesis on what is going wrong is formed.
Software teams can mimic the organization of cities by assigning which teams own which “neighborhoods” of their stack. We can think of a software neighborhood as a functionality, a service, or even an entire environment. The important thing is to remember that these neighborhoods have assigned teams when an incident is being dispatched, much like a real fire department in Manhattan.
The best people to assign neighborhoods for incident management are the locals themselves. People who build the software that will inevitably have an incident (big or small), are the sane default to assign when you smell smoke. Service ownership expands the entire software development lifecycle, including incidents.
Travel time
When it comes to incidents, reaction and travel time matters. Smokejumpers, for example, are trained to be airborne and enroute to a wildfire in a matter of minutes. When it comes to aircraft safety, getting out of the plane in less than 45 seconds best guarantees survival during an accident. Airplane safety cards, as it turns out, actually work.
So when it comes to software, what is the “safety card” procedure your teams can use? Oftentimes we call these runbooks or playbooks, but the goal is the same: getting to the point where you can mitigate an incident faster.
Greasing the mental wheels of an incident response team can dramatically decrease the amount of time an incident takes to be acted on, which overall decreases the amount of time that individual incident is impacting others.
Response time matters
Every phase of an incident has things that can be improved, and response time has one of the highest returns on investment when reducing the overall impact of an incident. So assign your neighrborhoods, standardize your declaration, and get to the fire faster.