Years ago, I worked as a firefighter. One of the most captivating parts of the job was incident pre-planning. Incident pre-planning involved simulating possible emergency scenarios and subsequent emergency services responses to determine what resources would be necessary to respond effectively and how to best deploy those resources. Without such pre-planning, we as firefighters risk something happening that will prevent us from doing our job – preserving human life and protecting property.
For example, in the event of a house fire, firefighters would enter the house to locate and extinguish the fire. Firefighting procedures require that whenever firefighters enter a building, a separate team – known as a Rapid Intervention Team (RIT) – must be on standby, ready to enter if the firefighters inside encounter an emergency and cannot rescue themselves. During incident pre-planning, we assign RIT responsibilities, ensure this team is equipped with specialized tools and training, and establish protocols that enable the team to focus solely on RIT duties when on-scene. The absence of a dedicated, well-prepared RIT team could be catastrophic. If firefighters inside a house on fire run out of air or become trapped under debris, those firefighters need immediate rescue. In such an event, it would be risky to divert firefighters in the house from other tasks, as this may impede the operation’s aims and put those firefighters at greater risk. That is why the designated RIT firefighters are prepared to step in to take on this essential task in an emergency.
Incident pre-planning – creating these safeguards for possible crisis events – is vital to ensure that when a real incident occurs, firefighters can respond effectively and efficiently while prioritizing the safety of everyone involved. During my training at the fire academy, a battle tested fire instructor imparted a lesson on this topic that has stayed with me throughout my career. He emphasized the need for a Plan A, Plan B, and even a Plan C.
His message was clear: proper planning is not just a nice-to-have; it is a must-have for us to succeed. And core to this planning is having back-up plans and systems for when things inevitably go wrong. This lesson in contingency planning has shaped my professional life.
In the fields of crisis and risk management, the principle of having back-up plans or safeguards in place is a cornerstone. To ensure success in an organization’s operations, it is imperative to account for the possibility that something might not work as expected. Relying on a single system without a backup plan exposes organizations to massive risks. Even a single component within a system without a ready backup could lead to catastrophic consequences.
This brings us to the CrowdStrike outage that shook the world in mid-July. A seemingly ordinary software update by CrowdStrike caused a massive global IT outage that crippled both organizations and entire industries for days to weeks. The outage caught organizations across various sectors – financial services, airlines, hospitals – completely off guard. Never before in human history has a single service outage impacted so many industries, organizations, and individuals. Experts have already labeled this event the largest IT failure to date.
However, the CrowdStrike outage is more than just a case study in IT failures; it also serves as a stark reminder of the dangers of single points of failure. A single point of failure refers to a part of a system, process, or infrastructure that, if it fails, will cause the entire system to stop functioning.
We saw this single point of failure dynamic play out in the immediate aftermath of the CrowdStrike outage. For many major companies, no contingency plan was in place should their primary IT system fail. As a result, when the outage occurred, normal operations for entire industries came to a halt and chaos ensued as organizations had to figure out how to restore systems that were affected.
For example, Delta Airlines experienced a massive disruption in their operations because of the CrowdStrike outage. Although many airlines faced disruptions, it took a significant amount of time for Delta to restore its normal operations. The airline explained,
"Upward of half of Delta’s IT systems worldwide are Windows based. The CrowdStrike error required Delta’s IT teams to manually repair and reboot each of the affected systems, with additional time then needed for applications to synchronize and start communicating with each other."
In the end, Delta had to cancel 7,000 flights because of the outage and now needs to refund customers an estimated total of $500 million. This estimated cost does not include the reputational damage or cost of reacquiring customers who now refuse to fly Delta because of frustration around this incident.
Delta serves as just one example of the cost of single point failures in an organization. The lack of backup systems and plans broadly across industries was painfully evident as organizations struggled to maintain or restore their operations. Some estimate that the total cost of the CrowdStrike outage to Fortune 500 companies will be around $5.4 billion.
This incident underscores just how serious the risks associated with single points of failure are. Therefore, the lesson from this incident for us becomes clear: we need to take seriously the risks associated with single points of failure in our operations and do all that we can to mitigate those risks. This includes identifying and addressing single points of failure in critical systems within our organization, and developing backup plans — Plan A, Plan B, and Plan C — in case some part of our system fails. Without backup systems or contingency plans, we leave ourselves vulnerable to the kind of widespread disruption witnessed during the CrowdStrike outage.
In today’s interconnected world, where technology is the backbone of nearly every industry, the risk of single points of failure can no longer be ignored. Proper planning and crisis preparedness are not just best practices; they are necessities in the business world. As we move forward, it is vital that we learn from the CrowdStrike incident and ensure that we have robust systems in place to protect against the next inevitable system issue to prevent a predictable crisis from unfolding.
Comments