A recent global IT outage, caused by an update to endpoint detection and response tool Crowdstrike, resulted in disruption to a wide range of businesses, including supermarkets, airports, retailers, petrol stations, banks and payment providers, telecommunications companies, and IT service providers around the world.
The update was deployed on a Friday afternoon (NZT), which meant that New Zealand businesses were also affected, just as the country was heading into the weekend. With the possibility of similar incidents happening again, we talked to Canterbury IT specialist Softsource vBridge GM Technical Delivery Iain Boyd about the outage and what it means for local businesses.
What happened?
At 4.09pm on 19 July Crowdstrike released an update to it’s Falcon EDR product. This product is usually updated multiple times per day to allow for new security threats and vulnerabilities to be identified and remediated by the Falcon platform. The update in this case caused a logic error that forced the Crowdstrike driver to crash which rendered some Windows-based systems unable to boot and therefore unusable.
How it happened
The Falcon agent interacts with the Windows kernel. In this case the update was to make the sensor watch “named pipe configurations”. This allows the falcon agent to monitor processes on the windows operating system and importantly interaction between processes that might be malicious. One of the downsides of this configuration is that the channel file must load pre-boot (to make sure it captures all interaction). Systems running Falcon sensor for Windows 7.11 and above that downloaded the updated configuration were susceptible to a system crash. That then resulted in a boot cycle displaying the windows blue screen of death.
The issue was identified quickly via the error code on the Windows Blue Screen. Crowdstrike released a workaround fix at 5:27pm (NZT), but unfortunately the fix was to remove the broken channel file which in most cases had to manually done at each computer affected. For enterprise businesses that needed to implement the fix on mass this sometimes took 24-48 hours for teams to understand the problem and come up with a plan for deploying the fix.
Could it happen again?
While Crowdstrike has released some information on what happened and how to fix it, they are still conducting root cause analysis to understand how the core logic issue that caused the outage to happen – so it’s difficult to say exactly without that information.
Luckily in this case there was no cybersecurity attack involved in the outage itself, it was simply a piece of faulty software. However, in the days that followed, Crowdstrike themselves acknowledged two new attacks in their blog as hackers sought to capitalise on the confusion associated with the outage – one that pretended to be a hotfix for the issue and one that had a Word file with a malicious macro pretending to be a Microsoft recovery manual.
Businesses should be wary of solutions that come from unverified sources particularly anything that is emailed from an unknown source. If in doubt businesses should speak with their trusted IT provider before taking action or refer to the Crowdstrike website that has trusted content in how to fix the issue. Regardless, businesses should plan for the certainty of it happening again. While Crowdstrike was the affected vendor this time, similar issues could occur with other software providers. They aren’t the only vendor that could cause an outage of this kind and scale – they just happened to be the unlucky ones on the day.
What can businesses do to increase their resilience to such events?
These situations are a good reminder to businesses that it’s really important to have the IT basics sorted or a trusted IT provider that can make sure the basics are covered. Businesses should have:
- Backups of critical systems and data. I always recommend companies follow the 3-2-1 backup rule: you should have three copies of your data (your production data and two backup copies) on two different media (for example disk and tape) with one copy off-site for disaster recovery purposes.
- An IT disaster recovery plan. At a minimum the plan should cover:
- Risk assessment and business impact analysis: Identify potential risks and assess their impact on business operations. This helps prioritise recovery efforts based on how critical they are.
- Recovery objectives: Define Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs) for critical systems and data. How long can your system be down and how much data can you afford to lose?
- Hardware and software inventory: Maintain an up-to-date inventory of all hardware and software assets. This helps in quickly identifying and replacing damaged components.
- Recovery procedures: Develop detailed recovery procedures for restoring IT systems, applications, and data. This includes step-by-step instructions for different disaster scenarios.
- Communication plan: Establish a communication plan to keep stakeholders informed during a disaster. This includes notifying employees, customers, and partners about the status of recovery efforts.
- Most importantly companies should test their plans and backups regularly to ensure the plan is fit for purpose. By practising recovery, staff know what to expect and the business has the comfort of knowing they can reliably recover should the worst happen.
- A business continuity plan (BCP). Businesses that have a good disaster recovery plan may want to consider expanding that plan to a full business continuity plan. This is less relevant to IT but more relevant to how you are going to operationally run the business in the event of a disaster. Some examples of what a BCP plan might include are:
- Employee safety and wellbeing: Include procedures to ensure the safety and well-being of employees during a crisis, such as evacuation plans and emergency contacts.
- Supply chain resilience: Develop strategies to maintain supply chain operations and manage relationships with key suppliers, customers and partners.
- Training and awareness: Conduct regular training and awareness programs to ensure that employees are familiar with their roles during a disruption.
- Delegations of authority and responsibility: What happens if someone that makes the key decisions is unavailable or uncontactable. Who is in charge? What decisions can they make?
Whilst widespread outages such as these can be daunting, they aren’t something you need to plan for alone. There is specialist help available that can walk you and your team through what’s required, ensure you have the right plans, policies and mitigations and can assist you in applying these to your business. As cyber threats continue to evolve, the importance of staying vigilant and prepared cannot be overstated.