Keeping Freight Moving in a Cloud Outage
When a hyperscale cloud stumbles, it does not matter how “simple” your workflow looks from the office, you feel it at the dock, in the driver app, and in the truck. The AWS incident was a classic cascade: a DNS problem in a core database path inside a single region rippled into compute and load balancer health checks, and suddenly a long list of everyday apps went sideways. It was not a hack, it was plumbing.
The lesson for non-tech companies is not to memorize acronyms; it is to assume the platform you rent will have a bad day and to design your playbook and vendor expectations around that reality.
We have seen the other side of the same coin: in 2024, a faulty security content update from CrowdStrike knocked millions of Windows machines into blue screens. That was a client-side failure, different mechanics than a cloud outage, same business outcome: people and freight stop moving unless you have prepared alternatives. The number that matters is not how many servers hiccupped, it is how many orders, loads, and invoices you did not process while you waited.
So, what do you do if the SaaS you rely on goes dark? First, treat resilience as an operational discipline, not an IT project. You need a minimal backup operating procedure you can run for order capture, tendering, dispatch, check calls, and invoicing when screens freeze. That can be a read-only mirror, CSV exports you can work from for a day, or a small backup tool you control. You also need out-of-band communications that do not depend on the tool that is down: SMS trees for drivers, a public status page for customers, and a standing phone bridge for your teams. The companies that communicated clearly during the AWS incident looked like adults under pressure, the ones that did not, watched tickets and anxiety pile up.
Next, raise the bar for your vendors, in contracts and in practice. Ask them, in plain language, how they survive a regional failure at their cloud provider. If the answer is “we are in multiple availability zones,” push further: zones help with building-level issues, regions help with provider problems. Ask for their target Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for your most critical flows, and how they have implemented read-only or queue-and-replay modes so you can keep accepting work even when downstream services are unhealthy. The AWS Well-Architected guidance is clear: spread production across multiple availability zones at minimum and consider multi-region for mission-critical workloads. Your SaaS should already live there for the parts of the platform your revenue depends on.
Change management is the other non-negotiable. Know how your provider ships updates. Do they roll out in rings with the ability to pause and automatically roll back if a health metric drops, or is it big bang to everyone at once? If they deploy any endpoint agents on your machines, confirm you can defer or stage those updates and that there is a documented rollback if they misfire. CrowdStrike’s experience shows why these matters. If your world is 100 percent Windows with one security stack across every desk, stand up a few continuity stations so tendering and invoicing continue when Windows has a bad day.
Make drills part of the business rhythm. Once a quarter, run a 30-minute brownout where your primary SaaS is treated as read-only and prove you can still take orders, dispatch, and bill. Once a year, simulate a region loss for an hour and score your real RTO and RPO against what your vendors promised. If your incident communications plan assumes email, add a Plan B, text, phone, or a status site, because email is often collateral damage when platforms wobble. Industry guidance calls this contingency planning, but the point is simpler: the only plan that works on a Monday morning is the one you have rehearsed.
Finally, put teeth in the paperwork. Embed RTO and RPO in your MSA instead of vague uptime percentages, require post-mortems within five business days for major incidents, and keep a standing right to export all of your data in a clean format on demand. Map your SaaS providers’ own dependencies, cloud, payments, messaging, and ask what their Plan B is for each. Review insurance with your broker, many firms learned in 2024 that third-party outages do not neatly fit standard coverages. You are not trying to punish vendors, you are aligning incentives, so resilience is not an afterthought.
The takeaway is straightforward. Cloud centralization gives us superpowers and shared failure modes. The fix is not bravado, it is redundancy. Architect your minimal shadow flow. Demand region-level resilience and safe update practices from your SaaS. Drill the bad day. Communicate like pros. Do that, and the next time a platform hiccups, your customers will not remember that you went down, they will remember that you kept them moving while everyone else waited for the internet to come back.