Atlassian Outage Puts A Spotlight On SaaS Resilience Gaps

SaaS brings speed, innovation, and enterprise capability …

Today’s enterprises embrace software as a service to add capability quickly without huge investments in infrastructure and staff to build it out internally. A business leader chooses a service, signs a contract, and very rapidly has a CI/CD tool, a full HRM system, or another business application.

… along with new, unfamiliar — and often poorly understood — risks.

Technology and business risks morph with changes in technology and how it is delivered. While cloud services are often considered more dependable, businesses face new risks with SaaS and public cloud — risks that are unfamiliar or not completely understood. People’s eyes pop open and ears perk up when they witness prolonged outage events such as the current issue with Atlassian. Suddenly, SaaS dependencies and resilience issues become relevant, as a business can’t access its favorite SaaS tool. The unique risk of using SaaS is that you don’t have control over the application or the tool and cannot reimplement yourself. It is also important to understand the cascading risks, as some of the well-known SaaS services are hosted on a leading hyperscaler’s infrastructure. You need to analyze the business impact of SaaS and cloud services outages just like for any other technology in your portfolio.

When Crafting Resilience For SaaS, Two Things Matter: What Your Vendor Does And What You Do

Define vendor responsibility.

Trust but verify vendor claims about service-level agreements supporting operations and resilience plans. To ensure that your SaaS providers deliver on their own promises:

  • Demand that SaaS providers share their resilience capabilities. Understand the design, architecture, and deployment model for these SaaS services. These should be transparent, not opaque. What resilience capabilities has the SaaS provider built to withstand failures? Insist that the provider be clear about which failure scenarios it covers and which it does not.
  • Inquire about IT operations and controls. While some SaaS providers may identify their design and architecture as a secret sauce, don’t settle for boilerplate responses. Engage your recovery practice people to inquire about how SaaS providers manage their services, including their operational practices.
  • Build SLAs with real vendor consequences into contracts. Downtime for a vendor represents more than a lack of service to your business. Depending on the particular SaaS tool, an outage can mean a whole lot of cost to your business — idle employees, missed deadlines, inability to sell or ship products, loss of physical or digital security, a threat to life, and reputational risks. Make your SLAs with the vendor match the importance of the service to your business. One company writes into its vendor contracts that SLA-violation payments must be signed by each member of the board of directors so that outages get escalated to the highest level.

Implement your own controls.

The resilience of your business is your concern; don’t pass the buck to your vendor. With SaaS, you avoid running and maintaining an application, but in the case of service outages, you incur business losses. You don’t run the infrastructure to put it all back together. Prepare for the risk scenarios that your SaaS provider does not cover, and develop a plan of controls and mitigations that your business can take to minimize the impact of SaaS outages on your business.

Risks And Controls For SaaS Resilience Vary — Act Accordingly

Risk Control Control description
Data loss or corruption Back up your data For the most part, SaaS vendors don’t take responsibility for client data; it may be part of their backups, but they aren’t guarding against accidental deletion or corruption. There is no easy way to initiate a restore. Let’s be clear that backing up data from SaaS does not mean that you can restore your business operations in the case of an outage. Data backups provide a safety net for your data in case of corruption — and restore it back on to SaaS. Backups may allow for a way to execute a service migration if staying with the current provider becomes untenable.
Dependent infrastructure outage Monitor key cloud service dependencies Ascertain if the infrastructure provider will have a downstream effect on your SaaS vendor’s offering. For instance, if your provider has significant infrastructure in AWS US East, you should monitor the service availability of that region in a resilience dashboard.
Short-term outage Identify tolerance for service outage Most cloud and SaaS outages are relatively short, and while disruption is inconvenient, the value that SaaS provides exceeds the hiccups. Identify internally when that equation changes and action must be taken — such as workarounds or service migrations.
Medium-term outage Workarounds and outage planning Identify key processes and operations that require workarounds to keep the business running even in a degraded state. When planning for outage scenarios, ask key questions such as: If your CI/CD pipeline fails, how will the developers write and publish code? If your collaboration system is unavailable, how will teams share key documents until service is restored? Is there a hybrid option or an available one from the vendor?
Long-term outage Service migration Most SaaS companies have a healthy set of competitors ready to help you transition to their platform. Identify in advance which vendors would be the best fit for your needs. If possible, test what would be necessary to transform and migrate data backups from your existing vendor into a new platform with potential vendors. Also perform rigorous due diligence on your alternate provider, as it may expose similar risks as your current provider.
Vendor shuts down or discontinues service Software escrow/SaaS escrow Companies such as NCC Group in the UK provide a unique escrow service, contracting with customers and software or SaaS vendors to hold (incremental) code and providing operational expertise in escrow to de-risk the possibility of a vendor discontinuing a product or going out of business.

Practice and test your recovery and resilience options.

Every athlete practices their sport, gauging their own performance — often with the help of coaches or other athletes to determine how to improve. Your resilience operations should be practiced, tested, and improved in the same way. Resilience and recovery is a sport, and executing requires everyone in your organization to know what they are doing in the case of a key application or service being offline. Your sales teams need to know what to do if Salesforce is not available; your HR team needs to know what workarounds to implement if Workday has an outage; and your DevOps teams need to understand how to stay productive if Atlassian goes down.

Just like with self-managed infrastructure, the key to surviving a SaaS outage is knowing the risks, implementing controls to mitigate those risks, and then testing your plans to make sure those work and that everyone knows how to execute in the case of crisis.

Special thanks to Charles Betz, David Mooter, and Will McKeon-White for their active participation in the creation of this blog.

Check out the companion blog about our observations on the Atlassian outage and what you can do if you are affected.