The AWS definition for operational resilience is:
Operational resilience is the ability to provide continuous service through people, processes, and technology that are aware of and adaptive to constant change. It is a real-time, execution-oriented norm embedded in the culture of AWS that is distinct from traditional approaches in Business Continuity, Disaster Recovery, and Crisis Management, which rely primarily on centralized, hierarchical programs focused on documentation development and maintenance.
Why should you optimize for operational resilience?
This is simple. If you don’t, even minor failures could result in prolonged downtimes, scaring away customers and angering stakeholders. According to this IDC study, the average annual cost of downtime for Fortune 1000 companies ranges between $1.25-$2.5 billion per year.
How can you achieve operational resilience?
Avoid assumptions and confirm the facts. There are four AWS operational resilience pillars you need to check: Infrastructure, Operations, Security, Software. Make sure your application can handle failures and recover quickly. If you want to dig deeper, check out AWS training material for each of these pillars.
Take advantage of the highly resilient infrastructure provided by AWS, and host your resources in multiple Availability Zones to virtually eliminate the risk of infrastructure failures such as hardware failures, natural disasters, and power outages. AWS provides a plethora of managed services to create highly available and scalable applications, so you don’t have to reinvent the wheel.
If you haven’t already, create playbooks so your team knows how to handle failures and disaster recovery, and practice it before going into production. Create regular backups to avoid or minimize data loss.
Use any feedback and learnings when handling issues and incidents to continuously improve these playbooks.
AWS provides a shared responsibility model when it comes to security. In short, AWS is responsible for the security of their hardware and infrastructure, and they provide excellent tools so you can implement the required security measures on your side. Always go for the least privilege strategy to make sure your data is safe, and create regular backups to minimize the blast radius of ransomware attacks and similar.
AWS Managed Services helps you minimize the risks to provision, run, and support the infrastructure by automating common activities such as change requests, monitoring, patch management, security, and backup services.
AWS provides complete toolkits to improve the stability and security of your application development, such as AWS CodeDeploy and AWS Code Pipeline.
Optimizing for operational resilience is often overlooked until companies feel the pain for the first time. Make sure to plan ahead and retain your customers with stability and trustworthiness.
- Carvalho, L., & Marden, M. (2018, February). Fostering Business and Organizational Transformation to Generate Business Value with Amazon Web Services. Document #US43535718 © 2018 IDC. Www.Idc.Com | Page 1IDC White Paper. Retrieved May 5, 2022, from https://pages.awscloud.com/rs/112-TZM-766/images/AWS-BV%20IDC%202018.pdf
- Shared Responsibility Model – Amazon Web Services (AWS). (n.d.). Amazon Web Services, Inc. Retrieved May 5, 2022, from https://aws.amazon.com/compliance/shared-responsibility-model/
- Techniques for writing least privilege IAM policies. (n.d.). Amazon Web Services. Retrieved May 5, 2022, from https://aws.amazon.com/blogs/security/techniques-for-writing-least-privilege-iam-policies/
- Learn AWS with Training and Certification | Cloud Skills Courses and Programs | AWS. (n.d.). Amazon Web Services, Inc. Retrieved May 5, 2022, from https://aws.amazon.com/training/