Reliability Pillar - Dr. Miles Aron's Notes

The Reliability pillar guides implementing consistent, correct AWS workloads, emphasizing lifecycle operation and testing. ## Design Principles 1. Automatically recover from failure: Automate recovery by monitoring business-value KPIs to predict and fix failures. 2. Test recovery procedures: Use cloud automation to simulate and test failure scenarios, validating recovery strategies. 3. Scale horizontally: Use multiple small resources to minimize single failure impacts and avoid common failure points. 4. Stop guessing capacity: Monitor and adjust resources automatically to meet demand without over or under-provisioning. 5. Manage change in automation: Implement infrastructure changes through automation for better tracking and review. ## Best Practices There are four best practice areas for reliability in the cloud: - Foundations: Address network and compute capacity by focusing on managing service quotas and planning network topology - Workload architecture: Favor [[Service-Oriented Architecture (SOA)]] or [[Microservices Architecture]], ensuring resilience and robustness in distributed system interactions. - Change management: Automate scaling, leverage monitoring, and implement change management to ensure workload reliability amidst operational and demand changes - Failure management: Emphasize data backup, fault isolation, component resilience, thorough testing, and disaster recovery planning to ensure workload reliability. ## Resources [Whitepaper](https://docs.aws.amazon.com/wellarchitected/latest/framework/rel-resources.html#rel-wp)