The Reliability pillar guides implementing consistent, correct AWS workloads, emphasizing lifecycle operation and testing.
## Design Principles
1. Automatically recover from failure: Automate recovery by monitoring business-value KPIs to predict and fix failures.
2. Test recovery procedures: Use cloud automation to simulate and test failure scenarios, validating recovery strategies.
3. Scale horizontally: Use multiple small resources to minimize single failure impacts and avoid common failure points.
4. Stop guessing capacity: Monitor and adjust resources automatically to meet demand without over or under-provisioning.
5. Manage change in automation: Implement infrastructure changes through automation for better tracking and review.
## Best Practices
There are four best practice areas for reliability in the cloud:
- Foundations: Address network and compute capacity by focusing on managing service quotas and planning network topology
- Workload architecture: Favor [[Service-Oriented Architecture (SOA)]] or [[Microservices Architecture]], ensuring resilience and robustness in distributed system interactions.
- Change management: Automate scaling, leverage monitoring, and implement change management to ensure workload reliability amidst operational and demand changes
- Failure management: Emphasize data backup, fault isolation, component resilience, thorough testing, and disaster recovery planning to ensure workload reliability.
## Resources
[Whitepaper](https://docs.aws.amazon.com/wellarchitected/latest/framework/rel-resources.html#rel-wp)