AWS Certified Solution Architect - Professional exam is well documented with blueprints, preparation guides and online courses. These materials are full of best advice from the industry for building resilient workloads for the cloud.
Below are concepts, ideas and excerpts I found unusual, interesting or usefull while reading through AWS resources from obligatory program by A Cloud Guru course "AWS Certified Solutions Architect - Professional 2020"
AWS provides a service for reviewing your workloads at no charge. The AWS Well-Architected Tool (AWS WA Tool) is a service in the cloud that provides a consistent process for you to review and measure architecture using the AWS Well-Architected Framework.
Security and operational excellence are generally not traded-off against the other
pillars.
Technology architecture teams typically include a set of roles such as: Technical Architect (infrastructure), Solutions Architect (software), Data Architect, Networking Architect, and Security Architect.
“Good intentions never work, you need good mechanisms to make anything happen” — Jeff Bezos.
Stop guessing your capacity needs
Evaluate threats to the business (for example, business risk and liabilities, and information security threats) and maintain this information in a risk registry. Evaluate the impact of risks, and tradeoffs between competing interests or alternative approaches. For example, accelerating speed to market for new features may be emphasized over cost optimization.
Ensure that there are identified owners for each application, workload, platform, and infrastructure component, and that each process and procedure has an identified owner responsible for its definition, and owners responsible for their performance.
AWS supports more security standards and compliance certifications than any other offering, including PCI-DSS, HIPAA/HITECH, FedRAMP, GDPR, FIPS 140-2, and NIST 800-171
When responsibility and ownership are undefined or unknown, you are at risk of both not performing necessary action in a timely fashion and of redundant and potentially conflicting efforts emerging to address those needs.
Plan for unsuccessful changes so that you are able to respond faster if necessary
and test and validate the changes you make.
Use a consistent process (including manual or automated checklists) to know when you are ready to go live with your workload or a change.
All of the metrics you collect should be aligned to a business need and the outcomes they support. Develop scripted responses to well-understood events and automate their performance in response to recognizing the event.
You must learn, share, and continuously improve to sustain operational excellence. Perform post-incident analysis of all customer impacting events.
On AWS, you can export your log data to Amazon S3 or send logs directly to Amazon S3 for long-term storage. Using AWS Glue, you can discover and prepare your log data in Amazon S3 for analytics, and store associated metadata in the AWS Glue Data Catalog. Amazon Athena, through its native integration with AWS Glue, can then be used to analyze your log data, querying it using standard SQL. Using a
business intelligence tool like Amazon QuickSight, you can visualize, explore, and analyze your data.
Successful evolution of operations is founded in: frequent small improvements; providing safe environments and time to experiment, develop, and test improvements; and environments in which learning from failures is encouraged.
Before you architect any workload, you need to put in place practices that influence security. You will want to control who can do what.
Security
Identity and Access Management
Detection
Infrastructure Protection
Data Protection
Incident Response
In AWS, you can implement detective controls by processing logs, events, and monitoring that allows for auditing, automated analysis, and alarming. CloudTrail logs, AWS API calls, and CloudWatch provide monitoring of metrics with alarming, and AWS Config provides configuration history. Amazon GuardDuty is a managed threat detection service that continuously monitors for malicious or unauthorized behavior
to help you protect your AWS accounts and workloads. Service-level logs are also available, for example, you can use Amazon Simple Storage Service (Amazon S3) to log access requests.
Log management is important to a Well-Architected workload for reasons ranging from security or forensics to regulatory or legal requirements.
Enforcing boundary protection, monitoring points of ingress and egress, and comprehensive logging, monitoring, and alerting are all essential to an effective information security plan.
An eye-opener about limited testing in production:
"Load testing in production should also be considered as part of game days where the production system is stressed, during hours of lower customer usage, with all personnel on hand to interpret results and
address any problems that arise."
Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. – Principles of Chaos Engineering
In pre-production and testing environments, chaos engineering should be done regularly, and be part of your CI/CD cycle. Chaos engineering in production is encouraged, however teams must take care not to disrupt availability for customers.
Checkout the interesting resources referenced from the white paper:
An interesting concept well applied in other areas is Game Days.
"Conduct game days regularly: Use game days to regularly exercise your procedures for responding to events and failures as close to production as possible (including in production environments) with the people who will be involved in actual failure scenarios. Game days enforce measures to ensure that production events do not impact users."
Cloud, AWS in particular supports adding redundancy to your solution easily. Moreover, AWS even encourages you to do so. A common pitfall to avoid is a Multi-AZ Regional Deployment:
"Fault isolated zonal deployment: One of the most important rules AWS has established for its own deployments is to avoid touching multiple Availability Zones within a Region at the same time. This is
critical to ensuring that Availability Zones are independent for purposes of our availability calculations.
We recommend that you use similar considerations in your deployments."
Multi-AZ is not an option for providing redundancy of Amazon EMR. Alternatively, it is possible to provide multiple master nodes, each secured with termination protection.
It is important to keep in mind that all of Amazon EMR data gets lost upon cluster termination. EMR File System (EMRFS) can store data in AWS S3. AWS S3 objects can be replicated across multiple AWS Availability Zone or Regions.
Purpose of bulkhead architecture is to limit impact of failure to a small subset of users or requests. So that other requests / users can continue using service unaffected by the failure. Bulkheads for data are partitions. Bulkheads for services are called cells.
Each cell is a complete independent service instance allowed to grow up to a maximum size. Workloads grow by adding more cells. Any failure is contained to the cell it occurs in. Key elements of AWS bulkhead architecture are Cell Router and n Cells. Routing occurs by partition key in request or user data, tying it to a particular cell. Each cell uses its own AWS ALB, Compute and Storage.
"Note that playbooks are used in response to specific incidents, while runbooks are used to achieve specific outcomes. Often, runbooks are used for routine activities and playbooks are used to respond to non-routine events."
Chapter "Example Implementations for Availability Goals" in "Reliablity Pillar AWS Well-Architected Framework" contains details descriptions for executable architecture implementations with pointers to business cases. Suggested samples cover:
Statically stable in this context is such that does not require control plane changes. Single region examples cover cases of 2-4 9s availability.
Multi-region scenarios have higher cost of operation. Region isolation is a natural boundary to isolate failure. Great care is required to avoid correlated failure across mutliple regions. Availabilities covered: 3.5 9s(99.95%),
"In the cloud, there are a number of principles that can help you strengthen your workload security:
• Strong identity foundation: principle of least privilege, separation of duties with appropriate authorization for each interaction with your AWS resources. Centralize identity management, and aim to eliminate reliance on long-term static credentials.
• Enable traceability: Monitor, alert, and audit actions and changes to your environment in real time.
Integrate log and metric collection with systems to automatically investigate and take action.
• Apply security at all layers: Apply a defense in depth approach with multiple security controls. Apply
to all layers (for example, edge of network, VPC, load balancing, every instance and compute service,
operating system, application, and code).
• Automate security best practices: Automated software-based security mechanisms improve your
ability to securely scale more rapidly and cost-effectively. Create secure architectures, including the
implementation of controls that are defined and managed as code in version-controlled templates.
• Protect data in transit and at rest: Classify your data into sensitivity levels and use mechanisms, such
as encryption, tokenization, and access control where appropriate.
• Keep people away from data: Use mechanisms and tools to reduce or eliminate the need for direct
access or manual processing of data. This reduces the risk of mishandling or modification and human
error when handling sensitive data.
• Prepare for security events: Prepare for an incident by having incident management and investigation
policy and processes that align to your organizational requirements. Run incident response simulations
and use tools with automation to increase your speed for detection, investigation, and recovery."
1. Foundations
2. Identity and access management
3. Detection
4. Infrastructure protection
5. Data protection
6. Incident response
AWS Organizations support management of Service Control Policies (SCP). AWS Organizations Organization Units are recommended to structure by function, rather than company's reporting lines. SCPs can be assigned to organization root, OUs or individual AWS accounts. SCPs define maximum available permissions for IAM entities in an account. IAM entities include all users, roles, and the account root user.
Best practices OUs:
AWS Control Tower orchestrates the capabilities of several other AWS services: AWS Organizations, AWS Service Catalog, AWS SSO to build an AWS landing those quickly. It helps to combat drift, which is divergence from best practices, applying preventing and detecting controls (guardrails).
Protecting Data At Rest
Protect Data in Transit
Incident Response
Approaches to address incident response:
Educate
Prepare
Simulate
Game days
Game day scenario examples: a leaked credential, a server communicating with unwanted systems, or a misconfiguration that results in unauthorized exposure
Simulation phases:
Iterate
People Perspective
Platform Perspective
Governance Perspective
Security Perspective
Business Perspective: is focused on ensuring that IT is aligned with business needs and that IT investments can be traced to demonstable business results.
Operations Perspective - Manage and Scale: Every organization has an operations group that
defines how day-to-day, quarter-to-quarter, and year-to-year business will be
conducted.
Recommented OUs: