Industry Best Practices for Cloud Workloads - AWS Highlights

AWS Certified Solution Architect - Professional exam is well documented with blueprints, preparation guides and online courses. These materials are full of best advice from the industry for building resilient workloads for the cloud.

Below are concepts, ideas and excerpts I found unusual, interesting or usefull while reading through AWS resources from obligatory program by A Cloud Guru course "AWS Certified Solutions Architect - Professional 2020"

White Papers

Well Architected Framework

AWS provides a service for reviewing your workloads at no charge. The AWS Well-Architected Tool (AWS WA Tool) is a service in the cloud that provides a consistent process for you to review and measure architecture using the AWS Well-Architected Framework.

A component is the code, configuration, and AWS Resources that together deliver against a requirement. A component is often the unit of technical ownership, and is decoupled from other components.
The term workload is used to identify a set of components that together deliver business value. A workload is usually the level of detail that business and technology leaders communicate about.
We think about architecture as being how components work together in a workload. How components communicate and interact is often the focus of architecture diagrams.
Milestones mark key changes in your architecture as it evolves throughout the product lifecycle (design, implementation, testing, go live, and in production).
Within an organization the technology portfolio is the collection of workloads that are required for the business to operate.

Security and operational excellence are generally not traded-off against the other
pillars.

Technology architecture teams typically include a set of roles such as: Technical Architect (infrastructure), Solutions Architect (software), Data Architect, Networking Architect, and Security Architect.

“Good intentions never work, you need good mechanisms to make anything happen” — Jeff Bezos.

General Design Principles

Stop guessing your capacity needs
Test systems at production scale
Automate to make architectural experimentation easier
Allow for evolutionary architectures
Drive architectures using data
Improve through game days

Operational Excellence - Design Principles

Perform operations as code
Make frequent, small, reversible changes
Refine operations procedures frequently
Anticipate failure
Learn from all operational failures

Operational Excellence - Best Practices

Organization
Prepare
Operate
Evolve

Evaluate threats to the business (for example, business risk and liabilities, and information security threats) and maintain this information in a risk registry. Evaluate the impact of risks, and tradeoffs between competing interests or alternative approaches. For example, accelerating speed to market for new features may be emphasized over cost optimization.

Ensure that there are identified owners for each application, workload, platform, and infrastructure component, and that each process and procedure has an identified owner responsible for its definition, and owners responsible for their performance.

AWS supports more security standards and compliance certifications than any other offering, including PCI-DSS, HIPAA/HITECH, FedRAMP, GDPR, FIPS 140-2, and NIST 800-171

When responsibility and ownership are undefined or unknown, you are at risk of both not performing necessary action in a timely fashion and of redundant and potentially conflicting efforts emerging to address those needs.

Prepare

Plan for unsuccessful changes so that you are able to respond faster if necessary
and test and validate the changes you make.

Use a consistent process (including manual or automated checklists) to know when you are ready to go live with your workload or a change.

Operate

All of the metrics you collect should be aligned to a business need and the outcomes they support. Develop scripted responses to well-understood events and automate their performance in response to recognizing the event.

Evolve

You must learn, share, and continuously improve to sustain operational excellence. Perform post-incident analysis of all customer impacting events.

On AWS, you can export your log data to Amazon S3 or send logs directly to Amazon S3 for long-term storage. Using AWS Glue, you can discover and prepare your log data in Amazon S3 for analytics, and store associated metadata in the AWS Glue Data Catalog. Amazon Athena, through its native integration with AWS Glue, can then be used to analyze your log data, querying it using standard SQL. Using a
business intelligence tool like Amazon QuickSight, you can visualize, explore, and analyze your data.

Successful evolution of operations is founded in: frequent small improvements; providing safe environments and time to experiment, develop, and test improvements; and environments in which learning from failures is encouraged.

Security Pillar

Before you architect any workload, you need to put in place practices that influence security. You will want to control who can do what.

Security

Identity and Access Management

Detection

Infrastructure Protection

Data Protection

Incident Response

Detection

In AWS, you can implement detective controls by processing logs, events, and monitoring that allows for auditing, automated analysis, and alarming. CloudTrail logs, AWS API calls, and CloudWatch provide monitoring of metrics with alarming, and AWS Config provides configuration history. Amazon GuardDuty is a managed threat detection service that continuously monitors for malicious or unauthorized behavior
to help you protect your AWS accounts and workloads. Service-level logs are also available, for example, you can use Amazon Simple Storage Service (Amazon S3) to log access requests.

Log management is important to a Well-Architected workload for reasons ranging from security or forensics to regulatory or legal requirements.

Infrastructure Protection

Enforcing boundary protection, monitoring points of ingress and egress, and comprehensive logging, monitoring, and alerting are all essential to an effective information security plan.

Data Protection

Well Architected Framework - Reliability Pillar

Testing in Production

An eye-opener about limited testing in production:

"Load testing in production should also be considered as part of game days where the production system is stressed, during hours of lower customer usage, with all personnel on hand to interpret results and
address any problems that arise."

Chaos Engineering

Chaos Engineering is the discipline of experimenting on a system in order to build conﬁdence in the system’s capability to withstand turbulent conditions in production. – Principles of Chaos Engineering
In pre-production and testing environments, chaos engineering should be done regularly, and be part of your CI/CD cycle. Chaos engineering in production is encouraged, however teams must take care not to disrupt availability for customers.

Resources

Checkout the interesting resources referenced from the white paper:

AWS Auto Scaling: How Scaling Plans Work

Game Days

An interesting concept well applied in other areas is Game Days.

"Conduct game days regularly: Use game days to regularly exercise your procedures for responding to events and failures as close to production as possible (including in production environments) with the people who will be involved in actual failure scenarios. Game days enforce measures to ensure that production events do not impact users."

No Multi-AZ Regional Deployments

Cloud, AWS in particular supports adding redundancy to your solution easily. Moreover, AWS even encourages you to do so. A common pitfall to avoid is a Multi-AZ Regional Deployment:

"Fault isolated zonal deployment: One of the most important rules AWS has established for its own deployments is to avoid touching multiple Availability Zones within a Region at the same time. This is
critical to ensuring that Availability Zones are independent for purposes of our availability calculations.
We recommend that you use similar considerations in your deployments."

Amazon EMR Redundancy

Multi-AZ is not an option for providing redundancy of Amazon EMR. Alternatively, it is possible to provide multiple master nodes, each secured with termination protection.

It is important to keep in mind that all of Amazon EMR data gets lost upon cluster termination. EMR File System (EMRFS) can store data in AWS S3. AWS S3 objects can be replicated across multiple AWS Availability Zone or Regions.

Bulkhead Arhitecture

Purpose of bulkhead architecture is to limit impact of failure to a small subset of users or requests. So that other requests / users can continue using service unaffected by the failure. Bulkheads for data are partitions. Bulkheads for services are called cells.

Each cell is a complete independent service instance allowed to grow up to a maximum size. Workloads grow by adding more cells. Any failure is contained to the cell it occurs in. Key elements of AWS bulkhead architecture are Cell Router and n Cells. Routing occurs by partition key in request or user data, tying it to a particular cell. Each cell uses its own AWS ALB, Compute and Storage.

Playbooks vs Runbooks

"Note that playbooks are used in response to speciﬁc incidents, while runbooks are used to achieve speciﬁc outcomes. Often, runbooks are used for routine activities and playbooks are used to respond to non-routine events."

Implementations for Availability Goals

Chapter "Example Implementations for Availability Goals" in "Reliablity Pillar AWS Well-Architected Framework" contains details descriptions for executable architecture implementations with pointers to business cases. Suggested samples cover:

monitoring and alerting
backup strategy
resiliency testing approach
disaster recovery
availability calculation

Statically stable in this context is such that does not require control plane changes. Single region examples cover cases of 2-4 9s availability.

Multi-region scenarios have higher cost of operation. Region isolation is a natural boundary to isolate failure. Great care is required to avoid correlated failure across mutliple regions. Availabilities covered: 3.5 9s(99.95%),

Well-Architected Framework - Security Pillar

Design Principles

"In the cloud, there are a number of principles that can help you strengthen your workload security:
• Strong identity foundation: principle of least privilege, separation of duties with appropriate authorization for each interaction with your AWS resources. Centralize identity management, and aim to eliminate reliance on long-term static credentials.
• Enable traceability: Monitor, alert, and audit actions and changes to your environment in real time.
Integrate log and metric collection with systems to automatically investigate and take action.
• Apply security at all layers: Apply a defense in depth approach with multiple security controls. Apply
to all layers (for example, edge of network, VPC, load balancing, every instance and compute service,
operating system, application, and code).
• Automate security best practices: Automated software-based security mechanisms improve your
ability to securely scale more rapidly and cost-effectively. Create secure architectures, including the
implementation of controls that are defined and managed as code in version-controlled templates.
• Protect data in transit and at rest: Classify your data into sensitivity levels and use mechanisms, such
as encryption, tokenization, and access control where appropriate.
• Keep people away from data: Use mechanisms and tools to reduce or eliminate the need for direct
access or manual processing of data. This reduces the risk of mishandling or modification and human
error when handling sensitive data.
• Prepare for security events: Prepare for an incident by having incident management and investigation
policy and processes that align to your organizational requirements. Run incident response simulations
and use tools with automation to increase your speed for detection, investigation, and recovery."

Six Areas of Cloud Security

1. Foundations
2. Identity and access management
3. Detection
4. Infrastructure protection
5. Data protection
6. Incident response

Operating Workloads Securily

Identify and prioritize risks using a threat model
Identify and validate control objectives
Keep up to date with security threats
Keep up to date with security recommendations
Evaluate and implement new security services and features regularly
Automate testing and validation of security controls in pipelines

AWS Account Management and Separation

Separate workloads using accounts
Secure AWS account
Manage accounts centrally
Set controls centrally
Configure services and resources centrally

AWS Organizations

AWS Organizations support management of Service Control Policies (SCP). AWS Organizations Organization Units are recommended to structure by function, rather than company's reporting lines. SCPs can be assigned to organization root, OUs or individual AWS accounts. SCPs define maximum available permissions for IAM entities in an account. IAM entities include all users, roles, and the account root user.

Best practices OUs:

Sandbox
Workloads
Policy Staging
Suspended
Individual Business Users
Exceptions
Deployments
Transitional

AWS Control Tower

AWS Control Tower orchestrates the capabilities of several other AWS services: AWS Organizations, AWS Service Catalog, AWS SSO to build an AWS landing those quickly. It helps to combat drift, which is divergence from best practices, applying preventing and detecting controls (guardrails).

Interesting Notions and AWS Services

Provable Security
IAM Access Analyser
AWS Resource Access Manager
Amazon GuardDuty
security information and event management (SIEM) system
Amazon EventBridge - provides a scalable rules engine designed to broker both native AWS event formats (such as CloudTrail events), as well as custom events you can generate from your application
AWS Config Rules and Conformance Packs -
All traffic between AZs is encrypted
AWS Local Zones place compute, storage, database, and other select AWS services closer to end users.
AWS Firewall Manager. It allows you to centrally configure and manage firewall rules across your accounts and applications
AWS Network Firewall is a managed service that uses a rules engine to give user finegrained
control over both stateful and stateless network traffic. It supports the Suricata compatible open
source intrusion prevention system (IPS) specifications.
common vulnerabilities and exposures (CVEs)
Amazon Macie is a fully managed data security and data privacy service that uses machine learning and pattern matching to discover and protect your sensitive data in AWS.
Encryption and tokenization are two important but distinct schemes for protecting data at rest.
Tokenization is a process that allows you to define a token to represent an otherwise sensitive piece
of information (for example, a token to represent a customer’s credit card number). A token must
be meaningless on its own, and must not be derived from the data it is tokenizing–therefore, a
cryptographic digest is not usable as a token.
hardware security module (HSM) - enables user to easily generate and use own
encryption keys in the AWS Cloud meeting corporate, contractual, and regulatory compliance
requirements for data security by using FIPS 140-2 Level 3 validated HSMs.
Amazon S3 Glacier Vault Lock and S3 Object Lock meets the Books and Records Management requirements of the SEC, CFTC, and FINRA.
AWS Systems Manager Automation uses automation documents
AWS Config Rules can check that all EBS volumes are encrypted and automatically remediate noncompliant resources.
AWS Security Hub can also verify a number of different controls through automated checks against security standards.
alert fatigue
AWS Well-Architected Labs, which provides you with a repository of code and documentation to give you hands-on experience implementing best practices

Protect Compute

Perform vulnerability management
Reduce attack surface
Enable people to perform actions at a distance
Implement managed services
Validate software integrity
Automate compute protection

Data Classification

Identify the data within your workload
Define data protection controls
Define data lifecycle management
Automate identification and classification

Protecting Data At Rest

Implement secure key management:
Enforce encryption at rest
Enforce access control
Audit the use of encryption keys
Use mechanisms to keep people away from data
Automate data at rest protection

Protect Data in Transit

Implement secure key and certificate management (best accomplished with AWS Certificate Manager (ACM)
Enforce encryption in transit
Authenticate network communications
Automate detection of unintended data access

Incident Response

Establish response objectives
Document plans
Respond using the cloud
Know what you have and what you need. Preserve logs, snapshots, and other evidence by copying
them to a centralized security cloud account.
Use redeployment mechanisms
Automate where possible
Choose scalable solutions
Learn and improve your process

Approaches to address incident response:

Educate
Prepare
Simulate
Iterate

Educate

Development Skills
AWS Services
Application Awareness: Train your incident response team on the specifics of the workloads and
environments that they own

Prepare

Identify key personnel and external resources
Develop incident management plans
Pre-provision access
Pre-deploy tools
Prepare forensic capabilities

Simulate

Run game days

Game days

Validating readiness
Developing confidence – learning from simulations and training staff
Following compliance or contractual obligations
Generating artifacts for accreditation
Being agile – incremental improvement
Becoming faster and improving tools
Refining communication and escalation
Developing comfort with the rare and the unexpected

Game day scenario examples: a leaked credential, a server communicating with unwanted systems, or a misconfiguration that results in unauthorized exposure

Simulation phases:

Evidence gathering
Contain the incident
Eradicate the incident
Recover from the incident
Post-incident debrief

Iterate

Automate containment and recovery capability
- eg. AWS WAG automatically blocking IPs originating unwanted traffic

Cloud Adoption Framework

People Perspective

Platform Perspective

Governance Perspective

Security Perspective

Business Perspective: is focused on ensuring that IT is aligned with business needs and that IT investments can be traced to demonstable business results.

Operations Perspective - Manage and Scale: Every organization has an operations group that
defines how day-to-day, quarter-to-quarter, and year-to-year business will be
conducted.

Organize AWS Environment with Multiple Accounts

Recommented OUs:

Security OU
Infrastructure OU
Sandbox OU
Workloads OU
Policy Staging OUT
Suspended OU
Individual Business Users OU
Exception OU
Deployemtns OU
Transitional OU

Resources

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-ha-launch.html
https://aws.amazon.com/blogs/security/how-to-use-service-control-policies-to-set-permission-guardrails-across-accounts-in-your-aws-organization/
https://aws.amazon.com/organizations/getting-started/best-practices/
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/stacksets-concepts.html
https://csrc.nist.gov/publications/detail/sp/800-61/rev-2/final
https://aws.amazon.com/compliance