Open-Source Study Guide

Cloud Architect Interview Guide

Everything you need to prepare for Solutions Architect interviews — from networking fundamentals to system design walkthroughs, with 148 interview questions and scripted answer frameworks.

15 Topic Pages

100+ Topics Covered

148 Interview Questions

4 System Design Walkthroughs

Infrastructure

Networking & VPC 7 topics

VPC design, subnets, Security Groups vs NACLs, Direct Connect, Route 53, VPC Endpoints.

Explore →

Security & IAM 7 topics

IAM policies, cross-account access, KMS encryption, WAF/Shield, GuardDuty, Zero Trust.

Explore →

Compute 4 topics

Trade-off matrix, EC2 vs ECS vs EKS vs Lambda, Real-World Scenarios, and Interview Cheat Sheet.

Explore →

Compute Article Deep-Dive

Choosing the right AWS compute service — EC2, ECS, EKS, or Lambda — with decision frameworks and STAR scenarios.

Read Article →

Compute AWS Article Deep-Dive

Choosing the right AWS compute service — EC2, ECS, EKS, or Lambda — with decision frameworks and STAR scenarios.

Read Article →

Storage 6 topics

S3 storage classes, EBS vs EFS vs FSx, data transfer strategies, backup & recovery.

Explore →

Data & Apps

Databases 6 topics

RDS vs Aurora, DynamoDB patterns, ElastiCache, database migration with DMS/SCT.

Explore →

Serverless 9 topics

Lambda internals, Step Functions, EventBridge, cold starts, concurrency, VPC integration.

Explore →

Streaming 7 topics

SQS vs SNS vs Kinesis, MSK (Kafka), stream processing patterns, video streaming.

Explore →

Gen AI 5 topics

Bedrock vs SageMaker, RAG architecture, prompt engineering, multi-model strategies.

Explore →

Design & Ops

Architecture 16 topics

Microservices, event-driven, three-tier, hybrid cloud, system design walkthroughs, trade-offs.

Explore →

Fault-Tolerant Blueprint 5 layers

First-principles HA design: global edge, stateless compute, async decoupling, multi-AZ data, multi-region.

Read Article →

DevOps & CI/CD 6 topics

CI/CD pipelines, GitOps, Git branching, DevSecOps, deployment strategies.

Explore →

DevOps Interview ↗ External Q&A

Real-world 45-minute DevOps interview questions covering ECS, EKS, RDS, CI/CD, and Linux.

View Artifact →

Kubernetes & EKS 16 topics

Karpenter, scaling strategies, KEDA, container lifecycle, sidecars, EKS Auto Mode.

Explore →

Cost Optimization 5 topics

Pricing models, right-sizing, FinOps governance, hidden costs, cost-efficient patterns.

Explore →

Foundations

AWS Core 8 topics

Well-Architected Framework, migration 7Rs, API Gateway auth, DR strategies, global infra.

Explore →

Fundamentals 3 topics

IP vs URL, DNS resolution flow, OSI model, CIDR notation, HTTP codes, cloud basics.

Explore →

Re:Invent 2024 4 topics

EKS Auto Mode, Aurora DSQL, S3 Tables — the most architecturally significant launches.

Explore →

Interview Prep 188 questions

Platform Engineering, Mobile CI/CD, Kubernetes, Terraform, AWS, System Design, Behavioral, DevSecOps & Career Guide.

Explore →

Printable Guides (PDF)

Platform Prep Guide PDF

Complete Platform Engineering preparation guide — printable format for offline study and quick reference.

Open PDF →

Interview Questions 336 Q&A

All 336 interview questions across 27 categories with answers — optimized for print and quick review.

Open PDF →

The Definitive Interview Compendium

Senior-level frameworks for the most high-stakes questions — with "bad vs. delightful" answer contrasts, comparison tables, and the "So What?" layer that separates architects from engineers.

◆ The most mis-answered foundational question

Dimension	High Availability (HA)	Disaster Recovery (DR)
Scope	Within a single Region (cross-AZ)	Across different geographic Regions
Primary Goal	Minimize downtime during AZ failure	Restore service after a Region-wide event
AWS Examples	Multi-AZ EC2, Application Load Balancers	Route 53 Health Checks, S3 Cross-Region Replication

Bad Answer: "HA means the app stays up; DR is when it falls back to another region." This conflates the two concepts entirely.
Delightful Answer: HA is scoped to AZs within one region. DR is scoped to different geographic regions. Confusing them signals you haven't operated at enterprise scale.
The "So What?" Layer: A total AZ failure is handled by HA — no human intervention needed. A total regional failure requires DR — a deliberate architectural and operational strategy.

◆ Layer-by-layer HA walkthrough — the "delightful" answer format

External ALB: Routes traffic across multiple AZs. Inherently HA — no single point of failure.
Web / App Tier (EC2 in ASG): A single EC2 instance resides in one AZ and is a SPOF. Deploy across ≥2 AZs inside an Auto Scaling Group. If one AZ fails, traffic re-routes to instances in surviving AZs automatically.
Internal ALB: Decouples the web and app tiers; distributes internal traffic across healthy AZs — identical HA properties to the external ALB.
Data Tier (Amazon DynamoDB): A regional service that automatically replicates data across three AZs. AZ failure is fully transparent to application code.
So What? This design ensures a complete data-center failure is a non-event for end users — business continuity with zero manual intervention.

◆ The "trap question" — RPO is always measured in TIME, not data volume

RTO (Recovery Time Objective): Max acceptable time the application can be offline. Measured in minutes, hours, or days.
RPO (Recovery Point Objective): Max acceptable data loss, expressed as a point in time. A common mistake: measuring RPO in megabytes. Wrong. It's always time.

Strategy	RTO / RPO	How It Works
Backup & Restore	Hours / Days	Data backed up to S3; restored to new region after disaster. Highest RTO/RPO, lowest cost.
Pilot Light	Minutes / Hours	Core data (DB) is live and replicated; app servers are "off" (AMIs) until disaster triggers provisioning.
Warm Standby	Minutes / Minutes	Scaled-down full environment always running in DR region. Scale up on activation.
Multi-site Active-Active	Real-time / Real-time	Full traffic in two+ regions simultaneously. Instant synchronous/asynchronous replication. Highest cost, zero RTO/RPO.

◆ Show fault isolation and EDA asynchrony — not just definitions

Monolith: Single tightly coupled unit. One component fails → redeploy everything. No independent scaling.
Microservices: Independent services, independently deployed, independently scaled, polyglot by design.

Real Flow — store.com (ALB Path-Based Routing):
- store.com/browse → EC2 + DynamoDB (Python)
- store.com/purchase → EKS + Aurora (Go) — isolated for high integrity
- store.com/return → Lambda (serverless, spiky traffic)
- If /browse crashes, /purchase and revenue remain unaffected. That's fault isolation.
EDA — Bad Answer: "API Gateway → Lambda → DynamoDB." That is synchronous, not event-driven.
EDA — Delightful: API Gateway → SQS → Lambda → DynamoDB. SQS decouples producer from consumer. Lambda consumes at its own rate. Benefits: independent scaling, built-in retry, no wasted idle capacity.

Trait	Synchronous	EDA (Asynchronous)
Scaling	All layers must match	Components scale independently
Retries	Hard-coded in client	Built-in queue retry
Cost	Over-provision for spikes	Pay only for events processed

◆ Distinguish by schema approach and consistency model — not just "relational vs. not"

Property	SQL (Relational)	NoSQL (Non-relational)
Schema	Schema-on-write — rigid, predefined	Schema-on-read — flexible, evolving
Consistency	ACID transactions	Eventual consistency (configurable)
Scaling	Vertical primary	Horizontal by design
Best For	Complex joins, financial transactions	High-throughput, low-latency, variable structure
AWS Examples	Amazon Aurora, RDS	Amazon DynamoDB

Architect's Rule: Never say "NoSQL is better." Choose based on access patterns. If you need ACID joins, SQL wins. If you need single-digit millisecond reads at millions of RPS, DynamoDB wins.

◆ Categorize GenAI use cases — "chatbots" alone is a junior answer

Customer Experience: Agentic assistants that take action on behalf of users (not just answer questions).
Productivity: Content creation, automated report generation, search summarization — measurable ROI.
Cutting Edge (MCP): Automated cost analysis, code-quality evaluation against architectural best practices — agents acting autonomously on cloud infrastructure.

Model Context Protocol (MCP): A standardized JSON-RPC protocol enabling LLMs to interact with tools and data sources without custom-coding every integration.
- Pre-MCP: Manually define every API URL, header, and payload per tool.
- MCP: MCP Server exposes tool schemas. MCP Client (e.g., VS Code) discovers tools dynamically. LLM chooses which tool to invoke based on purpose — no brittle hard-coded logic.
- Key Innovation: Dynamic Discovery — the protocol feeds tool purpose and schema to the LLM at runtime.

Protocol	What It Does
A2A	Agent-to-Agent communication (orchestration between agents)
MCP	Standardized protocol connecting agents/LLMs to tools and data
RAG	Retrieval-Augmented Generation — grounding LLM responses in external data without retraining

◆ Know the human-button distinction and the push vs. pull model

Continuous Delivery: Code is always in a deployable state; a manual trigger pushes to production.
Continuous Deployment: Every passing build auto-deploys to production. Zero human intervention.

Senior-Level Pipeline (Source → Build → Test → Deploy):
- IaC: Non-negotiable — Terraform ensures environment consistency across accounts.
- Automated Unit Testing: AWS CodeBuild catches logic errors before staging.
- DevSecOps: Shift security left — secret scanning and vulnerability assessment are pipeline gates, not post-deployment audits.
- Multi-Account Strategy: Dev → Test → Prod isolation minimizes blast radius of any failure.

Model	Traditional CI/CD	GitOps
Direction	Push — tool pushes config to environment	Pull — controller in cluster pulls from Git
Truth Source	CI/CD tool state	Git repository (declarative)
Controller	Jenkins, GitHub Actions, Harness	ArgoCD, Flux
Drift Detection	Manual or scheduled	Continuous, automatic reconciliation

◆ Three layers of scaling, and when each kicks in

Why EKS: AWS manages the control plane (API server, etcd) across multiple AZs. Managing this yourself is the hardest part of Kubernetes operations — EKS eliminates it.

Scaler	What It Scales	Trigger	Notes
HPA	Pods	CPU / Memory / custom metrics	Horizontal Pod Autoscaler — doesn't add nodes
Cluster Autoscaler	Nodes (via ASG)	Pending pods with no room	Works with EC2 Auto Scaling Groups
Karpenter	Nodes (directly)	Pending pods with no room	Bypasses ASGs; provisions right-sized, right-typed node instantly. Faster + cheaper.

DaemonSet: Ensures one pod per node. Use for cluster-wide agents — log collectors (Fluent Bit), monitoring agents (Prometheus node-exporter).
Sidecar: Runs alongside the app container in the same pod. Use for pod-specific concerns — service mesh proxy (Envoy), localized log offloading.
Monitoring Stack to name: Prometheus + Grafana for metrics; Fluent Bit / CloudWatch Container Insights / ELK for logs. Not naming your stack is a red flag.

◆ CPU is indirect, cold starts are solvable, messaging services are distinct tools

Cold Start: Latency when AWS provisions a new execution environment. Mitigate with Provisioned Concurrency (keeps environments warm) or Custom Container Images.
Scaling CPU: You cannot set Lambda CPU directly. Increase memory allocation — AWS scales CPU proportionally. More memory = faster execution = lower cost in pay-per-ms model.
Lambda vs. Fargate: Lambda scales instantly for burst; Fargate scales slower (node provisioning) but handles tasks exceeding Lambda's 15-minute limit.
Security best practices: Least-privilege IAM role per function; KMS for environment variable encryption; VPC integration for private resource access.

Service	Pattern	Use Case
SQS	Queue — point-to-point	One producer, one consumer. Decoupling, backpressure, retry.
SNS	Pub/sub topic — fan-out	One producer, many consumers. Broadcast notifications.
EventBridge	Event bus — content-based routing	Filter and route events from AWS services, SaaS, and custom apps with rules.

◆ The "So What?" layer separates architects from engineers

Trait 1 — The Question Behind the Question: When asked "What is RPO?", the real test is whether you know it's measured in time, not data volume. Pivot to the underlying principle, not the dictionary definition.
Trait 2 — System Design Pivot: When asked "What is SQS?", don't just say "It's a queue." Explain how you used SQS to decouple an ordering system and absorb a 10x Black Friday traffic spike without dropping a single order. That's the delight factor.
Trait 3 — Big Picture Thinking: Show how services interact to solve a business problem. Walk through flows, not definitions.

❌ Rambling: Five minutes on a VPC question signals inability to communicate with executives. Use the 3-2-1 trick: 3 points, 2 minutes, 1 clear summary.
❌ Generic answers: "I would use the best practices" is not an answer. Name the service, the trade-off, and the operational consequence.
❌ Jumping to solutions: Always ask for requirements first — RTO, RPO, budget, scale, compliance constraints. Skipping this is a junior mistake.
❌ Calling a service "the best": A Principal Architect selects services based on specific system requirements — cost, latency, scale — never on preference.

Ready to start preparing?

Pick a topic above or jump straight into the most popular section.

Start with Architecture →

Cloud Architect Interview Guide

Infrastructure

Data & Apps

Design & Ops

Foundations

Printable Guides (PDF)

Top 10 Cloud Architect Interview Questions

The Definitive Interview Compendium

Ready to start preparing?