Open-Source Study Guide
Cloud Architect Interview Guide
Everything you need to prepare for Solutions Architect interviews — from networking fundamentals to system design walkthroughs, with 148 interview questions and scripted answer frameworks.
15
Topic Pages
100+
Topics Covered
148
Interview Questions
4
System Design Walkthroughs
Infrastructure
Networking & VPC 7 topics
VPC design, subnets, Security Groups vs NACLs, Direct Connect, Route 53, VPC
Endpoints.
Explore →
Security & IAM 7 topics
IAM policies, cross-account access, KMS encryption, WAF/Shield, GuardDuty,
Zero Trust.
Explore →
Compute 4 topics
Trade-off matrix, EC2 vs ECS vs EKS vs Lambda, Real-World Scenarios, and
Interview Cheat Sheet.
Explore →
Compute Article Deep-Dive
Choosing the right AWS compute service — EC2, ECS, EKS, or Lambda — with
decision frameworks and STAR scenarios.
Read Article →
Compute AWS Article Deep-Dive
Choosing the right AWS compute service — EC2, ECS, EKS, or Lambda — with
decision frameworks and STAR scenarios.
Read Article →
Storage 6 topics
S3 storage classes, EBS vs EFS vs FSx, data transfer strategies, backup &
recovery.
Explore →
Data & Apps
Databases 6 topics
RDS vs Aurora, DynamoDB patterns, ElastiCache, database migration with
DMS/SCT.
Explore →
Serverless 9 topics
Lambda internals, Step Functions, EventBridge, cold starts, concurrency, VPC
integration.
Explore →
Streaming 7 topics
SQS vs SNS vs Kinesis, MSK (Kafka), stream processing patterns, video
streaming.
Explore →
Gen AI 5 topics
Bedrock vs SageMaker, RAG architecture, prompt engineering, multi-model
strategies.
Explore →
Design & Ops
Architecture 16 topics
Microservices, event-driven, three-tier, hybrid cloud, system design
walkthroughs, trade-offs.
Explore →
Fault-Tolerant Blueprint 5 layers
First-principles HA design: global edge, stateless compute, async decoupling,
multi-AZ data, multi-region.
Read Article →
DevOps & CI/CD 6 topics
CI/CD pipelines, GitOps, Git branching, DevSecOps, deployment strategies.
Explore →
DevOps Interview ↗ External Q&A
Real-world 45-minute DevOps interview questions covering ECS, EKS, RDS, CI/CD,
and Linux.
View Artifact →
Kubernetes & EKS 16 topics
Karpenter, scaling strategies, KEDA, container lifecycle, sidecars, EKS Auto
Mode.
Explore →
Cost Optimization 5 topics
Pricing models, right-sizing, FinOps governance, hidden costs, cost-efficient
patterns.
Explore →
Foundations
AWS Core 8 topics
Well-Architected Framework, migration 7Rs, API Gateway auth, DR strategies,
global infra.
Explore →
Fundamentals 3 topics
IP vs URL, DNS resolution flow, OSI model, CIDR notation, HTTP codes, cloud
basics.
Explore →
Re:Invent 2024 4 topics
EKS Auto Mode, Aurora DSQL, S3 Tables — the most architecturally significant
launches.
Explore →
Interview Prep 188 questions
Platform Engineering, Mobile CI/CD, Kubernetes, Terraform, AWS, System Design,
Behavioral, DevSecOps & Career Guide.
Explore →
Printable Guides (PDF)
Top 10 Cloud Architect Interview Questions
10 high-frequency Solutions Architect interview questions with STAR-format answers grounded in real experience at Enterprise Pharma, enhanced with key AWS architectural patterns.
The Definitive Interview Compendium
Senior-level frameworks for the most high-stakes questions — with "bad vs. delightful" answer contrasts, comparison tables, and the "So What?" layer that separates architects from engineers.
◆ The most mis-answered foundational question
| Dimension | High Availability (HA) | Disaster Recovery (DR) |
|---|---|---|
| Scope | Within a single Region (cross-AZ) | Across different geographic Regions |
| Primary Goal | Minimize downtime during AZ failure | Restore service after a Region-wide event |
| AWS Examples | Multi-AZ EC2, Application Load Balancers | Route 53 Health Checks, S3 Cross-Region Replication |
- Bad Answer: "HA means the app stays up; DR is when it falls back to another region." This conflates the two concepts entirely.
- Delightful Answer: HA is scoped to AZs within one region. DR is scoped to different geographic regions. Confusing them signals you haven't operated at enterprise scale.
- The "So What?" Layer: A total AZ failure is handled by HA — no human intervention needed. A total regional failure requires DR — a deliberate architectural and operational strategy.
◆ Layer-by-layer HA walkthrough — the "delightful" answer
format
- External ALB: Routes traffic across multiple AZs. Inherently HA — no single point of failure.
- Web / App Tier (EC2 in ASG): A single EC2 instance resides in one AZ and is a SPOF. Deploy across ≥2 AZs inside an Auto Scaling Group. If one AZ fails, traffic re-routes to instances in surviving AZs automatically.
- Internal ALB: Decouples the web and app tiers; distributes internal traffic across healthy AZs — identical HA properties to the external ALB.
- Data Tier (Amazon DynamoDB): A regional service that automatically replicates data across three AZs. AZ failure is fully transparent to application code.
- So What? This design ensures a complete data-center failure is a non-event for end users — business continuity with zero manual intervention.
◆ The "trap question" — RPO is always measured in TIME, not
data
volume
- RTO (Recovery Time Objective): Max acceptable time the application can be offline. Measured in minutes, hours, or days.
- RPO (Recovery Point Objective): Max acceptable data loss, expressed as a point in time. A common mistake: measuring RPO in megabytes. Wrong. It's always time.
| Strategy | RTO / RPO | How It Works |
|---|---|---|
| Backup & Restore | Hours / Days | Data backed up to S3; restored to new region after disaster. Highest RTO/RPO, lowest cost. |
| Pilot Light | Minutes / Hours | Core data (DB) is live and replicated; app servers are "off" (AMIs) until disaster triggers provisioning. |
| Warm Standby | Minutes / Minutes | Scaled-down full environment always running in DR region. Scale up on activation. |
| Multi-site Active-Active | Real-time / Real-time | Full traffic in two+ regions simultaneously. Instant synchronous/asynchronous replication. Highest cost, zero RTO/RPO. |
◆ Show fault isolation and EDA asynchrony — not just
definitions
- Monolith: Single tightly coupled unit. One component fails → redeploy everything. No independent scaling.
- Microservices: Independent services, independently deployed, independently scaled, polyglot by design.
- Real Flow — store.com (ALB Path-Based Routing):
store.com/browse→ EC2 + DynamoDB (Python)store.com/purchase→ EKS + Aurora (Go) — isolated for high integritystore.com/return→ Lambda (serverless, spiky traffic)- If /browse crashes, /purchase and revenue remain unaffected. That's fault isolation.
- EDA — Bad Answer: "API Gateway → Lambda → DynamoDB." That is synchronous, not event-driven.
- EDA — Delightful: API Gateway → SQS → Lambda → DynamoDB. SQS decouples producer from consumer. Lambda consumes at its own rate. Benefits: independent scaling, built-in retry, no wasted idle capacity.
| Trait | Synchronous | EDA (Asynchronous) |
|---|---|---|
| Scaling | All layers must match | Components scale independently |
| Retries | Hard-coded in client | Built-in queue retry |
| Cost | Over-provision for spikes | Pay only for events processed |
◆ Distinguish by schema approach and consistency model — not
just
"relational vs. not"
| Property | SQL (Relational) | NoSQL (Non-relational) |
|---|---|---|
| Schema | Schema-on-write — rigid, predefined | Schema-on-read — flexible, evolving |
| Consistency | ACID transactions | Eventual consistency (configurable) |
| Scaling | Vertical primary | Horizontal by design |
| Best For | Complex joins, financial transactions | High-throughput, low-latency, variable structure |
| AWS Examples | Amazon Aurora, RDS | Amazon DynamoDB |
- Architect's Rule: Never say "NoSQL is better." Choose based on access patterns. If you need ACID joins, SQL wins. If you need single-digit millisecond reads at millions of RPS, DynamoDB wins.
◆ Categorize GenAI use cases — "chatbots" alone is a junior
answer
- Customer Experience: Agentic assistants that take action on behalf of users (not just answer questions).
- Productivity: Content creation, automated report generation, search summarization — measurable ROI.
- Cutting Edge (MCP): Automated cost analysis, code-quality evaluation against architectural best practices — agents acting autonomously on cloud infrastructure.
- Model Context Protocol (MCP): A standardized JSON-RPC protocol enabling
LLMs to
interact with tools and data sources without custom-coding every integration.
- Pre-MCP: Manually define every API URL, header, and payload per tool.
- MCP: MCP Server exposes tool schemas. MCP Client (e.g., VS Code) discovers tools dynamically. LLM chooses which tool to invoke based on purpose — no brittle hard-coded logic.
- Key Innovation: Dynamic Discovery — the protocol feeds tool purpose and schema to the LLM at runtime.
| Protocol | What It Does |
|---|---|
| A2A | Agent-to-Agent communication (orchestration between agents) |
| MCP | Standardized protocol connecting agents/LLMs to tools and data |
| RAG | Retrieval-Augmented Generation — grounding LLM responses in external data without retraining |
◆ Know the human-button distinction and the push vs. pull
model
- Continuous Delivery: Code is always in a deployable state; a manual trigger pushes to production.
- Continuous Deployment: Every passing build auto-deploys to production. Zero human intervention.
- Senior-Level Pipeline (Source → Build → Test → Deploy):
- IaC: Non-negotiable — Terraform ensures environment consistency across accounts.
- Automated Unit Testing: AWS CodeBuild catches logic errors before staging.
- DevSecOps: Shift security left — secret scanning and vulnerability assessment are pipeline gates, not post-deployment audits.
- Multi-Account Strategy: Dev → Test → Prod isolation minimizes blast radius of any failure.
| Model | Traditional CI/CD | GitOps |
|---|---|---|
| Direction | Push — tool pushes config to environment | Pull — controller in cluster pulls from Git |
| Truth Source | CI/CD tool state | Git repository (declarative) |
| Controller | Jenkins, GitHub Actions, Harness | ArgoCD, Flux |
| Drift Detection | Manual or scheduled | Continuous, automatic reconciliation |
◆ Three layers of scaling, and when each kicks in
- Why EKS: AWS manages the control plane (API server, etcd) across multiple AZs. Managing this yourself is the hardest part of Kubernetes operations — EKS eliminates it.
| Scaler | What It Scales | Trigger | Notes |
|---|---|---|---|
| HPA | Pods | CPU / Memory / custom metrics | Horizontal Pod Autoscaler — doesn't add nodes |
| Cluster Autoscaler | Nodes (via ASG) | Pending pods with no room | Works with EC2 Auto Scaling Groups |
| Karpenter | Nodes (directly) | Pending pods with no room | Bypasses ASGs; provisions right-sized, right-typed node instantly. Faster + cheaper. |
- DaemonSet: Ensures one pod per node. Use for cluster-wide agents — log collectors (Fluent Bit), monitoring agents (Prometheus node-exporter).
- Sidecar: Runs alongside the app container in the same pod. Use for pod-specific concerns — service mesh proxy (Envoy), localized log offloading.
- Monitoring Stack to name: Prometheus + Grafana for metrics; Fluent Bit / CloudWatch Container Insights / ELK for logs. Not naming your stack is a red flag.
◆ CPU is indirect, cold starts are solvable, messaging
services
are distinct tools
- Cold Start: Latency when AWS provisions a new execution environment. Mitigate with Provisioned Concurrency (keeps environments warm) or Custom Container Images.
- Scaling CPU: You cannot set Lambda CPU directly. Increase memory allocation — AWS scales CPU proportionally. More memory = faster execution = lower cost in pay-per-ms model.
- Lambda vs. Fargate: Lambda scales instantly for burst; Fargate scales slower (node provisioning) but handles tasks exceeding Lambda's 15-minute limit.
- Security best practices: Least-privilege IAM role per function; KMS for environment variable encryption; VPC integration for private resource access.
| Service | Pattern | Use Case |
|---|---|---|
| SQS | Queue — point-to-point | One producer, one consumer. Decoupling, backpressure, retry. |
| SNS | Pub/sub topic — fan-out | One producer, many consumers. Broadcast notifications. |
| EventBridge | Event bus — content-based routing | Filter and route events from AWS services, SaaS, and custom apps with rules. |
◆ The "So What?" layer separates architects from engineers
- Trait 1 — The Question Behind the Question: When asked "What is RPO?", the real test is whether you know it's measured in time, not data volume. Pivot to the underlying principle, not the dictionary definition.
- Trait 2 — System Design Pivot: When asked "What is SQS?", don't just say "It's a queue." Explain how you used SQS to decouple an ordering system and absorb a 10x Black Friday traffic spike without dropping a single order. That's the delight factor.
- Trait 3 — Big Picture Thinking: Show how services interact to solve a business problem. Walk through flows, not definitions.
- ❌ Rambling: Five minutes on a VPC question signals inability to communicate with executives. Use the 3-2-1 trick: 3 points, 2 minutes, 1 clear summary.
- ❌ Generic answers: "I would use the best practices" is not an answer. Name the service, the trade-off, and the operational consequence.
- ❌ Jumping to solutions: Always ask for requirements first — RTO, RPO, budget, scale, compliance constraints. Skipping this is a junior mistake.
- ❌ Calling a service "the best": A Principal Architect selects services based on specific system requirements — cost, latency, scale — never on preference.
Ready to start preparing?
Pick a topic above or jump straight into the most popular section.
Start with Architecture →