Open-Source Study Guide

Cloud Architect Interview Guide

Everything you need to prepare for Solutions Architect interviews — from networking fundamentals to system design walkthroughs, with 148 interview questions and scripted answer frameworks.

15 Topic Pages
100+ Topics Covered
148 Interview Questions
4 System Design Walkthroughs
I

Infrastructure

D

Data & Apps

O

Design & Ops

F

Foundations

P

Printable Guides (PDF)

Q

Top 10 Cloud Architect Interview Questions

10 high-frequency Solutions Architect interview questions with STAR-format answers grounded in real experience at Enterprise Pharma, enhanced with key AWS architectural patterns.

The Definitive Interview Compendium

Senior-level frameworks for the most high-stakes questions — with "bad vs. delightful" answer contrasts, comparison tables, and the "So What?" layer that separates architects from engineers.

◆ The most mis-answered foundational question
Dimension High Availability (HA) Disaster Recovery (DR)
Scope Within a single Region (cross-AZ) Across different geographic Regions
Primary Goal Minimize downtime during AZ failure Restore service after a Region-wide event
AWS Examples Multi-AZ EC2, Application Load Balancers Route 53 Health Checks, S3 Cross-Region Replication
  • Bad Answer: "HA means the app stays up; DR is when it falls back to another region." This conflates the two concepts entirely.
  • Delightful Answer: HA is scoped to AZs within one region. DR is scoped to different geographic regions. Confusing them signals you haven't operated at enterprise scale.
  • The "So What?" Layer: A total AZ failure is handled by HA — no human intervention needed. A total regional failure requires DR — a deliberate architectural and operational strategy.
◆ Layer-by-layer HA walkthrough — the "delightful" answer format
  • External ALB: Routes traffic across multiple AZs. Inherently HA — no single point of failure.
  • Web / App Tier (EC2 in ASG): A single EC2 instance resides in one AZ and is a SPOF. Deploy across ≥2 AZs inside an Auto Scaling Group. If one AZ fails, traffic re-routes to instances in surviving AZs automatically.
  • Internal ALB: Decouples the web and app tiers; distributes internal traffic across healthy AZs — identical HA properties to the external ALB.
  • Data Tier (Amazon DynamoDB): A regional service that automatically replicates data across three AZs. AZ failure is fully transparent to application code.
  • So What? This design ensures a complete data-center failure is a non-event for end users — business continuity with zero manual intervention.
◆ The "trap question" — RPO is always measured in TIME, not data volume
  • RTO (Recovery Time Objective): Max acceptable time the application can be offline. Measured in minutes, hours, or days.
  • RPO (Recovery Point Objective): Max acceptable data loss, expressed as a point in time. A common mistake: measuring RPO in megabytes. Wrong. It's always time.
Strategy RTO / RPO How It Works
Backup & Restore Hours / Days Data backed up to S3; restored to new region after disaster. Highest RTO/RPO, lowest cost.
Pilot Light Minutes / Hours Core data (DB) is live and replicated; app servers are "off" (AMIs) until disaster triggers provisioning.
Warm Standby Minutes / Minutes Scaled-down full environment always running in DR region. Scale up on activation.
Multi-site Active-Active Real-time / Real-time Full traffic in two+ regions simultaneously. Instant synchronous/asynchronous replication. Highest cost, zero RTO/RPO.
◆ Show fault isolation and EDA asynchrony — not just definitions
  • Monolith: Single tightly coupled unit. One component fails → redeploy everything. No independent scaling.
  • Microservices: Independent services, independently deployed, independently scaled, polyglot by design.
  • Real Flow — store.com (ALB Path-Based Routing):
    • store.com/browse → EC2 + DynamoDB (Python)
    • store.com/purchase → EKS + Aurora (Go) — isolated for high integrity
    • store.com/return → Lambda (serverless, spiky traffic)
    • If /browse crashes, /purchase and revenue remain unaffected. That's fault isolation.
  • EDA — Bad Answer: "API Gateway → Lambda → DynamoDB." That is synchronous, not event-driven.
  • EDA — Delightful: API Gateway → SQS → Lambda → DynamoDB. SQS decouples producer from consumer. Lambda consumes at its own rate. Benefits: independent scaling, built-in retry, no wasted idle capacity.
Trait Synchronous EDA (Asynchronous)
Scaling All layers must match Components scale independently
Retries Hard-coded in client Built-in queue retry
Cost Over-provision for spikes Pay only for events processed
◆ Distinguish by schema approach and consistency model — not just "relational vs. not"
Property SQL (Relational) NoSQL (Non-relational)
Schema Schema-on-write — rigid, predefined Schema-on-read — flexible, evolving
Consistency ACID transactions Eventual consistency (configurable)
Scaling Vertical primary Horizontal by design
Best For Complex joins, financial transactions High-throughput, low-latency, variable structure
AWS Examples Amazon Aurora, RDS Amazon DynamoDB
  • Architect's Rule: Never say "NoSQL is better." Choose based on access patterns. If you need ACID joins, SQL wins. If you need single-digit millisecond reads at millions of RPS, DynamoDB wins.
◆ Categorize GenAI use cases — "chatbots" alone is a junior answer
  • Customer Experience: Agentic assistants that take action on behalf of users (not just answer questions).
  • Productivity: Content creation, automated report generation, search summarization — measurable ROI.
  • Cutting Edge (MCP): Automated cost analysis, code-quality evaluation against architectural best practices — agents acting autonomously on cloud infrastructure.
  • Model Context Protocol (MCP): A standardized JSON-RPC protocol enabling LLMs to interact with tools and data sources without custom-coding every integration.
    • Pre-MCP: Manually define every API URL, header, and payload per tool.
    • MCP: MCP Server exposes tool schemas. MCP Client (e.g., VS Code) discovers tools dynamically. LLM chooses which tool to invoke based on purpose — no brittle hard-coded logic.
    • Key Innovation: Dynamic Discovery — the protocol feeds tool purpose and schema to the LLM at runtime.
Protocol What It Does
A2A Agent-to-Agent communication (orchestration between agents)
MCP Standardized protocol connecting agents/LLMs to tools and data
RAG Retrieval-Augmented Generation — grounding LLM responses in external data without retraining
◆ Know the human-button distinction and the push vs. pull model
  • Continuous Delivery: Code is always in a deployable state; a manual trigger pushes to production.
  • Continuous Deployment: Every passing build auto-deploys to production. Zero human intervention.
  • Senior-Level Pipeline (Source → Build → Test → Deploy):
    • IaC: Non-negotiable — Terraform ensures environment consistency across accounts.
    • Automated Unit Testing: AWS CodeBuild catches logic errors before staging.
    • DevSecOps: Shift security left — secret scanning and vulnerability assessment are pipeline gates, not post-deployment audits.
    • Multi-Account Strategy: Dev → Test → Prod isolation minimizes blast radius of any failure.
Model Traditional CI/CD GitOps
Direction Push — tool pushes config to environment Pull — controller in cluster pulls from Git
Truth Source CI/CD tool state Git repository (declarative)
Controller Jenkins, GitHub Actions, Harness ArgoCD, Flux
Drift Detection Manual or scheduled Continuous, automatic reconciliation
◆ Three layers of scaling, and when each kicks in
  • Why EKS: AWS manages the control plane (API server, etcd) across multiple AZs. Managing this yourself is the hardest part of Kubernetes operations — EKS eliminates it.
Scaler What It Scales Trigger Notes
HPA Pods CPU / Memory / custom metrics Horizontal Pod Autoscaler — doesn't add nodes
Cluster Autoscaler Nodes (via ASG) Pending pods with no room Works with EC2 Auto Scaling Groups
Karpenter Nodes (directly) Pending pods with no room Bypasses ASGs; provisions right-sized, right-typed node instantly. Faster + cheaper.
  • DaemonSet: Ensures one pod per node. Use for cluster-wide agents — log collectors (Fluent Bit), monitoring agents (Prometheus node-exporter).
  • Sidecar: Runs alongside the app container in the same pod. Use for pod-specific concerns — service mesh proxy (Envoy), localized log offloading.
  • Monitoring Stack to name: Prometheus + Grafana for metrics; Fluent Bit / CloudWatch Container Insights / ELK for logs. Not naming your stack is a red flag.
◆ CPU is indirect, cold starts are solvable, messaging services are distinct tools
  • Cold Start: Latency when AWS provisions a new execution environment. Mitigate with Provisioned Concurrency (keeps environments warm) or Custom Container Images.
  • Scaling CPU: You cannot set Lambda CPU directly. Increase memory allocation — AWS scales CPU proportionally. More memory = faster execution = lower cost in pay-per-ms model.
  • Lambda vs. Fargate: Lambda scales instantly for burst; Fargate scales slower (node provisioning) but handles tasks exceeding Lambda's 15-minute limit.
  • Security best practices: Least-privilege IAM role per function; KMS for environment variable encryption; VPC integration for private resource access.
Service Pattern Use Case
SQS Queue — point-to-point One producer, one consumer. Decoupling, backpressure, retry.
SNS Pub/sub topic — fan-out One producer, many consumers. Broadcast notifications.
EventBridge Event bus — content-based routing Filter and route events from AWS services, SaaS, and custom apps with rules.
◆ The "So What?" layer separates architects from engineers
  • Trait 1 — The Question Behind the Question: When asked "What is RPO?", the real test is whether you know it's measured in time, not data volume. Pivot to the underlying principle, not the dictionary definition.
  • Trait 2 — System Design Pivot: When asked "What is SQS?", don't just say "It's a queue." Explain how you used SQS to decouple an ordering system and absorb a 10x Black Friday traffic spike without dropping a single order. That's the delight factor.
  • Trait 3 — Big Picture Thinking: Show how services interact to solve a business problem. Walk through flows, not definitions.
  • ❌ Rambling: Five minutes on a VPC question signals inability to communicate with executives. Use the 3-2-1 trick: 3 points, 2 minutes, 1 clear summary.
  • ❌ Generic answers: "I would use the best practices" is not an answer. Name the service, the trade-off, and the operational consequence.
  • ❌ Jumping to solutions: Always ask for requirements first — RTO, RPO, budget, scale, compliance constraints. Skipping this is a junior mistake.
  • ❌ Calling a service "the best": A Principal Architect selects services based on specific system requirements — cost, latency, scale — never on preference.

Ready to start preparing?

Pick a topic above or jump straight into the most popular section.

Start with Architecture →