Architecture

Architecture Interview Guide

Microservices, event-driven systems, three-tier, hybrid cloud, and system design trade-offs.

15Topics
Intermediate

Microservice with ALB

Architecture diagram: Microservices with ALB β€” path-based routing to independent microservices on EKS with separate databases.

Application Load Balancer (ALB) enables path-based routing to direct traffic to different microservices:

  • /browseTarget Group 1 β€” handles traffic for browsing (e.g. cloudwithraj.com/browse)
  • /buyTarget Group 2 β€” handles traffic for purchases (e.g. cloudwithraj.com/buy)
  • /* (catch all) → Target Group 3 β€” handles all other traffic

Each microservice can have its own compute (EC2 with Auto Scaling Group or Lambda), and its own database (Amazon Aurora, DynamoDB, etc.). The ALB sits behind a domain (e.g. cloudwithraj.com) and routes requests to the appropriate target group based on the URL path. This allows each microservice to scale independently using an Auto Scaling Group behind Amazon EKS.

Beginner

Event Driven Architecture (Basic)

Architecture diagram: Event Driven Architecture β€” API Gateway to SQS to Lambda with 4 key benefits.

An event-driven architecture decouples the producer and processor. The producer (human) invokes an API and sends information in a JSON payload. API Gateway puts it into an event store (SQS), and the processor (Lambda) picks it up and processes it. API Gateway and Lambda can scale and be managed/deployed independently.

Flow: API Gateway → SQS (Event Store) → Lambda

Benefits

  1. Scale and fail independently β€” Services are only aware of the event router, not each other. If one service fails, the rest keep running. The event router acts as an elastic buffer that accommodates surges in workloads.
  2. Develop with agility β€” No need to write custom code to poll, filter, and route events; the event router automatically filters and pushes events to consumers. The router removes the need for heavy coordination between producer and consumer services.
  3. Audit with ease β€” An event router acts as a centralized location to audit your application and define policies to restrict who can publish/subscribe and control access to your data. Events can be encrypted in transit and at rest.
  4. Cut costs β€” Event-driven architectures are push-based, so everything happens on-demand. No continuous polling means less network bandwidth, less CPU utilization, less idle fleet capacity, and fewer SSL/TLS handshakes.
Intermediate

Microservice Vs Event Driven

Architecture diagram: Microservice (sync) vs Event Driven (async) β€” comparison of request flow patterns.

The main differences between traditional microservices and event-driven architecture:

  1. Synchronous vs Asynchronous β€” Traditional microservice is synchronous: the request and response happen in the same invocation. With Event Driven, the user gets a confirmation that the message is inserted into SQS, but doesn't get the response from the actual message processing by the Lambda in the same invocation. Instead, the backend Lambda sends a response via websocket APIs, or the user can query the status afterwards.
  2. Independent Scaling β€” With EDA, API Gateway and Lambda/Database can scale independently. Lambda can consume messages at a rate that doesn't overwhelm the database.
  3. Built-in Retries β€” With EDA, retries are built in. With microservices, if Lambda fails, the user needs to send the request again. With EDA, once the message is in SQS, even if Lambda fails, SQS will automatically retry.

Microservice with Event Driven Flow: API Gateway → SQS → Lambda → DynamoDB, with Websocket API for async responses.

Advanced

Event Driven Architectures (Advanced)

Architecture diagram: Advanced EDA with EventBridge rules and SNS fan-out patterns.

Advanced event-driven patterns using multiple AWS services for routing and fan-out:

Flow: API Gateway → SQS 1 → Lambda 1 → EventBridge (Event Store + Router)

EventBridge Rules

Based on values in the message, EventBridge can fire different targets via rules:

  • Rule 1 → Step Function
  • Rule 2 → Lambda 2
  • Rule 3 → SQS 2

SNS Fan-Out

Based on values in the message, SNS can fire different targets using filters:

  • Filter 1 → SQS 3
  • Filter 2 → Lambda 1 / Lambda 2
  • Filter 3 → EKS Application
Beginner

Three-Tier Architecture

Architecture diagram: Three-Tier β€” Presentation, Application, and Database layers across Availability Zones.
  1. Presentation Layer β€” Customers consume the application using this layer. Generally, this is where the front end runs (e.g. amazon.com website). Implemented using an external-facing load balancer distributing traffic to VMs (EC2s) running a webserver, with Auto Scaling Groups across multiple Availability Zones.
  2. Application Layer β€” This is where the business logic resides. For example, you browse products on amazon.com, find a product you like, and click "add to cart". The flow comes to the application layer, validates the availability, and creates a cart. Implemented with an internal-facing load balancer and VMs running applications.
  3. Database Layer β€” This is where information is stored: product information, shopping cart, order history, etc. The application layer interacts with this layer for CRUD (Create, Read, Update, Delete) operations. This could be implemented using one or a mix of databases β€” SQL (e.g. Amazon Aurora) and/or NoSQL (DynamoDB).

Why is this so popular in interviews? This architecture comprises many critical patterns β€” microservices, load balancing, scaling, performance optimization, high availability, and more. Based on your answers, the interviewer can dig deep and check your understanding of the core concepts.

Advanced

Multi-site Active Active DR

Architecture diagram: Multi-site Active-Active DR with Route 53 routing to two AWS regions.

In a multi-site active-active disaster recovery setup, your application runs simultaneously in two or more AWS regions, each serving live production traffic.

How It Works

  • Amazon Route 53 uses latency-based or weighted routing to distribute traffic across regions.
  • Each region has a full stack: Load Balancer, compute (EC2/EKS/Lambda), and database.
  • Amazon DynamoDB Global Tables or Amazon Aurora Global Database provides multi-region, multi-active database replication with low-latency reads and writes in every region.
  • Amazon S3 Cross-Region Replication keeps objects in sync across regions.

Key Characteristics

  • RPO ≈ 0 (near-zero data loss with synchronous/asynchronous replication)
  • RTO ≈ 0 (no failover needed β€” both sites are already active)
  • Highest cost among DR strategies, but provides the best availability
  • Each region can handle 100% of total traffic if the other region goes down
Intermediate

Hybrid Cloud Architecture

Architecture diagram: Hybrid Cloud connecting on-premises data center to AWS via Direct Connect and VPN.

Hybrid cloud connects on-premises data centers with AWS cloud, enabling workloads to run across both environments.

Key AWS Services for Hybrid

  • AWS Direct Connect β€” Dedicated private network connection from on-premises to AWS (bypasses the public internet for consistent performance and lower latency).
  • AWS Site-to-Site VPN β€” Encrypted connection over the public internet between on-premises and AWS VPC. Simpler and cheaper than Direct Connect.
  • AWS Outposts β€” AWS infrastructure and services delivered to your on-premises data center. Run AWS compute, storage, and databases locally with seamless connection to the AWS region.
  • AWS Storage Gateway β€” Bridges on-premises storage with cloud storage (S3, EBS, Glacier). Supports file, volume, and tape gateway modes.
  • Amazon EKS Anywhere β€” Run Kubernetes clusters on-premises using the same EKS tooling and APIs as in the cloud.

Common Use Cases

  • Regulatory requirements that mandate certain data stays on-premises
  • Gradual cloud migration (lift-and-shift over time)
  • Low-latency processing at the edge with cloud bursting for peak demand
Advanced

System Design Trade-Offs

Infographic: System Design Trade-Offs β€” CAP theorem and five key trade-off spectrums.

Every system design decision involves trade-offs. Understanding these is critical for interviews:

CAP Theorem

A distributed system can only guarantee two of three: Consistency, Availability, Partition Tolerance. Since network partitions are unavoidable, the real choice is between CP (consistent but may be unavailable during partition) and AP (available but may return stale data).

Key Trade-Offs

  • Consistency vs Availability β€” Strong consistency (every read gets latest write) costs availability. Eventual consistency improves availability but users may see stale data temporarily.
  • Latency vs Throughput β€” Optimizing for low latency (fast individual requests) can reduce overall throughput, and vice versa. Batching improves throughput but increases per-request latency.
  • Cost vs Performance β€” Higher performance (more instances, provisioned IOPS, reserved capacity) costs more. Spot instances save money but risk interruption.
  • Simplicity vs Scalability β€” Monolithic architectures are simpler but harder to scale. Microservices scale independently but add operational complexity (networking, observability, deployment).
  • SQL vs NoSQL β€” SQL provides ACID transactions and strong consistency but harder to scale horizontally. NoSQL scales easily and handles unstructured data but sacrifices joins and complex queries.
Intermediate

Top 3 Popular Design

Infographic: Top 3 Popular Designs for Interviews β€” Three-Tier, Event-Driven Microservices, Real-time Data Pipeline.

The three most commonly asked system design patterns in cloud architect interviews:

  1. Three-Tier Web Application β€” Presentation (ALB + EC2 webservers), Application (internal ALB + EC2 app servers), Database (Aurora/DynamoDB). Covers load balancing, auto scaling, multi-AZ, caching (ElastiCache), and CDN (CloudFront). Almost every interview includes a variant of this.
  2. Event-Driven Microservices β€” API Gateway → SQS/SNS/EventBridge → Lambda/EKS consumers → DynamoDB. Demonstrates decoupling, async processing, fan-out patterns, retry/DLQ handling, and independent scaling. Shows you understand modern cloud-native design.
  3. Real-time Data Pipeline β€” Data producers → Amazon Kinesis Data Streams → Lambda/KDA (processing) → S3 (data lake) + DynamoDB (hot data) → Athena/QuickSight (analytics). Demonstrates streaming data ingestion, transformation, storage tiering, and visualization.
Intermediate

Three Tier with Microservice

Architecture diagram: Three-Tier with Microservices β€” Presentation, EKS Microservices, Database per Service.

Combining three-tier architecture with microservices gives you the best of both worlds β€” a familiar layered structure with independent deployability:

How They Combine

  • Presentation Layer β€” Frontend served via CloudFront + S3 (static), or EC2/ECS behind an external ALB. This layer only handles UI rendering and API calls.
  • Application Layer (Microservices) β€” Instead of a single monolithic app server, the business logic is split into independent microservices (e.g., Order Service, Payment Service, User Service). Each runs in its own container on EKS or ECS, with its own internal ALB or service mesh routing.
  • Database Layer (Database per Service) β€” Each microservice owns its own database (e.g., Order Service uses DynamoDB, User Service uses Aurora). No shared database β€” this ensures loose coupling.

Key Differences from Traditional Three-Tier

  • Each microservice can be deployed, scaled, and updated independently
  • Different technology stacks per service (polyglot architecture)
  • Inter-service communication via APIs, message queues, or event bus
  • More complex operationally but far more scalable and resilient
Beginner

Monolith Vs Microservice

Infographic: Monolith vs Microservice β€” visual comparison of architecture, deployment, scaling, and trade-offs.
Aspect Monolith Microservice
Deployment Single deployable unit Each service deployed independently
Scaling Scale the entire application Scale individual services as needed
Tech Stack Single language/framework Polyglot β€” different languages per service
Database Shared database Database per service
Team Structure One team owns everything Small teams own individual services
Complexity Simple to develop and debug initially Complex distributed system (networking, observability)
Failure Impact One bug can bring down entire app Failures are isolated to individual services
Best For Small teams, MVPs, low complexity Large teams, high scalability, independent releases

Start monolith, evolve to microservices. Many successful companies (Amazon, Netflix) started as monoliths and migrated to microservices when the team size and scaling requirements justified the operational complexity.

Advanced

System Design: Global E-Commerce Platform

Prompt: "Design a global e-commerce platform like Amazon that handles 100K orders/day, serves users in US, Europe, and Asia, and survives a full region failure."

Step 1 β€” Clarify Requirements

RequirementDecision
Users10M registered, 500K DAU, 100K orders/day
RegionsUS (primary), EU, Asia (secondary)
Availability99.99% β€” multi-region active-active
ConsistencyStrong for orders/inventory, eventual for catalog/reviews
Latency<200ms for page loads, <500ms for checkout

Step 2 β€” High-Level Architecture

LayerServiceWhy This Choice
Global DNSRoute 53 (latency-based routing)Routes users to nearest region automatically
CDNCloudFrontStatic assets + product images (reduces origin load by 80%+)
FrontendS3 + CloudFront (React SPA)Global edge distribution, zero server management
API LayerALB + EKS (Kubernetes)Microservices need independent scaling and deployment
Product CatalogDynamoDB Global TablesMulti-region reads, eventually consistent catalog is acceptable
Orders & InventoryAurora Global DatabaseStrong consistency for financial transactions, single writer + fast read replicas
SearchOpenSearchFull-text product search with faceted filtering
Cart & SessionsElastiCache Redis (Global Datastore)Sub-ms latency for session/cart, replicated across regions
Order EventsEventBridge + SQSOrder placed β†’ fan-out to inventory, payment, notification, shipping services
PaymentsStep Functions + LambdaSaga pattern: charge β†’ reserve β†’ confirm (with compensating rollback)

Step 3 β€” Key Trade-Offs to Discuss

  • DynamoDB vs Aurora for orders: Aurora gives SQL joins for complex order queries and strong consistency. DynamoDB Global Tables would give multi-region writes but only eventual consistency β€” risky for inventory counts.
  • Inventory race conditions: Use Aurora's SELECT FOR UPDATE or DynamoDB conditional writes to prevent overselling. Discuss: at 100K orders/day, hot product contention is real.
  • Region failover: Route 53 health checks detect region failure (30-60s). Aurora Global Database promotes secondary to primary (~1 min). Cart data in Redis Global Datastore is already replicated.

🎯 Key Takeaway

Interview tip: "I'd design this with a microservices architecture on EKS, using DynamoDB Global Tables for the product catalog (eventual consistency is fine for reads) and Aurora Global Database for orders/inventory (strong consistency is critical for financial data). The checkout flow uses Step Functions with a Saga pattern so if payment fails after inventory is reserved, compensating transactions release it automatically."

Advanced

System Design: Real-Time Analytics Pipeline

Prompt: "Design a real-time analytics system that processes 1 million events/second from a mobile app, computes trending content every 5 minutes, and powers a live dashboard."

Step 1 β€” Requirements

  • Throughput: 1M events/second peak (~86B events/day)
  • Latency: Dashboard updates within 30 seconds of event generation
  • Analytics: Trending content (5-min window), user engagement metrics, real-time anomaly detection
  • Retention: Hot data (7 days real-time), warm (90 days queryable), cold (archive)

Step 2 β€” Architecture Layers

LayerServiceRationale
IngestionKinesis Data Streams (100+ shards)Ordered within partition, handles 1M+ events/s, 7-day retention
Stream ProcessingManaged Apache FlinkSliding window aggregations (5-min trending), CEP for anomaly detection
Real-Time StoreElastiCache Redis (Sorted Sets)Trending leaderboard with ZADD/ZRANGE, sub-ms reads for dashboard
Batch LandingKinesis Data Firehose β†’ S3 (Parquet)Auto-batching, compression, format conversion for analytics
Batch AnalyticsAthena + Glue CatalogSQL on S3 for ad-hoc historical queries, $5/TB scanned
Dashboard APIAPI Gateway + LambdaReads from Redis (real-time) and Athena (historical)
VisualizationAmazon Managed GrafanaReal-time dashboards with auto-refresh, alerting

Step 3 β€” Key Design Decisions

  • Kinesis vs MSK: Kinesis for tight AWS integration and serverless consumer (Flink). MSK if team has Kafka expertise or needs Kafka Connect ecosystem.
  • Lambda vs Flink for processing: Lambda for simple event-by-event transforms. Flink for windowed aggregations, joins between streams, and complex event processing. At 1M events/s, Flink is the right choice.
  • Hot shard prevention: Partition by user_id (high cardinality). If a viral event causes skew, use random partition key suffix with application-level aggregation.

🎯 Key Takeaway

Interview tip: "I'd build a Lambda architecture with two paths: a speed layer (Kinesis β†’ Flink β†’ Redis for real-time trending) and a batch layer (Kinesis β†’ Firehose β†’ S3 β†’ Athena for historical queries). Flink's sliding window aggregation computes trending content every 5 minutes, writing results to Redis Sorted Sets that the dashboard API reads with sub-millisecond latency."

Advanced

System Design: Global Chat Application

Prompt: "Design a real-time chat system like Slack or WhatsApp for 10M users with group chats, message delivery guarantees, and read receipts."

Step 1 β€” Requirements

RequirementDecision
Users10M registered, 1M concurrent connections
Message deliveryAt-least-once with deduplication, ordered per conversation
Latency<100ms for message delivery to online users
Features1:1 chat, group chat (up to 500 members), read receipts, typing indicators, file sharing
StorageMessages stored forever, searchable

Step 2 β€” Architecture

ComponentServiceRationale
WebSocket GatewayNLB + EKS (custom WebSocket server)1M persistent connections. NLB for Layer 4 TCP passthrough. API Gateway WebSocket has 500 connection limit/API.
Connection RegistryElastiCache RedisMaps user_id β†’ server_id for message routing. Sub-ms lookups.
Message QueueSQS FIFOOrdered delivery per conversation (MessageGroupId = conversation_id). Deduplication built-in.
Message StoreDynamoDBPK = conversation_id, SK = timestamp#message_id. Single-digit ms writes at any scale.
SearchOpenSearchFull-text message search. DynamoDB Streams β†’ Lambda β†’ OpenSearch for indexing.
File StorageS3 + pre-signed URLsClient uploads directly to S3 (never through the API). CloudFront for image/file delivery.
Push NotificationsSNS (Mobile Push)For offline users. APNs (iOS), FCM (Android).
Presence & TypingElastiCache Redis Pub/SubEphemeral signals β€” no need to persist. Redis Pub/Sub broadcasts to subscribed servers.

Step 3 β€” Message Delivery Flow

  1. User A sends message via WebSocket β†’ WebSocket server receives it
  2. Write to DynamoDB (persist) + publish to SQS FIFO (delivery)
  3. Lookup User B's WebSocket server in Redis β†’ route message to that server
  4. If User B is online β†’ deliver via WebSocket. If offline β†’ send push notification via SNS
  5. User B's client sends read receipt β†’ update DynamoDB β†’ notify User A

🎯 Key Takeaway

Interview tip: "The key challenge is routing messages to the right WebSocket server among thousands. I'd use a Redis-based connection registry that maps each online user to their server. For group chats with 500 members, I'd fan-out through SQS FIFO with MessageGroupId per conversation to maintain ordering. Offline users get SNS push notifications. DynamoDB stores all messages with conversation_id as partition key and timestamp as sort key for efficient range queries."

Advanced

System Design: Multi-Tenant SaaS Platform

Prompt: "Design a multi-tenant SaaS platform where enterprise customers demand data isolation, custom domains, and guaranteed performance SLAs."

Tenant Isolation Models

ModelIsolationCostComplexityBest For
SiloSeparate account/VPC per tenantHighestHigh (per-tenant infra)Regulated enterprises, healthcare, finance
BridgeShared compute, separate databasesMediumMediumMid-market B2B SaaS
PoolShared everything, data partitioned by tenant_idLowestLow (but need row-level security)SMB, high-volume low-margin

Architecture (Bridge Model β€” Most Common)

ComponentServiceTenant Strategy
DNSRoute 53Custom domain per tenant (CNAME β†’ shared ALB)
APIALB + EKSShared compute. JWT contains tenant_id. Middleware extracts and scopes all queries.
AuthCognito (User Pool per tenant)Separate user pools for data isolation. Or shared pool with custom:tenant_id attribute.
DatabaseAurora PostgreSQLSchema-per-tenant (CREATE SCHEMA tenant_123) with Row Level Security policies as defense-in-depth.
StorageS3Prefix-per-tenant (s3://bucket/tenant_123/files/). Bucket policy restricts cross-tenant access.
QueuesSQSTenant_id in message attributes. Consumers filter or route by tenant.
MonitoringCloudWatchCustom metrics dimensioned by tenant_id. Per-tenant dashboards and alarms.

Key Design Challenges

  • Noisy neighbor: One tenant's heavy usage impacts others. Solution: per-tenant rate limiting at ALB/API Gateway, resource quotas, and tenant-aware auto-scaling.
  • Data migration: Moving a growing tenant from Pool to Bridge (or Bridge to Silo). Design the data model to support promotion without downtime.
  • Cost attribution: Track per-tenant resource consumption for usage-based billing. Tag all resources, use CloudWatch metrics with tenant dimension, and generate per-tenant invoices from the CUR.

🎯 Key Takeaway

Interview tip: "I'd use the Bridge model β€” shared EKS compute with schema-per-tenant in Aurora PostgreSQL. Every API request extracts tenant_id from the JWT and scopes database queries to that tenant's schema. Row Level Security in PostgreSQL provides defense-in-depth. For enterprise customers who demand full isolation, I'd promote them to the Silo model with a dedicated Aurora cluster and VPC, managed via AWS Organizations."

Advanced

Interview Questions β€” Architecture Patterns

Scenario-based architecture questions that test your ability to think through trade-offs, not just name services.

  1. Answer Guide
    EDA with SNS/SQS fan-out. Order placed β†’ SNS topic β†’ SQS queues for each downstream service. Discuss idempotency, dead-letter queues, and how to handle partial failures.
  2. Answer Guide
    Strangler Fig pattern. Route traffic through an ALB, gradually extracting services behind new paths. Never rewrite from scratch β€” it's the riskiest approach.
  3. Answer Guide
    Web tier (ALB + CloudFront with geo-restriction), App tier (ECS/EKS in private subnet), Data tier (Aurora with encryption). Discuss SGs between tiers, NACLs, and data residency.
  4. Answer Guide
    You can't have both during a partition. CP example: Aurora (favors consistency, may reject writes during failures). AP example: DynamoDB (eventual consistency, always available).
  5. Answer Guide
    Circuit breaker pattern for failing services, async processing via SQS for non-critical calls, parallel calls where possible, and timeout limits on each call.
  6. Answer Guide
    DynamoDB Global Tables for active-active with eventual consistency, Aurora Global Database for SQL with read replicas per region. Route 53 latency-based routing. Discuss conflict resolution.
  7. Answer Guide
    SQS FIFO for ordering (MessageGroupId), idempotency keys in DynamoDB for deduplication. Discuss the trade-off: FIFO queues have lower throughput (3,000 msg/s with batching).
  8. Answer Guide
    Start with a well-structured monolith. Microservices add operational complexity (service mesh, distributed tracing, CI/CD per service) that a 3-person team can't sustain. Evolve when team and scale justify it.
  9. Answer Guide
    CQRS (Command Query Responsibility Segregation). Write to Aurora/DynamoDB, project into read-optimized views (ElastiCache, OpenSearch). Events synchronize the two sides.
  10. Answer Guide
    Saga pattern with compensating transactions. Each service publishes events β€” if step 3 fails, trigger compensating actions for steps 1 and 2. Orchestrated (Step Functions) vs choreographed (events).

Preparation Strategy

Architecture interviews test your trade-off reasoning, not your ability to name services. For every design, articulate: "This approach gives us [benefit] at the cost of [trade-off]. I chose it because [constraint makes this trade-off acceptable]."