Cloud Architecture Patterns Guide | Three-Tier, Microservices, Event-Driven

Intermediate

Microservice with ALB

Architecture diagram: Microservices with ALB — path-based routing to independent microservices on EKS with separate databases.

Application Load Balancer (ALB) enables path-based routing to direct traffic to different microservices:

/browse → Target Group 1 — handles traffic for browsing (e.g. cloudwithraj.com/browse)
/buy → Target Group 2 — handles traffic for purchases (e.g. cloudwithraj.com/buy)
/* (catch all) → Target Group 3 — handles all other traffic

Each microservice can have its own compute (EC2 with Auto Scaling Group or Lambda), and its own database (Amazon Aurora, DynamoDB, etc.). The ALB sits behind a domain (e.g. cloudwithraj.com) and routes requests to the appropriate target group based on the URL path. This allows each microservice to scale independently using an Auto Scaling Group behind Amazon EKS.

Beginner

Event Driven Architecture (Basic)

Architecture diagram: Event Driven Architecture — API Gateway to SQS to Lambda with 4 key benefits.

An event-driven architecture decouples the producer and processor. The producer (human) invokes an API and sends information in a JSON payload. API Gateway puts it into an event store (SQS), and the processor (Lambda) picks it up and processes it. API Gateway and Lambda can scale and be managed/deployed independently.

Flow: API Gateway → SQS (Event Store) → Lambda

Benefits

Scale and fail independently — Services are only aware of the event router, not each other. If one service fails, the rest keep running. The event router acts as an elastic buffer that accommodates surges in workloads.
Develop with agility — No need to write custom code to poll, filter, and route events; the event router automatically filters and pushes events to consumers. The router removes the need for heavy coordination between producer and consumer services.
Audit with ease — An event router acts as a centralized location to audit your application and define policies to restrict who can publish/subscribe and control access to your data. Events can be encrypted in transit and at rest.
Cut costs — Event-driven architectures are push-based, so everything happens on-demand. No continuous polling means less network bandwidth, less CPU utilization, less idle fleet capacity, and fewer SSL/TLS handshakes.

Intermediate

Microservice Vs Event Driven

Architecture diagram: Microservice (sync) vs Event Driven (async) — comparison of request flow patterns.

The main differences between traditional microservices and event-driven architecture:

Synchronous vs Asynchronous — Traditional microservice is synchronous: the request and response happen in the same invocation. With Event Driven, the user gets a confirmation that the message is inserted into SQS, but doesn't get the response from the actual message processing by the Lambda in the same invocation. Instead, the backend Lambda sends a response via websocket APIs, or the user can query the status afterwards.
Independent Scaling — With EDA, API Gateway and Lambda/Database can scale independently. Lambda can consume messages at a rate that doesn't overwhelm the database.
Built-in Retries — With EDA, retries are built in. With microservices, if Lambda fails, the user needs to send the request again. With EDA, once the message is in SQS, even if Lambda fails, SQS will automatically retry.

Microservice with Event Driven Flow: API Gateway → SQS → Lambda → DynamoDB, with Websocket API for async responses.

Advanced

Event Driven Architectures (Advanced)

Advanced event-driven patterns using multiple AWS services for routing and fan-out:

Flow: API Gateway → SQS 1 → Lambda 1 → EventBridge (Event Store + Router)

EventBridge Rules

Based on values in the message, EventBridge can fire different targets via rules:

Rule 1 → Step Function
Rule 2 → Lambda 2
Rule 3 → SQS 2

SNS Fan-Out

Based on values in the message, SNS can fire different targets using filters:

Filter 1 → SQS 3
Filter 2 → Lambda 1 / Lambda 2
Filter 3 → EKS Application

Beginner

Three-Tier Architecture

Architecture diagram: Three-Tier — Presentation, Application, and Database layers across Availability Zones.

Presentation Layer — Customers consume the application using this layer. Generally, this is where the front end runs (e.g. amazon.com website). Implemented using an external-facing load balancer distributing traffic to VMs (EC2s) running a webserver, with Auto Scaling Groups across multiple Availability Zones.
Application Layer — This is where the business logic resides. For example, you browse products on amazon.com, find a product you like, and click "add to cart". The flow comes to the application layer, validates the availability, and creates a cart. Implemented with an internal-facing load balancer and VMs running applications.
Database Layer — This is where information is stored: product information, shopping cart, order history, etc. The application layer interacts with this layer for CRUD (Create, Read, Update, Delete) operations. This could be implemented using one or a mix of databases — SQL (e.g. Amazon Aurora) and/or NoSQL (DynamoDB).

Why is this so popular in interviews? This architecture comprises many critical patterns — microservices, load balancing, scaling, performance optimization, high availability, and more. Based on your answers, the interviewer can dig deep and check your understanding of the core concepts.

Advanced

Multi-site Active Active DR

In a multi-site active-active disaster recovery setup, your application runs simultaneously in two or more AWS regions, each serving live production traffic.

How It Works

Amazon Route 53 uses latency-based or weighted routing to distribute traffic across regions.
Each region has a full stack: Load Balancer, compute (EC2/EKS/Lambda), and database.
Amazon DynamoDB Global Tables or Amazon Aurora Global Database provides multi-region, multi-active database replication with low-latency reads and writes in every region.
Amazon S3 Cross-Region Replication keeps objects in sync across regions.

Key Characteristics

RPO ≈ 0 (near-zero data loss with synchronous/asynchronous replication)
RTO ≈ 0 (no failover needed — both sites are already active)
Highest cost among DR strategies, but provides the best availability
Each region can handle 100% of total traffic if the other region goes down

Intermediate

Hybrid Cloud Architecture

Architecture diagram: Hybrid Cloud connecting on-premises data center to AWS via Direct Connect and VPN.

Hybrid cloud connects on-premises data centers with AWS cloud, enabling workloads to run across both environments.

Key AWS Services for Hybrid

AWS Direct Connect — Dedicated private network connection from on-premises to AWS (bypasses the public internet for consistent performance and lower latency).
AWS Site-to-Site VPN — Encrypted connection over the public internet between on-premises and AWS VPC. Simpler and cheaper than Direct Connect.
AWS Outposts — AWS infrastructure and services delivered to your on-premises data center. Run AWS compute, storage, and databases locally with seamless connection to the AWS region.
AWS Storage Gateway — Bridges on-premises storage with cloud storage (S3, EBS, Glacier). Supports file, volume, and tape gateway modes.
Amazon EKS Anywhere — Run Kubernetes clusters on-premises using the same EKS tooling and APIs as in the cloud.

Common Use Cases

Regulatory requirements that mandate certain data stays on-premises
Gradual cloud migration (lift-and-shift over time)
Low-latency processing at the edge with cloud bursting for peak demand

Advanced

System Design Trade-Offs

Every system design decision involves trade-offs. Understanding these is critical for interviews:

CAP Theorem

A distributed system can only guarantee two of three: Consistency, Availability, Partition Tolerance. Since network partitions are unavoidable, the real choice is between CP (consistent but may be unavailable during partition) and AP (available but may return stale data).

Key Trade-Offs

Consistency vs Availability — Strong consistency (every read gets latest write) costs availability. Eventual consistency improves availability but users may see stale data temporarily.
Latency vs Throughput — Optimizing for low latency (fast individual requests) can reduce overall throughput, and vice versa. Batching improves throughput but increases per-request latency.
Cost vs Performance — Higher performance (more instances, provisioned IOPS, reserved capacity) costs more. Spot instances save money but risk interruption.
Simplicity vs Scalability — Monolithic architectures are simpler but harder to scale. Microservices scale independently but add operational complexity (networking, observability, deployment).
SQL vs NoSQL — SQL provides ACID transactions and strong consistency but harder to scale horizontally. NoSQL scales easily and handles unstructured data but sacrifices joins and complex queries.

Intermediate

Top 3 Popular Design

The three most commonly asked system design patterns in cloud architect interviews:

Three-Tier Web Application — Presentation (ALB + EC2 webservers), Application (internal ALB + EC2 app servers), Database (Aurora/DynamoDB). Covers load balancing, auto scaling, multi-AZ, caching (ElastiCache), and CDN (CloudFront). Almost every interview includes a variant of this.
Event-Driven Microservices — API Gateway → SQS/SNS/EventBridge → Lambda/EKS consumers → DynamoDB. Demonstrates decoupling, async processing, fan-out patterns, retry/DLQ handling, and independent scaling. Shows you understand modern cloud-native design.
Real-time Data Pipeline — Data producers → Amazon Kinesis Data Streams → Lambda/KDA (processing) → S3 (data lake) + DynamoDB (hot data) → Athena/QuickSight (analytics). Demonstrates streaming data ingestion, transformation, storage tiering, and visualization.

Intermediate

Three Tier with Microservice

Combining three-tier architecture with microservices gives you the best of both worlds — a familiar layered structure with independent deployability:

How They Combine

Presentation Layer — Frontend served via CloudFront + S3 (static), or EC2/ECS behind an external ALB. This layer only handles UI rendering and API calls.
Application Layer (Microservices) — Instead of a single monolithic app server, the business logic is split into independent microservices (e.g., Order Service, Payment Service, User Service). Each runs in its own container on EKS or ECS, with its own internal ALB or service mesh routing.
Database Layer (Database per Service) — Each microservice owns its own database (e.g., Order Service uses DynamoDB, User Service uses Aurora). No shared database — this ensures loose coupling.

Key Differences from Traditional Three-Tier

Each microservice can be deployed, scaled, and updated independently
Different technology stacks per service (polyglot architecture)
Inter-service communication via APIs, message queues, or event bus
More complex operationally but far more scalable and resilient

Beginner

Monolith Vs Microservice

Aspect	Monolith	Microservice
Deployment	Single deployable unit	Each service deployed independently
Scaling	Scale the entire application	Scale individual services as needed
Tech Stack	Single language/framework	Polyglot — different languages per service
Database	Shared database	Database per service
Team Structure	One team owns everything	Small teams own individual services
Complexity	Simple to develop and debug initially	Complex distributed system (networking, observability)
Failure Impact	One bug can bring down entire app	Failures are isolated to individual services
Best For	Small teams, MVPs, low complexity	Large teams, high scalability, independent releases

Start monolith, evolve to microservices. Many successful companies (Amazon, Netflix) started as monoliths and migrated to microservices when the team size and scaling requirements justified the operational complexity.

Advanced

System Design: Global E-Commerce Platform

Prompt: "Design a global e-commerce platform like Amazon that handles 100K orders/day, serves users in US, Europe, and Asia, and survives a full region failure."

Step 1 — Clarify Requirements

Requirement	Decision
Users	10M registered, 500K DAU, 100K orders/day
Regions	US (primary), EU, Asia (secondary)
Availability	99.99% — multi-region active-active
Consistency	Strong for orders/inventory, eventual for catalog/reviews
Latency	<200ms for page loads, <500ms for checkout

Step 2 — High-Level Architecture

Layer	Service	Why This Choice
Global DNS	Route 53 (latency-based routing)	Routes users to nearest region automatically
CDN	CloudFront	Static assets + product images (reduces origin load by 80%+)
Frontend	S3 + CloudFront (React SPA)	Global edge distribution, zero server management
API Layer	ALB + EKS (Kubernetes)	Microservices need independent scaling and deployment
Product Catalog	DynamoDB Global Tables	Multi-region reads, eventually consistent catalog is acceptable
Orders & Inventory	Aurora Global Database	Strong consistency for financial transactions, single writer + fast read replicas
Search	OpenSearch	Full-text product search with faceted filtering
Cart & Sessions	ElastiCache Redis (Global Datastore)	Sub-ms latency for session/cart, replicated across regions
Order Events	EventBridge + SQS	Order placed → fan-out to inventory, payment, notification, shipping services
Payments	Step Functions + Lambda	Saga pattern: charge → reserve → confirm (with compensating rollback)

Step 3 — Key Trade-Offs to Discuss

DynamoDB vs Aurora for orders: Aurora gives SQL joins for complex order queries and strong consistency. DynamoDB Global Tables would give multi-region writes but only eventual consistency — risky for inventory counts.
Inventory race conditions: Use Aurora's SELECT FOR UPDATE or DynamoDB conditional writes to prevent overselling. Discuss: at 100K orders/day, hot product contention is real.
Region failover: Route 53 health checks detect region failure (30-60s). Aurora Global Database promotes secondary to primary (~1 min). Cart data in Redis Global Datastore is already replicated.

🎯 Key Takeaway

Interview tip: "I'd design this with a microservices architecture on EKS, using DynamoDB Global Tables for the product catalog (eventual consistency is fine for reads) and Aurora Global Database for orders/inventory (strong consistency is critical for financial data). The checkout flow uses Step Functions with a Saga pattern so if payment fails after inventory is reserved, compensating transactions release it automatically."

Advanced

System Design: Real-Time Analytics Pipeline

Prompt: "Design a real-time analytics system that processes 1 million events/second from a mobile app, computes trending content every 5 minutes, and powers a live dashboard."

Step 1 — Requirements

Throughput: 1M events/second peak (~86B events/day)
Latency: Dashboard updates within 30 seconds of event generation
Analytics: Trending content (5-min window), user engagement metrics, real-time anomaly detection
Retention: Hot data (7 days real-time), warm (90 days queryable), cold (archive)

Step 2 — Architecture Layers

Layer	Service	Rationale
Ingestion	Kinesis Data Streams (100+ shards)	Ordered within partition, handles 1M+ events/s, 7-day retention
Stream Processing	Managed Apache Flink	Sliding window aggregations (5-min trending), CEP for anomaly detection
Real-Time Store	ElastiCache Redis (Sorted Sets)	Trending leaderboard with ZADD/ZRANGE, sub-ms reads for dashboard
Batch Landing	Kinesis Data Firehose → S3 (Parquet)	Auto-batching, compression, format conversion for analytics
Batch Analytics	Athena + Glue Catalog	SQL on S3 for ad-hoc historical queries, $5/TB scanned
Dashboard API	API Gateway + Lambda	Reads from Redis (real-time) and Athena (historical)
Visualization	Amazon Managed Grafana	Real-time dashboards with auto-refresh, alerting

Step 3 — Key Design Decisions

Kinesis vs MSK: Kinesis for tight AWS integration and serverless consumer (Flink). MSK if team has Kafka expertise or needs Kafka Connect ecosystem.
Lambda vs Flink for processing: Lambda for simple event-by-event transforms. Flink for windowed aggregations, joins between streams, and complex event processing. At 1M events/s, Flink is the right choice.
Hot shard prevention: Partition by user_id (high cardinality). If a viral event causes skew, use random partition key suffix with application-level aggregation.

🎯 Key Takeaway

Interview tip: "I'd build a Lambda architecture with two paths: a speed layer (Kinesis → Flink → Redis for real-time trending) and a batch layer (Kinesis → Firehose → S3 → Athena for historical queries). Flink's sliding window aggregation computes trending content every 5 minutes, writing results to Redis Sorted Sets that the dashboard API reads with sub-millisecond latency."

Advanced

System Design: Global Chat Application

Prompt: "Design a real-time chat system like Slack or WhatsApp for 10M users with group chats, message delivery guarantees, and read receipts."

Step 1 — Requirements

Requirement	Decision
Users	10M registered, 1M concurrent connections
Message delivery	At-least-once with deduplication, ordered per conversation
Latency	<100ms for message delivery to online users
Features	1:1 chat, group chat (up to 500 members), read receipts, typing indicators, file sharing
Storage	Messages stored forever, searchable

Step 2 — Architecture

Component	Service	Rationale
WebSocket Gateway	NLB + EKS (custom WebSocket server)	1M persistent connections. NLB for Layer 4 TCP passthrough. API Gateway WebSocket has 500 connection limit/API.
Connection Registry	ElastiCache Redis	Maps user_id → server_id for message routing. Sub-ms lookups.
Message Queue	SQS FIFO	Ordered delivery per conversation (MessageGroupId = conversation_id). Deduplication built-in.
Message Store	DynamoDB	PK = conversation_id, SK = timestamp#message_id. Single-digit ms writes at any scale.
Search	OpenSearch	Full-text message search. DynamoDB Streams → Lambda → OpenSearch for indexing.
File Storage	S3 + pre-signed URLs	Client uploads directly to S3 (never through the API). CloudFront for image/file delivery.
Push Notifications	SNS (Mobile Push)	For offline users. APNs (iOS), FCM (Android).
Presence & Typing	ElastiCache Redis Pub/Sub	Ephemeral signals — no need to persist. Redis Pub/Sub broadcasts to subscribed servers.

Step 3 — Message Delivery Flow

User A sends message via WebSocket → WebSocket server receives it
Write to DynamoDB (persist) + publish to SQS FIFO (delivery)
Lookup User B's WebSocket server in Redis → route message to that server
If User B is online → deliver via WebSocket. If offline → send push notification via SNS
User B's client sends read receipt → update DynamoDB → notify User A

🎯 Key Takeaway

Interview tip: "The key challenge is routing messages to the right WebSocket server among thousands. I'd use a Redis-based connection registry that maps each online user to their server. For group chats with 500 members, I'd fan-out through SQS FIFO with MessageGroupId per conversation to maintain ordering. Offline users get SNS push notifications. DynamoDB stores all messages with conversation_id as partition key and timestamp as sort key for efficient range queries."

Advanced

System Design: Multi-Tenant SaaS Platform

Prompt: "Design a multi-tenant SaaS platform where enterprise customers demand data isolation, custom domains, and guaranteed performance SLAs."

Tenant Isolation Models

Model	Isolation	Cost	Complexity	Best For
Silo	Separate account/VPC per tenant	Highest	High (per-tenant infra)	Regulated enterprises, healthcare, finance
Bridge	Shared compute, separate databases	Medium	Medium	Mid-market B2B SaaS
Pool	Shared everything, data partitioned by tenant_id	Lowest	Low (but need row-level security)	SMB, high-volume low-margin

Architecture (Bridge Model — Most Common)

Component	Service	Tenant Strategy
DNS	Route 53	Custom domain per tenant (CNAME → shared ALB)
API	ALB + EKS	Shared compute. JWT contains tenant_id. Middleware extracts and scopes all queries.
Auth	Cognito (User Pool per tenant)	Separate user pools for data isolation. Or shared pool with custom:tenant_id attribute.
Database	Aurora PostgreSQL	Schema-per-tenant (CREATE SCHEMA tenant_123) with Row Level Security policies as defense-in-depth.
Storage	S3	Prefix-per-tenant (s3://bucket/tenant_123/files/). Bucket policy restricts cross-tenant access.
Queues	SQS	Tenant_id in message attributes. Consumers filter or route by tenant.
Monitoring	CloudWatch	Custom metrics dimensioned by tenant_id. Per-tenant dashboards and alarms.

Key Design Challenges

Noisy neighbor: One tenant's heavy usage impacts others. Solution: per-tenant rate limiting at ALB/API Gateway, resource quotas, and tenant-aware auto-scaling.
Data migration: Moving a growing tenant from Pool to Bridge (or Bridge to Silo). Design the data model to support promotion without downtime.
Cost attribution: Track per-tenant resource consumption for usage-based billing. Tag all resources, use CloudWatch metrics with tenant dimension, and generate per-tenant invoices from the CUR.

🎯 Key Takeaway

Interview tip: "I'd use the Bridge model — shared EKS compute with schema-per-tenant in Aurora PostgreSQL. Every API request extracts tenant_id from the JWT and scopes database queries to that tenant's schema. Row Level Security in PostgreSQL provides defense-in-depth. For enterprise customers who demand full isolation, I'd promote them to the Silo model with a dedicated Aurora cluster and VPC, managed via AWS Organizations."

Advanced

Interview Questions — Architecture Patterns

Scenario-based architecture questions that test your ability to think through trade-offs, not just name services.

You're building an e-commerce platform that needs to process orders, send notifications, update inventory, and trigger shipping — all when a customer clicks "Buy." Would you use synchronous microservices or event-driven architecture? Design the flow.

Answer Guide

EDA with SNS/SQS fan-out. Order placed → SNS topic → SQS queues for each downstream service. Discuss idempotency, dead-letter queues, and how to handle partial failures.
Your monolithic application is deployed on a single EC2 instance. The team wants to move to microservices. What migration strategy would you recommend — big bang rewrite or incremental? Name the specific pattern you'd use.

Answer Guide

Strangler Fig pattern. Route traffic through an ALB, gradually extracting services behind new paths. Never rewrite from scratch — it's the riskiest approach.
Design a three-tier architecture for a healthcare application that processes 10,000 transactions/second. The data must stay in a single country due to regulations. Walk through each tier and the security boundaries between them.

Answer Guide

Web tier (ALB + CloudFront with geo-restriction), App tier (ECS/EKS in private subnet), Data tier (Aurora with encryption). Discuss SGs between tiers, NACLs, and data residency.
Explain the CAP theorem. A teammate says "we need strong consistency AND high availability." How do you respond? Give a real-world AWS example of a CP system and an AP system.

Answer Guide

You can't have both during a partition. CP example: Aurora (favors consistency, may reject writes during failures). AP example: DynamoDB (eventual consistency, always available).
You have a microservice making synchronous REST calls to 5 downstream services. If one downstream service is slow (3-second response), the entire request takes 15 seconds. How do you fix this?

Answer Guide

Circuit breaker pattern for failing services, async processing via SQS for non-critical calls, parallel calls where possible, and timeout limits on each call.
Design a multi-site active-active architecture for a global SaaS platform serving users in US, Europe, and Asia. How do you handle data consistency across regions?

Answer Guide

DynamoDB Global Tables for active-active with eventual consistency, Aurora Global Database for SQL with read replicas per region. Route 53 latency-based routing. Discuss conflict resolution.
Your event-driven system is processing 1 million events per day through EventBridge. Some events are being processed out of order and duplicates are appearing. How do you solve both problems?

Answer Guide

SQS FIFO for ordering (MessageGroupId), idempotency keys in DynamoDB for deduplication. Discuss the trade-off: FIFO queues have lower throughput (3,000 msg/s with batching).
A team proposes using microservices for a 3-person startup building an MVP. What's your recommendation and why?

Answer Guide

Start with a well-structured monolith. Microservices add operational complexity (service mesh, distributed tracing, CI/CD per service) that a 3-person team can't sustain. Evolve when team and scale justify it.
You need to design a system where reads outnumber writes 100:1 and the data model for reads is very different from writes. What pattern would you recommend?

Answer Guide

CQRS (Command Query Responsibility Segregation). Write to Aurora/DynamoDB, project into read-optimized views (ElastiCache, OpenSearch). Events synchronize the two sides.
Your distributed order processing system spans 4 microservices. An order requires all 4 to succeed — but if the third one fails after the first two committed, how do you handle the rollback?

Answer Guide

Saga pattern with compensating transactions. Each service publishes events — if step 3 fails, trigger compensating actions for steps 1 and 2. Orchestrated (Step Functions) vs choreographed (events).

Preparation Strategy

Architecture interviews test your trade-off reasoning, not your ability to name services. For every design, articulate: "This approach gives us [benefit] at the cost of [trade-off]. I chose it because [constraint makes this trade-off acceptable]."