Gen AI

Gen AI Interview Guide

Layers, prompt engineering, RAG, fine-tuning, and multi-model invocation.

4Topics

Beginner

Gen AI Layers

Generative AI systems are built in three layers:

Infrastructure Layer — The compute foundation that trains and runs models. Includes GPU/accelerator instances (e.g., EC2 P5, Trainium, Inferentia), storage (S3 for training data), and networking. AWS services: Amazon SageMaker, EC2 GPU instances, AWS Trainium chips.
Model Layer — The foundation models (FMs) themselves. These are large pre-trained models (LLMs) like Anthropic Claude, Meta Llama, Amazon Titan, Stability AI. Access via Amazon Bedrock (managed, serverless access to multiple FMs) or self-host on SageMaker.
Application Layer — The end-user-facing applications built on top of models. Includes prompt engineering, RAG pipelines, agents, chatbots, and code generators. AWS services: Amazon Q (AI assistant), PartyRock (no-code AI app builder), Bedrock Agents.

Bedrock vs SageMaker — When to Use Which

Aspect	Amazon Bedrock	Amazon SageMaker
Model Type	Pre-built FMs (Claude, Llama, Titan)	Custom models or fine-tuned FMs
Infrastructure	Fully serverless — no instances	You manage instances (ml.p4d, etc.)
Customization	Prompt engineering + RAG + light fine-tuning	Full training, custom architectures
Pricing	Pay per token (input/output)	Pay per instance hour
Best For	Application teams consuming AI	ML teams building/training models

🎯 Key Takeaway

Interview tip: Most companies operate at the Application Layer — consuming pre-trained models via APIs (Bedrock) rather than training their own. For SA interviews, know Bedrock (serverless FM access), Knowledge Bases (managed RAG), and Agents (autonomous task execution). Say: "Start with Bedrock for speed and cost. Move to SageMaker only when you need custom model training or specialized ML pipelines."

Intermediate

Prompt Engineering Vs RAG Vs Fine Tuning

Three techniques to customize LLM behavior, each with different trade-offs:

Technique	How It Works	Cost	Latency Impact	Best For
Prompt Engineering	Craft instructions, examples, and context in the prompt. No model changes.	Lowest — API calls only	None — same as base model	Quick customization, prototyping, general tasks
RAG	Retrieve relevant docs from vector DB, inject as context before generation.	Medium — vector DB + embedding costs	+100-500ms (retrieval step)	Domain Q&A, up-to-date knowledge, reducing hallucinations
Fine Tuning	Retrain model on custom dataset to permanently change behavior.	Highest — GPU training compute	None (baked into model)	Specialized tone, domain terminology, structured output

Decision Framework

Start here → Prompt Engineering. Zero cost, instant iteration. Use few-shot prompting, chain-of-thought, and system instructions.
Model lacks your data? → Add RAG. Connect S3 docs to Bedrock Knowledge Bases. The model stays general but gets domain context per query.
Need consistent specialized behavior? → Fine-tune. The model internalizes your style/terminology. Can combine with RAG for best results.

Advanced Prompt Techniques (Know These for Interviews)

Few-Shot Prompting — Include 2-5 examples of input → output in the prompt
Chain-of-Thought (CoT) — Ask the model to "think step by step" for reasoning tasks
System Instructions — Set persona, constraints, and output format in the system message
Temperature Control — 0.0 for deterministic (facts), 0.7-1.0 for creative tasks

🎯 Key Takeaway

Interview tip: The #1 mistake is jumping to fine-tuning. Say: "I always start with prompt engineering, then add RAG if the model needs domain-specific knowledge. Fine-tuning is the last resort — it's expensive, requires labeled data, and the model can't easily be updated."

Intermediate

RAG with Bedrock

Retrieval Augmented Generation (RAG) with Amazon Bedrock lets you ground LLM responses in your own data without fine-tuning:

Architecture Flow (5 Steps)

Ingest Documents — Upload your knowledge base (PDFs, web pages, internal docs) to Amazon S3. Supports ~20 document formats automatically.
Chunk & Embed — Bedrock Knowledge Bases automatically chunks documents (configurable chunk size: 100-300 tokens recommended) and converts them into vector embeddings using an embedding model (Amazon Titan Embeddings v2 or Cohere Embed).
Store in Vector DB — Embeddings are stored in a vector database.
User Query — When a user asks a question, the query is embedded and a similarity search (k-NN) retrieves the most relevant document chunks from the vector DB.
Augmented Prompt — The retrieved chunks are injected into the prompt as context, and the foundation model generates a grounded, cited response.

Vector Database Options (Know These for Interviews)

Vector DB	Type	Best For	Max Vectors
OpenSearch Serverless	AWS-managed, serverless	Default choice for Bedrock KB — zero ops	Billions
Aurora PostgreSQL (pgvector)	SQL DB with vector extension	When you already use Aurora — combine relational + vector	Millions
Pinecone	3rd-party SaaS	Multi-cloud, high-performance similarity search	Billions
Redis Enterprise	In-memory	Ultra-low latency retrieval (<1ms)	Millions

RAG Best Practices

Chunk size matters — Too small = loses context. Too large = adds noise. Start with 200-300 tokens with 10-20% overlap.
Metadata filtering — Tag documents with metadata (department, date, type) and filter at query time for precision.
Hybrid search — Combine vector similarity + keyword search (BM25) for best recall. OpenSearch Serverless supports both.
Guardrails — Use Bedrock Guardrails to filter harmful content, PII, and enforce topic boundaries in responses.
Citation tracking — Bedrock Knowledge Bases return source citations automatically — critical for enterprise trust.

🎯 Key Takeaway

Interview tip: RAG is THE most common GenAI architecture pattern. Say: "I'd use Bedrock Knowledge Bases with S3 as the data source and OpenSearch Serverless as the vector store. The key design decisions are chunk size (200-300 tokens), embedding model selection, and whether to add metadata filtering for precision."

Advanced

Gen AI Multi Model Invocation

Multi-model invocation is the practice of using multiple foundation models in a single application to optimize for cost, latency, accuracy, and reliability:

4 Key Strategies

Strategy	How It Works	Benefit	Example
Model Routing	Route different request types to different models based on complexity	50-80% cost reduction	Simple Q&A → Titan Lite ($0.15/M tokens), Complex reasoning → Claude Sonnet ($3/M tokens)
Model Fallback	If primary model is unavailable or rate-limited, automatically switch to secondary	99.99% availability	Claude → Llama 3 → Titan as fallback chain
Ensemble / Consensus	Send same prompt to multiple models, aggregate responses (voting, best-of-N)	Higher accuracy for critical decisions	Medical diagnosis: 3 models vote, majority wins
Chain of Models	Pipeline where output of one model feeds into another	Specialized processing per step	Model A extracts entities → Model B generates summary from entities

AWS Implementation Architecture

Amazon Bedrock — Single API endpoint gives access to Claude, Llama, Titan, Cohere, Mistral. Switch models by changing the modelId parameter.
Step Functions — Orchestrate multi-model workflows with error handling, retries, and parallel execution.
Bedrock Agents — Autonomous agents that can decide which model to call, invoke APIs, and query knowledge bases.
Lambda + API Gateway — Build a routing layer that classifies request complexity and routes to the optimal model.

Cost Optimization with Model Routing

A typical enterprise pattern: route 80% of requests (simple) to a cheap model and 20% (complex) to a powerful model. This reduces costs by 60-70% vs. using the expensive model for everything.

🎯 Key Takeaway

Interview tip: This shows cost-awareness and production maturity. Say: "I'd use model routing with a classifier Lambda in front of Bedrock. Simple requests go to Titan Lite for cost savings, complex reasoning goes to Claude Sonnet. Step Functions handles the orchestration with built-in error handling and fallback to alternative models."

Advanced

Interview Questions — Gen AI

Gen AI is the hottest interview topic in 2025-2026. These questions test your ability to architect AI-powered applications, not just use APIs.

When would you use prompt engineering vs RAG vs fine-tuning? A legal team wants to build a chatbot that answers questions about their 50,000 internal legal documents. Which approach do you recommend?

Answer Guide

RAG — the documents change frequently (new cases, updated contracts), fine-tuning would require constant retraining. RAG retrieves relevant document chunks at query time. Prompt engineering alone can't inject 50K documents into context. Fine-tuning is for changing model behavior/style, not adding knowledge.
Design a RAG architecture on AWS. Walk through the complete flow from document ingestion to user query response. Which vector database would you choose and why?

Answer Guide

Ingestion: S3 → Bedrock Knowledge Base (auto-chunks, embeds, indexes). Query: user prompt → embedding → vector search → top-K chunks → LLM with context. Vector DB options: OpenSearch Serverless (managed, scales), Aurora pgvector (existing PostgreSQL), Pinecone (purpose-built). Choice depends on existing infrastructure and scale.
Your RAG chatbot is returning inaccurate answers. Users report it "hallucinates" facts that aren't in the source documents. What are three strategies to improve accuracy?

Answer Guide

1) Better chunking strategy (smaller chunks, overlap, semantic chunking). 2) Metadata filtering (restrict search to relevant document categories). 3) Bedrock Guardrails (ground truth checks, citation requirements). Also: hybrid search (keyword + semantic), re-ranking retrieved chunks, and prompt engineering to instruct "only answer from provided context."
Compare Bedrock vs SageMaker for deploying a Gen AI application. A startup with 2 ML engineers wants to build a customer support chatbot. Which platform and why?

Answer Guide

Bedrock — fully managed, API-based, no infrastructure. Perfect for a small team that wants to focus on the application, not model hosting. SageMaker for: custom model training, fine-tuning at scale, full control over inference infrastructure, teams with deep ML expertise. Bedrock for application builders, SageMaker for model builders.
How would you implement guardrails for a Gen AI application in a regulated industry (banking/healthcare)? What specific risks do you need to mitigate?

Answer Guide

Bedrock Guardrails: content filtering (block harmful content), PII detection/redaction, topic denial (prevent off-topic responses), word filters, grounding checks. Additional: input/output logging to S3 (audit trail), VPC endpoints for Bedrock (no public internet), IAM for model access control. Risks: hallucinations, data leakage, prompt injection.
A customer wants their Gen AI app to take actions — book meetings, query databases, send emails — not just answer questions. Design the architecture using Bedrock Agents.

Answer Guide

Bedrock Agents with Action Groups. Define tools (Lambda functions for each action: book_meeting, query_db, send_email). Agent uses ReAct pattern: reason about user intent → select tool → execute → return result. Include: OpenAPI schema for each action, session management for multi-turn conversations, guardrails to limit which actions are allowed.
Your Gen AI application uses Claude on Bedrock. The API costs are $10,000/month with 100,000 requests/day. The CEO asks you to cut costs by 60%. What strategies would you use?

Answer Guide

1) Prompt caching (cache common prefixes). 2) Model routing — use a smaller/cheaper model (Haiku) for simple queries, Sonnet for complex ones. 3) Response caching (cache identical queries in ElastiCache). 4) Prompt optimization (shorter prompts = fewer tokens). 5) Batch API for non-real-time workloads (50% cheaper).
Explain the difference between embedding models and generation models. Why does a RAG system need both? What happens if you use a poor-quality embedding model?

Answer Guide

Embedding models convert text to vectors (for search/retrieval). Generation models produce text (answers). RAG needs embeddings to find relevant chunks, then a generator to compose the answer. Poor embeddings → wrong chunks retrieved → irrelevant or wrong answers regardless of how good the generator is. Embedding quality is the foundation of RAG accuracy.
How do you evaluate the quality of a Gen AI application? A product manager asks "how do we know if our chatbot is getting better?" Define the metrics and evaluation approach.

Answer Guide

Metrics: relevance (does the answer address the question?), groundedness (is it supported by source documents?), coherence, completeness, toxicity. Approaches: human evaluation (gold standard), LLM-as-judge (use a model to evaluate another model), RAGAS framework for RAG-specific metrics. A/B testing in production with user feedback.
Design a multi-model architecture where different LLMs handle different types of queries. How do you route requests to the right model and handle fallback if a model is unavailable?

Answer Guide

Router Lambda classifies intent → routes to appropriate model (Haiku for FAQ, Sonnet for complex reasoning, Claude for code generation). Step Functions orchestrates with fallback: primary model timeout → retry → fallback model. Cost optimization: cheapest model that meets quality threshold. Use Bedrock's multi-model endpoints or custom routing logic.

Preparation Strategy

Gen AI interviews in 2025-2026 focus on architectural decisions, not model theory. Know when to use RAG vs fine-tuning, understand the cost implications of model selection, and be ready to discuss guardrails and responsible AI. Show that you can build production-grade AI applications, not just prototype demos.