Gen AI

Gen AI Interview Guide

Layers, prompt engineering, RAG, fine-tuning, and multi-model invocation.

4Topics
Beginner

Gen AI Layers

Generative AI systems are built in three layers:

  1. Infrastructure Layer β€” The compute foundation that trains and runs models. Includes GPU/accelerator instances (e.g., EC2 P5, Trainium, Inferentia), storage (S3 for training data), and networking. AWS services: Amazon SageMaker, EC2 GPU instances, AWS Trainium chips.
  2. Model Layer β€” The foundation models (FMs) themselves. These are large pre-trained models (LLMs) like Anthropic Claude, Meta Llama, Amazon Titan, Stability AI. Access via Amazon Bedrock (managed, serverless access to multiple FMs) or self-host on SageMaker.
  3. Application Layer β€” The end-user-facing applications built on top of models. Includes prompt engineering, RAG pipelines, agents, chatbots, and code generators. AWS services: Amazon Q (AI assistant), PartyRock (no-code AI app builder), Bedrock Agents.

Bedrock vs SageMaker β€” When to Use Which

Aspect Amazon Bedrock Amazon SageMaker
Model Type Pre-built FMs (Claude, Llama, Titan) Custom models or fine-tuned FMs
Infrastructure Fully serverless β€” no instances You manage instances (ml.p4d, etc.)
Customization Prompt engineering + RAG + light fine-tuning Full training, custom architectures
Pricing Pay per token (input/output) Pay per instance hour
Best For Application teams consuming AI ML teams building/training models

🎯 Key Takeaway

Interview tip: Most companies operate at the Application Layer β€” consuming pre-trained models via APIs (Bedrock) rather than training their own. For SA interviews, know Bedrock (serverless FM access), Knowledge Bases (managed RAG), and Agents (autonomous task execution). Say: "Start with Bedrock for speed and cost. Move to SageMaker only when you need custom model training or specialized ML pipelines."

Intermediate

Prompt Engineering Vs RAG Vs Fine Tuning

Three techniques to customize LLM behavior, each with different trade-offs:

Technique How It Works Cost Latency Impact Best For
Prompt Engineering Craft instructions, examples, and context in the prompt. No model changes. Lowest β€” API calls only None β€” same as base model Quick customization, prototyping, general tasks
RAG Retrieve relevant docs from vector DB, inject as context before generation. Medium β€” vector DB + embedding costs +100-500ms (retrieval step) Domain Q&A, up-to-date knowledge, reducing hallucinations
Fine Tuning Retrain model on custom dataset to permanently change behavior. Highest β€” GPU training compute None (baked into model) Specialized tone, domain terminology, structured output

Decision Framework

  • Start here β†’ Prompt Engineering. Zero cost, instant iteration. Use few-shot prompting, chain-of-thought, and system instructions.
  • Model lacks your data? β†’ Add RAG. Connect S3 docs to Bedrock Knowledge Bases. The model stays general but gets domain context per query.
  • Need consistent specialized behavior? β†’ Fine-tune. The model internalizes your style/terminology. Can combine with RAG for best results.

Advanced Prompt Techniques (Know These for Interviews)

  • Few-Shot Prompting β€” Include 2-5 examples of input β†’ output in the prompt
  • Chain-of-Thought (CoT) β€” Ask the model to "think step by step" for reasoning tasks
  • System Instructions β€” Set persona, constraints, and output format in the system message
  • Temperature Control β€” 0.0 for deterministic (facts), 0.7-1.0 for creative tasks

🎯 Key Takeaway

Interview tip: The #1 mistake is jumping to fine-tuning. Say: "I always start with prompt engineering, then add RAG if the model needs domain-specific knowledge. Fine-tuning is the last resort β€” it's expensive, requires labeled data, and the model can't easily be updated."

Intermediate

RAG with Bedrock

Retrieval Augmented Generation (RAG) with Amazon Bedrock lets you ground LLM responses in your own data without fine-tuning:

Architecture Flow (5 Steps)

  1. Ingest Documents β€” Upload your knowledge base (PDFs, web pages, internal docs) to Amazon S3. Supports ~20 document formats automatically.
  2. Chunk & Embed β€” Bedrock Knowledge Bases automatically chunks documents (configurable chunk size: 100-300 tokens recommended) and converts them into vector embeddings using an embedding model (Amazon Titan Embeddings v2 or Cohere Embed).
  3. Store in Vector DB β€” Embeddings are stored in a vector database.
  4. User Query β€” When a user asks a question, the query is embedded and a similarity search (k-NN) retrieves the most relevant document chunks from the vector DB.
  5. Augmented Prompt β€” The retrieved chunks are injected into the prompt as context, and the foundation model generates a grounded, cited response.

Vector Database Options (Know These for Interviews)

Vector DB Type Best For Max Vectors
OpenSearch Serverless AWS-managed, serverless Default choice for Bedrock KB β€” zero ops Billions
Aurora PostgreSQL (pgvector) SQL DB with vector extension When you already use Aurora β€” combine relational + vector Millions
Pinecone 3rd-party SaaS Multi-cloud, high-performance similarity search Billions
Redis Enterprise In-memory Ultra-low latency retrieval (<1ms) Millions

RAG Best Practices

  • Chunk size matters β€” Too small = loses context. Too large = adds noise. Start with 200-300 tokens with 10-20% overlap.
  • Metadata filtering β€” Tag documents with metadata (department, date, type) and filter at query time for precision.
  • Hybrid search β€” Combine vector similarity + keyword search (BM25) for best recall. OpenSearch Serverless supports both.
  • Guardrails β€” Use Bedrock Guardrails to filter harmful content, PII, and enforce topic boundaries in responses.
  • Citation tracking β€” Bedrock Knowledge Bases return source citations automatically β€” critical for enterprise trust.

🎯 Key Takeaway

Interview tip: RAG is THE most common GenAI architecture pattern. Say: "I'd use Bedrock Knowledge Bases with S3 as the data source and OpenSearch Serverless as the vector store. The key design decisions are chunk size (200-300 tokens), embedding model selection, and whether to add metadata filtering for precision."

Advanced

Gen AI Multi Model Invocation

Multi-model invocation is the practice of using multiple foundation models in a single application to optimize for cost, latency, accuracy, and reliability:

4 Key Strategies

Strategy How It Works Benefit Example
Model Routing Route different request types to different models based on complexity 50-80% cost reduction Simple Q&A β†’ Titan Lite ($0.15/M tokens), Complex reasoning β†’ Claude Sonnet ($3/M tokens)
Model Fallback If primary model is unavailable or rate-limited, automatically switch to secondary 99.99% availability Claude β†’ Llama 3 β†’ Titan as fallback chain
Ensemble / Consensus Send same prompt to multiple models, aggregate responses (voting, best-of-N) Higher accuracy for critical decisions Medical diagnosis: 3 models vote, majority wins
Chain of Models Pipeline where output of one model feeds into another Specialized processing per step Model A extracts entities β†’ Model B generates summary from entities

AWS Implementation Architecture

  • Amazon Bedrock β€” Single API endpoint gives access to Claude, Llama, Titan, Cohere, Mistral. Switch models by changing the modelId parameter.
  • Step Functions β€” Orchestrate multi-model workflows with error handling, retries, and parallel execution.
  • Bedrock Agents β€” Autonomous agents that can decide which model to call, invoke APIs, and query knowledge bases.
  • Lambda + API Gateway β€” Build a routing layer that classifies request complexity and routes to the optimal model.

Cost Optimization with Model Routing

A typical enterprise pattern: route 80% of requests (simple) to a cheap model and 20% (complex) to a powerful model. This reduces costs by 60-70% vs. using the expensive model for everything.

🎯 Key Takeaway

Interview tip: This shows cost-awareness and production maturity. Say: "I'd use model routing with a classifier Lambda in front of Bedrock. Simple requests go to Titan Lite for cost savings, complex reasoning goes to Claude Sonnet. Step Functions handles the orchestration with built-in error handling and fallback to alternative models."

Advanced

Interview Questions β€” Gen AI

Gen AI is the hottest interview topic in 2025-2026. These questions test your ability to architect AI-powered applications, not just use APIs.

  1. Answer Guide
    RAG β€” the documents change frequently (new cases, updated contracts), fine-tuning would require constant retraining. RAG retrieves relevant document chunks at query time. Prompt engineering alone can't inject 50K documents into context. Fine-tuning is for changing model behavior/style, not adding knowledge.
  2. Answer Guide
    Ingestion: S3 β†’ Bedrock Knowledge Base (auto-chunks, embeds, indexes). Query: user prompt β†’ embedding β†’ vector search β†’ top-K chunks β†’ LLM with context. Vector DB options: OpenSearch Serverless (managed, scales), Aurora pgvector (existing PostgreSQL), Pinecone (purpose-built). Choice depends on existing infrastructure and scale.
  3. Answer Guide
    1) Better chunking strategy (smaller chunks, overlap, semantic chunking). 2) Metadata filtering (restrict search to relevant document categories). 3) Bedrock Guardrails (ground truth checks, citation requirements). Also: hybrid search (keyword + semantic), re-ranking retrieved chunks, and prompt engineering to instruct "only answer from provided context."
  4. Answer Guide
    Bedrock β€” fully managed, API-based, no infrastructure. Perfect for a small team that wants to focus on the application, not model hosting. SageMaker for: custom model training, fine-tuning at scale, full control over inference infrastructure, teams with deep ML expertise. Bedrock for application builders, SageMaker for model builders.
  5. Answer Guide
    Bedrock Guardrails: content filtering (block harmful content), PII detection/redaction, topic denial (prevent off-topic responses), word filters, grounding checks. Additional: input/output logging to S3 (audit trail), VPC endpoints for Bedrock (no public internet), IAM for model access control. Risks: hallucinations, data leakage, prompt injection.
  6. Answer Guide
    Bedrock Agents with Action Groups. Define tools (Lambda functions for each action: book_meeting, query_db, send_email). Agent uses ReAct pattern: reason about user intent β†’ select tool β†’ execute β†’ return result. Include: OpenAPI schema for each action, session management for multi-turn conversations, guardrails to limit which actions are allowed.
  7. Answer Guide
    1) Prompt caching (cache common prefixes). 2) Model routing β€” use a smaller/cheaper model (Haiku) for simple queries, Sonnet for complex ones. 3) Response caching (cache identical queries in ElastiCache). 4) Prompt optimization (shorter prompts = fewer tokens). 5) Batch API for non-real-time workloads (50% cheaper).
  8. Answer Guide
    Embedding models convert text to vectors (for search/retrieval). Generation models produce text (answers). RAG needs embeddings to find relevant chunks, then a generator to compose the answer. Poor embeddings β†’ wrong chunks retrieved β†’ irrelevant or wrong answers regardless of how good the generator is. Embedding quality is the foundation of RAG accuracy.
  9. Answer Guide
    Metrics: relevance (does the answer address the question?), groundedness (is it supported by source documents?), coherence, completeness, toxicity. Approaches: human evaluation (gold standard), LLM-as-judge (use a model to evaluate another model), RAGAS framework for RAG-specific metrics. A/B testing in production with user feedback.
  10. Answer Guide
    Router Lambda classifies intent β†’ routes to appropriate model (Haiku for FAQ, Sonnet for complex reasoning, Claude for code generation). Step Functions orchestrates with fallback: primary model timeout β†’ retry β†’ fallback model. Cost optimization: cheapest model that meets quality threshold. Use Bedrock's multi-model endpoints or custom routing logic.

Preparation Strategy

Gen AI interviews in 2025-2026 focus on architectural decisions, not model theory. Know when to use RAG vs fine-tuning, understand the cost implications of model selection, and be ready to discuss guardrails and responsible AI. Show that you can build production-grade AI applications, not just prototype demos.