Day 19: Testing, Validation, Troubleshooting (Domain 5)
Learning Objectives
- - Use Bedrock Evaluations for FM and RAG quality assessment
- - Apply LLM-as-a-judge for automated evaluation (98% cost savings)
- - Use AgentCore Evaluations for agent correctness and safety
- - Implement X-Ray tracing for end-to-end FM call debugging
- - Design RAG evaluation with precision, recall, MRR, NDCG metrics
Tasks
Tasks
0/6 completed- Read30m
Amazon Bedrock Model Evaluation
LLM-as-a-judge, human evaluation, automated metrics.
- Blog25m
LLM-as-a-Judge on Bedrock - 98% Cost Savings
Use a judge model to evaluate another model's outputs. Massive cost savings vs human eval.
- Blog25m
Evaluating RAG with Bedrock KB Evaluation
RAG-specific evaluation: retrieval relevance, answer faithfulness, completeness.
- Blog20m
Evaluate Agents with RAGAS and LLM-as-Judge
Agent evaluation: task completion, tool usage, reasoning quality.
- Blog15m
CloudWatch GenAI Observability
Purpose-built GenAI tracing, end-to-end across LLMs/agents/KBs.
- Hands-on60m
Evaluating LLMs Using LLM-as-a-Judge (Sample Notebooks)
Work through the evaluation notebooks for hands-on practice.
Exam Skills
Write your understanding, then reveal the reference answer.
Hands-On Lab
Build real muscle memory with these activities.
Run Bedrock Evaluations with LLM-as-a-Judge
Set up an automated evaluation job using LLM-as-a-judge to assess model quality.
- 1 Open Bedrock console → Evaluations → Create evaluation job
- 2 Select 'LLM-as-a-judge' evaluation type
- 3 Choose Claude as the judge model and Nova as the target model
- 4 Upload a test dataset with 10-20 prompt-response pairs in JSONL format
- 5 Configure evaluation criteria: accuracy, relevance, coherence, and safety
Explore LLM-as-a-Judge Sample Notebooks
Work through the evaluation notebooks to understand automated model assessment.
- 1 Clone the evaluating-large-language-models-using-llm-as-a-judge repository
- 2 Open the main evaluation notebook
- 3 Follow the setup to configure the judge model and evaluation criteria
- 4 Run the evaluation on a sample dataset and review the scores
- 5 Experiment with different judge prompts to understand how judge instructions affect scoring
Enable CloudWatch GenAI Observability
Set up the CloudWatch GenAI observability dashboard for monitoring Bedrock invocations.
- 1 Open CloudWatch console and navigate to GenAI Observability (in the left nav)
- 2 Enable the pre-configured Bedrock dashboard
- 3 Review the available metrics: invocation count, token usage, error rates, latency percentiles
- 4 Create a custom alarm for high error rates: trigger when error count > 10 in 5 minutes
- 5 Make several Bedrock API calls and verify the metrics appear in the dashboard
Scenarios
Think through each scenario before revealing the answer.
RAG Hallucination Diagnosis
- •What Guardrails feature detects ungrounded responses?
- •How do you trace the full request path to find the failure point?
- •What might be wrong with the retrieval vs the prompt?
Practice Questions
14 questions across 3 difficulty levels.
Further Reading
Go deeper into today's topics.
RAG Evaluation and LLM-as-a-Judge on Bedrock (GA)
GA capabilities: quality, user experience, instruction compliance, safety metrics — 98% cost savings.
Evaluate RAG with Bedrock, LlamaIndex, and RAGAS
Open-source RAG evaluation: context relevance, answer faithfulness, answer relevancy metrics.
AgentCore Observability — OTEL-Based Tracing
OTEL-based tracing, debugging agent workflows end-to-end.
CloudWatch AppSignals for Bedrock
Native tracing, LangChain/Strands compatible for production monitoring.
GenAIOps on AWS: End-to-End Observability Stack
Full observability architecture: CloudWatch + X-Ray + AppSignals + OTEL for GenAI production workloads.