Testing Week 3 · Friday

Day 19: Testing, Validation, Troubleshooting (Domain 5)

Learning Objectives

- Use Bedrock Evaluations for FM and RAG quality assessment
- Apply LLM-as-a-judge for automated evaluation (98% cost savings)
- Use AgentCore Evaluations for agent correctness and safety
- Implement X-Ray tracing for end-to-end FM call debugging
- Design RAG evaluation with precision, recall, MRR, NDCG metrics

Tasks

0/6 completed

Read30m
Amazon Bedrock Model Evaluation
LLM-as-a-judge, human evaluation, automated metrics.
Blog25m
LLM-as-a-Judge on Bedrock - 98% Cost Savings
Use a judge model to evaluate another model's outputs. Massive cost savings vs human eval.
Blog25m
Evaluating RAG with Bedrock KB Evaluation
RAG-specific evaluation: retrieval relevance, answer faithfulness, completeness.
Blog20m
Evaluate Agents with RAGAS and LLM-as-Judge
Agent evaluation: task completion, tool usage, reasoning quality.
Blog15m
CloudWatch GenAI Observability
Purpose-built GenAI tracing, end-to-end across LLMs/agents/KBs.
Hands-on60m
Evaluating LLMs Using LLM-as-a-Judge (Sample Notebooks)
Work through the evaluation notebooks for hands-on practice.

Exam Skills

Write your understanding, then reveal the reference answer.

0/14 reviewed

Hands-On Lab

Build real muscle memory with these activities.

intermediate 45 min

Run Bedrock Evaluations with LLM-as-a-Judge

Set up an automated evaluation job using LLM-as-a-judge to assess model quality.

1 Open Bedrock console → Evaluations → Create evaluation job
2 Select 'LLM-as-a-judge' evaluation type
3 Choose Claude as the judge model and Nova as the target model
4 Upload a test dataset with 10-20 prompt-response pairs in JSONL format
5 Configure evaluation criteria: accuracy, relevance, coherence, and safety

Open Lab

intermediate 45 min

Explore LLM-as-a-Judge Sample Notebooks

Work through the evaluation notebooks to understand automated model assessment.

1 Clone the evaluating-large-language-models-using-llm-as-a-judge repository
2 Open the main evaluation notebook
3 Follow the setup to configure the judge model and evaluation criteria
4 Run the evaluation on a sample dataset and review the scores
5 Experiment with different judge prompts to understand how judge instructions affect scoring

Open Lab

intermediate 30 min

Enable CloudWatch GenAI Observability

Set up the CloudWatch GenAI observability dashboard for monitoring Bedrock invocations.

1 Open CloudWatch console and navigate to GenAI Observability (in the left nav)
2 Enable the pre-configured Bedrock dashboard
3 Review the available metrics: invocation count, token usage, error rates, latency percentiles
4 Create a custom alarm for high error rates: trigger when error count > 10 in 5 minutes
5 Make several Bedrock API calls and verify the metrics appear in the dashboard

Open Lab

Scenarios

Think through each scenario before revealing the answer.

D5: TestingHard

#15

RAG Hallucination Diagnosis

After a prompt update, users report that the RAG chatbot is 'making things up' -- responses contain information not in the knowledge base. How do you diagnose and fix this?

Think First

•What Guardrails feature detects ungrounded responses?
•How do you trace the full request path to find the failure point?
•What might be wrong with the retrieval vs the prompt?

Practice Questions

14 questions across 3 difficulty levels.

Day 19: Testing, Validation, Troubleshooting (Domain 5)

Learning Objectives

Tasks

Tasks

Exam Skills

Hands-On Lab

Run Bedrock Evaluations with LLM-as-a-Judge

Explore LLM-as-a-Judge Sample Notebooks

Enable CloudWatch GenAI Observability

Scenarios

RAG Hallucination Diagnosis

Practice Questions

Foundation

Applied

Expert

Further Reading

RAG Evaluation and LLM-as-a-Judge on Bedrock (GA)

Evaluate RAG with Bedrock, LlamaIndex, and RAGAS

AgentCore Observability — OTEL-Based Tracing

CloudWatch AppSignals for Bedrock

GenAIOps on AWS: End-to-End Observability Stack