AIP-C01 Study Hub
Testing Week 3 · Friday

Day 19: Testing, Validation, Troubleshooting (Domain 5)

Learning Objectives

  • - Use Bedrock Evaluations for FM and RAG quality assessment
  • - Apply LLM-as-a-judge for automated evaluation (98% cost savings)
  • - Use AgentCore Evaluations for agent correctness and safety
  • - Implement X-Ray tracing for end-to-end FM call debugging
  • - Design RAG evaluation with precision, recall, MRR, NDCG metrics

Tasks

Tasks

0/6 completed
  • Read30m

    Amazon Bedrock Model Evaluation

    LLM-as-a-judge, human evaluation, automated metrics.

  • Blog25m

    LLM-as-a-Judge on Bedrock - 98% Cost Savings

    Use a judge model to evaluate another model's outputs. Massive cost savings vs human eval.

  • Blog25m

    Evaluating RAG with Bedrock KB Evaluation

    RAG-specific evaluation: retrieval relevance, answer faithfulness, completeness.

  • Blog20m

    Evaluate Agents with RAGAS and LLM-as-Judge

    Agent evaluation: task completion, tool usage, reasoning quality.

  • Blog15m

    CloudWatch GenAI Observability

    Purpose-built GenAI tracing, end-to-end across LLMs/agents/KBs.

  • Hands-on60m

    Evaluating LLMs Using LLM-as-a-Judge (Sample Notebooks)

    Work through the evaluation notebooks for hands-on practice.

Exam Skills

Write your understanding, then reveal the reference answer.

0/14 reviewed

Hands-On Lab

Build real muscle memory with these activities.

intermediate 45 min

Run Bedrock Evaluations with LLM-as-a-Judge

Set up an automated evaluation job using LLM-as-a-judge to assess model quality.

  1. 1 Open Bedrock console → Evaluations → Create evaluation job
  2. 2 Select 'LLM-as-a-judge' evaluation type
  3. 3 Choose Claude as the judge model and Nova as the target model
  4. 4 Upload a test dataset with 10-20 prompt-response pairs in JSONL format
  5. 5 Configure evaluation criteria: accuracy, relevance, coherence, and safety
Open Lab
intermediate 45 min

Explore LLM-as-a-Judge Sample Notebooks

Work through the evaluation notebooks to understand automated model assessment.

  1. 1 Clone the evaluating-large-language-models-using-llm-as-a-judge repository
  2. 2 Open the main evaluation notebook
  3. 3 Follow the setup to configure the judge model and evaluation criteria
  4. 4 Run the evaluation on a sample dataset and review the scores
  5. 5 Experiment with different judge prompts to understand how judge instructions affect scoring
Open Lab
intermediate 30 min

Enable CloudWatch GenAI Observability

Set up the CloudWatch GenAI observability dashboard for monitoring Bedrock invocations.

  1. 1 Open CloudWatch console and navigate to GenAI Observability (in the left nav)
  2. 2 Enable the pre-configured Bedrock dashboard
  3. 3 Review the available metrics: invocation count, token usage, error rates, latency percentiles
  4. 4 Create a custom alarm for high error rates: trigger when error count > 10 in 5 minutes
  5. 5 Make several Bedrock API calls and verify the metrics appear in the dashboard
Open Lab

Scenarios

Think through each scenario before revealing the answer.

D5: TestingHard
#15

RAG Hallucination Diagnosis

After a prompt update, users report that the RAG chatbot is 'making things up' -- responses contain information not in the knowledge base. How do you diagnose and fix this?
Think First
  • What Guardrails feature detects ungrounded responses?
  • How do you trace the full request path to find the failure point?
  • What might be wrong with the retrieval vs the prompt?

Practice Questions

14 questions across 3 difficulty levels.

Further Reading

Go deeper into today's topics.