Day 18: Cost Optimization + Performance (Domain 4)
Learning Objectives
- - Distinguish 3 caching layers: Prompt Caching, Semantic Caching, Exact-Match Caching
- - Design model cascading and Intelligent Prompt Routing architectures
- - Know batch inference (50% discount) for non-real-time workloads
- - Implement cost monitoring with CloudWatch + Cost Explorer
- - Understand provisioned throughput vs on-demand pricing
Tasks
Tasks
0/5 completed- Blog30m
Effective Cost Optimization Strategies for Amazon Bedrock
Comprehensive guide: model cascading, caching, batch inference, provisioned throughput.
- Blog20m
Prompt Caching on Bedrock - Up to 85% Cost Reduction
5-min TTL (1.25x write, 0.1x read) or 1-hour TTL (2.0x write, 0.1x read). Min 1024 tokens.
- Blog20m
ElastiCache Semantic Cache - 86% Cost Reduction
Application-level semantic caching using embeddings and cosine similarity.
- Blog15m
Intelligent Prompt Routing
Auto-routes to cheapest capable model. Automated model cascading.
- Read15m
Model Invocation Logging
Log all prompts/responses to CloudWatch or S3. Essential for cost analysis and debugging.
Exam Skills
Write your understanding, then reveal the reference answer.
Hands-On Lab
Build real muscle memory with these activities.
Implement Semantic Caching with ElastiCache
Set up a semantic cache using ElastiCache to reduce Bedrock API costs for similar queries.
- 1 Create an ElastiCache Serverless Redis cluster with vector search enabled
- 2 Write a Lambda function that generates embeddings for incoming queries using Titan Embeddings
- 3 Before calling Bedrock, search ElastiCache for a cached response with cosine similarity > 0.95
- 4 If cache hit: return cached response (saving the Bedrock API call). If cache miss: call Bedrock, store the query embedding and response in ElastiCache
- 5 Test with 10 similar queries and measure the cache hit rate and cost savings
Enable and Analyze Model Invocation Logging
Set up model invocation logging to CloudWatch to track token usage and cost per request.
- 1 Open Bedrock console → Settings → Model invocation logging
- 2 Enable logging to CloudWatch Logs and select a log group
- 3 Make several Bedrock API calls with different models
- 4 Open CloudWatch Logs Insights and run: fields @timestamp, inputTokenCount, outputTokenCount, modelId | sort @timestamp desc
- 5 Calculate the cost per request using the token counts and published pricing
Scenarios
Think through each scenario before revealing the answer.
GenAI App Cost Reduction
- •Which caching method handles 'similar but not identical' queries?
- •How does semantic caching work technically?
- •What similarity threshold is appropriate?
- •What additional optimization can you add for remaining queries?
Practice Questions
14 questions across 3 difficulty levels.
Further Reading
Go deeper into today's topics.
Effective Cost Optimization Strategies for Bedrock
Comprehensive guide: model cascading, caching, batch inference, provisioned throughput, distillation.
Prompt Caching on Bedrock — Up to 85% Cost Reduction
Up to 85% cost reduction for repeated contexts. TTL options and pricing.
Track, Allocate, and Manage GenAI Cost with Bedrock
Application Inference Profiles, cost allocation tags, Cost Explorer integration.
Batch Job Orchestration with Step Functions
50% savings with batch inference: S3 Map state, parallel processing.
Build a Proactive AI Cost Management System
Automated cost monitoring and alerting for Bedrock workloads.
Optimizing Cost for FMs with Amazon Bedrock — FinOps Blog
FinOps perspective: pricing options, model selection for cost, KB optimization.