The typical throughput for a model in AWS Bedrock depends on the specific model, instance type, and workload. For example, models like Claude or Jurassic-2 might handle between 10-50 requests per second (RPS) or 100-500 tokens per second (TPS) under standard configurations, but these numbers can vary widely. Smaller models, such as those optimized for low-latency tasks, may achieve higher RPS but process fewer tokens per request. Larger models, like those designed for complex reasoning, often prioritize accuracy over speed, resulting in lower throughput. Amazon does not publish exact benchmarks, as performance depends on factors like input/output token counts, payload size, and regional server load. Testing with your specific workload is critical—running a load test with realistic prompts will give the most accurate baseline for your use case.
Throughput can be increased through configuration adjustments and infrastructure choices. One approach is selecting a larger instance type (e.g., switching from a bedrock.m5.large
to bedrock.m5.4xlarge
), which allocates more compute resources to the model. Another option is enabling provisioned throughput, a paid feature that reserves dedicated capacity for your model, guaranteeing consistent performance during traffic spikes. For example, provisioning 50 RPS for Claude-v2 ensures those requests are prioritized over non-provisioned workloads. Additionally, tuning inference parameters like max_tokens
or batching multiple requests into a single API call can improve efficiency. If your application allows asynchronous processing, using Bedrock’s batch API to queue requests and process them in bulk can also increase effective throughput by reducing overhead.
Developers should also optimize their implementation. Caching frequent or repetitive queries (e.g., common customer support prompts) reduces redundant model calls. Using newer model versions, such as switching from Titan-v1 to Titan-v2, might offer better token efficiency due to architectural improvements. Regional selection matters too—deploying your application in the same AWS Region as your Bedrock endpoint minimizes network latency. Finally, monitoring CloudWatch metrics like ModelLatency
and InvocationCount
helps identify bottlenecks. For example, if latency spikes at 30 RPS, you’ve likely hit the default limit for your instance, signaling a need to scale or provision more capacity. Combining these strategies allows balancing cost and performance based on your application’s needs.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word