🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz
  • Home
  • AI Reference
  • What is the typical throughput (requests per second or tokens per second) one can expect from Bedrock for a given model, and can this throughput be increased through any configuration?

What is the typical throughput (requests per second or tokens per second) one can expect from Bedrock for a given model, and can this throughput be increased through any configuration?

The typical throughput for a model in AWS Bedrock depends on the specific model, instance type, and workload. For example, models like Claude or Jurassic-2 might handle between 10-50 requests per second (RPS) or 100-500 tokens per second (TPS) under standard configurations, but these numbers can vary widely. Smaller models, such as those optimized for low-latency tasks, may achieve higher RPS but process fewer tokens per request. Larger models, like those designed for complex reasoning, often prioritize accuracy over speed, resulting in lower throughput. Amazon does not publish exact benchmarks, as performance depends on factors like input/output token counts, payload size, and regional server load. Testing with your specific workload is critical—running a load test with realistic prompts will give the most accurate baseline for your use case.

Throughput can be increased through configuration adjustments and infrastructure choices. One approach is selecting a larger instance type (e.g., switching from a bedrock.m5.large to bedrock.m5.4xlarge), which allocates more compute resources to the model. Another option is enabling provisioned throughput, a paid feature that reserves dedicated capacity for your model, guaranteeing consistent performance during traffic spikes. For example, provisioning 50 RPS for Claude-v2 ensures those requests are prioritized over non-provisioned workloads. Additionally, tuning inference parameters like max_tokens or batching multiple requests into a single API call can improve efficiency. If your application allows asynchronous processing, using Bedrock’s batch API to queue requests and process them in bulk can also increase effective throughput by reducing overhead.

Developers should also optimize their implementation. Caching frequent or repetitive queries (e.g., common customer support prompts) reduces redundant model calls. Using newer model versions, such as switching from Titan-v1 to Titan-v2, might offer better token efficiency due to architectural improvements. Regional selection matters too—deploying your application in the same AWS Region as your Bedrock endpoint minimizes network latency. Finally, monitoring CloudWatch metrics like ModelLatency and InvocationCount helps identify bottlenecks. For example, if latency spikes at 30 RPS, you’ve likely hit the default limit for your instance, signaling a need to scale or provision more capacity. Combining these strategies allows balancing cost and performance based on your application’s needs.

Like the article? Spread the word