Use the table below to plan throughput for each Heroku AI model. Limits apply to both US (us-east-1) and EU (eu-central-1) regions unless otherwise noted.
| Model | Requests / min | Tokens / min | Notes |
|---|
| Claude 4.5 Sonnet | 150 | 800,000 | Supports extended reasoning and prompt caching (system, tools). |
| Claude 4 Sonnet | 150 | 800,000 | Supports extended reasoning and prompt caching (system, tools). |
| Claude 3.7 Sonnet | 150 | 800,000 | Supports extended reasoning and prompt caching (system, tools). |
| Claude 3.5 Sonnet (Latest) | 150 | 800,000 | Prompt caching available for system and tools. |
| Claude 3.5 Haiku | 200 | 800,000 | Prompt caching available for system and tools. |
| Claude 3 Haiku | 250 | 800,000 | Fastest tier for high-volume workloads. |
| GPT-OSS 120B | 200 | 800,000 | Open-weight model hosted via Heroku AI. |
| Nova Pro | 150 | 800,000 | Prompt caching available for system. |
| Nova Lite | 150 | 800,000 | Prompt caching available for system. |
| Cohere Embed Multilingual | 500 | 800,000 | Applies to embedding tokens; batch up to 96 inputs per request. |
| Stable Image Ultra | 20 | N/A | Limit measured per image generation request. |
Token and request rate limits are per minute and calculated with a sliding window. Review these headers in API responses for information on the current state of your inference add-on’s rate limits:
| Header | Description | Example |
|---|
x-ratelimit-limit-requests | Limit on requests per minute | 200 |
x-ratelimit-limit-tokens | Limit on tokens per minute | 800000 |
x-ratelimit-remaining-requests | Remaining requests permitted before reaching rate limit | 198 |
x-ratelimit-remaining-tokens | Remaining tokens permitted before reaching rate limit | 799892 |
x-ratelimit-reset-requests | Time until more request capacity becomes available | 51s |
x-ratelimit-reset-tokens | Time until more token capacity becomes available | 51s |
The reset headers are calculated with a one-minute sliding window. As entries expire, additional capacity releases gradually rather than all at once. The returned time value indicates when the oldest entry expires in the sliding window.
import os
import requests
response = requests.post(
f"{os.getenv('INFERENCE_URL')}/v1/chat/completions",
headers={
"Authorization": f"Bearer {os.getenv('INFERENCE_KEY')}",
"Content-Type": "application/json"
},
json={
"model": os.getenv("INFERENCE_MODEL_ID"),
"messages": [{"role": "user", "content": "Hello!"}]
}
)
# Check rate limit status
print(f"Requests remaining: {response.headers.get('x-ratelimit-remaining-requests')}")
print(f"Tokens remaining: {response.headers.get('x-ratelimit-remaining-tokens')}")
print(f"Request limit resets in: {response.headers.get('x-ratelimit-reset-requests')}")
const response = await fetch(
`${process.env.INFERENCE_URL}/v1/chat/completions`,
{
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.INFERENCE_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: process.env.INFERENCE_MODEL_ID,
messages: [{ role: 'user', content: 'Hello!' }]
})
}
);
// Check rate limit status
console.log('Requests remaining:', response.headers.get('x-ratelimit-remaining-requests'));
console.log('Tokens remaining:', response.headers.get('x-ratelimit-remaining-tokens'));
console.log('Request limit resets in:', response.headers.get('x-ratelimit-reset-requests'));
curl -i https://us.inference.heroku.com/v1/chat/completions \
-H "Authorization: Bearer $INFERENCE_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-4-sonnet",
"messages": [{"role": "user", "content": "Hello!"}]
}' 2>&1 | grep -i "x-ratelimit"
Tips
- Prompt caching: When available, cache your system and tool definitions to reduce billed token usage and improve performance. See Prompt Caching for details.
- Scaling beyond limits: Contact Heroku Support if your production workload consistently approaches these thresholds.
- Regional routing: Deploy workloads in the region closest to your users. The limits above apply per region, so running in both US and EU doubles the overall headroom.