Skip to main content
Use the table below to plan throughput for each Heroku AI model. Limits apply to both US (us-east-1) and EU (eu-central-1) regions unless otherwise noted.
ModelRequests / minTokens / minNotes
Claude 4.5 Sonnet150800,000Supports extended reasoning and prompt caching (system, tools).
Claude 4 Sonnet150800,000Supports extended reasoning and prompt caching (system, tools).
Claude 3.7 Sonnet150800,000Supports extended reasoning and prompt caching (system, tools).
Claude 3.5 Sonnet (Latest)150800,000Prompt caching available for system and tools.
Claude 3.5 Haiku200800,000Prompt caching available for system and tools.
Claude 3 Haiku250800,000Fastest tier for high-volume workloads.
GPT-OSS 120B200800,000Open-weight model hosted via Heroku AI.
Nova Pro150800,000Prompt caching available for system.
Nova Lite150800,000Prompt caching available for system.
Cohere Embed Multilingual500800,000Applies to embedding tokens; batch up to 96 inputs per request.
Stable Image Ultra20N/ALimit measured per image generation request.

Rate Limit Headers

Token and request rate limits are per minute and calculated with a sliding window. Review these headers in API responses for information on the current state of your inference add-on’s rate limits:
HeaderDescriptionExample
x-ratelimit-limit-requestsLimit on requests per minute200
x-ratelimit-limit-tokensLimit on tokens per minute800000
x-ratelimit-remaining-requestsRemaining requests permitted before reaching rate limit198
x-ratelimit-remaining-tokensRemaining tokens permitted before reaching rate limit799892
x-ratelimit-reset-requestsTime until more request capacity becomes available51s
x-ratelimit-reset-tokensTime until more token capacity becomes available51s
The reset headers are calculated with a one-minute sliding window. As entries expire, additional capacity releases gradually rather than all at once. The returned time value indicates when the oldest entry expires in the sliding window.

Reading Rate Limit Headers

import os
import requests

response = requests.post(
    f"{os.getenv('INFERENCE_URL')}/v1/chat/completions",
    headers={
        "Authorization": f"Bearer {os.getenv('INFERENCE_KEY')}",
        "Content-Type": "application/json"
    },
    json={
        "model": os.getenv("INFERENCE_MODEL_ID"),
        "messages": [{"role": "user", "content": "Hello!"}]
    }
)

# Check rate limit status
print(f"Requests remaining: {response.headers.get('x-ratelimit-remaining-requests')}")
print(f"Tokens remaining: {response.headers.get('x-ratelimit-remaining-tokens')}")
print(f"Request limit resets in: {response.headers.get('x-ratelimit-reset-requests')}")

Tips

  • Prompt caching: When available, cache your system and tool definitions to reduce billed token usage and improve performance. See Prompt Caching for details.
  • Scaling beyond limits: Contact Heroku Support if your production workload consistently approaches these thresholds.
  • Regional routing: Deploy workloads in the region closest to your users. The limits above apply per region, so running in both US and EU doubles the overall headroom.