Skip to main content
When interacting with the Heroku AI API, you may encounter errors due to invalid requests, authentication issues, rate limiting, or server-side problems. This guide explains how to interpret error responses, diagnose common issues, and implement robust error handling in your applications. All Heroku AI API errors follow a consistent format and include information to help you identify and resolve the issue. Understanding these errors is essential for building production-ready applications that gracefully handle edge cases and failures.

Error Response Format

When an error occurs, the API returns a JSON response with the following structure:
{
  "error": {
    "code": 400,
    "message": "Invalid request: 'messages' field is required",
    "type": "invalid_request_error"
  }
}
The error object contains three fields:
FieldTypeDescription
codeintegerThe HTTP status code (e.g., 400, 401, 429)
messagestringA human-readable description of what went wrong
typestringA machine-readable error category for programmatic handling
The type field helps you categorize errors in your code. Common types include:
  • invalid_request_error - The request was malformed or missing required fields
  • authentication_error - The API key is invalid or missing
  • authorization_error - The API key doesn’t have access to the requested resource
  • rate_limit_error - Too many requests in a given time period
  • server_error - An internal error occurred on Heroku’s servers

HTTP Status Codes

400 Bad Request

A 400 error indicates that your request was malformed or contained invalid parameters. The API could not process the request because something in the request body, query parameters, or headers was incorrect. Common causes:
  • Missing required fields (model, messages)
  • Invalid JSON syntax in the request body
  • Parameter values outside allowed ranges (e.g., temperature > 1.0)
  • Malformed model names that don’t exist
  • Invalid message format or role values
Example error response:
{
  "error": {
    "code": 400,
    "message": "Invalid request: 'model' field is required",
    "type": "invalid_request_error"
  }
}
How to diagnose: First, validate your request JSON is well-formed. You can use a JSON validator or test with a minimal request:
import json
import os
from openai import OpenAI

client = OpenAI(
    base_url=os.getenv("INFERENCE_URL") + "/v1",
    api_key=os.getenv("INFERENCE_KEY")
)

# Minimal valid request to test connectivity
try:
    response = client.chat.completions.create(
        model="claude-4-5-sonnet",
        messages=[{"role": "user", "content": "Hello"}],
        max_tokens=10
    )
    print("Request successful:", response.choices[0].message.content)
except Exception as e:
    print(f"Error: {e}")
Resolution: Check the error message for specific details about which field is invalid. Common fixes include:
  • Ensure model matches an available model ID (e.g., claude-4-5-sonnet, not claude-4.5-sonnet)
  • Verify messages is an array with at least one message object
  • Confirm each message has both role and content fields
  • Check that numeric parameters are within valid ranges

401 Unauthorized

A 401 error means the API could not authenticate your request. This typically indicates a problem with your API key. Common causes:
  • Missing Authorization header
  • API key is malformed or contains extra whitespace
  • API key has been regenerated and the old key is no longer valid
  • Using the wrong environment’s API key (production vs. staging)
Example error response:
{
  "error": {
    "code": 401,
    "message": "Invalid API key provided",
    "type": "authentication_error"
  }
}
How to diagnose: Verify your API key is correctly configured:
import os

# Check if the key is set and has expected format
key = os.getenv("INFERENCE_KEY")
if not key:
    print("ERROR: INFERENCE_KEY environment variable is not set")
elif not key.startswith("inf-"):
    print(f"WARNING: Key doesn't start with 'inf-'. First 8 chars: {key[:8]}...")
else:
    print(f"Key format looks correct. First 8 chars: {key[:8]}...")
    print(f"Key length: {len(key)} characters")
Resolution:
  1. Retrieve a fresh API key from your Heroku app:
    heroku config:get INFERENCE_KEY -a your-app-name
    
  2. Ensure there’s no whitespace or newline characters in your key:
    export INFERENCE_KEY=$(heroku config:get INFERENCE_KEY -a your-app-name | tr -d '[:space:]')
    
  3. If using the OpenAI SDK, verify the key is being passed correctly:
    # Correct
    client = OpenAI(api_key=os.getenv("INFERENCE_KEY"), ...)
    
    # Common mistake - hardcoding with typos
    client = OpenAI(api_key="inf-abc123...", ...)  # May have copy-paste errors
    

403 Forbidden

A 403 error indicates your API key is valid but doesn’t have permission to access the requested resource. This is different from 401—your key authenticated successfully, but authorization failed. Common causes:
  • Requesting a model that isn’t provisioned for your app
  • API key doesn’t have access to the specific model tier
  • Attempting to access resources belonging to a different organization or app
Example error response:
{
  "error": {
    "code": 403,
    "message": "You do not have access to that model",
    "type": "authorization_error"
  }
}
How to diagnose: Check which models are provisioned for your app:
# List all inference add-ons attached to your app
heroku addons -a your-app-name | grep inference

# Check the specific model ID configured
heroku config:get INFERENCE_MODEL_ID -a your-app-name
Resolution:
  1. Use the model that matches your provisioned add-on:
    # Get the model from environment, not hardcoded
    model = os.getenv("INFERENCE_MODEL_ID")
    
    response = client.chat.completions.create(
        model=model,  # Use the provisioned model
        messages=[{"role": "user", "content": "Hello"}]
    )
    
  2. Provision the model you need:
    heroku ai:models:create claude-4-5-sonnet -a your-app-name
    
  3. If using multiple models, use the correct INFERENCE_KEY for each:
    # Different models have different keys
    export INFERENCE_KEY_HAIKU=$(heroku config:get HEROKU_INFERENCE_JADE_KEY -a your-app-name)
    export INFERENCE_KEY_SONNET=$(heroku config:get INFERENCE_KEY -a your-app-name)
    

429 Too Many Requests

A 429 error means you’ve exceeded the rate limits for your model. Heroku AI enforces both requests-per-minute and tokens-per-minute limits to ensure fair usage and system stability. Rate limits by model:
ModelRequests/minTokens/min
Claude 4.5 Sonnet150800,000
Claude 4 Sonnet150800,000
Claude 3.5 Haiku200800,000
Nova Pro / Lite150800,000
Stable Image Ultra20N/A
Example error response:
{
  "error": {
    "code": 429,
    "message": "Rate limit exceeded. Please retry after 12 seconds.",
    "type": "rate_limit_error"
  }
}
Response headers: When approaching or exceeding limits, the API returns headers to help you manage your request rate:
HeaderDescription
X-RateLimit-Limit-RequestsMaximum requests allowed per minute
X-RateLimit-Remaining-RequestsRequests remaining in current window
X-RateLimit-Reset-RequestsUnix timestamp when the request limit resets
Retry-AfterSeconds to wait before retrying (on 429 responses)
How to diagnose: Track your request rate and token usage:
import time
from collections import deque

class RateLimitTracker:
    def __init__(self, window_seconds=60):
        self.requests = deque()
        self.tokens = deque()
        self.window = window_seconds

    def log_request(self, tokens_used):
        now = time.time()
        self.requests.append(now)
        self.tokens.append((now, tokens_used))

        # Clean old entries
        cutoff = now - self.window
        while self.requests and self.requests[0] < cutoff:
            self.requests.popleft()
        while self.tokens and self.tokens[0][0] < cutoff:
            self.tokens.popleft()

    def get_stats(self):
        total_tokens = sum(t[1] for t in self.tokens)
        return {
            "requests_last_minute": len(self.requests),
            "tokens_last_minute": total_tokens
        }

tracker = RateLimitTracker()
Resolution: Implement exponential backoff with jitter (see Retry Strategies below) and consider these optimization strategies:
  1. Batch requests: For embeddings, send up to 96 inputs per request instead of one at a time
  2. Use prompt caching: Cache system prompts and tool definitions to reduce token usage
  3. Queue requests: Implement a request queue that respects rate limits
  4. Choose appropriate models: Use Claude 3.5 Haiku for high-volume, latency-sensitive workloads

500 Internal Server Error

A 500 error indicates an unexpected problem on Heroku’s servers. These errors are not caused by your request and are typically transient. Example error response:
{
  "error": {
    "code": 500,
    "message": "An internal error occurred. Please try again.",
    "type": "server_error"
  }
}
How to handle: 500 errors are usually safe to retry. Implement exponential backoff:
import time
import random
from openai import OpenAI, APIError

def make_request_with_retry(client, max_retries=3, **kwargs):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(**kwargs)
        except APIError as e:
            if e.status_code == 500 and attempt < max_retries - 1:
                wait_time = (2 ** attempt) + random.uniform(0, 1)
                print(f"Server error, retrying in {wait_time:.1f}s...")
                time.sleep(wait_time)
            else:
                raise
If 500 errors persist for more than a few minutes, check the Heroku Status page and contact support if needed.

503 Service Unavailable

A 503 error indicates the service is temporarily unavailable, usually due to high load or maintenance. Example error response:
{
  "error": {
    "code": 503,
    "message": "Service temporarily unavailable. Please try again later.",
    "type": "server_error"
  }
}
How to handle: Like 500 errors, 503 errors are transient and safe to retry. However, use longer backoff intervals since the service may need time to recover:
def make_request_with_retry(client, max_retries=5, **kwargs):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(**kwargs)
        except APIError as e:
            if e.status_code in [500, 503] and attempt < max_retries - 1:
                # Longer backoff for 503
                base_wait = 4 if e.status_code == 503 else 2
                wait_time = (base_wait ** attempt) + random.uniform(0, 1)
                print(f"Service unavailable, retrying in {wait_time:.1f}s...")
                time.sleep(wait_time)
            else:
                raise

Retry Strategies

Exponential Backoff

Exponential backoff is the recommended strategy for handling transient errors. The idea is to wait progressively longer between retries, reducing load on the server while eventually succeeding.
import time
import random
from openai import OpenAI, APIError, RateLimitError
import os

def create_completion_with_retry(
    client: OpenAI,
    max_retries: int = 5,
    initial_delay: float = 1.0,
    max_delay: float = 60.0,
    **kwargs
):
    """
    Make a chat completion request with exponential backoff retry logic.

    Args:
        client: OpenAI client configured for Heroku AI
        max_retries: Maximum number of retry attempts
        initial_delay: Initial delay in seconds before first retry
        max_delay: Maximum delay between retries
        **kwargs: Arguments passed to chat.completions.create()

    Returns:
        ChatCompletion response

    Raises:
        The last exception if all retries fail
    """
    last_exception = None

    for attempt in range(max_retries + 1):
        try:
            response = client.chat.completions.create(**kwargs)
            return response

        except RateLimitError as e:
            last_exception = e
            if attempt == max_retries:
                raise

            # Use Retry-After header if available, otherwise calculate backoff
            retry_after = getattr(e, 'retry_after', None)
            if retry_after:
                delay = float(retry_after)
            else:
                delay = min(initial_delay * (2 ** attempt), max_delay)

            # Add jitter to prevent thundering herd
            delay += random.uniform(0, delay * 0.1)

            print(f"Rate limited. Retrying in {delay:.1f}s (attempt {attempt + 1}/{max_retries})")
            time.sleep(delay)

        except APIError as e:
            last_exception = e

            # Only retry on server errors
            if e.status_code not in [500, 502, 503, 504]:
                raise

            if attempt == max_retries:
                raise

            delay = min(initial_delay * (2 ** attempt), max_delay)
            delay += random.uniform(0, delay * 0.1)

            print(f"Server error ({e.status_code}). Retrying in {delay:.1f}s (attempt {attempt + 1}/{max_retries})")
            time.sleep(delay)

    raise last_exception


# Usage example
client = OpenAI(
    base_url=os.getenv("INFERENCE_URL") + "/v1",
    api_key=os.getenv("INFERENCE_KEY")
)

response = create_completion_with_retry(
    client,
    model=os.getenv("INFERENCE_MODEL_ID", "claude-4-5-sonnet"),
    messages=[
        {"role": "user", "content": "Explain exponential backoff in one paragraph."}
    ],
    max_tokens=200
)

print(response.choices[0].message.content)
Why jitter matters: When many clients hit a rate limit simultaneously, they might all retry at exactly the same time, causing another wave of rate limiting. Adding random jitter (a small random delay) spreads out the retries and prevents this “thundering herd” problem.

Request Queuing

For high-throughput applications, implement a request queue that respects rate limits:
import asyncio
from collections import deque
from datetime import datetime, timedelta

class RateLimitedQueue:
    def __init__(self, requests_per_minute: int = 150):
        self.requests_per_minute = requests_per_minute
        self.request_times = deque()
        self.lock = asyncio.Lock()

    async def wait_for_capacity(self):
        async with self.lock:
            now = datetime.now()
            cutoff = now - timedelta(minutes=1)

            # Remove old request times
            while self.request_times and self.request_times[0] < cutoff:
                self.request_times.popleft()

            # If at capacity, wait until oldest request expires
            if len(self.request_times) >= self.requests_per_minute:
                wait_until = self.request_times[0] + timedelta(minutes=1)
                wait_seconds = (wait_until - now).total_seconds()
                if wait_seconds > 0:
                    await asyncio.sleep(wait_seconds)

            # Record this request
            self.request_times.append(datetime.now())

Debugging Tips

Using Request IDs

Every API response includes a unique request ID in the X-Request-ID header. Include this ID when contacting support:
try:
    response = client.chat.completions.create(
        model="claude-4-5-sonnet",
        messages=[{"role": "user", "content": "Hello"}]
    )
except APIError as e:
    print(f"Error: {e}")
    print(f"Request ID: {e.request_id}")  # Include this in support requests

Logging Recommendations

Enable detailed logging during development to diagnose issues:
import logging
import httpx

# Enable HTTP-level logging
logging.basicConfig(level=logging.DEBUG)
httpx_logger = logging.getLogger("httpx")
httpx_logger.setLevel(logging.DEBUG)

# Or use OpenAI's built-in logging
import openai
openai.log = "debug"

When to Contact Support

Contact Heroku Support if you experience:
  • Persistent 500/503 errors lasting more than 15 minutes
  • Rate limit errors when your usage is well below documented limits
  • Authentication errors with keys that previously worked
  • Unexpected model behavior that differs from documentation
Include the following in your support request:
  1. Request ID from the error response
  2. Timestamp of when the issue occurred (with timezone)
  3. The exact error message and status code
  4. A minimal code example that reproduces the issue
  5. Your app name and region

Additional Resources