Error Handling

When interacting with the Heroku AI API, you may encounter errors due to invalid requests, authentication issues, rate limiting, or server-side problems. This guide explains how to interpret error responses, diagnose common issues, and implement robust error handling in your applications. All Heroku AI API errors follow a consistent format and include information to help you identify and resolve the issue. Understanding these errors is essential for building production-ready applications that gracefully handle edge cases and failures.

Error Response Format

When an error occurs, the API returns a JSON response with the following structure:

{
  "error": {
    "code": 400,
    "message": "Invalid request: 'messages' field is required",
    "type": "invalid_request_error"
  }
}

The error object contains three fields:

Field	Type	Description
`code`	integer	The HTTP status code (e.g., 400, 401, 429)
`message`	string	A human-readable description of what went wrong
`type`	string	A machine-readable error category for programmatic handling

The type field helps you categorize errors in your code. Common types include:

invalid_request_error - The request was malformed or missing required fields
authentication_error - The API key is invalid or missing
authorization_error - The API key doesn’t have access to the requested resource
rate_limit_error - Too many requests in a given time period
server_error - An internal error occurred on Heroku’s servers

HTTP Status Codes

400 Bad Request

A 400 error indicates that your request was malformed or contained invalid parameters. The API could not process the request because something in the request body, query parameters, or headers was incorrect. Common causes:

Missing required fields (model, messages)
Invalid JSON syntax in the request body
Parameter values outside allowed ranges (e.g., temperature > 1.0)
Malformed model names that don’t exist
Invalid message format or role values

Example error response:

{
  "error": {
    "code": 400,
    "message": "Invalid request: 'model' field is required",
    "type": "invalid_request_error"
  }
}

How to diagnose: First, validate your request JSON is well-formed. You can use a JSON validator or test with a minimal request:

Python
cURL

import json
import os
from openai import OpenAI

client = OpenAI(
    base_url=os.getenv("INFERENCE_URL") + "/v1",
    api_key=os.getenv("INFERENCE_KEY")
)

# Minimal valid request to test connectivity
try:
    response = client.chat.completions.create(
        model="claude-4-5-sonnet",
        messages=[{"role": "user", "content": "Hello"}],
        max_tokens=10
    )
    print("Request successful:", response.choices[0].message.content)
except Exception as e:
    print(f"Error: {e}")

# Test with minimal valid request
curl https://us.inference.heroku.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $INFERENCE_KEY" \
  -d '{
    "model": "claude-4-5-sonnet",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 10
  }'

Resolution: Check the error message for specific details about which field is invalid. Common fixes include:

Ensure model matches an available model ID (e.g., claude-4-5-sonnet, not claude-4.5-sonnet)
Verify messages is an array with at least one message object
Confirm each message has both role and content fields
Check that numeric parameters are within valid ranges

401 Unauthorized

A 401 error means the API could not authenticate your request. This typically indicates a problem with your API key. Common causes:

Missing Authorization header
API key is malformed or contains extra whitespace
API key has been regenerated and the old key is no longer valid
Using the wrong environment’s API key (production vs. staging)

Example error response:

{
  "error": {
    "code": 401,
    "message": "Invalid API key provided",
    "type": "authentication_error"
  }
}

How to diagnose: Verify your API key is correctly configured:

Python
Bash

import os

# Check if the key is set and has expected format
key = os.getenv("INFERENCE_KEY")
if not key:
    print("ERROR: INFERENCE_KEY environment variable is not set")
elif not key.startswith("inf-"):
    print(f"WARNING: Key doesn't start with 'inf-'. First 8 chars: {key[:8]}...")
else:
    print(f"Key format looks correct. First 8 chars: {key[:8]}...")
    print(f"Key length: {len(key)} characters")

# Check if the key is set and display masked version
if [ -z "$INFERENCE_KEY" ]; then
  echo "ERROR: INFERENCE_KEY is not set"
else
  echo "Key is set. First 8 chars: ${INFERENCE_KEY:0:8}..."
  echo "Key length: ${#INFERENCE_KEY} characters"
fi

Resolution:

Retrieve a fresh API key from your Heroku app:

heroku config:get INFERENCE_KEY -a your-app-name

Ensure there’s no whitespace or newline characters in your key:

export INFERENCE_KEY=$(heroku config:get INFERENCE_KEY -a your-app-name | tr -d '[:space:]')

If using the OpenAI SDK, verify the key is being passed correctly:

# Correct
client = OpenAI(api_key=os.getenv("INFERENCE_KEY"), ...)

# Common mistake - hardcoding with typos
client = OpenAI(api_key="inf-abc123...", ...)  # May have copy-paste errors

403 Forbidden

A 403 error indicates your API key is valid but doesn’t have permission to access the requested resource. This is different from 401—your key authenticated successfully, but authorization failed. Common causes:

Requesting a model that isn’t provisioned for your app
API key doesn’t have access to the specific model tier
Attempting to access resources belonging to a different organization or app

Example error response:

{
  "error": {
    "code": 403,
    "message": "You do not have access to that model",
    "type": "authorization_error"
  }
}

How to diagnose: Check which models are provisioned for your app:

# List all inference add-ons attached to your app
heroku addons -a your-app-name | grep inference

# Check the specific model ID configured
heroku config:get INFERENCE_MODEL_ID -a your-app-name

Resolution:

Use the model that matches your provisioned add-on:

# Get the model from environment, not hardcoded
model = os.getenv("INFERENCE_MODEL_ID")

response = client.chat.completions.create(
    model=model,  # Use the provisioned model
    messages=[{"role": "user", "content": "Hello"}]
)

Provision the model you need:

heroku ai:models:create claude-4-5-sonnet -a your-app-name

If using multiple models, use the correct INFERENCE_KEY for each:

# Different models have different keys
export INFERENCE_KEY_HAIKU=$(heroku config:get HEROKU_INFERENCE_JADE_KEY -a your-app-name)
export INFERENCE_KEY_SONNET=$(heroku config:get INFERENCE_KEY -a your-app-name)

429 Too Many Requests

A 429 error means you’ve exceeded the rate limits for your model. Heroku AI enforces both requests-per-minute and tokens-per-minute limits to ensure fair usage and system stability. Rate limits by model:

Model	Requests/min	Tokens/min
Claude 4.5 Sonnet	150	800,000
Claude 4 Sonnet	150	800,000
Claude 3.5 Haiku	200	800,000
Nova Pro / Lite	150	800,000
Stable Image Ultra	20	N/A

Example error response:

{
  "error": {
    "code": 429,
    "message": "Rate limit exceeded. Please retry after 12 seconds.",
    "type": "rate_limit_error"
  }
}

Response headers: When approaching or exceeding limits, the API returns headers to help you manage your request rate:

Header	Description
`X-RateLimit-Limit-Requests`	Maximum requests allowed per minute
`X-RateLimit-Remaining-Requests`	Requests remaining in current window
`X-RateLimit-Reset-Requests`	Unix timestamp when the request limit resets
`Retry-After`	Seconds to wait before retrying (on 429 responses)

How to diagnose: Track your request rate and token usage:

import time
from collections import deque

class RateLimitTracker:
    def __init__(self, window_seconds=60):
        self.requests = deque()
        self.tokens = deque()
        self.window = window_seconds

    def log_request(self, tokens_used):
        now = time.time()
        self.requests.append(now)
        self.tokens.append((now, tokens_used))

        # Clean old entries
        cutoff = now - self.window
        while self.requests and self.requests[0] < cutoff:
            self.requests.popleft()
        while self.tokens and self.tokens[0][0] < cutoff:
            self.tokens.popleft()

    def get_stats(self):
        total_tokens = sum(t[1] for t in self.tokens)
        return {
            "requests_last_minute": len(self.requests),
            "tokens_last_minute": total_tokens
        }

tracker = RateLimitTracker()

Resolution: Implement exponential backoff with jitter (see Retry Strategies below) and consider these optimization strategies:

Batch requests: For embeddings, send up to 96 inputs per request instead of one at a time
Use prompt caching: Cache system prompts and tool definitions to reduce token usage
Queue requests: Implement a request queue that respects rate limits
Choose appropriate models: Use Claude 3.5 Haiku for high-volume, latency-sensitive workloads

500 Internal Server Error

A 500 error indicates an unexpected problem on Heroku’s servers. These errors are not caused by your request and are typically transient. Example error response:

{
  "error": {
    "code": 500,
    "message": "An internal error occurred. Please try again.",
    "type": "server_error"
  }
}

How to handle: 500 errors are usually safe to retry. Implement exponential backoff:

import time
import random
from openai import OpenAI, APIError

def make_request_with_retry(client, max_retries=3, **kwargs):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(**kwargs)
        except APIError as e:
            if e.status_code == 500 and attempt < max_retries - 1:
                wait_time = (2 ** attempt) + random.uniform(0, 1)
                print(f"Server error, retrying in {wait_time:.1f}s...")
                time.sleep(wait_time)
            else:
                raise

If 500 errors persist for more than a few minutes, check the Heroku Status page and contact support if needed.

503 Service Unavailable

A 503 error indicates the service is temporarily unavailable, usually due to high load or maintenance. Example error response:

{
  "error": {
    "code": 503,
    "message": "Service temporarily unavailable. Please try again later.",
    "type": "server_error"
  }
}

How to handle: Like 500 errors, 503 errors are transient and safe to retry. However, use longer backoff intervals since the service may need time to recover:

def make_request_with_retry(client, max_retries=5, **kwargs):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(**kwargs)
        except APIError as e:
            if e.status_code in [500, 503] and attempt < max_retries - 1:
                # Longer backoff for 503
                base_wait = 4 if e.status_code == 503 else 2
                wait_time = (base_wait ** attempt) + random.uniform(0, 1)
                print(f"Service unavailable, retrying in {wait_time:.1f}s...")
                time.sleep(wait_time)
            else:
                raise

Retry Strategies

Exponential Backoff

Exponential backoff is the recommended strategy for handling transient errors. The idea is to wait progressively longer between retries, reducing load on the server while eventually succeeding.

Python
TypeScript

import time
import random
from openai import OpenAI, APIError, RateLimitError
import os

def create_completion_with_retry(
    client: OpenAI,
    max_retries: int = 5,
    initial_delay: float = 1.0,
    max_delay: float = 60.0,
    **kwargs
):
    """
    Make a chat completion request with exponential backoff retry logic.

    Args:
        client: OpenAI client configured for Heroku AI
        max_retries: Maximum number of retry attempts
        initial_delay: Initial delay in seconds before first retry
        max_delay: Maximum delay between retries
        **kwargs: Arguments passed to chat.completions.create()

    Returns:
        ChatCompletion response

    Raises:
        The last exception if all retries fail
    """
    last_exception = None

    for attempt in range(max_retries + 1):
        try:
            response = client.chat.completions.create(**kwargs)
            return response

        except RateLimitError as e:
            last_exception = e
            if attempt == max_retries:
                raise

            # Use Retry-After header if available, otherwise calculate backoff
            retry_after = getattr(e, 'retry_after', None)
            if retry_after:
                delay = float(retry_after)
            else:
                delay = min(initial_delay * (2 ** attempt), max_delay)

            # Add jitter to prevent thundering herd
            delay += random.uniform(0, delay * 0.1)

            print(f"Rate limited. Retrying in {delay:.1f}s (attempt {attempt + 1}/{max_retries})")
            time.sleep(delay)

        except APIError as e:
            last_exception = e

            # Only retry on server errors
            if e.status_code not in [500, 502, 503, 504]:
                raise

            if attempt == max_retries:
                raise

            delay = min(initial_delay * (2 ** attempt), max_delay)
            delay += random.uniform(0, delay * 0.1)

            print(f"Server error ({e.status_code}). Retrying in {delay:.1f}s (attempt {attempt + 1}/{max_retries})")
            time.sleep(delay)

    raise last_exception


# Usage example
client = OpenAI(
    base_url=os.getenv("INFERENCE_URL") + "/v1",
    api_key=os.getenv("INFERENCE_KEY")
)

response = create_completion_with_retry(
    client,
    model=os.getenv("INFERENCE_MODEL_ID", "claude-4-5-sonnet"),
    messages=[
        {"role": "user", "content": "Explain exponential backoff in one paragraph."}
    ],
    max_tokens=200
)

print(response.choices[0].message.content)

import OpenAI from 'openai';

interface RetryOptions {
  maxRetries?: number;
  initialDelay?: number;
  maxDelay?: number;
}

async function createCompletionWithRetry(
  client: OpenAI,
  params: OpenAI.ChatCompletionCreateParams,
  options: RetryOptions = {}
): Promise<OpenAI.ChatCompletion> {
  const {
    maxRetries = 5,
    initialDelay = 1000,
    maxDelay = 60000,
  } = options;

  let lastError: Error | null = null;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await client.chat.completions.create(params);
    } catch (error) {
      lastError = error as Error;

      if (error instanceof OpenAI.RateLimitError) {
        if (attempt === maxRetries) throw error;

        // Calculate delay with exponential backoff and jitter
        let delay = Math.min(initialDelay * Math.pow(2, attempt), maxDelay);
        delay += Math.random() * delay * 0.1;

        console.log(
          `Rate limited. Retrying in ${(delay / 1000).toFixed(1)}s ` +
          `(attempt ${attempt + 1}/${maxRetries})`
        );
        await sleep(delay);
      } else if (error instanceof OpenAI.APIError) {
        const status = error.status;

        // Only retry on server errors
        if (![500, 502, 503, 504].includes(status ?? 0)) {
          throw error;
        }

        if (attempt === maxRetries) throw error;

        let delay = Math.min(initialDelay * Math.pow(2, attempt), maxDelay);
        delay += Math.random() * delay * 0.1;

        console.log(
          `Server error (${status}). Retrying in ${(delay / 1000).toFixed(1)}s ` +
          `(attempt ${attempt + 1}/${maxRetries})`
        );
        await sleep(delay);
      } else {
        throw error;
      }
    }
  }

  throw lastError;
}

function sleep(ms: number): Promise<void> {
  return new Promise(resolve => setTimeout(resolve, ms));
}

// Usage example
const client = new OpenAI({
  baseURL: process.env.INFERENCE_URL + '/v1',
  apiKey: process.env.INFERENCE_KEY,
});

const response = await createCompletionWithRetry(
  client,
  {
    model: process.env.INFERENCE_MODEL_ID ?? 'claude-4-5-sonnet',
    messages: [
      { role: 'user', content: 'Explain exponential backoff in one paragraph.' }
    ],
    max_tokens: 200,
  }
);

console.log(response.choices[0].message.content);

Why jitter matters: When many clients hit a rate limit simultaneously, they might all retry at exactly the same time, causing another wave of rate limiting. Adding random jitter (a small random delay) spreads out the retries and prevents this “thundering herd” problem.

Request Queuing

For high-throughput applications, implement a request queue that respects rate limits:

import asyncio
from collections import deque
from datetime import datetime, timedelta

class RateLimitedQueue:
    def __init__(self, requests_per_minute: int = 150):
        self.requests_per_minute = requests_per_minute
        self.request_times = deque()
        self.lock = asyncio.Lock()

    async def wait_for_capacity(self):
        async with self.lock:
            now = datetime.now()
            cutoff = now - timedelta(minutes=1)

            # Remove old request times
            while self.request_times and self.request_times[0] < cutoff:
                self.request_times.popleft()

            # If at capacity, wait until oldest request expires
            if len(self.request_times) >= self.requests_per_minute:
                wait_until = self.request_times[0] + timedelta(minutes=1)
                wait_seconds = (wait_until - now).total_seconds()
                if wait_seconds > 0:
                    await asyncio.sleep(wait_seconds)

            # Record this request
            self.request_times.append(datetime.now())

Debugging Tips

Using Request IDs

Every API response includes a unique request ID in the X-Request-ID header. Include this ID when contacting support:

try:
    response = client.chat.completions.create(
        model="claude-4-5-sonnet",
        messages=[{"role": "user", "content": "Hello"}]
    )
except APIError as e:
    print(f"Error: {e}")
    print(f"Request ID: {e.request_id}")  # Include this in support requests

Logging Recommendations

Enable detailed logging during development to diagnose issues:

import logging
import httpx

# Enable HTTP-level logging
logging.basicConfig(level=logging.DEBUG)
httpx_logger = logging.getLogger("httpx")
httpx_logger.setLevel(logging.DEBUG)

# Or use OpenAI's built-in logging
import openai
openai.log = "debug"

When to Contact Support

Contact Heroku Support if you experience:

Persistent 500/503 errors lasting more than 15 minutes
Rate limit errors when your usage is well below documented limits
Authentication errors with keys that previously worked
Unexpected model behavior that differs from documentation

Include the following in your support request:

Request ID from the error response
Timestamp of when the issue occurred (with timezone)
The exact error message and status code
A minimal code example that reproduces the issue
Your app name and region

Additional Resources

Rate Limits - Detailed rate limits by model
Troubleshooting - Common issues and solutions
OpenAI SDK Compatibility - Using the OpenAI SDK with Heroku AI
Heroku Status - Check for service incidents

Get started

Core concepts

Agents

Tools

Evaluation

Integrations

Reference

Cookbook

Error Response Format

HTTP Status Codes

400 Bad Request

401 Unauthorized

403 Forbidden

429 Too Many Requests

500 Internal Server Error

503 Service Unavailable

Retry Strategies

Exponential Backoff

Request Queuing

Debugging Tips

Using Request IDs

Logging Recommendations

When to Contact Support

Additional Resources

Get started

Core concepts

Agents

Tools

Evaluation

Integrations

Reference

Cookbook

​Error Response Format

​HTTP Status Codes

​400 Bad Request

​401 Unauthorized

​403 Forbidden

​429 Too Many Requests

​500 Internal Server Error

​503 Service Unavailable

​Retry Strategies

​Exponential Backoff

​Request Queuing

​Debugging Tips

​Using Request IDs

​Logging Recommendations

​When to Contact Support

​Additional Resources

Error Response Format

HTTP Status Codes

400 Bad Request

401 Unauthorized

403 Forbidden

429 Too Many Requests

500 Internal Server Error

503 Service Unavailable

Retry Strategies

Exponential Backoff

Request Queuing

Debugging Tips

Using Request IDs

Logging Recommendations

When to Contact Support

Additional Resources