Chat Completions

The /v1/chat/completions endpoint generates conversational completions for a provided set of input messages. You can specify the model, adjust generation settings such as temperature, and optionally stream responses or enable tool calling.

OpenAI Compatible: This endpoint is fully compatible with the OpenAI Chat Completions API. Simply point the OpenAI SDK to our base URL.

View our available chat models to see which models support which features.

Base URL

https://us.inference.heroku.com

Authentication

All requests must include an Authorization header with your Heroku Inference API key:

Authorization: Bearer YOUR_INFERENCE_KEY

You can get your API key from your Heroku app’s INFERENCE_KEY config variable.

Request Parameters

model

string · required Model ID to use for completion, typically the value of your INFERENCE_MODEL_ID config var. Example: "claude-4-5-sonnet", "claude-4-5-haiku"

messages

array · required Array of message objects representing the conversation history. Each message must have a role and content. Supported roles: system, user, assistant, tool

[
  {
    "role": "system",
    "content": "You are a helpful assistant."
  },
  {
    "role": "user",
    "content": "What is Heroku?"
  }
]

max_completion_tokens

integer · optional Maximum number of tokens the model can generate before stopping.

Max value: 4096 for Haiku models
Max value: 8192 for Sonnet models

temperature

float · optional · default: 1.0 Controls randomness of the response. Range: 0.0 to 1.0

Values closer to 0 make responses more focused and deterministic
Values closer to 1.0 encourage more creative and diverse responses

top_p

float · optional · default: 0.999 Nucleus sampling threshold. Range: 0 to 1.0. Specifies the cumulative probability of tokens to consider.

stream

boolean · optional · default: false Stream responses incrementally via server-sent events. Useful for chat interfaces and avoiding timeout errors.

stop

array of strings · optional List of strings that stop the model from generating further tokens if encountered in the response.

tools

array of objects · optional List of tools the model may call. See Tool Use Guide for details.

Tool Object Structure

{
  "type": "function",
  "function": {
    "name": "get_weather",
    "description": "Get current weather for a location",
    "parameters": {
      "type": "object",
      "properties": {
        "location": {
          "type": "string",
          "description": "City and state, e.g. Portland, OR"
        }
      },
      "required": ["location"]
    }
  }
}

tool_choice

string or object · optional · default: "required" Controls how the model uses provided tools.

"none" - Model will not call any tools
"auto" - Model can call zero or more tools
"required" - Model must call at least one tool

extended_thinking

object · optional Enable extended thinking for Claude 3.7 Sonnet and Claude 4 Sonnet only. Allows the model to use additional internal tokens for reasoning steps.

Extended Thinking Object

{
  "enabled": true,
  "budget_tokens": 1024,
  "include_reasoning": true
}

Fields:

enabled (boolean): Enable extended thinking
budget_tokens (integer): Minimum 1024, maximum varies by model
include_reasoning (boolean): Include reasoning trace in response

Extended thinking is only supported for Claude 3.7 Sonnet and Claude 4 Sonnet. Requests with extended_thinking for other models will fail.

Response

id

string Unique identifier for the chat completion.

object

string Always returns "chat.completion".

created

integer Unix timestamp when the completion was created.

model

string Model ID used to generate the response.

choices

array Array containing the generated message (always length 1).

Choice Object

index (integer): Index of the choice (always 0)message (object): Generated message content

role (string): Always "assistant"
content (string): Text content of the response
tool_calls (array, optional): Tool calls requested by the model
reasoning (object, optional): Reasoning trace if extended thinking enabled

finish_reason (string): Reason the model stopped

"stop" - Natural stopping point
"length" - Reached max tokens
"tool_calls" - Made tool calls

usage

object Token usage statistics.

prompt_tokens (integer): Tokens in the input
completion_tokens (integer): Tokens in the output
total_tokens (integer): Total tokens used

Examples

curl https://us.inference.heroku.com/v1/chat/completions \
  -H "Authorization: Bearer $INFERENCE_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-4-5-sonnet",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is Heroku?"
      }
    ],
    "max_completion_tokens": 1024,
    "temperature": 0.7
  }'

Response Example

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1745623456,
  "model": "claude-4-5-sonnet",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Heroku is a cloud platform as a service (PaaS) that enables developers to build, run, and operate applications entirely in the cloud. It supports multiple programming languages and provides tools for deployment, scaling, and management of applications."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 28,
    "completion_tokens": 45,
    "total_tokens": 73
  }
}

Tool Calling Example

curl https://us.inference.heroku.com/v1/chat/completions \
  -H "Authorization: Bearer $INFERENCE_KEY" \
  -H "Content-Type": "application/json" \
  -d '{
    "model": "claude-4-5-sonnet",
    "messages": [
      {"role": "user", "content": "What is the weather in Portland?"}
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get current weather for a location",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string",
                "description": "City and state, e.g. Portland, OR"
              }
            },
            "required": ["location"]
          }
        }
      }
    ],
    "tool_choice": "required"
  }'

Streaming

Enable streaming to receive incremental responses as server-sent events (SSE):

curl https://us.inference.heroku.com/v1/chat/completions \
  -H "Authorization: Bearer $INFERENCE_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-4-5-sonnet",
    "messages": [{"role": "user", "content": "Tell me a story"}],
    "stream": true
  }'

Each chunk contains a delta of the completion:

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1745623456,"model":"claude-4-5-sonnet","choices":[{"index":0,"delta":{"role":"assistant","content":"Once"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1745623456,"model":"claude-4-5-sonnet","choices":[{"index":0,"delta":{"content":" upon"},"finish_reason":null}]}

data: [DONE]

Prompt Caching

Prompt caching optimizes the Heroku Managed Inference and Agents add-on to deliver faster responses for common workflows, document processing, and code-generation tools. Heroku automatically enables prompt caching for system prompts and tool definitions.

How It Works

When agentic apps make calls to models, a portion of the request content remains static. Prompt caching enables Heroku to skip reprocessing this static content for every call and instead use its already processed result from a secure cache.

First request: A request with a new, substantial prompt is processed and Heroku securely caches the results.
Similar requests: For subsequent requests with the same initial prompt or tools, Heroku reuses the cached components to provide a faster response.

Caching only occurs when content meets the minimum token threshold, improving performance where it adds the most value.

What Gets Cached

Heroku only uses prompt caching for:

System prompts: Instructions that define the model’s behavior
Tool definitions: Function schemas and descriptions

User messages and conversation history are never cached.

Cache Behavior

Automatic expiration: The secure cache automatically expires after five minutes of inactivity
Security: Each cache is built on Heroku’s secure infrastructure and protects your data with cryptographic hashing
No extra cost: Heroku doesn’t charge for cache writes or pass on the difference for cache hits

Disable Prompt Caching

You can disable prompt caching for any request by adding an HTTP header:

cURL
Python
JavaScript

curl https://us.inference.heroku.com/v1/chat/completions \
  -H "Authorization: Bearer $INFERENCE_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Heroku-Prompt-Caching: false" \
  -d '{
    "model": "claude-4-5-sonnet",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ]
  }'

import os
import requests

response = requests.post(
    f"{os.getenv('INFERENCE_URL')}/v1/chat/completions",
    headers={
        "Authorization": f"Bearer {os.getenv('INFERENCE_KEY')}",
        "Content-Type": "application/json",
        "X-Heroku-Prompt-Caching": "false"
    },
    json={
        "model": os.getenv("INFERENCE_MODEL_ID"),
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Hello!"}
        ]
    }
)

const response = await fetch(
  `${process.env.INFERENCE_URL}/v1/chat/completions`,
  {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${process.env.INFERENCE_KEY}`,
      'Content-Type': 'application/json',
      'X-Heroku-Prompt-Caching': 'false'
    },
    body: JSON.stringify({
      model: process.env.INFERENCE_MODEL_ID,
      messages: [
        { role: 'system', content: 'You are a helpful assistant.' },
        { role: 'user', content: 'Hello!' }
      ]
    })
  }
);

Messages API

Native Anthropic SDK for Claude models

Agents Endpoint

Automatically execute Heroku tools

Embeddings

Generate text embeddings

Tool Use Guide

Available Heroku tools

Authorizations

Authorization

string

header

required

Bearer token using your INFERENCE_KEY

Body

application/json

model

string

required

Model ID to use for completion

Example:

"claude-4-sonnet"

messages

object[]

required

Array of message objects

Show child attributes

max_completion_tokens

integer

Maximum tokens to generate

Example:

1024

temperature

number<float>

default:1

Sampling temperature

Required range: 0 <= x <= 1

top_p

number<float>

default:0.999

Nucleus sampling threshold

Required range: 0 <= x <= 1

stream

boolean

default:false

Stream responses via SSE

stop

string[]

Strings that stop generation

tools

object[]

Tools the model may call

Show child attributes

tool_choice

default:required

Controls tool usage

Available options:

none,

auto,

required

Response

Successful response

string

Unique completion ID

object

enum<string>

Object type

Available options:

chat.completion

created

integer

Unix timestamp

model

string

Model used

choices

object[]

Show child attributes

usage

object

Show child attributes

Endpoints

Reference

Base URL

Authentication

Request Parameters

model

messages

max_completion_tokens

temperature

top_p

stream

stop

tools

tool_choice

extended_thinking

Response

id

object

created

model

choices

usage

Examples

Response Example

Tool Calling Example

Streaming

Prompt Caching

How It Works

What Gets Cached

Cache Behavior

Disable Prompt Caching

Messages API

Agents Endpoint

Embeddings

Tool Use Guide

Authorizations

Body

Response

Endpoints

Reference

​Base URL

​Authentication

​Request Parameters

​model

​messages

​max_completion_tokens

​temperature

​top_p

​stream

​stop

​tools

​tool_choice

​extended_thinking

​Response

​id

​object

​created

​model

​choices

​usage

​Examples

​Response Example

​Tool Calling Example

​Streaming

​Prompt Caching

​How It Works

​What Gets Cached

​Cache Behavior

​Disable Prompt Caching

​Related Endpoints

Messages API

Agents Endpoint

Embeddings

Tool Use Guide

Authorizations

Body

Response

Base URL

Authentication

Request Parameters

model

messages

max_completion_tokens

temperature

top_p

stream

stop

tools

tool_choice

extended_thinking

Response

id

object

created

model

choices

usage

Examples

Response Example

Tool Calling Example

Streaming

Prompt Caching

How It Works

What Gets Cached

Cache Behavior

Disable Prompt Caching

Related Endpoints