Skip to main content
POST
/
v1
/
chat
/
completions
Chat Completions
curl --request POST \
  --url https://us.inference.heroku.com/v1/chat/completions \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "model": "claude-4-sonnet",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is Heroku?"
    }
  ],
  "max_completion_tokens": 1024,
  "temperature": 0.7
}
'
{
  "id": "<string>",
  "object": "chat.completion",
  "created": 123,
  "model": "<string>",
  "choices": [
    {
      "index": 123,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 123,
    "completion_tokens": 123,
    "total_tokens": 123
  }
}
The /v1/chat/completions endpoint generates conversational completions for a provided set of input messages. You can specify the model, adjust generation settings such as temperature, and optionally stream responses or enable tool calling.
OpenAI Compatible: This endpoint is fully compatible with the OpenAI Chat Completions API. Simply point the OpenAI SDK to our base URL.
View our available chat models to see which models support which features.

Base URL

https://us.inference.heroku.com

Authentication

All requests must include an Authorization header with your Heroku Inference API key:
Authorization: Bearer YOUR_INFERENCE_KEY
You can get your API key from your Heroku app’s INFERENCE_KEY config variable.

Request Parameters

model

string · required Model ID to use for completion, typically the value of your INFERENCE_MODEL_ID config var. Example: "claude-4-5-sonnet", "claude-4-5-haiku"

messages

array · required Array of message objects representing the conversation history. Each message must have a role and content. Supported roles: system, user, assistant, tool
[
  {
    "role": "system",
    "content": "You are a helpful assistant."
  },
  {
    "role": "user",
    "content": "What is Heroku?"
  }
]

max_completion_tokens

integer · optional Maximum number of tokens the model can generate before stopping.
  • Max value: 4096 for Haiku models
  • Max value: 8192 for Sonnet models

temperature

float · optional · default: 1.0 Controls randomness of the response. Range: 0.0 to 1.0
  • Values closer to 0 make responses more focused and deterministic
  • Values closer to 1.0 encourage more creative and diverse responses

top_p

float · optional · default: 0.999 Nucleus sampling threshold. Range: 0 to 1.0. Specifies the cumulative probability of tokens to consider.

stream

boolean · optional · default: false Stream responses incrementally via server-sent events. Useful for chat interfaces and avoiding timeout errors.

stop

array of strings · optional List of strings that stop the model from generating further tokens if encountered in the response.

tools

array of objects · optional List of tools the model may call. See Tool Use Guide for details.
{
  "type": "function",
  "function": {
    "name": "get_weather",
    "description": "Get current weather for a location",
    "parameters": {
      "type": "object",
      "properties": {
        "location": {
          "type": "string",
          "description": "City and state, e.g. Portland, OR"
        }
      },
      "required": ["location"]
    }
  }
}

tool_choice

string or object · optional · default: "required" Controls how the model uses provided tools.
  • "none" - Model will not call any tools
  • "auto" - Model can call zero or more tools
  • "required" - Model must call at least one tool

extended_thinking

object · optional Enable extended thinking for Claude 3.7 Sonnet and Claude 4 Sonnet only. Allows the model to use additional internal tokens for reasoning steps.
{
  "enabled": true,
  "budget_tokens": 1024,
  "include_reasoning": true
}
Fields:
  • enabled (boolean): Enable extended thinking
  • budget_tokens (integer): Minimum 1024, maximum varies by model
  • include_reasoning (boolean): Include reasoning trace in response
Extended thinking is only supported for Claude 3.7 Sonnet and Claude 4 Sonnet. Requests with extended_thinking for other models will fail.

Response

id

string Unique identifier for the chat completion.

object

string Always returns "chat.completion".

created

integer Unix timestamp when the completion was created.

model

string Model ID used to generate the response.

choices

array Array containing the generated message (always length 1).
index (integer): Index of the choice (always 0)message (object): Generated message content
  • role (string): Always "assistant"
  • content (string): Text content of the response
  • tool_calls (array, optional): Tool calls requested by the model
  • reasoning (object, optional): Reasoning trace if extended thinking enabled
finish_reason (string): Reason the model stopped
  • "stop" - Natural stopping point
  • "length" - Reached max tokens
  • "tool_calls" - Made tool calls

usage

object Token usage statistics.
  • prompt_tokens (integer): Tokens in the input
  • completion_tokens (integer): Tokens in the output
  • total_tokens (integer): Total tokens used

Examples

curl https://us.inference.heroku.com/v1/chat/completions \
  -H "Authorization: Bearer $INFERENCE_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-4-5-sonnet",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is Heroku?"
      }
    ],
    "max_completion_tokens": 1024,
    "temperature": 0.7
  }'

Response Example

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1745623456,
  "model": "claude-4-5-sonnet",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Heroku is a cloud platform as a service (PaaS) that enables developers to build, run, and operate applications entirely in the cloud. It supports multiple programming languages and provides tools for deployment, scaling, and management of applications."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 28,
    "completion_tokens": 45,
    "total_tokens": 73
  }
}

Tool Calling Example

curl https://us.inference.heroku.com/v1/chat/completions \
  -H "Authorization: Bearer $INFERENCE_KEY" \
  -H "Content-Type": "application/json" \
  -d '{
    "model": "claude-4-5-sonnet",
    "messages": [
      {"role": "user", "content": "What is the weather in Portland?"}
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get current weather for a location",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string",
                "description": "City and state, e.g. Portland, OR"
              }
            },
            "required": ["location"]
          }
        }
      }
    ],
    "tool_choice": "required"
  }'

Streaming

Enable streaming to receive incremental responses as server-sent events (SSE):
curl https://us.inference.heroku.com/v1/chat/completions \
  -H "Authorization: Bearer $INFERENCE_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-4-5-sonnet",
    "messages": [{"role": "user", "content": "Tell me a story"}],
    "stream": true
  }'
Each chunk contains a delta of the completion:
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1745623456,"model":"claude-4-5-sonnet","choices":[{"index":0,"delta":{"role":"assistant","content":"Once"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1745623456,"model":"claude-4-5-sonnet","choices":[{"index":0,"delta":{"content":" upon"},"finish_reason":null}]}

data: [DONE]

Prompt Caching

Prompt caching optimizes the Heroku Managed Inference and Agents add-on to deliver faster responses for common workflows, document processing, and code-generation tools. Heroku automatically enables prompt caching for system prompts and tool definitions.

How It Works

When agentic apps make calls to models, a portion of the request content remains static. Prompt caching enables Heroku to skip reprocessing this static content for every call and instead use its already processed result from a secure cache.
  1. First request: A request with a new, substantial prompt is processed and Heroku securely caches the results.
  2. Similar requests: For subsequent requests with the same initial prompt or tools, Heroku reuses the cached components to provide a faster response.
Caching only occurs when content meets the minimum token threshold, improving performance where it adds the most value.

What Gets Cached

Heroku only uses prompt caching for:
  • System prompts: Instructions that define the model’s behavior
  • Tool definitions: Function schemas and descriptions
User messages and conversation history are never cached.

Cache Behavior

  • Automatic expiration: The secure cache automatically expires after five minutes of inactivity
  • Security: Each cache is built on Heroku’s secure infrastructure and protects your data with cryptographic hashing
  • No extra cost: Heroku doesn’t charge for cache writes or pass on the difference for cache hits

Disable Prompt Caching

You can disable prompt caching for any request by adding an HTTP header:
curl https://us.inference.heroku.com/v1/chat/completions \
  -H "Authorization: Bearer $INFERENCE_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Heroku-Prompt-Caching: false" \
  -d '{
    "model": "claude-4-5-sonnet",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ]
  }'

Messages API

Native Anthropic SDK for Claude models

Agents Endpoint

Automatically execute Heroku tools

Embeddings

Generate text embeddings

Tool Use Guide

Available Heroku tools

Authorizations

Authorization
string
header
required

Bearer token using your INFERENCE_KEY

Body

application/json
model
string
required

Model ID to use for completion

Example:

"claude-4-sonnet"

messages
object[]
required

Array of message objects

max_completion_tokens
integer

Maximum tokens to generate

Example:

1024

temperature
number<float>
default:1

Sampling temperature

Required range: 0 <= x <= 1
top_p
number<float>
default:0.999

Nucleus sampling threshold

Required range: 0 <= x <= 1
stream
boolean
default:false

Stream responses via SSE

stop
string[]

Strings that stop generation

tools
object[]

Tools the model may call

tool_choice
default:required

Controls tool usage

Available options:
none,
auto,
required

Response

Successful response

id
string

Unique completion ID

object
enum<string>

Object type

Available options:
chat.completion
created
integer

Unix timestamp

model
string

Model used

choices
object[]
usage
object