Messages API - Heroku AI

The /v1/messages endpoint provides native Anthropic API compatibility for Claude models. If you’re already using the Anthropic SDK, you can switch to Heroku AI by changing the base URL and API key—no other code changes required.

Anthropic SDK Compatible: This endpoint is fully compatible with the Anthropic Messages API. Use the Anthropic Python or JavaScript SDK by pointing it to our base URL.

Anthropic Models Only: The Messages API is exclusively available for Claude models. For other models (Amazon Nova, Cohere, etc.), use the Chat Completions API.

View our available Claude models to see which models support which features.

Base URL

https://us.inference.heroku.com

Authentication

Unlike the Chat Completions endpoint which uses Authorization: Bearer, the Messages API uses Anthropic’s authentication pattern:

x-api-key: YOUR_INFERENCE_KEY

You can get your API key from your Heroku app’s INFERENCE_KEY config variable.

Request Parameters

model

string · required The Claude model to use. Use your INFERENCE_MODEL_ID config var value. Example: "claude-4-5-sonnet", "claude-4-5-haiku", "claude-opus-4-5"

max_tokens

integer · required The maximum number of tokens to generate. Unlike Chat Completions where this is optional, the Messages API requires this field.

messages

array · required Array of message objects. Each message has a role (user or assistant) and content.

[
  {"role": "user", "content": "What is Heroku?"},
  {"role": "assistant", "content": "Heroku is a cloud platform..."},
  {"role": "user", "content": "How do I deploy to it?"}
]

The Messages API uses user and assistant roles only. System prompts are passed separately via the system parameter.

system

string or array · optional System prompt that sets the assistant’s behavior. Can be a string or an array of content blocks (for prompt caching).

temperature

float · optional · default: 1.0 Controls randomness. Range: 0.0 to 1.0.

top_p

float · optional Nucleus sampling threshold. Range: 0.0 to 1.0.

top_k

integer · optional Only sample from the top K options for each token.

stop_sequences

array of strings · optional Custom strings that cause the model to stop generating.

stream

boolean · optional · default: false Enable streaming responses via server-sent events.

metadata

object · optional Metadata about the request. Includes user_id for tracking.

thinking

object · optional Enable extended thinking for Claude 3.7 Sonnet and Claude 4 Sonnet. The model uses additional internal reasoning steps before responding.

Thinking Object Structure

{
  "thinking": {
    "type": "enabled",
    "budget_tokens": 5000
  }
}

Fields:

type (string): Set to "enabled" to activate extended thinking
budget_tokens (integer): Token budget for reasoning (minimum 1024)

Extended thinking is only available for Claude 3.7 Sonnet and Claude 4 Sonnet. Requests with thinking for other models will fail.

tools

array · optional Tools the model can use. Follows Anthropic’s tool format.

Tool Object Structure

{
  "name": "get_weather",
  "description": "Get current weather for a location",
  "input_schema": {
    "type": "object",
    "properties": {
      "location": {
        "type": "string",
        "description": "City name"
      }
    },
    "required": ["location"]
  }
}

tool_choice

object · optional Controls how the model uses tools.

{"type": "auto"} - Model decides whether to use tools
{"type": "any"} - Model must use at least one tool
{"type": "tool", "name": "tool_name"} - Model must use the specified tool

Response

id

string Unique identifier for the message.

type

string Always returns "message".

role

string Always returns "assistant".

content

array Array of content blocks generated by the model.

Content Block Types

text - Text response content

type (string): "text"
text (string): The generated text

tool_use - Tool invocation request

type (string): "tool_use"
id (string): Unique tool use ID
name (string): Name of the tool to call
input (object): Arguments for the tool

thinking - Extended thinking content (when enabled)

type (string): "thinking"
thinking (string): The model’s reasoning process

model

string Model ID used to generate the response.

stop_reason

string Reason the model stopped generating.

"end_turn" - Natural stopping point
"max_tokens" - Reached max tokens
"stop_sequence" - Hit a stop sequence
"tool_use" - Made a tool call

usage

object Token usage statistics.

input_tokens (integer): Tokens in the input
output_tokens (integer): Tokens in the output

Examples

curl https://us.inference.heroku.com/v1/messages \
  -H "x-api-key: $INFERENCE_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-4-sonnet",
    "max_tokens": 1024,
    "messages": [
      {"role": "user", "content": "What is Heroku?"}
    ]
  }'

Response Example

{
  "id": "msg_01XYZ...",
  "type": "message",
  "role": "assistant",
  "content": [
    {
      "type": "text",
      "text": "Heroku is a cloud platform as a service (PaaS) that enables developers to build, run, and operate applications entirely in the cloud. It supports multiple programming languages and provides tools for deployment, scaling, and management of applications."
    }
  ],
  "model": "claude-4-sonnet",
  "stop_reason": "end_turn",
  "stop_sequence": null,
  "usage": {
    "input_tokens": 12,
    "output_tokens": 48
  }
}

Tool Calling Example

from anthropic import Anthropic

client = Anthropic(
    api_key=os.getenv("INFERENCE_KEY"),
    base_url=os.getenv("INFERENCE_URL")
)

tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "input_schema": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "City and state, e.g. Portland, OR"
                }
            },
            "required": ["location"]
        }
    }
]

message = client.messages.create(
    model="claude-4-sonnet",
    max_tokens=1024,
    tools=tools,
    messages=[
        {"role": "user", "content": "What's the weather in Portland?"}
    ]
)

Streaming

Enable streaming to receive incremental responses as server-sent events (SSE):

from anthropic import Anthropic

client = Anthropic(
    api_key=os.getenv("INFERENCE_KEY"),
    base_url=os.getenv("INFERENCE_URL")
)

with client.messages.stream(
    model="claude-4-sonnet",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Tell me a story"}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Prompt Caching

The Messages API supports Anthropic-style prompt caching via cache_control blocks. This differs from Chat Completions, which uses a header to enable/disable automatic caching.

How It Works

Add cache_control to cacheable content blocks:

from anthropic import Anthropic

client = Anthropic(
    api_key=os.getenv("INFERENCE_KEY"),
    base_url=os.getenv("INFERENCE_URL")
)

# Cache a large system prompt
message = client.messages.create(
    model="claude-4-sonnet",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an expert on the Heroku platform...",  # Long system prompt
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "How do I scale dynos?"}
    ]
)

What Can Be Cached

System prompts: Add cache_control to system content blocks
Messages: Add cache_control to message content blocks
Tools: Add cache_control to tool definitions

Cache Behavior

Cache entries expire after 5 minutes of inactivity
Minimum content length required for caching (varies by model)
Cache hits return faster and may have different pricing

When to Use Messages vs Chat Completions

Use Messages API when…	Use Chat Completions when…
You have existing Anthropic SDK code	You need OpenAI SDK compatibility
You want native `cache_control` for prompt caching	You prefer header-based caching control
You’re using Anthropic-specific features like `thinking`	You’re using non-Anthropic models
You want to use the Anthropic Agent SDK (coming soon)	You need multi-provider support

Limitations

Anthropic models only: This endpoint only works with Claude models
No anthropic-beta header: Beta features accessed via this header are not currently supported
Agent SDK support coming soon: Full Anthropic Agent SDK support will be available with a future Managed Inference plan update

Chat Completions

OpenAI-compatible endpoint for all models

Agents Endpoint

Automatically execute Heroku tools

Embeddings

Generate text embeddings

Model Selection

Choose the right Claude model

Authorizations

x-api-key

string

header

required

API key using your INFERENCE_KEY (Anthropic-style authentication)

Body

application/json

model

string

required

Claude model ID to use

Example:

"claude-4-sonnet"

max_tokens

integer

required

Maximum tokens to generate (required)

Example:

1024

messages

object[]

required

Array of message objects with user/assistant roles

Show child attributes

system

System prompt (string or array of content blocks for caching)

temperature

number<float>

default:1

Sampling temperature

Required range: 0 <= x <= 1

top_p

number<float>

Nucleus sampling threshold

Required range: 0 <= x <= 1

top_k

integer

Only sample from top K options

stop_sequences

string[]

Custom stop sequences

stream

boolean

default:false

Stream responses via SSE

metadata

object

Request metadata for tracking

Show child attributes

thinking

object

Extended thinking configuration (Claude 3.7/4 Sonnet only)

Show child attributes

tools

object[]

Tools the model may use

Show child attributes

tool_choice

object

Controls how model uses tools

Show child attributes

Response

Successful response

string

Unique message identifier

type

enum<string>

Object type

Available options:

message

role

enum<string>

Always assistant

Available options:

assistant

content

object[]

Array of content blocks

Show child attributes

model

string

Model that generated the response

stop_reason

enum<string>

Why generation stopped

Available options:

end_turn,

max_tokens,

stop_sequence,

tool_use

stop_sequence

string | null

Stop sequence that was matched, if any

usage

object

Show child attributes

Endpoints

Reference

​Base URL

​Authentication

​Request Parameters

​model

​max_tokens

​messages

​system

​temperature

​top_p

​top_k

​stop_sequences

​stream

​metadata

​thinking

​tools

​tool_choice

​Response

​id

​type

​role

​content

​model

​stop_reason

​usage

​Examples

​Response Example

​Tool Calling Example

​Streaming

​Prompt Caching

​How It Works

​What Can Be Cached

​Cache Behavior

​When to Use Messages vs Chat Completions

​Limitations

​Related Endpoints

Chat Completions

Agents Endpoint

Embeddings

Model Selection

Authorizations

Body

Response

Base URL

Authentication

Request Parameters

model

max_tokens

messages

system

temperature

top_p

top_k

stop_sequences

stream

metadata

thinking

tools

tool_choice

Response

id

type

role

content

model

stop_reason

usage

Examples

Response Example

Tool Calling Example

Streaming

Prompt Caching

How It Works

What Can Be Cached

Cache Behavior

When to Use Messages vs Chat Completions

Limitations

Related Endpoints