Skip to main content
POST
/
v1
/
messages
curl --request POST \
  --url https://us.inference.heroku.com/v1/messages \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <api-key>' \
  --data '
{
  "model": "claude-4-sonnet",
  "max_tokens": 1024,
  "messages": [
    {
      "role": "user",
      "content": "Hello, what should I build today?"
    }
  ]
}
'
{
  "id": "<string>",
  "type": "message",
  "role": "assistant",
  "content": [
    {
      "type": "text",
      "text": "<string>",
      "thinking": "<string>",
      "id": "<string>",
      "name": "<string>",
      "input": {}
    }
  ],
  "model": "<string>",
  "stop_reason": "end_turn",
  "stop_sequence": "<string>",
  "usage": {
    "input_tokens": 123,
    "output_tokens": 123
  }
}
The /v1/messages endpoint provides native Anthropic API compatibility for Claude models. If you’re already using the Anthropic SDK, you can switch to Heroku AI by changing the base URL and API key—no other code changes required.
Anthropic SDK Compatible: This endpoint is fully compatible with the Anthropic Messages API. Use the Anthropic Python or JavaScript SDK by pointing it to our base URL.
Anthropic Models Only: The Messages API is exclusively available for Claude models. For other models (Amazon Nova, Cohere, etc.), use the Chat Completions API.
View our available Claude models to see which models support which features.

Base URL

https://us.inference.heroku.com

Authentication

Unlike the Chat Completions endpoint which uses Authorization: Bearer, the Messages API uses Anthropic’s authentication pattern:
x-api-key: YOUR_INFERENCE_KEY
You can get your API key from your Heroku app’s INFERENCE_KEY config variable.

Request Parameters

model

string · required The Claude model to use. Use your INFERENCE_MODEL_ID config var value. Example: "claude-4-5-sonnet", "claude-4-5-haiku", "claude-opus-4-5"

max_tokens

integer · required The maximum number of tokens to generate. Unlike Chat Completions where this is optional, the Messages API requires this field.

messages

array · required Array of message objects. Each message has a role (user or assistant) and content.
[
  {"role": "user", "content": "What is Heroku?"},
  {"role": "assistant", "content": "Heroku is a cloud platform..."},
  {"role": "user", "content": "How do I deploy to it?"}
]
The Messages API uses user and assistant roles only. System prompts are passed separately via the system parameter.

system

string or array · optional System prompt that sets the assistant’s behavior. Can be a string or an array of content blocks (for prompt caching).

temperature

float · optional · default: 1.0 Controls randomness. Range: 0.0 to 1.0.

top_p

float · optional Nucleus sampling threshold. Range: 0.0 to 1.0.

top_k

integer · optional Only sample from the top K options for each token.

stop_sequences

array of strings · optional Custom strings that cause the model to stop generating.

stream

boolean · optional · default: false Enable streaming responses via server-sent events.

metadata

object · optional Metadata about the request. Includes user_id for tracking.

thinking

object · optional Enable extended thinking for Claude 3.7 Sonnet and Claude 4 Sonnet. The model uses additional internal reasoning steps before responding.
{
  "thinking": {
    "type": "enabled",
    "budget_tokens": 5000
  }
}
Fields:
  • type (string): Set to "enabled" to activate extended thinking
  • budget_tokens (integer): Token budget for reasoning (minimum 1024)
Extended thinking is only available for Claude 3.7 Sonnet and Claude 4 Sonnet. Requests with thinking for other models will fail.

tools

array · optional Tools the model can use. Follows Anthropic’s tool format.
{
  "name": "get_weather",
  "description": "Get current weather for a location",
  "input_schema": {
    "type": "object",
    "properties": {
      "location": {
        "type": "string",
        "description": "City name"
      }
    },
    "required": ["location"]
  }
}

tool_choice

object · optional Controls how the model uses tools.
  • {"type": "auto"} - Model decides whether to use tools
  • {"type": "any"} - Model must use at least one tool
  • {"type": "tool", "name": "tool_name"} - Model must use the specified tool

Response

id

string Unique identifier for the message.

type

string Always returns "message".

role

string Always returns "assistant".

content

array Array of content blocks generated by the model.
text - Text response content
  • type (string): "text"
  • text (string): The generated text
tool_use - Tool invocation request
  • type (string): "tool_use"
  • id (string): Unique tool use ID
  • name (string): Name of the tool to call
  • input (object): Arguments for the tool
thinking - Extended thinking content (when enabled)
  • type (string): "thinking"
  • thinking (string): The model’s reasoning process

model

string Model ID used to generate the response.

stop_reason

string Reason the model stopped generating.
  • "end_turn" - Natural stopping point
  • "max_tokens" - Reached max tokens
  • "stop_sequence" - Hit a stop sequence
  • "tool_use" - Made a tool call

usage

object Token usage statistics.
  • input_tokens (integer): Tokens in the input
  • output_tokens (integer): Tokens in the output

Examples

curl https://us.inference.heroku.com/v1/messages \
  -H "x-api-key: $INFERENCE_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-4-sonnet",
    "max_tokens": 1024,
    "messages": [
      {"role": "user", "content": "What is Heroku?"}
    ]
  }'

Response Example

{
  "id": "msg_01XYZ...",
  "type": "message",
  "role": "assistant",
  "content": [
    {
      "type": "text",
      "text": "Heroku is a cloud platform as a service (PaaS) that enables developers to build, run, and operate applications entirely in the cloud. It supports multiple programming languages and provides tools for deployment, scaling, and management of applications."
    }
  ],
  "model": "claude-4-sonnet",
  "stop_reason": "end_turn",
  "stop_sequence": null,
  "usage": {
    "input_tokens": 12,
    "output_tokens": 48
  }
}

Tool Calling Example

from anthropic import Anthropic

client = Anthropic(
    api_key=os.getenv("INFERENCE_KEY"),
    base_url=os.getenv("INFERENCE_URL")
)

tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "input_schema": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "City and state, e.g. Portland, OR"
                }
            },
            "required": ["location"]
        }
    }
]

message = client.messages.create(
    model="claude-4-sonnet",
    max_tokens=1024,
    tools=tools,
    messages=[
        {"role": "user", "content": "What's the weather in Portland?"}
    ]
)

Streaming

Enable streaming to receive incremental responses as server-sent events (SSE):
from anthropic import Anthropic

client = Anthropic(
    api_key=os.getenv("INFERENCE_KEY"),
    base_url=os.getenv("INFERENCE_URL")
)

with client.messages.stream(
    model="claude-4-sonnet",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Tell me a story"}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Prompt Caching

The Messages API supports Anthropic-style prompt caching via cache_control blocks. This differs from Chat Completions, which uses a header to enable/disable automatic caching.

How It Works

Add cache_control to cacheable content blocks:
from anthropic import Anthropic

client = Anthropic(
    api_key=os.getenv("INFERENCE_KEY"),
    base_url=os.getenv("INFERENCE_URL")
)

# Cache a large system prompt
message = client.messages.create(
    model="claude-4-sonnet",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an expert on the Heroku platform...",  # Long system prompt
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "How do I scale dynos?"}
    ]
)

What Can Be Cached

  • System prompts: Add cache_control to system content blocks
  • Messages: Add cache_control to message content blocks
  • Tools: Add cache_control to tool definitions

Cache Behavior

  • Cache entries expire after 5 minutes of inactivity
  • Minimum content length required for caching (varies by model)
  • Cache hits return faster and may have different pricing

When to Use Messages vs Chat Completions

Use Messages API when…Use Chat Completions when…
You have existing Anthropic SDK codeYou need OpenAI SDK compatibility
You want native cache_control for prompt cachingYou prefer header-based caching control
You’re using Anthropic-specific features like thinkingYou’re using non-Anthropic models
You want to use the Anthropic Agent SDK (coming soon)You need multi-provider support

Limitations

  • Anthropic models only: This endpoint only works with Claude models
  • No anthropic-beta header: Beta features accessed via this header are not currently supported
  • Agent SDK support coming soon: Full Anthropic Agent SDK support will be available with a future Managed Inference plan update

Chat Completions

OpenAI-compatible endpoint for all models

Agents Endpoint

Automatically execute Heroku tools

Embeddings

Generate text embeddings

Model Selection

Choose the right Claude model

Authorizations

x-api-key
string
header
required

API key using your INFERENCE_KEY (Anthropic-style authentication)

Body

application/json
model
string
required

Claude model ID to use

Example:

"claude-4-sonnet"

max_tokens
integer
required

Maximum tokens to generate (required)

Example:

1024

messages
object[]
required

Array of message objects with user/assistant roles

system

System prompt (string or array of content blocks for caching)

temperature
number<float>
default:1

Sampling temperature

Required range: 0 <= x <= 1
top_p
number<float>

Nucleus sampling threshold

Required range: 0 <= x <= 1
top_k
integer

Only sample from top K options

stop_sequences
string[]

Custom stop sequences

stream
boolean
default:false

Stream responses via SSE

metadata
object

Request metadata for tracking

thinking
object

Extended thinking configuration (Claude 3.7/4 Sonnet only)

tools
object[]

Tools the model may use

tool_choice
object

Controls how model uses tools

Response

Successful response

id
string

Unique message identifier

type
enum<string>

Object type

Available options:
message
role
enum<string>

Always assistant

Available options:
assistant
content
object[]

Array of content blocks

model
string

Model that generated the response

stop_reason
enum<string>

Why generation stopped

Available options:
end_turn,
max_tokens,
stop_sequence,
tool_use
stop_sequence
string | null

Stop sequence that was matched, if any

usage
object