Generates conversational completions for a provided set of input messages
/v1/chat/completions endpoint generates conversational completions for a provided set of input messages. You can specify the model, adjust generation settings such as temperature, and optionally stream responses or enable tool calling.
Authorization header with your Heroku Inference API key:
INFERENCE_KEY config variable.
INFERENCE_MODEL_ID config var.
Example: "claude-4-5-sonnet", "claude-4-5-haiku"
role and content.
Supported roles: system, user, assistant, tool
4096 for Haiku models8192 for Sonnet models1.0
Controls randomness of the response. Range: 0.0 to 1.0
0 make responses more focused and deterministic1.0 encourage more creative and diverse responses0.999
Nucleus sampling threshold. Range: 0 to 1.0. Specifies the cumulative probability of tokens to consider.
false
Stream responses incrementally via server-sent events. Useful for chat interfaces and avoiding timeout errors.
Tool Object Structure
"required"
Controls how the model uses provided tools.
"none" - Model will not call any tools"auto" - Model can call zero or more tools"required" - Model must call at least one toolExtended Thinking Object
enabled (boolean): Enable extended thinkingbudget_tokens (integer): Minimum 1024, maximum varies by modelinclude_reasoning (boolean): Include reasoning trace in response"chat.completion".
Choice Object
role (string): Always "assistant"content (string): Text content of the responsetool_calls (array, optional): Tool calls requested by the modelreasoning (object, optional): Reasoning trace if extended thinking enabled"stop" - Natural stopping point"length" - Reached max tokens"tool_calls" - Made tool callsprompt_tokens (integer): Tokens in the inputcompletion_tokens (integer): Tokens in the outputtotal_tokens (integer): Total tokens usedBearer token using your INFERENCE_KEY
Model ID to use for completion
"claude-4-sonnet"
Array of message objects
Maximum tokens to generate
1024
Sampling temperature
0 <= x <= 1Nucleus sampling threshold
0 <= x <= 1Stream responses via SSE
Strings that stop generation
Tools the model may call
Controls tool usage
none, auto, required