Skip to main content
Heroku AI exposes OpenAI-compatible endpoints, which lets you reuse the official OpenAI SDKs while routing traffic through Heroku’s managed infrastructure. Use the recipes below to bootstrap real applications quickly, then adapt them to your product needs.
All examples call https://us.inference.heroku.com/v1 and rely on a Heroku-managed API key. Provision a model first and export the key as INFERENCE_KEY.

Before you start

  • Install the Heroku CLI and log in.
  • Provision a model, for example heroku ai:models:create claude-4-5-sonnet --app my-ai-app.
  • Store the resulting key securely (heroku config:get INFERENCE_KEY --app my-ai-app).
  • Set the base URL when instantiating the SDK client (base_url="https://us.inference.heroku.com/v1").

Recipe: Customer support chat

Provide agents with instant answers sourced from your knowledge base.
1

Install dependencies

pip install openai flask
2

Create `app.py`

import os
from flask import Flask, request, jsonify
from openai import OpenAI

client = OpenAI(
    base_url="https://us.inference.heroku.com/v1",
    api_key=os.environ["INFERENCE_KEY"],
)

app = Flask(__name__)

@app.post("/chat")
def chat():
    question = request.json.get("question", "")
    response = client.chat.completions.create(
        model="claude-4-5-sonnet",
        messages=[
            {"role": "system", "content": "You are a Heroku support specialist."},
            {"role": "user", "content": question},
        ],
        max_tokens=600,
        temperature=0.3,
    )
    return jsonify({"answer": response.choices[0].message.content})

if __name__ == "__main__":
    app.run(debug=True)
3

Deploy to Heroku

heroku create my-ai-support
git push heroku main
heroku config:set INFERENCE_KEY=$(heroku config:get INFERENCE_KEY --app my-ai-app) --app my-ai-support

Recipe: Search + summarize notebook

Blend embeddings and chat completions to turn raw documents into actionable briefs.
1

Embed your corpus

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://us.inference.heroku.com/v1",
    api_key=os.environ["INFERENCE_KEY"],
)

documents = [...]
vectors = client.embeddings.create(
    model="cohere-embed-multilingual",
    input=documents,
)
2

Index vectors in Postgres + pgvector

CREATE TABLE docs (
  id serial PRIMARY KEY,
  content text,
  embedding vector(1024)
);
3

Answer questions with retrieval

def summarize(question: str) -> str:
    related = fetch_similar_vectors(question, top_k=4)
    context = "\n\n".join(doc.text for doc in related)
    completion = client.chat.completions.create(
        model="claude-4-5-sonnet",
        temperature=0.4,
        max_tokens=700,
        messages=[
            {"role": "system", "content": "Summarize internal Heroku docs."},
            {"role": "user", "content": f"{question}\n\nContext:\n{context}"},
        ],
    )
    return completion.choices[0].message.content

Recipe: Image generator microservice

Expose Stable Image Ultra behind a simple REST interface.
import base64
import os
from io import BytesIO

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI

client = OpenAI(
    base_url="https://us.inference.heroku.com/v1",
    api_key=os.environ["INFERENCE_KEY"],
)

app = FastAPI()

@app.post("/images")
async def images(prompt: str):
    result = client.images.generate(
        model="stable-image-ultra",
        prompt=prompt,
        size="1024x1024",
    )
    data = base64.b64decode(result.data[0].b64_json)
    image_bytes = BytesIO(data)
    image_bytes.seek(0)
    return StreamingResponse(image_bytes, media_type="image/png")
Deploy with a Heroku Container stack or uvicorn buildpack, then protect the endpoint with an API key or Heroku session token.

What to build next

  • Add streaming UIs with Server-Sent Events for chat interfaces.
  • Swap models dynamically by passing the model from request payloads.
  • Log completions to Heroku Data for Redis for analytics.
  • Pair these recipes with the Chat Completions API guide for advanced parameters such as tool calling and structured outputs.