Embeddings & RAG playbooks

Heroku AI exposes OpenAI-compatible embedding models, including cohere-embed-multilingual. These notes adapt the OpenAI Cookbook’s retrieval patterns to Heroku’s stack so you can stand up semantic search, question answering, and summarization pipelines quickly. Primary references: Search + Summarize and FAQ bot with embeddings.

Design the retrieval flow

Core pipeline

Chunk source documents (300–500 word windows with overlap).
Embed chunks with cohere-embed-multilingual via the Embeddings API.
Persist vectors in Postgres + pgvector or another ANN store.
At query time, embed the question, retrieve top-k matches, and assemble a prompt.
Send prompt + context to claude-4-5-sonnet or claude-4-5-haiku via the Chat Completions API.

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://us.inference.heroku.com/v1",
    api_key=os.environ["INFERENCE_KEY"],
)

def embed_texts(texts: list[str]) -> list[list[float]]:
    result = client.embeddings.create(
        model="cohere-embed-multilingual",
        input=texts,
    )
    return [item.embedding for item in result.data]

Batch up to 96 strings per request for cost efficiency, mirroring the cookbook guidance. Store the resulting vectors with their source metadata.

Assemble the answering prompt

Prompt template

System message: define role, tone, and citation requirements.
Human message: include the user question and formatted context snippets.
Add explicit output rules (e.g., “If facts are missing, say you don’t know.”).

Follow the cookbook’s recommendation to annotate each chunk with identifiers so you can cite sources back to the end user.

def answer_with_context(question: str, context_blocks: list[str]) -> str:
    prompt = "\n\n".join(
        [f"[{i}] {block}" for i, block in enumerate(context_blocks, start=1)]
    )
    response = client.chat.completions.create(
        model="claude-4-5-sonnet",
        messages=[
            {
                "role": "system",
                "content": "You are a concise support assistant. Cite sources like [1], [2].",
            },
            {
                "role": "user",
                "content": f"Question: {question}\n\nContext:\n{prompt}",
            },
        ],
        temperature=0.2,
        max_tokens=800,
    )
    return response.choices[0].message.content

Use Haiku when you prioritize throughput; upgrade to Sonnet for harder questions or multilingual outputs.

Operational guidance

Best practices

Normalize text (lowercase, trim whitespace) before embedding.
Refresh embeddings when source data changes; track versions in Postgres.
Add automated evaluations (see our prompt patterns cookbook) to spot drift.
Cache retrieval results for popular questions to reduce load.

Deployment on Heroku

Run the embedding pipeline as a worker dyno; trigger rebuilds via Scheduler or webhooks.
Store secrets (API keys, database URLs) in Config Vars, never in source.
Use the pgvector extension on Heroku Postgres for similarity search.
Stream chat responses to the client for faster perceived latency.

Get started

Core concepts

Agents

Tools

Evaluation

Integrations

Reference

Cookbook

Embeddings & RAG playbooks

Design the retrieval flow

Core pipeline

Assemble the answering prompt

Prompt template

Operational guidance

Best practices

Deployment on Heroku

Get started

Core concepts

Agents

Tools

Evaluation

Integrations

Reference

Cookbook

​Design the retrieval flow

​Core pipeline

​Assemble the answering prompt

​Prompt template

​Operational guidance

​Best practices

​Deployment on Heroku

Design the retrieval flow

Core pipeline

Assemble the answering prompt

Prompt template

Operational guidance

Best practices

Deployment on Heroku