Heroku AI exposes OpenAI-compatible embedding models, including cohere-embed-multilingual. These notes adapt the OpenAI Cookbook’s retrieval patterns to Heroku’s stack so you can stand up semantic search, question answering, and summarization pipelines quickly. Primary references: Search + Summarize and FAQ bot with embeddings.
Design the retrieval flow
Core pipeline
- Chunk source documents (300–500 word windows with overlap).
- Embed chunks with
cohere-embed-multilingualvia the Embeddings API. - Persist vectors in Postgres + pgvector or another ANN store.
- At query time, embed the question, retrieve top-k matches, and assemble a prompt.
- Send prompt + context to
claude-4-5-sonnetorclaude-4-5-haikuvia the Chat Completions API.
Batch up to 96 strings per request for cost efficiency, mirroring the cookbook guidance. Store the resulting vectors with their source metadata.
Assemble the answering prompt
Prompt template
- System message: define role, tone, and citation requirements.
- Human message: include the user question and formatted context snippets.
- Add explicit output rules (e.g., “If facts are missing, say you don’t know.”).
Follow the cookbook’s recommendation to annotate each chunk with identifiers so you can cite sources back to the end user.
Use Haiku when you prioritize throughput; upgrade to Sonnet for harder questions or multilingual outputs.
Operational guidance
Best practices
- Normalize text (lowercase, trim whitespace) before embedding.
- Refresh embeddings when source data changes; track versions in Postgres.
- Add automated evaluations (see our prompt patterns cookbook) to spot drift.
- Cache retrieval results for popular questions to reduce load.
Deployment on Heroku
- Run the embedding pipeline as a worker dyno; trigger rebuilds via Scheduler or webhooks.
- Store secrets (API keys, database URLs) in Config Vars, never in source.
- Use the
pgvectorextension on Heroku Postgres for similarity search. - Stream chat responses to the client for faster perceived latency.