Catalog overview

Pick the best model for every experience

Heroku Managed Inference and Agents exposes leading chat, embedding, and generative image models behind OpenAI-compatible endpoints. Use these cards and comparison tools to balance quality, performance, and spend.

Model selection checklist

Prototype with Claude 4.5 Haiku or Nova Lite for fast, cost-effective validation.
Upgrade to Claude 4.5 Sonnet for balanced performance on complex tasks.
Reach for Claude 4 Sonnet when workflows demand deep reasoning.
Use Cohere Embed Multilingual to add semantic search and retrieval.

Featured models

Claude Opus 4.5 NEW

Next-generation frontier model with autonomous coding and enhanced reasoning.

Best for complex agentic workflows, strategic analysis, and multi-step reasoning tasks.

View details →

Claude 4.5 Sonnet

High-performance model balancing intelligence and speed for complex tasks.

Perfect for data processing, sales forecasting, and nuanced content generation.

View details →

Claude 4.5 Haiku

Fast and highly cost-effective model optimized for high-throughput tasks.

Ideal for rapid responses, content moderation, and inventory management.

View details →

Stable Image Ultra

High-fidelity diffusion with prompt adherence, aspect ratio control, and seed replay.

Perfect for brand assets, product renders, and campaign variants.

View details →

Compare models at a glance

Model	Category	Strength	Relative cost	Context window	Extended thinking	Vision	Link
Claude Opus 4.5	Chat	Frontier model with autonomous coding	Premium	200K tokens	✓	✓	Pricing
Claude 4.5 Sonnet	Chat	High-performance balanced model with speed	Balanced	200K tokens	✓	✓	Pricing
Claude 4 Sonnet	Chat	Deep reasoning with chain-of-thought	Premium	200K tokens	✓	✓	Pricing
Claude 3.7 Sonnet ⚠️	Chat	High intelligence with extended thinking	High	200K tokens	✓	✓	Deprecated Feb 28
Claude 3.5 Sonnet ⚠️	Chat	Balanced quality, speed, and cost	Balanced	200K tokens	Optional	✓	Deprecated Feb 28
Claude 4.5 Haiku	Chat	Fast and highly cost-effective	Low	200K tokens	✗	✗	Pricing
Claude 3.5 Haiku ⚠️	Chat	Lowest latency, great for scale	Low	200K tokens	✗	✗	Deprecated Feb 28
Claude 3.0 Haiku ⚠️	Chat	Legacy low-cost option	Low	200K tokens	✗	✗	Deprecated Feb 28
Amazon Nova 2 Lite	Chat	Fast with extended context and reasoning	Low	1M tokens	✓	✗	Pricing
Amazon Nova Pro	Chat	Enterprise-first generalist	High	Large	✗	✗	Pricing
Amazon Nova Lite	Chat	Efficient conversational model	Low	Medium	✗	✗	Pricing
Kimi K2 Thinking	Chat	Chain-of-thought open-weight model	Balanced	Large	✓	✗	Pricing
MiniMax M2	Chat	Programming and tool-calling specialist	Balanced	Large	✗	✗	Pricing
Qwen3 235B	Chat	Complex reasoning and agentic coding	Balanced	Large	✗	✗	Pricing
Qwen3 Coder 480B	Chat	Agentic coding specialist	Balanced	Large	✗	✗	Pricing
OpenAI GPT OSS 120B	Chat	Open-weight experimentation	Self-managed	128K tokens	✗	✗	Pricing
Cohere Embed Multilingual	Embeddings	Multilingual semantic search	Per 1M tokens	—	—	—	Pricing
Cohere Rerank 3.5	Rerank	Multilingual semantic reranking	Per search unit	—	—	—	Pricing
Amazon Rerank 1.0	Rerank	High-performing AWS-backed reranker	Per search unit	—	—	—	Pricing
Stable Image Ultra	Image	High fidelity image synthesis	Per image	—	—	—	Pricing

Cost guidance reflects relative position within the Heroku AI catalog. Check the pricing page for live token and image rates before deploying to production.

Chat models

Advanced language models for assistants, copilots, and automated workflows.

Claude Opus 4.5 (frontier model)

Highlights

Next-generation frontier LLM from Anthropic.
Autonomous coding capabilities with effort control.
Enhanced reasoning for complex multi-step tasks.
Multimodal: accepts images, PDFs, and mixed media.
200K token window with extended thinking support.

Best for

Complex agentic workflows requiring autonomy.
Strategic analysis and research tasks.
Advanced coding assistants and code generation.
High-stakes decision support systems.

Operational tips

Use effort control to balance cost vs. reasoning depth.
Reserve for tasks that exceed Sonnet capabilities.
Stream responses for long-running agent tasks.
Combine with tool calling for autonomous workflows.

claude-opus-4-5

US, EU

Claude 4.5 Sonnet (high-performance balance)

Highlights

Latest Sonnet model balancing intelligence and speed.
Multimodal: accepts images, PDFs, and mixed media.
200K token window for large documents and code bases.
Supports extended thinking, tool calling, and structured outputs.

Best for

Complex tasks requiring both speed and quality.
Data processing and sales forecasting workflows.
Nuanced content generation for enterprise applications.

Operational tips

Optimized for high-throughput tasks and real-time interactions.
Use for applications requiring rapid responses and content moderation.
Stream responses to maintain low latency for end users.

claude-4-5-sonnet

Claude 4 Sonnet (premium reasoning)

Highlights

Extended thinking unlocks deeper multi-step reasoning.
Multimodal: accepts images, PDFs, and mixed media.
200K token window for large documents and code bases.
Supports tool calling and structured outputs.

Best for

Strategic analysis and research copilots.
Complex coding assistants and debugging flows.
Document intelligence use cases requiring full fidelity.

Operational tips

Enable extended thinking only when needed to manage output spend.
Use evaluation harnesses in AI Studio before promoting new prompts.
Stream responses to keep latency manageable for end users.

claude-4-sonnet

Claude 3.7 Sonnet (DEPRECATED - Feb 28)

Deprecated: This model will be removed on February 28, 2025. Migrate to Claude 4.5 Sonnet for improved performance and continued support.

Highlights

Extended thinking for difficult reasoning with reduced pricing.
Vision support and 200K context window.
Fast enough for most production assistants.

Best for

Technical documentation Q&A bots.
Code review companions.
Enterprise knowledge management workflows.

Operational tips

Instrument latency metrics; responses are slightly faster than Claude 4.
Reserve extended thinking for escalations to balance cost.
Attach the model to staging apps to validate prompts before rollout.

claude-3-7-sonnet

Claude 3.5 Sonnet (DEPRECATED - Feb 28)

Deprecated: This model will be removed on February 28, 2025. Migrate to Claude 4.5 Sonnet for improved performance and continued support.

Highlights

Strong reasoning with faster average latency than Claude 4.
Optional extended thinking when prompts demand it.
Vision for OCR, screenshot analysis, and product imagery.

Best for

General-purpose product assistants.
Code generation where turnaround speed matters.
Marketing content with light creative editing.

Operational tips

Set max_tokens between 600–1200 to keep responses snappy.
Toggle streaming for user-facing chat experiences.
Use as the baseline model in evaluation suites.

claude-3-5-sonnet

Claude 4.5 Haiku (fast and cost-effective)

Highlights

Fast and highly cost-effective for high-throughput tasks.
Optimized for rapid responses and real-time interactions.
Full 200K context window for broad prompts.
Perfect for content moderation and inventory management.

Best for

Applications requiring rapid responses.
Content moderation at scale.
Inventory management and high-volume automation.

Operational tips

Optimized for high-throughput with low latency.
No extended thinking—design prompts accordingly.
Batch inference when possible to maximize concurrency.

claude-4-5-haiku

Claude 3.5 Haiku (DEPRECATED - Feb 28)

Deprecated: This model will be removed on February 28, 2025. Migrate to Claude 4.5 Haiku for improved performance and continued support.

Highlights

Lowest latency across the Claude family.
Full 200K context window for broad prompts.
Optimized for throughput and predictable billing.

Best for

Customer support automations.
Moderation or classification pipelines.
Broadcast messaging, notifications, and templated replies.

Operational tips

No extended thinking—design prompts accordingly.
Batch inference when possible to maximize concurrency.
Monitor token mix; output remains capped at 4K tokens.

claude-3-5-haiku

Claude 3.0 Haiku (DEPRECATED - Feb 28)

Deprecated: This model will be removed on February 28, 2025. Migrate to Claude 4.5 Haiku for improved performance and continued support.

Highlights

Ultra-low latency for simple prompts.
Minimal cost overhead for pilot projects.
Compatible with existing Haiku workloads.

Best for

Lightweight Q&A and FAQ bots.
Content moderation stubs.
Legacy apps migrating from older Anthropic versions.

Operational tips

Plan migration to Claude 4.5 Haiku immediately.
Limit prompts to short instructions to maintain output quality.
Model will be removed on February 28, 2025.

claude-3-haiku

Amazon Nova Pro (enterprise generalist)

Highlights

Strong reasoning with AWS-native integrations.
Extended context for multilingual and domain-heavy prompts.
Governance aligned with AWS Bedrock controls.

Best for

Workloads already standardized on AWS security tooling.
Enterprise knowledge bases requiring PII guardrails.
Finance and regulated industry copilots.

Operational tips

Cross-check request volumes with AWS cost allocation tags.
Rely on Bedrock safety settings exposed via model parameters.
Coordinate upgrades with AWS release cadence.

amazon-nova-pro

Amazon Nova Lite (efficient chat)

Highlights

Lower price point compared to Nova Pro.
Optimized for general conversational tasks.
Shorter latency with manageable quality trade-offs.

Best for

Customer service and triage bots.
Internal productivity copilots.
High-volume prompts within the AWS ecosystem.

Operational tips

Use shorter prompts to keep responses sharp.
Pair with Claude models for fallback escalation flows.
Benchmark throughput against Haiku to choose the lowest-cost option.

amazon-nova-lite

OpenAI GPT OSS 120B (open-weight sandbox)

Highlights

Open-source friendly architecture and weights.
Supports custom fine-tuning and adapters.
128K context window for broader experimentation.

Best for

Research teams evaluating open models.
Hybrid setups mixing managed and self-hosted inference.
Education and experimentation environments.

Operational tips

Expect higher latency compared to hosted Claude models.
Budget for additional evaluation since quality varies by prompt.
Track GPU usage via Heroku metrics to control costs.

openai-gpt-oss-120b

Amazon Nova 2 Lite (fast reasoning with 1M context)

Highlights

Fast and cost-effective with 1M token context window.
Extended thinking with three intensity levels (low, medium, high).
Built-in tools: code interpreter and web grounding.
Remote MCP tool support for agentic workflows.

Best for

High-volume conversational applications.
Document processing and business automation.
Cost-sensitive deployments requiring reasoning.

Operational tips

Use thinking intensity controls to balance speed vs. reasoning.
Leverage 1M context for large document analysis.
Built-in code interpreter reduces need for external tools.

amazon-nova-2-lite

US, EU

Kimi K2 Thinking (chain-of-thought reasoning)

Highlights

Open-weight LLM from Moonshot AI with chain-of-thought.
Extended thinking for complex reasoning tasks.
Strong performance on multi-step problem solving.
Transparent reasoning process in responses.

Best for

Mathematical and logical reasoning tasks.
Research and analysis requiring step-by-step thinking.
Educational applications showing work process.

Operational tips

Allow for longer response times due to thinking steps.
US-only availability—plan deployments accordingly.
Extract intermediate reasoning for debugging.

kimi-k2-thinking

US only

MiniMax M2 (programming and tools)

Highlights

Optimized for conversational chat and tool-calling.
Strong programming task capabilities.
Efficient for high-throughput applications.
Good balance of speed and quality.

Best for

Code generation and programming assistants.
Tool-enabled workflows and agents.
Technical support and developer copilots.

Operational tips

US-only availability—consider alternatives for EU.
Leverage tool-calling for structured outputs.
Test against Claude for quality comparison.

minimax-m2

US only

Qwen3 235B (complex reasoning)

Highlights

Large-scale model for complex reasoning tasks.
Strong conversational and tool-calling support.
Agentic coding capabilities for automated workflows.
Excellent multilingual understanding.

Best for

Complex reasoning and analysis tasks.
Agentic coding and autonomous development.
Multilingual applications and content.

Operational tips

US-only availability—plan deployments accordingly.
Use for tasks requiring deep reasoning capabilities.
Monitor latency for time-sensitive applications.

qwen3-235b

US only

Qwen3 Coder 480B (agentic coding specialist)

Highlights

Largest Qwen model optimized for coding tasks.
Agentic coding with autonomous development capabilities.
Tool-calling for IDE and workflow integrations.
Strong performance across programming languages.

Best for

Automated code generation and refactoring.
Complex coding agents and development tools.
Code review and analysis at scale.

Operational tips

US-only availability—ideal for US-based dev teams.
Expect higher latency due to model size.
Combine with tool-calling for IDE integration.

qwen3-coder-480b

US only

Embedding models

Use embeddings for semantic search, classification, clustering, and retrieval-augmented generation.

Cohere Embed Multilingual

Highlights

100+ language support with 1,024-dimension vectors.
Optimized presets for search, classification, and clustering.
Batch up to 96 inputs per request to lower per-item cost.

Best for

Global support search with mixed-language corpora.
RAG pipelines feeding Claude Sonnet responses.
Content recommendation and deduplication systems.

Operational tips

Normalize text (lowercase, trim whitespace) before embedding.
Persist vectors in Postgres + pgvector for efficient retrieval.
Cache frequently queried embeddings to avoid reprocessing.

cohere-embed-multilingual

search_document, search_query, classification, clustering

Image models

Generate photorealistic and stylized visuals straight from prompts.

Stable Image Ultra

Highlights

Supports 16:9, 1:1, 21:9, 2:3, 3:2, 4:5, 5:4, 9:16, and 9:21 aspect ratios.
Resolutions up to 1536×640 with prompt adherence tuned for enterprise brand safety.
Negative prompts and seed control for reproducible iterations.

Best for

Marketing and creative production pipelines.
Product visualization and storyboarding.
Social content generation with tight turnaround times.

Operational tips

Start with draft-size renders to validate prompts before increasing resolution.
Store seeds alongside prompts so teams can reproduce edits.
Use negative prompts to block banned styles or elements.

stable-image-ultra

Rerank models

Improve search quality by ranking documents based on semantic relevance to a query.

Cohere Rerank 3.5 (multilingual)

Highlights

Enhanced reasoning with broad data compatibility.
Multilingual support for global applications.
Up to 1000 documents per request.
Optimized for RAG pipeline integration.

Best for

Retrieval-Augmented Generation (RAG) pipelines.
Semantic search result ranking.
Multilingual document retrieval.

Operational tips

Use after initial vector search to improve relevance.
Set top_n to limit results and reduce latency.
Rate limit: 250 requests per minute.

cohere-rerank-3-5

US, EU

Amazon Rerank 1.0 (AWS-backed)

Highlights

High-performing reranker backed by AWS infrastructure.
Seamless integration with AWS ecosystem.
Enterprise-grade reliability and scaling.
Consistent performance for production workloads.

Best for

Enterprise search applications.
AWS-native RAG implementations.
High-volume reranking workloads.

Operational tips

Ideal for teams already using AWS services.
Rate limit: 200 requests per minute.
Pair with Cohere embeddings for full RAG pipeline.

amazon-rerank-1-0

US, EU

Model selection playbooks

Picking a default chat model

Prototype with Claude 4.5 Haiku or Nova 2 Lite for fast, cost-effective performance.
Upgrade to Claude 4.5 Sonnet for balanced performance on complex tasks.
Escalate to Claude 4 Sonnet for high-stakes workflows (compliance, financial analysis, multi-step coding).
Reach for Claude Opus 4.5 for complex agentic workflows and autonomous coding tasks.
Use Nova Pro if you need tighter alignment with AWS governance.
Try Qwen3 Coder 480B or Kimi K2 Thinking for specialized coding or reasoning tasks (US only).

Cost and performance controls

Stream completions to surface partial output without waiting for full responses.
Cap max_tokens based on UI constraints to avoid runaway output cost.
Use temperature ≤ 0.5 for deterministic system flows.
Batch embedding requests and reuse embeddings for unchanged documents.
Track spend per app in the Heroku Dashboard and set alerts for anomalies.

Provisioning checklist

Inspect availability

heroku ai:models:list

Create a managed resource

heroku ai:models:create claude-4-5-sonnet --app my-ai-app

Verify attachment

heroku ai:models:info --app my-ai-app

Chat Completions API

Implement conversational workloads with streaming and tool calling.

Embeddings API

Build RAG, semantic search, and clustering pipelines.

Rerank API

Improve search relevance by reranking documents semantically.

Image Generation API

Turn prompts into assets for marketing, product, and design.

AI Studio

Evaluate prompts, compare models, and share test links.

Get started

Core concepts

Agents

Tools

Evaluation

Integrations

Reference

Cookbook

​Pick the best model for every experience

​Featured models

​Compare models at a glance

​Chat models

Highlights

Best for

Operational tips

Highlights

Best for

Operational tips

Highlights

Best for

Operational tips

Highlights

Best for

Operational tips

Highlights

Best for

Operational tips

Highlights

Best for

Operational tips

Highlights

Best for

Operational tips

Highlights

Best for

Operational tips

Highlights

Best for

Operational tips

Highlights

Best for

Operational tips

Highlights

Best for

Operational tips

Highlights

Best for

Operational tips

Highlights

Best for

Operational tips

Highlights

Best for

Operational tips

Highlights

Best for

Operational tips

Highlights

Best for

Operational tips

​Embedding models

Highlights

Best for

Operational tips

​Image models

Highlights

Best for

Operational tips

​Rerank models

Highlights

Best for

Operational tips

Highlights

Best for

Operational tips

​Model selection playbooks

​Picking a default chat model

​Cost and performance controls

​Provisioning checklist

​Related resources

Pick the best model for every experience

Featured models

Compare models at a glance

Chat models

Embedding models

Image models

Rerank models

Model selection playbooks

Picking a default chat model

Cost and performance controls

Provisioning checklist

Related resources