Catalog overview
Pick the best model for every experience
Heroku Managed Inference and Agents exposes leading chat, embedding, and generative image models behind OpenAI-compatible endpoints. Use these cards and comparison tools to balance quality, performance, and spend.
Model selection checklist
- Prototype with Claude 4.5 Haiku or Nova Lite for fast, cost-effective validation.
- Upgrade to Claude 4.5 Sonnet for balanced performance on complex tasks.
- Reach for Claude 4 Sonnet when workflows demand deep reasoning.
- Use Cohere Embed Multilingual to add semantic search and retrieval.
Featured models
Compare models at a glance
| Model | Category | Strength | Relative cost | Context window | Extended thinking | Vision | Link |
|---|---|---|---|---|---|---|---|
| Claude Opus 4.5 | Chat | Frontier model with autonomous coding | Premium | 200K tokens | ✓ | ✓ | Pricing |
| Claude 4.5 Sonnet | Chat | High-performance balanced model with speed | Balanced | 200K tokens | ✓ | ✓ | Pricing |
| Claude 4 Sonnet | Chat | Deep reasoning with chain-of-thought | Premium | 200K tokens | ✓ | ✓ | Pricing |
| Claude 3.7 Sonnet ⚠️ | Chat | High intelligence with extended thinking | High | 200K tokens | ✓ | ✓ | Deprecated Feb 28 |
| Claude 3.5 Sonnet ⚠️ | Chat | Balanced quality, speed, and cost | Balanced | 200K tokens | Optional | ✓ | Deprecated Feb 28 |
| Claude 4.5 Haiku | Chat | Fast and highly cost-effective | Low | 200K tokens | ✗ | ✗ | Pricing |
| Claude 3.5 Haiku ⚠️ | Chat | Lowest latency, great for scale | Low | 200K tokens | ✗ | ✗ | Deprecated Feb 28 |
| Claude 3.0 Haiku ⚠️ | Chat | Legacy low-cost option | Low | 200K tokens | ✗ | ✗ | Deprecated Feb 28 |
| Amazon Nova 2 Lite | Chat | Fast with extended context and reasoning | Low | 1M tokens | ✓ | ✗ | Pricing |
| Amazon Nova Pro | Chat | Enterprise-first generalist | High | Large | ✗ | ✗ | Pricing |
| Amazon Nova Lite | Chat | Efficient conversational model | Low | Medium | ✗ | ✗ | Pricing |
| Kimi K2 Thinking | Chat | Chain-of-thought open-weight model | Balanced | Large | ✓ | ✗ | Pricing |
| MiniMax M2 | Chat | Programming and tool-calling specialist | Balanced | Large | ✗ | ✗ | Pricing |
| Qwen3 235B | Chat | Complex reasoning and agentic coding | Balanced | Large | ✗ | ✗ | Pricing |
| Qwen3 Coder 480B | Chat | Agentic coding specialist | Balanced | Large | ✗ | ✗ | Pricing |
| OpenAI GPT OSS 120B | Chat | Open-weight experimentation | Self-managed | 128K tokens | ✗ | ✗ | Pricing |
| Cohere Embed Multilingual | Embeddings | Multilingual semantic search | Per 1M tokens | — | — | — | Pricing |
| Cohere Rerank 3.5 | Rerank | Multilingual semantic reranking | Per search unit | — | — | — | Pricing |
| Amazon Rerank 1.0 | Rerank | High-performing AWS-backed reranker | Per search unit | — | — | — | Pricing |
| Stable Image Ultra | Image | High fidelity image synthesis | Per image | — | — | — | Pricing |
Chat models
Advanced language models for assistants, copilots, and automated workflows.Claude Opus 4.5 (frontier model)
Claude Opus 4.5 (frontier model)
Highlights
- Next-generation frontier LLM from Anthropic.
- Autonomous coding capabilities with effort control.
- Enhanced reasoning for complex multi-step tasks.
- Multimodal: accepts images, PDFs, and mixed media.
- 200K token window with extended thinking support.
Best for
- Complex agentic workflows requiring autonomy.
- Strategic analysis and research tasks.
- Advanced coding assistants and code generation.
- High-stakes decision support systems.
Operational tips
- Use effort control to balance cost vs. reasoning depth.
- Reserve for tasks that exceed Sonnet capabilities.
- Stream responses for long-running agent tasks.
- Combine with tool calling for autonomous workflows.
claude-opus-4-5Claude 4.5 Sonnet (high-performance balance)
Claude 4.5 Sonnet (high-performance balance)
Highlights
- Latest Sonnet model balancing intelligence and speed.
- Multimodal: accepts images, PDFs, and mixed media.
- 200K token window for large documents and code bases.
- Supports extended thinking, tool calling, and structured outputs.
Best for
- Complex tasks requiring both speed and quality.
- Data processing and sales forecasting workflows.
- Nuanced content generation for enterprise applications.
Operational tips
- Optimized for high-throughput tasks and real-time interactions.
- Use for applications requiring rapid responses and content moderation.
- Stream responses to maintain low latency for end users.
claude-4-5-sonnetClaude 4 Sonnet (premium reasoning)
Claude 4 Sonnet (premium reasoning)
Claude 3.7 Sonnet (DEPRECATED - Feb 28)
Claude 3.7 Sonnet (DEPRECATED - Feb 28)
Highlights
- Extended thinking for difficult reasoning with reduced pricing.
- Vision support and 200K context window.
- Fast enough for most production assistants.
Best for
- Technical documentation Q&A bots.
- Code review companions.
- Enterprise knowledge management workflows.
Operational tips
- Instrument latency metrics; responses are slightly faster than Claude 4.
- Reserve extended thinking for escalations to balance cost.
- Attach the model to staging apps to validate prompts before rollout.
claude-3-7-sonnetClaude 3.5 Sonnet (DEPRECATED - Feb 28)
Claude 3.5 Sonnet (DEPRECATED - Feb 28)
Highlights
- Strong reasoning with faster average latency than Claude 4.
- Optional extended thinking when prompts demand it.
- Vision for OCR, screenshot analysis, and product imagery.
Best for
- General-purpose product assistants.
- Code generation where turnaround speed matters.
- Marketing content with light creative editing.
Operational tips
- Set
max_tokensbetween 600–1200 to keep responses snappy. - Toggle streaming for user-facing chat experiences.
- Use as the baseline model in evaluation suites.
claude-3-5-sonnetClaude 4.5 Haiku (fast and cost-effective)
Claude 4.5 Haiku (fast and cost-effective)
Highlights
- Fast and highly cost-effective for high-throughput tasks.
- Optimized for rapid responses and real-time interactions.
- Full 200K context window for broad prompts.
- Perfect for content moderation and inventory management.
Best for
- Applications requiring rapid responses.
- Content moderation at scale.
- Inventory management and high-volume automation.
Operational tips
- Optimized for high-throughput with low latency.
- No extended thinking—design prompts accordingly.
- Batch inference when possible to maximize concurrency.
claude-4-5-haikuClaude 3.5 Haiku (DEPRECATED - Feb 28)
Claude 3.5 Haiku (DEPRECATED - Feb 28)
Highlights
- Lowest latency across the Claude family.
- Full 200K context window for broad prompts.
- Optimized for throughput and predictable billing.
Best for
- Customer support automations.
- Moderation or classification pipelines.
- Broadcast messaging, notifications, and templated replies.
Operational tips
- No extended thinking—design prompts accordingly.
- Batch inference when possible to maximize concurrency.
- Monitor token mix; output remains capped at 4K tokens.
claude-3-5-haikuClaude 3.0 Haiku (DEPRECATED - Feb 28)
Claude 3.0 Haiku (DEPRECATED - Feb 28)
Highlights
- Ultra-low latency for simple prompts.
- Minimal cost overhead for pilot projects.
- Compatible with existing Haiku workloads.
Best for
- Lightweight Q&A and FAQ bots.
- Content moderation stubs.
- Legacy apps migrating from older Anthropic versions.
Operational tips
- Plan migration to Claude 4.5 Haiku immediately.
- Limit prompts to short instructions to maintain output quality.
- Model will be removed on February 28, 2025.
claude-3-haikuAmazon Nova Pro (enterprise generalist)
Amazon Nova Pro (enterprise generalist)
Highlights
- Strong reasoning with AWS-native integrations.
- Extended context for multilingual and domain-heavy prompts.
- Governance aligned with AWS Bedrock controls.
Best for
- Workloads already standardized on AWS security tooling.
- Enterprise knowledge bases requiring PII guardrails.
- Finance and regulated industry copilots.
Operational tips
- Cross-check request volumes with AWS cost allocation tags.
- Rely on Bedrock safety settings exposed via model parameters.
- Coordinate upgrades with AWS release cadence.
amazon-nova-proAmazon Nova Lite (efficient chat)
Amazon Nova Lite (efficient chat)
Highlights
- Lower price point compared to Nova Pro.
- Optimized for general conversational tasks.
- Shorter latency with manageable quality trade-offs.
Best for
- Customer service and triage bots.
- Internal productivity copilots.
- High-volume prompts within the AWS ecosystem.
Operational tips
- Use shorter prompts to keep responses sharp.
- Pair with Claude models for fallback escalation flows.
- Benchmark throughput against Haiku to choose the lowest-cost option.
amazon-nova-liteOpenAI GPT OSS 120B (open-weight sandbox)
OpenAI GPT OSS 120B (open-weight sandbox)
Highlights
- Open-source friendly architecture and weights.
- Supports custom fine-tuning and adapters.
- 128K context window for broader experimentation.
Best for
- Research teams evaluating open models.
- Hybrid setups mixing managed and self-hosted inference.
- Education and experimentation environments.
Operational tips
- Expect higher latency compared to hosted Claude models.
- Budget for additional evaluation since quality varies by prompt.
- Track GPU usage via Heroku metrics to control costs.
openai-gpt-oss-120bAmazon Nova 2 Lite (fast reasoning with 1M context)
Amazon Nova 2 Lite (fast reasoning with 1M context)
Highlights
- Fast and cost-effective with 1M token context window.
- Extended thinking with three intensity levels (low, medium, high).
- Built-in tools: code interpreter and web grounding.
- Remote MCP tool support for agentic workflows.
Best for
- High-volume conversational applications.
- Document processing and business automation.
- Cost-sensitive deployments requiring reasoning.
Operational tips
- Use thinking intensity controls to balance speed vs. reasoning.
- Leverage 1M context for large document analysis.
- Built-in code interpreter reduces need for external tools.
amazon-nova-2-liteKimi K2 Thinking (chain-of-thought reasoning)
Kimi K2 Thinking (chain-of-thought reasoning)
Highlights
- Open-weight LLM from Moonshot AI with chain-of-thought.
- Extended thinking for complex reasoning tasks.
- Strong performance on multi-step problem solving.
- Transparent reasoning process in responses.
Best for
- Mathematical and logical reasoning tasks.
- Research and analysis requiring step-by-step thinking.
- Educational applications showing work process.
Operational tips
- Allow for longer response times due to thinking steps.
- US-only availability—plan deployments accordingly.
- Extract intermediate reasoning for debugging.
kimi-k2-thinkingMiniMax M2 (programming and tools)
MiniMax M2 (programming and tools)
Highlights
- Optimized for conversational chat and tool-calling.
- Strong programming task capabilities.
- Efficient for high-throughput applications.
- Good balance of speed and quality.
Best for
- Code generation and programming assistants.
- Tool-enabled workflows and agents.
- Technical support and developer copilots.
Operational tips
- US-only availability—consider alternatives for EU.
- Leverage tool-calling for structured outputs.
- Test against Claude for quality comparison.
minimax-m2Qwen3 235B (complex reasoning)
Qwen3 235B (complex reasoning)
Highlights
- Large-scale model for complex reasoning tasks.
- Strong conversational and tool-calling support.
- Agentic coding capabilities for automated workflows.
- Excellent multilingual understanding.
Best for
- Complex reasoning and analysis tasks.
- Agentic coding and autonomous development.
- Multilingual applications and content.
Operational tips
- US-only availability—plan deployments accordingly.
- Use for tasks requiring deep reasoning capabilities.
- Monitor latency for time-sensitive applications.
qwen3-235bQwen3 Coder 480B (agentic coding specialist)
Qwen3 Coder 480B (agentic coding specialist)
Highlights
- Largest Qwen model optimized for coding tasks.
- Agentic coding with autonomous development capabilities.
- Tool-calling for IDE and workflow integrations.
- Strong performance across programming languages.
Best for
- Automated code generation and refactoring.
- Complex coding agents and development tools.
- Code review and analysis at scale.
Operational tips
- US-only availability—ideal for US-based dev teams.
- Expect higher latency due to model size.
- Combine with tool-calling for IDE integration.
qwen3-coder-480bEmbedding models
Use embeddings for semantic search, classification, clustering, and retrieval-augmented generation.Cohere Embed Multilingual
Cohere Embed Multilingual
Highlights
- 100+ language support with 1,024-dimension vectors.
- Optimized presets for search, classification, and clustering.
- Batch up to 96 inputs per request to lower per-item cost.
Best for
- Global support search with mixed-language corpora.
- RAG pipelines feeding Claude Sonnet responses.
- Content recommendation and deduplication systems.
Operational tips
- Normalize text (lowercase, trim whitespace) before embedding.
- Persist vectors in Postgres + pgvector for efficient retrieval.
- Cache frequently queried embeddings to avoid reprocessing.
cohere-embed-multilingualsearch_document, search_query, classification, clusteringImage models
Generate photorealistic and stylized visuals straight from prompts.Stable Image Ultra
Stable Image Ultra
Highlights
- Supports 16:9, 1:1, 21:9, 2:3, 3:2, 4:5, 5:4, 9:16, and 9:21 aspect ratios.
- Resolutions up to 1536×640 with prompt adherence tuned for enterprise brand safety.
- Negative prompts and seed control for reproducible iterations.
Best for
- Marketing and creative production pipelines.
- Product visualization and storyboarding.
- Social content generation with tight turnaround times.
Operational tips
- Start with draft-size renders to validate prompts before increasing resolution.
- Store seeds alongside prompts so teams can reproduce edits.
- Use negative prompts to block banned styles or elements.
stable-image-ultraRerank models
Improve search quality by ranking documents based on semantic relevance to a query.Cohere Rerank 3.5 (multilingual)
Cohere Rerank 3.5 (multilingual)
Highlights
- Enhanced reasoning with broad data compatibility.
- Multilingual support for global applications.
- Up to 1000 documents per request.
- Optimized for RAG pipeline integration.
Best for
- Retrieval-Augmented Generation (RAG) pipelines.
- Semantic search result ranking.
- Multilingual document retrieval.
Operational tips
- Use after initial vector search to improve relevance.
- Set
top_nto limit results and reduce latency. - Rate limit: 250 requests per minute.
cohere-rerank-3-5Amazon Rerank 1.0 (AWS-backed)
Amazon Rerank 1.0 (AWS-backed)
Highlights
- High-performing reranker backed by AWS infrastructure.
- Seamless integration with AWS ecosystem.
- Enterprise-grade reliability and scaling.
- Consistent performance for production workloads.
Best for
- Enterprise search applications.
- AWS-native RAG implementations.
- High-volume reranking workloads.
Operational tips
- Ideal for teams already using AWS services.
- Rate limit: 200 requests per minute.
- Pair with Cohere embeddings for full RAG pipeline.
amazon-rerank-1-0Model selection playbooks
Picking a default chat model
- Prototype with Claude 4.5 Haiku or Nova 2 Lite for fast, cost-effective performance.
- Upgrade to Claude 4.5 Sonnet for balanced performance on complex tasks.
- Escalate to Claude 4 Sonnet for high-stakes workflows (compliance, financial analysis, multi-step coding).
- Reach for Claude Opus 4.5 for complex agentic workflows and autonomous coding tasks.
- Use Nova Pro if you need tighter alignment with AWS governance.
- Try Qwen3 Coder 480B or Kimi K2 Thinking for specialized coding or reasoning tasks (US only).
Cost and performance controls
- Stream completions to surface partial output without waiting for full responses.
- Cap
max_tokensbased on UI constraints to avoid runaway output cost. - Use
temperature≤ 0.5 for deterministic system flows. - Batch embedding requests and reuse embeddings for unchanged documents.
- Track spend per app in the Heroku Dashboard and set alerts for anomalies.