Skip to main content

Catalog overview

Pick the best model for every experience

Heroku Managed Inference and Agents exposes leading chat, embedding, and generative image models behind OpenAI-compatible endpoints. Use these cards and comparison tools to balance quality, performance, and spend.

Model selection checklist

  • Prototype with Claude 4.5 Haiku or Nova Lite for fast, cost-effective validation.
  • Upgrade to Claude 4.5 Sonnet for balanced performance on complex tasks.
  • Reach for Claude 4 Sonnet when workflows demand deep reasoning.
  • Use Cohere Embed Multilingual to add semantic search and retrieval.

Compare models at a glance

ModelCategoryStrengthRelative costContext windowExtended thinkingVisionLink
Claude Opus 4.5ChatFrontier model with autonomous codingPremium200K tokensPricing
Claude 4.5 SonnetChatHigh-performance balanced model with speedBalanced200K tokensPricing
Claude 4 SonnetChatDeep reasoning with chain-of-thoughtPremium200K tokensPricing
Claude 3.7 Sonnet ⚠️ChatHigh intelligence with extended thinkingHigh200K tokensDeprecated Feb 28
Claude 3.5 Sonnet ⚠️ChatBalanced quality, speed, and costBalanced200K tokensOptionalDeprecated Feb 28
Claude 4.5 HaikuChatFast and highly cost-effectiveLow200K tokensPricing
Claude 3.5 Haiku ⚠️ChatLowest latency, great for scaleLow200K tokensDeprecated Feb 28
Claude 3.0 Haiku ⚠️ChatLegacy low-cost optionLow200K tokensDeprecated Feb 28
Amazon Nova 2 LiteChatFast with extended context and reasoningLow1M tokensPricing
Amazon Nova ProChatEnterprise-first generalistHighLargePricing
Amazon Nova LiteChatEfficient conversational modelLowMediumPricing
Kimi K2 ThinkingChatChain-of-thought open-weight modelBalancedLargePricing
MiniMax M2ChatProgramming and tool-calling specialistBalancedLargePricing
Qwen3 235BChatComplex reasoning and agentic codingBalancedLargePricing
Qwen3 Coder 480BChatAgentic coding specialistBalancedLargePricing
OpenAI GPT OSS 120BChatOpen-weight experimentationSelf-managed128K tokensPricing
Cohere Embed MultilingualEmbeddingsMultilingual semantic searchPer 1M tokensPricing
Cohere Rerank 3.5RerankMultilingual semantic rerankingPer search unitPricing
Amazon Rerank 1.0RerankHigh-performing AWS-backed rerankerPer search unitPricing
Stable Image UltraImageHigh fidelity image synthesisPer imagePricing
Cost guidance reflects relative position within the Heroku AI catalog. Check the pricing page for live token and image rates before deploying to production.

Chat models

Advanced language models for assistants, copilots, and automated workflows.

Highlights

  • Next-generation frontier LLM from Anthropic.
  • Autonomous coding capabilities with effort control.
  • Enhanced reasoning for complex multi-step tasks.
  • Multimodal: accepts images, PDFs, and mixed media.
  • 200K token window with extended thinking support.

Best for

  • Complex agentic workflows requiring autonomy.
  • Strategic analysis and research tasks.
  • Advanced coding assistants and code generation.
  • High-stakes decision support systems.

Operational tips

  • Use effort control to balance cost vs. reasoning depth.
  • Reserve for tasks that exceed Sonnet capabilities.
  • Stream responses for long-running agent tasks.
  • Combine with tool calling for autonomous workflows.
claude-opus-4-5
US, EU

Highlights

  • Latest Sonnet model balancing intelligence and speed.
  • Multimodal: accepts images, PDFs, and mixed media.
  • 200K token window for large documents and code bases.
  • Supports extended thinking, tool calling, and structured outputs.

Best for

  • Complex tasks requiring both speed and quality.
  • Data processing and sales forecasting workflows.
  • Nuanced content generation for enterprise applications.

Operational tips

  • Optimized for high-throughput tasks and real-time interactions.
  • Use for applications requiring rapid responses and content moderation.
  • Stream responses to maintain low latency for end users.
claude-4-5-sonnet

Highlights

  • Extended thinking unlocks deeper multi-step reasoning.
  • Multimodal: accepts images, PDFs, and mixed media.
  • 200K token window for large documents and code bases.
  • Supports tool calling and structured outputs.

Best for

  • Strategic analysis and research copilots.
  • Complex coding assistants and debugging flows.
  • Document intelligence use cases requiring full fidelity.

Operational tips

  • Enable extended thinking only when needed to manage output spend.
  • Use evaluation harnesses in AI Studio before promoting new prompts.
  • Stream responses to keep latency manageable for end users.
claude-4-sonnet
Deprecated: This model will be removed on February 28, 2025. Migrate to Claude 4.5 Sonnet for improved performance and continued support.

Highlights

  • Extended thinking for difficult reasoning with reduced pricing.
  • Vision support and 200K context window.
  • Fast enough for most production assistants.

Best for

  • Technical documentation Q&A bots.
  • Code review companions.
  • Enterprise knowledge management workflows.

Operational tips

  • Instrument latency metrics; responses are slightly faster than Claude 4.
  • Reserve extended thinking for escalations to balance cost.
  • Attach the model to staging apps to validate prompts before rollout.
claude-3-7-sonnet
Deprecated: This model will be removed on February 28, 2025. Migrate to Claude 4.5 Sonnet for improved performance and continued support.

Highlights

  • Strong reasoning with faster average latency than Claude 4.
  • Optional extended thinking when prompts demand it.
  • Vision for OCR, screenshot analysis, and product imagery.

Best for

  • General-purpose product assistants.
  • Code generation where turnaround speed matters.
  • Marketing content with light creative editing.

Operational tips

  • Set max_tokens between 600–1200 to keep responses snappy.
  • Toggle streaming for user-facing chat experiences.
  • Use as the baseline model in evaluation suites.
claude-3-5-sonnet

Highlights

  • Fast and highly cost-effective for high-throughput tasks.
  • Optimized for rapid responses and real-time interactions.
  • Full 200K context window for broad prompts.
  • Perfect for content moderation and inventory management.

Best for

  • Applications requiring rapid responses.
  • Content moderation at scale.
  • Inventory management and high-volume automation.

Operational tips

  • Optimized for high-throughput with low latency.
  • No extended thinking—design prompts accordingly.
  • Batch inference when possible to maximize concurrency.
claude-4-5-haiku
Deprecated: This model will be removed on February 28, 2025. Migrate to Claude 4.5 Haiku for improved performance and continued support.

Highlights

  • Lowest latency across the Claude family.
  • Full 200K context window for broad prompts.
  • Optimized for throughput and predictable billing.

Best for

  • Customer support automations.
  • Moderation or classification pipelines.
  • Broadcast messaging, notifications, and templated replies.

Operational tips

  • No extended thinking—design prompts accordingly.
  • Batch inference when possible to maximize concurrency.
  • Monitor token mix; output remains capped at 4K tokens.
claude-3-5-haiku
Deprecated: This model will be removed on February 28, 2025. Migrate to Claude 4.5 Haiku for improved performance and continued support.

Highlights

  • Ultra-low latency for simple prompts.
  • Minimal cost overhead for pilot projects.
  • Compatible with existing Haiku workloads.

Best for

  • Lightweight Q&A and FAQ bots.
  • Content moderation stubs.
  • Legacy apps migrating from older Anthropic versions.

Operational tips

  • Plan migration to Claude 4.5 Haiku immediately.
  • Limit prompts to short instructions to maintain output quality.
  • Model will be removed on February 28, 2025.
claude-3-haiku

Highlights

  • Strong reasoning with AWS-native integrations.
  • Extended context for multilingual and domain-heavy prompts.
  • Governance aligned with AWS Bedrock controls.

Best for

  • Workloads already standardized on AWS security tooling.
  • Enterprise knowledge bases requiring PII guardrails.
  • Finance and regulated industry copilots.

Operational tips

  • Cross-check request volumes with AWS cost allocation tags.
  • Rely on Bedrock safety settings exposed via model parameters.
  • Coordinate upgrades with AWS release cadence.
amazon-nova-pro

Highlights

  • Lower price point compared to Nova Pro.
  • Optimized for general conversational tasks.
  • Shorter latency with manageable quality trade-offs.

Best for

  • Customer service and triage bots.
  • Internal productivity copilots.
  • High-volume prompts within the AWS ecosystem.

Operational tips

  • Use shorter prompts to keep responses sharp.
  • Pair with Claude models for fallback escalation flows.
  • Benchmark throughput against Haiku to choose the lowest-cost option.
amazon-nova-lite

Highlights

  • Open-source friendly architecture and weights.
  • Supports custom fine-tuning and adapters.
  • 128K context window for broader experimentation.

Best for

  • Research teams evaluating open models.
  • Hybrid setups mixing managed and self-hosted inference.
  • Education and experimentation environments.

Operational tips

  • Expect higher latency compared to hosted Claude models.
  • Budget for additional evaluation since quality varies by prompt.
  • Track GPU usage via Heroku metrics to control costs.
openai-gpt-oss-120b

Highlights

  • Fast and cost-effective with 1M token context window.
  • Extended thinking with three intensity levels (low, medium, high).
  • Built-in tools: code interpreter and web grounding.
  • Remote MCP tool support for agentic workflows.

Best for

  • High-volume conversational applications.
  • Document processing and business automation.
  • Cost-sensitive deployments requiring reasoning.

Operational tips

  • Use thinking intensity controls to balance speed vs. reasoning.
  • Leverage 1M context for large document analysis.
  • Built-in code interpreter reduces need for external tools.
amazon-nova-2-lite
US, EU

Highlights

  • Open-weight LLM from Moonshot AI with chain-of-thought.
  • Extended thinking for complex reasoning tasks.
  • Strong performance on multi-step problem solving.
  • Transparent reasoning process in responses.

Best for

  • Mathematical and logical reasoning tasks.
  • Research and analysis requiring step-by-step thinking.
  • Educational applications showing work process.

Operational tips

  • Allow for longer response times due to thinking steps.
  • US-only availability—plan deployments accordingly.
  • Extract intermediate reasoning for debugging.
kimi-k2-thinking
US only

Highlights

  • Optimized for conversational chat and tool-calling.
  • Strong programming task capabilities.
  • Efficient for high-throughput applications.
  • Good balance of speed and quality.

Best for

  • Code generation and programming assistants.
  • Tool-enabled workflows and agents.
  • Technical support and developer copilots.

Operational tips

  • US-only availability—consider alternatives for EU.
  • Leverage tool-calling for structured outputs.
  • Test against Claude for quality comparison.
minimax-m2
US only

Highlights

  • Large-scale model for complex reasoning tasks.
  • Strong conversational and tool-calling support.
  • Agentic coding capabilities for automated workflows.
  • Excellent multilingual understanding.

Best for

  • Complex reasoning and analysis tasks.
  • Agentic coding and autonomous development.
  • Multilingual applications and content.

Operational tips

  • US-only availability—plan deployments accordingly.
  • Use for tasks requiring deep reasoning capabilities.
  • Monitor latency for time-sensitive applications.
qwen3-235b
US only

Highlights

  • Largest Qwen model optimized for coding tasks.
  • Agentic coding with autonomous development capabilities.
  • Tool-calling for IDE and workflow integrations.
  • Strong performance across programming languages.

Best for

  • Automated code generation and refactoring.
  • Complex coding agents and development tools.
  • Code review and analysis at scale.

Operational tips

  • US-only availability—ideal for US-based dev teams.
  • Expect higher latency due to model size.
  • Combine with tool-calling for IDE integration.
qwen3-coder-480b
US only

Embedding models

Use embeddings for semantic search, classification, clustering, and retrieval-augmented generation.

Highlights

  • 100+ language support with 1,024-dimension vectors.
  • Optimized presets for search, classification, and clustering.
  • Batch up to 96 inputs per request to lower per-item cost.

Best for

  • Global support search with mixed-language corpora.
  • RAG pipelines feeding Claude Sonnet responses.
  • Content recommendation and deduplication systems.

Operational tips

  • Normalize text (lowercase, trim whitespace) before embedding.
  • Persist vectors in Postgres + pgvector for efficient retrieval.
  • Cache frequently queried embeddings to avoid reprocessing.
cohere-embed-multilingual
search_document, search_query, classification, clustering

Image models

Generate photorealistic and stylized visuals straight from prompts.

Highlights

  • Supports 16:9, 1:1, 21:9, 2:3, 3:2, 4:5, 5:4, 9:16, and 9:21 aspect ratios.
  • Resolutions up to 1536×640 with prompt adherence tuned for enterprise brand safety.
  • Negative prompts and seed control for reproducible iterations.

Best for

  • Marketing and creative production pipelines.
  • Product visualization and storyboarding.
  • Social content generation with tight turnaround times.

Operational tips

  • Start with draft-size renders to validate prompts before increasing resolution.
  • Store seeds alongside prompts so teams can reproduce edits.
  • Use negative prompts to block banned styles or elements.
stable-image-ultra

Rerank models

Improve search quality by ranking documents based on semantic relevance to a query.

Highlights

  • Enhanced reasoning with broad data compatibility.
  • Multilingual support for global applications.
  • Up to 1000 documents per request.
  • Optimized for RAG pipeline integration.

Best for

  • Retrieval-Augmented Generation (RAG) pipelines.
  • Semantic search result ranking.
  • Multilingual document retrieval.

Operational tips

  • Use after initial vector search to improve relevance.
  • Set top_n to limit results and reduce latency.
  • Rate limit: 250 requests per minute.
cohere-rerank-3-5
US, EU

Highlights

  • High-performing reranker backed by AWS infrastructure.
  • Seamless integration with AWS ecosystem.
  • Enterprise-grade reliability and scaling.
  • Consistent performance for production workloads.

Best for

  • Enterprise search applications.
  • AWS-native RAG implementations.
  • High-volume reranking workloads.

Operational tips

  • Ideal for teams already using AWS services.
  • Rate limit: 200 requests per minute.
  • Pair with Cohere embeddings for full RAG pipeline.
amazon-rerank-1-0
US, EU

Model selection playbooks

Picking a default chat model

  1. Prototype with Claude 4.5 Haiku or Nova 2 Lite for fast, cost-effective performance.
  2. Upgrade to Claude 4.5 Sonnet for balanced performance on complex tasks.
  3. Escalate to Claude 4 Sonnet for high-stakes workflows (compliance, financial analysis, multi-step coding).
  4. Reach for Claude Opus 4.5 for complex agentic workflows and autonomous coding tasks.
  5. Use Nova Pro if you need tighter alignment with AWS governance.
  6. Try Qwen3 Coder 480B or Kimi K2 Thinking for specialized coding or reasoning tasks (US only).

Cost and performance controls

  • Stream completions to surface partial output without waiting for full responses.
  • Cap max_tokens based on UI constraints to avoid runaway output cost.
  • Use temperature ≤ 0.5 for deterministic system flows.
  • Batch embedding requests and reuse embeddings for unchanged documents.
  • Track spend per app in the Heroku Dashboard and set alerts for anomalies.

Provisioning checklist

1

Inspect availability

heroku ai:models:list
2

Create a managed resource

heroku ai:models:create claude-4-5-sonnet --app my-ai-app
3

Verify attachment

heroku ai:models:info --app my-ai-app

Chat Completions API

Implement conversational workloads with streaming and tool calling.

Embeddings API

Build RAG, semantic search, and clustering pipelines.

Rerank API

Improve search relevance by reranking documents semantically.

Image Generation API

Turn prompts into assets for marketing, product, and design.

AI Studio

Evaluate prompts, compare models, and share test links.