Skip to main content
POST
/
v1
/
rerank
Rerank
curl --request POST \
  --url https://us.inference.heroku.com/v1/rerank \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "model": "cohere-rerank-3-5",
  "query": "How do I optimize database connection pooling?",
  "documents": [
    "Connection pooling reduces overhead by reusing existing connections.",
    "You can monitor application performance using built-in metrics.",
    "Set max pool size based on your dyno count and concurrent queries."
  ],
  "top_n": 2
}
'
{
  "results": [
    {
      "index": 123,
      "relevance_score": 123
    }
  ]
}
The /v1/rerank endpoint ranks a list of documents by their semantic relevance to a given query. This is essential for Retrieval-Augmented Generation (RAG) pipelines, semantic search, and question-answering applications where you need to surface the most relevant content.
Improve RAG Quality: Use reranking after initial retrieval to boost the relevance of documents passed to your LLM, improving response accuracy and reducing hallucinations.

Base URL

https://us.inference.heroku.com

Authentication

All requests must include an Authorization header with your Heroku Inference API key:
Authorization: Bearer YOUR_RERANK_KEY
You can get your API key from your Heroku app’s RERANK_KEY config variable (assuming you created the model resource with an --as RERANK flag).

Available Models

ModelDescriptionRate LimitAvailability
cohere-rerank-3-5Enhanced reasoning with broad data compatibility and multilingual support250 RPMUS, EU
amazon-rerank-1-0High-performing reranker backed by AWS200 RPMUS, EU

Request Parameters

model

string · required ID of the rerank model to use. Example: "cohere-rerank-3-5" or "amazon-rerank-1-0"

query

string · required The search query or question to rank documents against. Example: "How do I optimize database connection pooling?"

documents

array · required List of document strings to rank. Maximum of 1000 documents per request.
[
  "Connection pooling reduces overhead by reusing existing connections.",
  "You can monitor application performance using built-in metrics.",
  "Set max pool size based on your dyno count and concurrent queries."
]

top_n

integer · optional Return only the top N most relevant documents. If not specified, all documents are returned ranked by relevance. Example: 10

Response

id

string Unique identifier for the rerank request (UUID format).

results

array List of ranked documents, ordered by relevance score (highest first).
Each object in the results array includes:index (integer): Original position of the document in the input array (0-indexed)relevance_score (float): Relevance score between 0 and 1, where higher values indicate greater relevance to the query

meta

object API version and billing information.
  • api_version.version (string): API version number
  • api_version.is_experimental (boolean): Whether this API is experimental
  • billed_units.search_units (integer): Number of search units consumed for billing

Examples

eval $(heroku config -a $APP_NAME --shell | grep '^RERANK_' | sed 's/^/export /' | tee >(cat >&2))

curl $RERANK_URL/v1/rerank \
  -H "Authorization: Bearer $RERANK_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "cohere-rerank-3-5",
    "query": "How do I optimize database connection pooling?",
    "documents": [
      "Connection pooling reduces overhead by reusing existing database connections instead of creating new ones for each request.",
      "You can monitor application performance using built-in metrics and logging tools.",
      "Set max pool size based on your dyno count and expected concurrent queries to prevent connection exhaustion.",
      "Regular database backups are essential for disaster recovery planning."
    ],
    "top_n": 2
  }'

Response Example

{
  "id": "f844c7c3-c357-4476-9a9d-d2de06f2106f",
  "results": [
    {
      "index": 0,
      "relevance_score": 0.6740
    },
    {
      "index": 2,
      "relevance_score": 0.5308
    }
  ],
  "meta": {
    "api_version": {
      "version": "2",
      "is_experimental": false
    },
    "billed_units": {
      "search_units": 1
    }
  }
}

Use Cases

RAG Pipeline Enhancement

Reranking is most powerful when combined with initial retrieval. First retrieve candidates using embeddings, then rerank to surface the most relevant documents before passing them to your LLM.
import os
import requests

def rag_with_rerank(query: str, retrieved_docs: list[str], top_k: int = 3):
    """
    Rerank retrieved documents and return the most relevant ones.
    """
    # Step 1: Rerank the retrieved documents
    rerank_response = requests.post(
        f"{os.getenv('RERANK_URL')}/v1/rerank",
        headers={
            "Authorization": f"Bearer {os.getenv('RERANK_KEY')}",
            "Content-Type": "application/json"
        },
        json={
            "model": "cohere-rerank-3-5",
            "query": query,
            "documents": retrieved_docs,
            "top_n": top_k
        }
    )

    reranked = rerank_response.json()["results"]

    # Step 2: Get the top documents in ranked order
    top_docs = [retrieved_docs[r["index"]] for r in reranked]

    # Step 3: Pass to LLM for generation
    context = "\n\n".join(top_docs)
    return context, reranked
Improve search result quality by reranking keyword or vector search results based on semantic relevance.

Question Answering

Identify the most relevant documents for answering user questions, reducing the context window needed and improving answer accuracy.

Error Responses

Status CodeDescriptionCommon Causes
400Bad RequestMissing required fields, documents exceed 1000 limit, invalid JSON
401UnauthorizedMissing or invalid authorization token
403ForbiddenNo access to the requested model
404Not FoundInvalid model ID
429Too Many RequestsRate limit exceeded (250 RPM for Cohere, 200 RPM for Amazon)
500Internal Server ErrorBackend service errors
Implement exponential backoff when handling 429 errors to gracefully handle rate limits.

Embeddings

Generate embeddings for initial retrieval before reranking

Chat Completions

Generate responses using reranked context

Vector Database

Store and query embeddings with pgvector

Embeddings + RAG Cookbook

Build complete RAG pipelines

Authorizations

Authorization
string
header
required

Bearer token using your INFERENCE_KEY

Body

application/json
model
string
required

ID of the rerank model to use

Example:

"cohere-rerank-3-5"

query
string
required

The search query or question to rank documents against

Example:

"How do I optimize database connection pooling?"

documents
string[]
required

Array of text documents to rank by relevance to the query

Example:
[
"Connection pooling reduces overhead by reusing existing connections.",
"You can monitor application performance using built-in metrics.",
"Set max pool size based on your dyno count and concurrent queries."
]
top_n
integer

Number of most relevant results to return. If not specified, returns all documents.

Example:

3

Response

Successful response

results
object[]

Documents ranked by relevance, highest score first