Prompt patterns & evaluations

This guide distills prompt optimization practices from the OpenAI Cookbook so you can harden chat-based Heroku AI apps without relying on the Agents API. Adapted references: Optimize Prompts and Evaluation Flywheel.

Structure prompts for reuse

Template anatomy

System message: role, tone, forbidden behaviors.
Instructions block: numbered steps to follow (use bullet formatting from the cookbook).
Reference data: optional context, clearly delimited.
Output contract: JSON schema or textual requirements to aid parsing.

Keep variants in source control so changes can be reviewed just like code.

PROMPT = {
    "role": "system",
    "content": (
        "You are a Heroku AI support assistant.\n"
        "# Style\n"
        "- Be concise.\n"
        "- Cite docs pages when helpful.\n"
        "# Disallowed\n"
        "- Never invent plan names or pricing.\n"
        "# Output\n"
        "Respond in Markdown with a short list of actions."
    ),
}

Store prompts alongside unit tests so changes must pass automated checks before deployment.

Automated prompt reviews

Checker workflow

Run prompt text through heuristic checkers (linting length, missing sections).
Optionally call the chat completions API with a “critic” system message to flag contradictions or formatting gaps (inspired by the cookbook’s multi-agent loop).
Return actionable feedback for human review.

Keep the critic prompt deterministic (low temperature) and limit output to a structured checklist.

def audit_prompt(candidate: str) -> str:
    critic = client.chat.completions.create(
        model="claude-4-5-haiku",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are PromptChecker.\n"
                    "Identify contradictions, missing output rules, or unclear tone.\n"
                    "Respond as JSON with keys: has_issue (bool), notes (array of strings)."
                ),
            },
            {"role": "user", "content": candidate},
        ],
        temperature=0.0,
        max_tokens=500,
    )
    return critic.choices[0].message.content

Haiku keeps costs low for CI pipelines; switch to Sonnet for more nuanced critiques.

Evaluation flywheel

Cycle overview

Collect failing traces from production and label failure modes.
Measure with LLM graders scoring binary pass/fail outcomes.
Improve prompts or context, re-run graders, and deploy if scores rise.

Align graders with human expectations using the cookbook’s TPR/TNR approach.

def run_eval(example: dict) -> bool:
    response = client.chat.completions.create(
        model="claude-4-5-sonnet",
        messages=[
            {"role": "system", "content": "Judge if the assistant followed policy. Reply PASS or FAIL."},
            {"role": "user", "content": example["transcript"]},
        ],
        temperature=0.0,
        max_tokens=10,
    )
    verdict = response.choices[0].message.content.strip().upper()
    return verdict == "PASS"

Run graders inside Heroku Scheduler or CI; persist scores to Postgres so you can chart quality over time.

Get started

Core concepts

Agents

Tools

Evaluation

Integrations

Reference

Cookbook

Prompt patterns & evaluations

Structure prompts for reuse

Template anatomy

Automated prompt reviews

Checker workflow

Evaluation flywheel

Cycle overview

Get started

Core concepts

Agents

Tools

Evaluation

Integrations

Reference

Cookbook

​Structure prompts for reuse

​Template anatomy

​Automated prompt reviews

​Checker workflow

​Evaluation flywheel

​Cycle overview

Structure prompts for reuse

Template anatomy

Automated prompt reviews

Checker workflow

Evaluation flywheel

Cycle overview