Skip to main content

This guide distills prompt optimization practices from the OpenAI Cookbook so you can harden chat-based Heroku AI apps without relying on the Agents API. Adapted references: Optimize Prompts and Evaluation Flywheel.

Structure prompts for reuse

Template anatomy

  • System message: role, tone, forbidden behaviors.
  • Instructions block: numbered steps to follow (use bullet formatting from the cookbook).
  • Reference data: optional context, clearly delimited.
  • Output contract: JSON schema or textual requirements to aid parsing.

Keep variants in source control so changes can be reviewed just like code.

PROMPT = {
    "role": "system",
    "content": (
        "You are a Heroku AI support assistant.\n"
        "# Style\n"
        "- Be concise.\n"
        "- Cite docs pages when helpful.\n"
        "# Disallowed\n"
        "- Never invent plan names or pricing.\n"
        "# Output\n"
        "Respond in Markdown with a short list of actions."
    ),
}
Store prompts alongside unit tests so changes must pass automated checks before deployment.

Automated prompt reviews

Checker workflow

  1. Run prompt text through heuristic checkers (linting length, missing sections).
  2. Optionally call the chat completions API with a “critic” system message to flag contradictions or formatting gaps (inspired by the cookbook’s multi-agent loop).
  3. Return actionable feedback for human review.

Keep the critic prompt deterministic (low temperature) and limit output to a structured checklist.

def audit_prompt(candidate: str) -> str:
    critic = client.chat.completions.create(
        model="claude-4-5-haiku",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are PromptChecker.\n"
                    "Identify contradictions, missing output rules, or unclear tone.\n"
                    "Respond as JSON with keys: has_issue (bool), notes (array of strings)."
                ),
            },
            {"role": "user", "content": candidate},
        ],
        temperature=0.0,
        max_tokens=500,
    )
    return critic.choices[0].message.content
Haiku keeps costs low for CI pipelines; switch to Sonnet for more nuanced critiques.

Evaluation flywheel

Cycle overview

  1. Collect failing traces from production and label failure modes.
  2. Measure with LLM graders scoring binary pass/fail outcomes.
  3. Improve prompts or context, re-run graders, and deploy if scores rise.

Align graders with human expectations using the cookbook’s TPR/TNR approach.

def run_eval(example: dict) -> bool:
    response = client.chat.completions.create(
        model="claude-4-5-sonnet",
        messages=[
            {"role": "system", "content": "Judge if the assistant followed policy. Reply PASS or FAIL."},
            {"role": "user", "content": example["transcript"]},
        ],
        temperature=0.0,
        max_tokens=10,
    )
    verdict = response.choices[0].message.content.strip().upper()
    return verdict == "PASS"
Run graders inside Heroku Scheduler or CI; persist scores to Postgres so you can chart quality over time.