back

Prompt engineering is system design

Most prompt advice is about phrasing. "Be specific." "Give examples." "Say please." That gets you from bad to decent. It does not get you from decent to production-ready.

Prompt engineering is not copywriting for LLMs. It is system design. You are defining the input contract for a non-deterministic function. The prompt is the interface. The model is the runtime. And like any interface, the quality of the contract determines the reliability of the system.

This post covers the full discipline: how prompts work mechanically, the techniques that actually matter, how to build prompt-driven systems that hold up in production, and the failure modes that will bite you if you ignore them.

Anatomy of a prompt

Every effective prompt has four components. You do not always need all four, but knowing which ones you are using (and which you are skipping) matters.

┌─────────────────────────────────────────────┐
│                   PROMPT                     │
│                                             │
│  ┌───────────┐  ┌───────────┐              │
│  │Instruction│  │  Context   │              │
│  │           │  │            │              │
│  │ what to   │  │ background │              │
│  │ do        │  │ info       │              │
│  └───────────┘  └───────────┘              │
│                                             │
│  ┌───────────┐  ┌───────────┐              │
│  │Constraints│  │ Examples   │              │
│  │           │  │            │              │
│  │ rules and │  │ input/     │              │
│  │ boundaries│  │ output     │              │
│  │           │  │ pairs      │              │
│  └───────────┘  └───────────┘              │
│                                             │
└─────────────────────────────────────────────┘

Instruction. What you want the model to do. "Summarize this article." "Extract the email addresses." "Review this code for security issues." The clearer the verb, the better the result.

Context. The background information the model needs. The article to summarize. The codebase conventions to follow. The user's previous messages. Context is where most prompt quality is won or lost, because most people either provide too little (the model guesses) or too much (the model gets confused).

Constraints. The rules the output must follow. "Respond in JSON." "Keep it under 200 words." "Do not include personally identifiable information." "If you are unsure, say so instead of guessing." Constraints turn vague requests into reliable contracts.

Examples. Input-output pairs that show the model what you want. This is the single most effective technique for controlling output format and quality. One good example is worth ten sentences of instruction.

Three prompting strategies

The distinction between zero-shot, few-shot, and system prompting is the foundation everything else builds on.

  Strategy        Examples Given    Best For
  ────────        ──────────────    ────────
  Zero-shot       none              simple, well-defined tasks
  Few-shot        1-5 pairs         format control, edge cases
  System prompt   none (persona)    agents, consistent behavior

Zero-shot means you give the instruction and context, but no examples. "Translate this sentence to French." The model already knows how to do the task. You are just directing it. This works for common, well-understood tasks.

Few-shot means you include examples of the input-output mapping you want. This is dramatically more effective than adding more instruction words. Instead of explaining your output format in prose, show it.

Bad (zero-shot, verbose instruction):
"Extract the company name and founding year from the text below.
Return the result as a JSON object with keys 'company' and 'year'.
The year should be a number, not a string. If the year is not
mentioned, use null."

Better (few-shot, one example):
"Extract company info from text.

Text: Apple was founded in 1976 by Steve Jobs.
Output: {"company": "Apple", "year": 1976}

Text: The startup launched last summer.
Output: {"company": "The startup", "year": null}

Text: {{input}}
Output:"

The few-shot version is shorter, clearer, and produces more consistent output. The model learns the format from the example, not from your description of the format.

System prompting sets the model's persona and behavior rules before the conversation starts. This is how you build agents and assistants with consistent behavior.

const systemPrompt = `You are a code review assistant for a TypeScript project.

Rules:
- Reference specific line numbers
- Categorize issues as "bug", "style", or "performance"
- If code is correct, say so. Do not invent issues.
- Respond in JSON: [{"line": number, "category": string, "issue": string}]`;

The system prompt is not a suggestion. It is the constitution. Every response the model produces should be consistent with it.

Mental models for prompt engineering

Input, transformation, output

Think of the prompt as defining a function:

f(input) = output

where:
  - input    = the data you provide (context, examples, user query)
  - f        = the transformation defined by your instruction + constraints
  - output   = the model's response, shaped by all of the above

When the output is wrong, the debugging question is: which part of the function is broken? Is the input incomplete (missing context)? Is the transformation ambiguous (unclear instruction)? Are the constraints too loose (no format specified)?

Determinism vs variability

LLMs are not deterministic. The same prompt can produce different outputs on different runs. This is a feature for creative tasks and a problem for structured tasks.

  Temperature    Behavior           Use When
  ───────────    ────────           ────────
  0              near-deterministic  structured extraction, JSON, code
  0.3-0.7        balanced           general tasks, summaries
  1.0+           high variability   brainstorming, creative writing

For production systems, use temperature 0 and design your prompt to fully specify the output. If the model needs to "think creatively," that is a sign your prompt is underspecified, not that you need higher temperature.

Constraints as architecture

The most underrated prompting technique is stacking constraints. Each constraint narrows the output space. Enough constraints and the model has almost no room to produce bad output.

Without constraints:
  "Summarize this article."
  → Could be 50 words or 500. Could be bullet points or prose.
     Could include opinions or just facts.

With stacked constraints:
  "Summarize this article in exactly 3 bullet points.
   Each bullet should be one sentence, max 20 words.
   Focus on factual claims, not opinions.
   Start each bullet with a verb."
  → Output space is tiny. Almost every run produces good output.

Advanced techniques

Chain-of-thought

Chain-of-thought prompting asks the model to show its reasoning before giving the answer. This consistently improves accuracy on complex tasks because it forces the model to work through the problem step by step instead of jumping to a conclusion.

Without chain-of-thought:
  "Is this code thread-safe? Answer yes or no."
  → Often wrong. The model pattern-matches instead of reasoning.

With chain-of-thought:
  "Analyze this code for thread safety.
   1. List all shared mutable state
   2. Check if each access is synchronized
   3. Identify any race conditions
   4. Conclude whether the code is thread-safe"
  → Much more accurate. The steps force actual analysis.

You do not need to say "think step by step." You need to define the steps.

Prompt chaining

Break complex tasks into a pipeline of simpler prompts. Each prompt does one well-defined transformation.

  Monolithic prompt (fragile):
  ┌──────────────────────────────────┐
  │ "Read this PR, find bugs,       │
  │  categorize them, suggest fixes, │
  │  estimate severity, write a      │
  │  summary, format as markdown"    │
  └──────────────────────────────────┘
  → Tries to do too much. Quality degrades on every sub-task.

  Chained prompts (reliable):
  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐
  │ Extract  │──▶│Categorize│──▶│ Suggest  │──▶│ Format   │
  │ issues   │   │ + score  │   │ fixes    │   │ summary  │
  └──────────┘   └──────────┘   └──────────┘   └──────────┘
  → Each prompt has one job. Quality stays high.

Chaining costs more tokens (each step is an API call) but produces dramatically better results on complex tasks. The trade-off is almost always worth it.

Self-consistency

Run the same prompt multiple times and take the majority answer. This reduces the impact of random variation.

async function selfConsistentAnswer(prompt: string, runs = 5) {
  const responses = await Promise.all(
    Array.from({ length: runs }, () => callLLM(prompt, { temperature: 0.7 }))
  );

  // Count occurrences of each unique answer
  const counts = new Map<string, number>();
  for (const response of responses) {
    const normalized = response.trim().toLowerCase();
    counts.set(normalized, (counts.get(normalized) || 0) + 1);
  }

  // Return the most common answer
  return [...counts.entries()].sort((a, b) => b[1] - a[1])[0][0];
}

Use this for high-stakes classification or extraction where getting the wrong answer is costly. Do not use it for generative tasks where there is no "correct" answer.

Role prompting and persona design

Assigning the model a specific role changes its behavior more than you might expect. The key is specificity.

Vague role (minimal effect):
  "You are a helpful assistant."

Specific role (strong effect):
  "You are a senior TypeScript engineer reviewing pull requests
   for a fintech startup. You care deeply about type safety,
   error handling, and edge cases around money calculations.
   You are direct and do not sugarcoat feedback."

The specific role activates more relevant knowledge and produces responses that match the described perspective. The model does not "become" the role, but it conditions its output distribution on it.

Instruction layering

Complex systems often need instructions at multiple levels. Layer them from general to specific.

// Layer 1: System prompt (always active, defines persona)
const system = `You are a data extraction agent.
You always respond in valid JSON.
You never fabricate data that is not present in the source.`;

// Layer 2: Task instruction (changes per request type)
const taskInstruction = `Extract all monetary amounts from the
document below. For each amount, include the value, currency,
and the sentence it appeared in.`;

// Layer 3: Constraints (specific to this run)
const constraints = `Output format: [{"value": number, "currency": string, "sentence": string}]
If a currency is ambiguous, use "USD" as default.
Ignore amounts that appear in headers or footnotes.`;

// Layer 4: Examples (optional, for edge cases)
const example = `Example input: "The project cost $1.2M in 2024."
Example output: [{"value": 1200000, "currency": "USD", "sentence": "The project cost $1.2M in 2024."}]`;

const fullPrompt = [taskInstruction, constraints, example, document].join(
  "\n\n"
);

Each layer handles a different concern. The system prompt never changes. The task instruction changes per use case. Constraints and examples change per run. This separation makes the system maintainable.

Prompt engineering for production systems

Dynamic prompt generation

In production, prompts are not static strings. They are templates populated at runtime.

function buildReviewPrompt(diff: string, rules: string[], language: string) {
  const ruleBlock = rules.map((rule, i) => `${i + 1}. ${rule}`).join("\n");

  return `Review this ${language} code diff.

Apply these project-specific rules:
${ruleBlock}

Diff:
\`\`\`
${diff}
\`\`\`

For each issue found, respond with:
- Line number
- Rule violated (by number)
- Suggested fix

If the code follows all rules, respond with: "No issues found."`;
}

The prompt is now a function. The instruction is fixed. The context (diff, rules, language) is injected at runtime. This is how you build systems that handle thousands of different inputs with the same prompt structure.

Context window management

Every token in the prompt costs money and consumes limited space. In production, you need a strategy.

┌──────────────────────────────────────────────┐
│              Context Window                   │
│                                              │
│  ┌─────────────┐  Fixed cost: system prompt, │
│  │   System    │  instructions, constraints  │
│  │   prompt    │  (~500-1000 tokens)         │
│  │   + rules   │                             │
│  └─────────────┘                             │
│                                              │
│  ┌─────────────┐  Variable: grows with input │
│  │   Context   │  (code, docs, history)      │
│  │   data      │  (~1000-50000 tokens)       │
│  └─────────────┘                             │
│                                              │
│  ┌─────────────┐  Reserved for model output  │
│  │   Output    │  (~500-4000 tokens)         │
│  │   space     │                             │
│  └─────────────┘                             │
│                                              │
│  Total budget: 128k-200k tokens              │
└──────────────────────────────────────────────┘

Strategies:

Tool usage and function calling

When the model calls tools, the prompt needs to describe not just what each tool does, but when to use it and what to do with the result.

Bad tool description:
  "search_code: Searches for code"

Good tool description:
  "search_code: Search for a pattern across all files in the repository.
   Returns up to 10 matching files with line numbers and surrounding context.

   When to use: Before modifying code, to find all usages of a function,
   type, or variable. Also use to verify that a proposed change does not
   break existing callers.

   Input: {pattern: string, file_glob?: string}
   Output: Array of {file: string, line: number, context: string}"

The model uses the tool description to decide when and how to call it. A vague description produces wrong tool calls. A detailed description produces correct ones. This is prompt engineering applied to tool interfaces.

Evaluating prompt quality

You cannot improve what you do not measure. Prompt evaluation needs to be systematic, not vibes-based.

The three metrics

  Metric         What it measures              How to test
  ──────         ────────────────              ───────────
  Accuracy       correct answers               labeled test set
  Consistency    same input = same output      run N times, compare
  Cost           tokens consumed per task      log and aggregate

Accuracy. Build a test set of 20-50 input-output pairs where you know the correct answer. Run your prompt against the test set. Measure how many it gets right. This is the floor.

Consistency. Run the same prompt on the same input 10 times. If you get 10 different outputs, your prompt is underspecified. Add constraints or lower temperature until consistency improves.

Cost. Track input tokens, output tokens, and number of API calls per task. A prompt that uses 50k tokens per run might be correct but too expensive to deploy.

Iterative refinement

  ┌──────────┐     ┌──────────┐     ┌──────────┐
  │  Write   │────▶│  Test on  │────▶│ Analyze  │
  │  prompt  │     │  eval set │     │ failures │
  └──────────┘     └──────────┘     └────┬─────┘
       ▲                                  │
       └──────────────────────────────────┘
              fix the specific failure

Do not rewrite the entire prompt when something fails. Read the failure, understand why it happened, and fix that specific case. Add a constraint. Add an example. Clarify an instruction. Then re-run the full eval set to make sure the fix did not break other cases.

This is the same discipline as debugging code. Surgical fixes, not rewrites.

Failure modes

Hallucination

The model invents facts that are not in the context. This happens most when the prompt asks for information without providing it.

Causes hallucination:
  "What is the return type of the getUserById function?"
  (if the function is not in the context, the model will guess)

Prevents hallucination:
  "Based ONLY on the code provided below, what is the return type
   of getUserById? If the function is not present in the provided
   code, respond with 'Function not found in context.'"

The fix is always the same: tell the model what to do when it does not have the information. Without an explicit fallback, the model fills the gap with plausible fiction.

Prompt brittleness

A prompt that works for your test cases but fails on real inputs. This happens when the prompt is overfit to specific examples.

Brittle:
  "Extract the price from the product description.
   The price is always in the format $X.XX."

  Fails on: "Starting at €49/mo" or "Price: 1,299.00 USD"

Robust:
  "Extract the price from the product description.
   The price may be in any currency and any format
   (e.g., $10.99, €49/mo, 1,299 USD, free).
   Return: {amount: number, currency: string, period?: string}
   If no price is found, return: {amount: null}"

Build your eval set from real inputs, not synthetic ones. The edge cases in real data are what break brittle prompts.

Token waste

Prompts that use 10,000 tokens when 2,000 would produce the same result. Common causes:

Wasteful:
  System: "You are a helpful assistant that summarizes text..."
  User: "Please summarize the following text. Make sure your
         summary is concise and captures the key points. The
         summary should be in paragraph form and should not
         exceed 3 sentences. Please focus on the main ideas..."

Efficient:
  System: "You summarize text in exactly 3 sentences."
  User: "Summarize:\n\n{{text}}"

Every unnecessary token is money and latency. In high-volume systems, this matters.

Real-world patterns

Content generation

The key insight: separate structure from content. First prompt generates the structure (outline, sections, key points). Second prompt generates the content for each section. This produces dramatically better results than a single "write an article about X" prompt.

  Prompt 1 (structure):
  "Create an outline for an article about {{topic}}.
   Return 5-7 section headings with 1-sentence descriptions."

  Prompt 2 (per section):
  "Write the '{{section_title}}' section of an article about {{topic}}.
   Context from other sections: {{summaries}}
   Target: 200 words. Tone: {{tone}}."

  Prompt 3 (polish):
  "Review this article for consistency in tone and flow.
   Fix transitions between sections. Do not change content."

Data extraction

The key insight: show the edge cases in your examples, not just the happy path.

Good few-shot for extraction:

Input: "Contact John at john@company.com or 555-1234"
Output: {"emails": ["john@company.com"], "phones": ["555-1234"]}

Input: "No contact info available"
Output: {"emails": [], "phones": []}

Input: "Email me at john [at] company [dot] com"
Output: {"emails": ["john@company.com"], "phones": []}

The third example teaches the model to handle obfuscated emails. Without it, most models will miss them. Your examples define the edges of what the model will handle.

Coding assistants

The key insight: the system prompt is your quality floor. Everything the assistant does should be consistent with it.

const system = `You are a TypeScript code assistant for a Next.js project.

Technical context:
- Next.js 16 with Pages Router
- Tailwind CSS v4 with cn() utility
- React 19, no class components
- Strict TypeScript, no "any"

Behavior rules:
- Read existing code before suggesting changes
- Match existing patterns, do not introduce new ones
- When modifying files, show the minimal diff
- If a task is ambiguous, ask for clarification instead of guessing
- Never suggest installing new dependencies unless explicitly asked`;

Every rule in the system prompt prevents a specific failure mode. "Match existing patterns" prevents the model from generating textbook code that looks nothing like your project. "Minimal diff" prevents the model from rewriting entire files when one line needs to change. These are not suggestions. They are constraints that produce reliable behavior.

Prompt engineering vs fine-tuning

  Prompt Engineering           Fine-tuning
  ──────────────────           ───────────
  Change the input             Change the model
  Instant iteration            Hours/days to train
  No data required             Needs training data
  Flexible per task            Fixed after training
  Higher per-call cost         Lower per-call cost
  Works across models          Model-specific

  Use prompting when:          Use fine-tuning when:
  - Iterating quickly          - High volume, fixed task
  - Task varies per user       - Need lower latency/cost
  - Small scale                - Have quality training data
  - Exploring approach         - Prompt is proven, optimized

Start with prompt engineering. Always. Fine-tuning is an optimization you apply after you have proven the task works with prompts and have collected enough high-quality examples to train on. Fine-tuning a model with a bad prompt just bakes in the bad behavior.

The discipline

Prompt engineering is not about finding magic phrases. It is about designing input contracts.

Every prompt is a specification. The instruction says what to do. The context provides the data. The constraints define the boundaries. The examples show the expected behavior. When any of these is missing, the model fills the gap with its best guess, and best guesses are not reliable at scale.

  Reliable prompt = clear instruction
                  + sufficient context
                  + explicit constraints
                  + representative examples
                  + defined failure behavior

Start with the simplest prompt that could work. Test it on real inputs. When it fails, fix the specific failure. Add a constraint, add an example, clarify an instruction. Test again. Repeat.

The prompts that work in production are not clever. They are thorough.