Most prompt advice is about phrasing. "Be specific." "Give examples." "Say please." That gets you from bad to decent. It does not get you from decent to production-ready.
Prompt engineering is not copywriting for LLMs. It is system design. You are defining the input contract for a non-deterministic function. The prompt is the interface. The model is the runtime. And like any interface, the quality of the contract determines the reliability of the system.
This post covers the full discipline: how prompts work mechanically, the techniques that actually matter, how to build prompt-driven systems that hold up in production, and the failure modes that will bite you if you ignore them.
Every effective prompt has four components. You do not always need all four, but knowing which ones you are using (and which you are skipping) matters.
┌─────────────────────────────────────────────┐
│ PROMPT │
│ │
│ ┌───────────┐ ┌───────────┐ │
│ │Instruction│ │ Context │ │
│ │ │ │ │ │
│ │ what to │ │ background │ │
│ │ do │ │ info │ │
│ └───────────┘ └───────────┘ │
│ │
│ ┌───────────┐ ┌───────────┐ │
│ │Constraints│ │ Examples │ │
│ │ │ │ │ │
│ │ rules and │ │ input/ │ │
│ │ boundaries│ │ output │ │
│ │ │ │ pairs │ │
│ └───────────┘ └───────────┘ │
│ │
└─────────────────────────────────────────────┘
Instruction. What you want the model to do. "Summarize this article." "Extract the email addresses." "Review this code for security issues." The clearer the verb, the better the result.
Context. The background information the model needs. The article to summarize. The codebase conventions to follow. The user's previous messages. Context is where most prompt quality is won or lost, because most people either provide too little (the model guesses) or too much (the model gets confused).
Constraints. The rules the output must follow. "Respond in JSON." "Keep it under 200 words." "Do not include personally identifiable information." "If you are unsure, say so instead of guessing." Constraints turn vague requests into reliable contracts.
Examples. Input-output pairs that show the model what you want. This is the single most effective technique for controlling output format and quality. One good example is worth ten sentences of instruction.
The distinction between zero-shot, few-shot, and system prompting is the foundation everything else builds on.
Strategy Examples Given Best For
──────── ────────────── ────────
Zero-shot none simple, well-defined tasks
Few-shot 1-5 pairs format control, edge cases
System prompt none (persona) agents, consistent behavior
Zero-shot means you give the instruction and context, but no examples. "Translate this sentence to French." The model already knows how to do the task. You are just directing it. This works for common, well-understood tasks.
Few-shot means you include examples of the input-output mapping you want. This is dramatically more effective than adding more instruction words. Instead of explaining your output format in prose, show it.
Bad (zero-shot, verbose instruction):
"Extract the company name and founding year from the text below.
Return the result as a JSON object with keys 'company' and 'year'.
The year should be a number, not a string. If the year is not
mentioned, use null."
Better (few-shot, one example):
"Extract company info from text.
Text: Apple was founded in 1976 by Steve Jobs.
Output: {"company": "Apple", "year": 1976}
Text: The startup launched last summer.
Output: {"company": "The startup", "year": null}
Text: {{input}}
Output:"
The few-shot version is shorter, clearer, and produces more consistent output. The model learns the format from the example, not from your description of the format.
System prompting sets the model's persona and behavior rules before the conversation starts. This is how you build agents and assistants with consistent behavior.
const systemPrompt = `You are a code review assistant for a TypeScript project.
Rules:
- Reference specific line numbers
- Categorize issues as "bug", "style", or "performance"
- If code is correct, say so. Do not invent issues.
- Respond in JSON: [{"line": number, "category": string, "issue": string}]`;
The system prompt is not a suggestion. It is the constitution. Every response the model produces should be consistent with it.
Think of the prompt as defining a function:
f(input) = output
where:
- input = the data you provide (context, examples, user query)
- f = the transformation defined by your instruction + constraints
- output = the model's response, shaped by all of the above
When the output is wrong, the debugging question is: which part of the function is broken? Is the input incomplete (missing context)? Is the transformation ambiguous (unclear instruction)? Are the constraints too loose (no format specified)?
LLMs are not deterministic. The same prompt can produce different outputs on different runs. This is a feature for creative tasks and a problem for structured tasks.
Temperature Behavior Use When
─────────── ──────── ────────
0 near-deterministic structured extraction, JSON, code
0.3-0.7 balanced general tasks, summaries
1.0+ high variability brainstorming, creative writing
For production systems, use temperature 0 and design your prompt to fully specify the output. If the model needs to "think creatively," that is a sign your prompt is underspecified, not that you need higher temperature.
The most underrated prompting technique is stacking constraints. Each constraint narrows the output space. Enough constraints and the model has almost no room to produce bad output.
Without constraints:
"Summarize this article."
→ Could be 50 words or 500. Could be bullet points or prose.
Could include opinions or just facts.
With stacked constraints:
"Summarize this article in exactly 3 bullet points.
Each bullet should be one sentence, max 20 words.
Focus on factual claims, not opinions.
Start each bullet with a verb."
→ Output space is tiny. Almost every run produces good output.
Chain-of-thought prompting asks the model to show its reasoning before giving the answer. This consistently improves accuracy on complex tasks because it forces the model to work through the problem step by step instead of jumping to a conclusion.
Without chain-of-thought:
"Is this code thread-safe? Answer yes or no."
→ Often wrong. The model pattern-matches instead of reasoning.
With chain-of-thought:
"Analyze this code for thread safety.
1. List all shared mutable state
2. Check if each access is synchronized
3. Identify any race conditions
4. Conclude whether the code is thread-safe"
→ Much more accurate. The steps force actual analysis.
You do not need to say "think step by step." You need to define the steps.
Break complex tasks into a pipeline of simpler prompts. Each prompt does one well-defined transformation.
Monolithic prompt (fragile):
┌──────────────────────────────────┐
│ "Read this PR, find bugs, │
│ categorize them, suggest fixes, │
│ estimate severity, write a │
│ summary, format as markdown" │
└──────────────────────────────────┘
→ Tries to do too much. Quality degrades on every sub-task.
Chained prompts (reliable):
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Extract │──▶│Categorize│──▶│ Suggest │──▶│ Format │
│ issues │ │ + score │ │ fixes │ │ summary │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
→ Each prompt has one job. Quality stays high.
Chaining costs more tokens (each step is an API call) but produces dramatically better results on complex tasks. The trade-off is almost always worth it.
Run the same prompt multiple times and take the majority answer. This reduces the impact of random variation.
async function selfConsistentAnswer(prompt: string, runs = 5) {
const responses = await Promise.all(
Array.from({ length: runs }, () => callLLM(prompt, { temperature: 0.7 }))
);
// Count occurrences of each unique answer
const counts = new Map<string, number>();
for (const response of responses) {
const normalized = response.trim().toLowerCase();
counts.set(normalized, (counts.get(normalized) || 0) + 1);
}
// Return the most common answer
return [...counts.entries()].sort((a, b) => b[1] - a[1])[0][0];
}
Use this for high-stakes classification or extraction where getting the wrong answer is costly. Do not use it for generative tasks where there is no "correct" answer.
Assigning the model a specific role changes its behavior more than you might expect. The key is specificity.
Vague role (minimal effect):
"You are a helpful assistant."
Specific role (strong effect):
"You are a senior TypeScript engineer reviewing pull requests
for a fintech startup. You care deeply about type safety,
error handling, and edge cases around money calculations.
You are direct and do not sugarcoat feedback."
The specific role activates more relevant knowledge and produces responses that match the described perspective. The model does not "become" the role, but it conditions its output distribution on it.
Complex systems often need instructions at multiple levels. Layer them from general to specific.
// Layer 1: System prompt (always active, defines persona)
const system = `You are a data extraction agent.
You always respond in valid JSON.
You never fabricate data that is not present in the source.`;
// Layer 2: Task instruction (changes per request type)
const taskInstruction = `Extract all monetary amounts from the
document below. For each amount, include the value, currency,
and the sentence it appeared in.`;
// Layer 3: Constraints (specific to this run)
const constraints = `Output format: [{"value": number, "currency": string, "sentence": string}]
If a currency is ambiguous, use "USD" as default.
Ignore amounts that appear in headers or footnotes.`;
// Layer 4: Examples (optional, for edge cases)
const example = `Example input: "The project cost $1.2M in 2024."
Example output: [{"value": 1200000, "currency": "USD", "sentence": "The project cost $1.2M in 2024."}]`;
const fullPrompt = [taskInstruction, constraints, example, document].join(
"\n\n"
);
Each layer handles a different concern. The system prompt never changes. The task instruction changes per use case. Constraints and examples change per run. This separation makes the system maintainable.
In production, prompts are not static strings. They are templates populated at runtime.
function buildReviewPrompt(diff: string, rules: string[], language: string) {
const ruleBlock = rules.map((rule, i) => `${i + 1}. ${rule}`).join("\n");
return `Review this ${language} code diff.
Apply these project-specific rules:
${ruleBlock}
Diff:
\`\`\`
${diff}
\`\`\`
For each issue found, respond with:
- Line number
- Rule violated (by number)
- Suggested fix
If the code follows all rules, respond with: "No issues found."`;
}
The prompt is now a function. The instruction is fixed. The context (diff, rules, language) is injected at runtime. This is how you build systems that handle thousands of different inputs with the same prompt structure.
Every token in the prompt costs money and consumes limited space. In production, you need a strategy.
┌──────────────────────────────────────────────┐
│ Context Window │
│ │
│ ┌─────────────┐ Fixed cost: system prompt, │
│ │ System │ instructions, constraints │
│ │ prompt │ (~500-1000 tokens) │
│ │ + rules │ │
│ └─────────────┘ │
│ │
│ ┌─────────────┐ Variable: grows with input │
│ │ Context │ (code, docs, history) │
│ │ data │ (~1000-50000 tokens) │
│ └─────────────┘ │
│ │
│ ┌─────────────┐ Reserved for model output │
│ │ Output │ (~500-4000 tokens) │
│ │ space │ │
│ └─────────────┘ │
│ │
│ Total budget: 128k-200k tokens │
└──────────────────────────────────────────────┘
Strategies:
When the model calls tools, the prompt needs to describe not just what each tool does, but when to use it and what to do with the result.
Bad tool description:
"search_code: Searches for code"
Good tool description:
"search_code: Search for a pattern across all files in the repository.
Returns up to 10 matching files with line numbers and surrounding context.
When to use: Before modifying code, to find all usages of a function,
type, or variable. Also use to verify that a proposed change does not
break existing callers.
Input: {pattern: string, file_glob?: string}
Output: Array of {file: string, line: number, context: string}"
The model uses the tool description to decide when and how to call it. A vague description produces wrong tool calls. A detailed description produces correct ones. This is prompt engineering applied to tool interfaces.
You cannot improve what you do not measure. Prompt evaluation needs to be systematic, not vibes-based.
Metric What it measures How to test
────── ──────────────── ───────────
Accuracy correct answers labeled test set
Consistency same input = same output run N times, compare
Cost tokens consumed per task log and aggregate
Accuracy. Build a test set of 20-50 input-output pairs where you know the correct answer. Run your prompt against the test set. Measure how many it gets right. This is the floor.
Consistency. Run the same prompt on the same input 10 times. If you get 10 different outputs, your prompt is underspecified. Add constraints or lower temperature until consistency improves.
Cost. Track input tokens, output tokens, and number of API calls per task. A prompt that uses 50k tokens per run might be correct but too expensive to deploy.
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Write │────▶│ Test on │────▶│ Analyze │
│ prompt │ │ eval set │ │ failures │
└──────────┘ └──────────┘ └────┬─────┘
▲ │
└──────────────────────────────────┘
fix the specific failure
Do not rewrite the entire prompt when something fails. Read the failure, understand why it happened, and fix that specific case. Add a constraint. Add an example. Clarify an instruction. Then re-run the full eval set to make sure the fix did not break other cases.
This is the same discipline as debugging code. Surgical fixes, not rewrites.
The model invents facts that are not in the context. This happens most when the prompt asks for information without providing it.
Causes hallucination:
"What is the return type of the getUserById function?"
(if the function is not in the context, the model will guess)
Prevents hallucination:
"Based ONLY on the code provided below, what is the return type
of getUserById? If the function is not present in the provided
code, respond with 'Function not found in context.'"
The fix is always the same: tell the model what to do when it does not have the information. Without an explicit fallback, the model fills the gap with plausible fiction.
A prompt that works for your test cases but fails on real inputs. This happens when the prompt is overfit to specific examples.
Brittle:
"Extract the price from the product description.
The price is always in the format $X.XX."
Fails on: "Starting at €49/mo" or "Price: 1,299.00 USD"
Robust:
"Extract the price from the product description.
The price may be in any currency and any format
(e.g., $10.99, €49/mo, 1,299 USD, free).
Return: {amount: number, currency: string, period?: string}
If no price is found, return: {amount: null}"
Build your eval set from real inputs, not synthetic ones. The edge cases in real data are what break brittle prompts.
Prompts that use 10,000 tokens when 2,000 would produce the same result. Common causes:
Wasteful:
System: "You are a helpful assistant that summarizes text..."
User: "Please summarize the following text. Make sure your
summary is concise and captures the key points. The
summary should be in paragraph form and should not
exceed 3 sentences. Please focus on the main ideas..."
Efficient:
System: "You summarize text in exactly 3 sentences."
User: "Summarize:\n\n{{text}}"
Every unnecessary token is money and latency. In high-volume systems, this matters.
The key insight: separate structure from content. First prompt generates the structure (outline, sections, key points). Second prompt generates the content for each section. This produces dramatically better results than a single "write an article about X" prompt.
Prompt 1 (structure):
"Create an outline for an article about {{topic}}.
Return 5-7 section headings with 1-sentence descriptions."
Prompt 2 (per section):
"Write the '{{section_title}}' section of an article about {{topic}}.
Context from other sections: {{summaries}}
Target: 200 words. Tone: {{tone}}."
Prompt 3 (polish):
"Review this article for consistency in tone and flow.
Fix transitions between sections. Do not change content."
The key insight: show the edge cases in your examples, not just the happy path.
Good few-shot for extraction:
Input: "Contact John at john@company.com or 555-1234"
Output: {"emails": ["john@company.com"], "phones": ["555-1234"]}
Input: "No contact info available"
Output: {"emails": [], "phones": []}
Input: "Email me at john [at] company [dot] com"
Output: {"emails": ["john@company.com"], "phones": []}
The third example teaches the model to handle obfuscated emails. Without it, most models will miss them. Your examples define the edges of what the model will handle.
The key insight: the system prompt is your quality floor. Everything the assistant does should be consistent with it.
const system = `You are a TypeScript code assistant for a Next.js project.
Technical context:
- Next.js 16 with Pages Router
- Tailwind CSS v4 with cn() utility
- React 19, no class components
- Strict TypeScript, no "any"
Behavior rules:
- Read existing code before suggesting changes
- Match existing patterns, do not introduce new ones
- When modifying files, show the minimal diff
- If a task is ambiguous, ask for clarification instead of guessing
- Never suggest installing new dependencies unless explicitly asked`;
Every rule in the system prompt prevents a specific failure mode. "Match existing patterns" prevents the model from generating textbook code that looks nothing like your project. "Minimal diff" prevents the model from rewriting entire files when one line needs to change. These are not suggestions. They are constraints that produce reliable behavior.
Prompt Engineering Fine-tuning
────────────────── ───────────
Change the input Change the model
Instant iteration Hours/days to train
No data required Needs training data
Flexible per task Fixed after training
Higher per-call cost Lower per-call cost
Works across models Model-specific
Use prompting when: Use fine-tuning when:
- Iterating quickly - High volume, fixed task
- Task varies per user - Need lower latency/cost
- Small scale - Have quality training data
- Exploring approach - Prompt is proven, optimized
Start with prompt engineering. Always. Fine-tuning is an optimization you apply after you have proven the task works with prompts and have collected enough high-quality examples to train on. Fine-tuning a model with a bad prompt just bakes in the bad behavior.
Prompt engineering is not about finding magic phrases. It is about designing input contracts.
Every prompt is a specification. The instruction says what to do. The context provides the data. The constraints define the boundaries. The examples show the expected behavior. When any of these is missing, the model fills the gap with its best guess, and best guesses are not reliable at scale.
Reliable prompt = clear instruction
+ sufficient context
+ explicit constraints
+ representative examples
+ defined failure behavior
Start with the simplest prompt that could work. Test it on real inputs. When it fails, fix the specific failure. Add a constraint, add an example, clarify an instruction. Test again. Repeat.
The prompts that work in production are not clever. They are thorough.