back

Prompt workflows in production

A single prompt works for simple tasks. "Summarize this paragraph." "Translate this sentence." "Extract the email from this text." One input, one transformation, one output. Clean.

The moment the task gets complex, single prompts start failing in subtle ways. "Review this PR for bugs, style issues, security problems, and test coverage gaps, then summarize your findings in a structured report." That prompt will produce output. It will look reasonable. But it will miss things, conflate categories, and produce inconsistent results across runs. The model is trying to do five things at once, and it does each one worse than if it did them separately.

The fix is not a better prompt. It is a better workflow. Break the task into steps. Run each step with a focused prompt. Pass structured data between steps. Validate output at each stage. Retry on failure. This is prompt orchestration, and it is how every serious LLM-powered system actually works.

Core workflow patterns

Four patterns cover almost every production prompt workflow.

  Pattern              When to use
  ───────              ───────────
  Chaining             steps depend on each other sequentially
  Decomposition        task splits into independent subtasks
  Parallel execution   subtasks can run concurrently
  Queued execution     high volume, rate limits, retries needed

Prompt chaining

The most common pattern. Each step does one transformation. The output of step N becomes the input of step N+1.

  ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
  │ Step 1   │────▶│ Step 2   │────▶│ Step 3   │────▶│ Step 4   │
  │ Extract  │     │ Classify │     │ Enrich   │     │ Format   │
  └──────────┘     └──────────┘     └──────────┘     └──────────┘
       │                │                │                │
    raw text        entities         categories      structured
                                    + context          report

Each prompt is simple. "Extract entities from this text." "Classify each entity." "Add context for each." "Format as a report." The model does one thing well at each step instead of doing four things poorly in one shot.

type Step<In, Out> = {
  name: string;
  prompt: (input: In) => string;
  parse: (response: string) => Out;
};

async function chain<A, B, C>(
  input: A,
  step1: Step<A, B>,
  step2: Step<B, C>
): Promise<C> {
  console.log(`[chain] Running ${step1.name}`);
  const raw1 = await callLLM(step1.prompt(input));
  const result1 = step1.parse(raw1);

  console.log(`[chain] Running ${step2.name}`);
  const raw2 = await callLLM(step2.prompt(result1));
  return step2.parse(raw2);
}

The chain function is typed end-to-end. Step 1 takes A and produces B. Step 2 takes B and produces C. If the types do not match, the code does not compile. This prevents the most common chaining bug: step N producing output that step N+1 cannot parse.

Before and after:

Single prompt (fragile):
  "Analyze this customer support ticket. Identify the problem,
   determine urgency, suggest a resolution, draft a response,
   and format everything as JSON."
  → Does 5 things. Quality is mediocre on all of them.

Chained prompts (reliable):
  Step 1: "What is the customer's problem? One sentence."
  Step 2: "Rate urgency 1-5 based on this problem: {{problem}}"
  Step 3: "Suggest a resolution for: {{problem}} (urgency: {{urgency}})"
  Step 4: "Draft a customer response for: {{problem}} + {{resolution}}"
  → Each step is focused. Quality is high on all of them.

Task decomposition

Chaining is sequential. Decomposition is about splitting a task into parts that can be handled independently, whether you run them in sequence or in parallel.

  "Review this PR"
         │
    ┌────┼────┬────────┐
    ▼    ▼    ▼        ▼
  Types  Tests Style  Security
  check  check check  check
    │    │    │        │
    └────┼────┴────────┘
         ▼
    Merge results
    into one review

Each subtask has its own prompt, its own constraints, and its own evaluation criteria. The type checker prompt looks for TypeScript errors. The security prompt looks for injection vulnerabilities. They do not interfere with each other.

interface ReviewTask {
  name: string;
  systemPrompt: string;
  focus: string;
}

const tasks: ReviewTask[] = [
  {
    name: "types",
    systemPrompt: "You check TypeScript code for type safety issues.",
    focus: "Find type errors, unsafe casts, missing null checks.",
  },
  {
    name: "tests",
    systemPrompt: "You evaluate test coverage and quality.",
    focus: "Are changed lines covered? Are edge cases tested?",
  },
  {
    name: "style",
    systemPrompt: "You check code style and readability.",
    focus: "Naming, structure, complexity, dead code.",
  },
  {
    name: "security",
    systemPrompt: "You audit code for security vulnerabilities.",
    focus: "Injection, auth bypass, data exposure, unsafe deps.",
  },
];

Parallel execution

When subtasks are independent, run them at the same time. This cuts latency proportionally.

  Sequential (4 steps x 3s each = 12s):
  ┌───┐ ┌───┐ ┌───┐ ┌───┐
  │ A │→│ B │→│ C │→│ D │  total: 12s
  └───┘ └───┘ └───┘ └───┘

  Parallel (4 steps at once = 3s + merge):
  ┌───┐
  │ A │─┐
  └───┘ │
  ┌───┐ │
  │ B │─┼──▶ merge  total: ~4s
  └───┘ │
  ┌───┐ │
  │ C │─┤
  └───┘ │
  ┌───┐ │
  │ D │─┘
  └───┘
async function parallelReview(diff: string, tasks: ReviewTask[]) {
  const results = await Promise.all(
    tasks.map(async (task) => {
      const response = await callLLM(`${task.focus}\n\nDiff:\n${diff}`, {
        system: task.systemPrompt,
      });
      return { name: task.name, findings: response };
    })
  );

  // Merge step: combine all findings into one report
  const merged = results.map((r) => `## ${r.name}\n${r.findings}`).join("\n\n");

  return callLLM(
    `Combine these review sections into a single structured report.
     Remove duplicates. Order by severity.\n\n${merged}`
  );
}

Four API calls run concurrently. One merge call at the end. Total latency is roughly the time of the slowest subtask plus the merge, not the sum of all subtasks.

Queued execution

When you have many tasks, rate limits, or need retry logic, you need a queue.

  ┌─────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
  │  Tasks   │────▶│  Queue   │────▶│ Executor │────▶│ Results  │
  │  (100s)  │     │  (FIFO)  │     │ (3 at a  │     │  Store   │
  │          │     │          │     │  time)    │     │          │
  └─────────┘     └──────────┘     └──────────┘     └──────────┘
                                        │
                                   ┌────┴────┐
                                   │  Retry  │
                                   │  queue  │
                                   └─────────┘
class PromptQueue {
  private queue: Array<{
    id: string;
    prompt: string;
    resolve: (value: string) => void;
    reject: (error: Error) => void;
    attempts: number;
  }> = [];
  private running = 0;
  private maxConcurrency: number;
  private maxRetries: number;

  constructor(concurrency = 3, retries = 2) {
    this.maxConcurrency = concurrency;
    this.maxRetries = retries;
  }

  enqueue(id: string, prompt: string): Promise<string> {
    return new Promise((resolve, reject) => {
      this.queue.push({ id, prompt, resolve, reject, attempts: 0 });
      this.process();
    });
  }

  private async process() {
    while (this.running < this.maxConcurrency && this.queue.length > 0) {
      const task = this.queue.shift();
      if (!task) break;
      this.running++;

      this.execute(task).finally(() => {
        this.running--;
        this.process();
      });
    }
  }

  private async execute(task: (typeof this.queue)[number]) {
    try {
      task.attempts++;
      console.log(`[queue] ${task.id} attempt ${task.attempts}`);
      const result = await callLLM(task.prompt);
      task.resolve(result);
    } catch (err) {
      if (task.attempts < this.maxRetries) {
        console.log(`[queue] ${task.id} failed, requeuing`);
        this.queue.push(task); // back of the queue
      } else {
        task.reject(
          new Error(`${task.id} failed after ${task.attempts} attempts`)
        );
      }
    }
  }
}

Usage:

const queue = new PromptQueue(3, 2); // 3 concurrent, 2 retries

const results = await Promise.all(
  files.map((file, i) =>
    queue.enqueue(`file-${i}`, `Review this file:\n${file}`)
  )
);

A hundred files, three running at a time, automatic retries on failure. The queue handles rate limits and transient errors without any of the calling code needing to know.

System architecture

A production prompt workflow has four components.

┌──────────────────────────────────────────────────────────┐
│                   Workflow Orchestrator                    │
│                                                          │
│   ┌────────────┐   ┌────────────┐   ┌────────────┐     │
│   │   Task     │   │  Prompt    │   │  Context   │     │
│   │   Queue    │   │  Executor  │   │  Store     │     │
│   │            │   │            │   │            │     │
│   │ enqueue    │──▶│ call LLM   │◀─▶│ read/write │     │
│   │ dequeue    │   │ retry      │   │ state      │     │
│   │ prioritize │   │ validate   │   │ history    │     │
│   └────────────┘   └─────┬──────┘   └────────────┘     │
│                          │                               │
│                    ┌─────▼──────┐                        │
│                    │  Output    │                        │
│                    │  Validator │                        │
│                    │            │                        │
│                    │ schema     │                        │
│                    │ assertions │                        │
│                    │ fallback   │                        │
│                    └────────────┘                        │
│                                                          │
└──────────────────────────────────────────────────────────┘

Task queue. Accepts work items, manages ordering (FIFO or priority), tracks completion. For simple systems, this is an in-memory array. For production, it is Redis, SQS, or a database table.

Prompt executor. Makes the LLM API call. Handles retries with exponential backoff. Enforces token limits. Logs every call. This is the only component that touches the LLM API.

Context store. Holds state between pipeline steps. Step 1 writes its output. Step 2 reads it as input. For simple pipelines, this is just function arguments. For complex systems, it is a key-value store or database.

Output validator. Checks that each step's output matches the expected schema. If validation fails, the step retries or the pipeline halts. This is where you catch hallucinations, malformed JSON, and off-topic responses before they propagate downstream.

Data flow and state management

The hardest part of prompt workflows is not the prompts. It is the data flowing between them.

Typed contracts between steps

Every step should have a defined input type and output type.

interface ExtractResult {
  entities: Array<{ name: string; type: string }>;
  raw_text: string;
}

interface ClassifyResult {
  entities: Array<{ name: string; type: string; category: string }>;
}

interface EnrichResult {
  entities: Array<{
    name: string;
    type: string;
    category: string;
    context: string;
  }>;
}
  Step 1               Step 2               Step 3
  ──────               ──────               ──────
  Input: raw text      Input: Extract       Input: Classify
  Output: Extract      Result               Result
  Result               Output: Classify     Output: Enrich
                       Result               Result

  ┌──────────┐        ┌──────────┐        ┌──────────┐
  │ entities │───────▶│ entities │───────▶│ entities │
  │ raw_text │        │ +category│        │ +category│
  └──────────┘        └──────────┘        │ +context │
                                          └──────────┘
  Each step adds fields. No step removes or renames them.

Context accumulation vs isolation

Two strategies for passing data through a pipeline:

Accumulation. Each step sees all prior results. The context grows with each step.

// Accumulation: each step gets everything
let context = { input: rawText };
context = { ...context, ...(await step1(context)) };
context = { ...context, ...(await step2(context)) };
context = { ...context, ...(await step3(context)) };

Good for pipelines where later steps need context from earlier steps. Bad for long pipelines because context grows and eventually hits token limits.

Isolation. Each step only sees its direct input. No accumulated state.

// Isolation: each step gets only what it needs
const extracted = await step1(rawText);
const classified = await step2(extracted.entities);
const enriched = await step3(classified.entities);

Good for long pipelines. Keeps each step's context window small. Bad when a later step genuinely needs context from two steps back.

Most production systems use a hybrid: accumulate within a phase, isolate between phases.

Reliability and control

Output validation

Every LLM response should be validated before the next step consumes it.

function validateJSON<T>(
  response: string,
  schema: (data: unknown) => data is T
): T {
  let parsed: unknown;
  try {
    parsed = JSON.parse(response);
  } catch {
    throw new Error(`Invalid JSON: ${response.slice(0, 100)}`);
  }

  if (!schema(parsed)) {
    throw new Error(
      `Schema validation failed: ${JSON.stringify(parsed).slice(0, 100)}`
    );
  }

  return parsed;
}

// Usage in a pipeline step
const raw = await callLLM(prompt);
const result = validateJSON(raw, isExtractResult);
// If validation fails, the step can retry or the pipeline can halt

Error handling strategies

  Error Type          Strategy            Example
  ──────────          ────────            ───────
  Rate limit          exponential backoff wait 1s, 2s, 4s, 8s
  Invalid JSON        retry same prompt   up to 3 attempts
  Wrong content       retry with hint     "Your response was not valid JSON. Respond only with JSON."
  Timeout             retry once          then fail
  Persistent failure  skip + flag         log for human review
async function reliableStep<T>(
  prompt: string,
  validate: (response: string) => T,
  maxRetries = 3
): Promise<T> {
  let lastError: Error | null = null;

  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const response = await callLLM(prompt);
      return validate(response);
    } catch (err) {
      lastError = err as Error;
      console.log(
        `[retry] Attempt ${attempt + 1} failed: ${lastError.message}`
      );

      // Add a correction hint for the next attempt
      if (lastError.message.includes("Invalid JSON")) {
        prompt += "\n\nIMPORTANT: Respond ONLY with valid JSON. No other text.";
      }

      // Exponential backoff for rate limits
      await sleep(Math.pow(2, attempt) * 1000);
    }
  }

  throw lastError;
}

Logging for debugging

Every step, every input, every output. When a pipeline produces wrong results at step 4, you need to trace back through steps 1-3 to find where the reasoning went wrong.

interface StepLog {
  step: string;
  input: string;
  output: string;
  tokens: { input: number; output: number };
  latencyMs: number;
  attempt: number;
  timestamp: string;
}

const logs: StepLog[] = [];

async function loggedStep(name: string, prompt: string): Promise<string> {
  const start = Date.now();
  const result = await callLLM(prompt);
  logs.push({
    step: name,
    input: prompt.slice(0, 500),
    output: result.slice(0, 500),
    tokens: { input: 0, output: 0 }, // populate from API response
    latencyMs: Date.now() - start,
    attempt: 1,
    timestamp: new Date().toISOString(),
  });
  return result;
}

Performance and scaling

Latency optimization

The fastest pipeline is the one that runs the fewest sequential steps. Every sequential step adds latency. Every parallel step does not.

  4 sequential steps:    total = s1 + s2 + s3 + s4
  2 sequential + 2 parallel:  total = s1 + max(s2, s3) + s4

  Example with 2s per step:
  Sequential:  2 + 2 + 2 + 2 = 8s
  Hybrid:      2 + 2 + 2     = 6s  (s2 and s3 run in parallel)

Identify which steps depend on each other and which are independent. Run independent steps in parallel. Keep sequential steps only where the output of one is the input of the next.

Token budgeting

Every step consumes tokens. Track them.

interface TokenBudget {
  maxInputTokens: number;
  maxOutputTokens: number;
  used: { input: number; output: number };
}

function checkBudget(budget: TokenBudget, stepInput: string): boolean {
  const estimatedTokens = Math.ceil(stepInput.length / 4); // rough estimate
  return budget.used.input + estimatedTokens <= budget.maxInputTokens;
}

If a 5-step pipeline uses 10,000 input tokens per step, that is 50,000 tokens per run. At 100 runs per day, that is 5 million tokens. Know your numbers before you scale.

Caching

Same input should produce the same output without another API call.

const cache = new Map<string, string>();

async function cachedLLM(prompt: string): Promise<string> {
  const key = createHash("sha256").update(prompt).digest("hex");

  if (cache.has(key)) {
    console.log("[cache] hit");
    return cache.get(key)!;
  }

  const result = await callLLM(prompt);
  cache.set(key, result);
  return result;
}

Caching works best for deterministic steps (extraction, classification, formatting) where temperature is 0 and the same input always needs the same output. It does not work for generative steps where you want variety.

Real-world pipelines

Content generation

  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐
  │ Research │──▶│ Outline  │──▶│  Draft   │──▶│  Edit    │──▶│  Format  │
  │          │   │          │   │          │   │          │   │          │
  │ gather   │   │ structure│   │ write    │   │ refine   │   │ markdown │
  │ sources  │   │ sections │   │ sections │   │ tone +   │   │ + meta   │
  │          │   │          │   │ per plan │   │ accuracy │   │          │
  └──────────┘   └──────────┘   └──────────┘   └──────────┘   └──────────┘

The research step gathers sources. The outline step structures them. The draft step writes each section independently (parallel). The edit step checks tone, accuracy, and transitions. The format step produces the final output. Each step has a focused prompt and a defined output schema.

Code generation

  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐
  │   Spec   │──▶│ Scaffold │──▶│Implement │──▶│  Test    │
  │          │   │          │   │          │   │ + review │
  │ parse    │   │ generate │   │ fill in  │   │          │
  │ require- │   │ file     │   │ function │   │ generate │
  │ ments    │   │ structure│   │ bodies   │   │ tests    │
  └──────────┘   └──────────┘   └──────────┘   └──────────┘

The spec step parses natural language requirements into a structured task definition. The scaffold step generates the file and function signatures. The implement step fills in the function bodies one at a time (each function is an independent subtask, parallelizable). The test step generates and runs tests against the implementation.

Data extraction

  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐
  │  Ingest  │──▶│ Extract  │──▶│ Validate │──▶│Normalize │──▶│  Store   │
  │          │   │          │   │          │   │          │   │          │
  │ parse    │   │ pull out │   │ schema   │   │ standard │   │ write to │
  │ document │   │ fields   │   │ check    │   │ format   │   │ database │
  └──────────┘   └──────────┘   └──────────┘   └──────────┘   └──────────┘

Ingest handles file format (PDF, HTML, email). Extract pulls out structured data. Validate checks against a schema and flags anomalies. Normalize converts to a standard format (dates, currencies, units). Store writes to the destination. The validate step is where hallucinated data gets caught before it enters your database.

Anti-patterns

Chaining when a single prompt works

If one prompt reliably produces the right output, adding a chain just adds latency and cost. A two-step pipeline costs twice as much as a single call. Only chain when a single prompt produces inconsistent or low-quality results.

Over-engineering the pipeline

Five steps where two would suffice. Every step adds a failure point, a retry budget, a validation check, and token cost. Start with the fewest steps that produce reliable output. Add steps only when you can point to a specific quality problem that the extra step solves.

Testing steps in isolation

Every step in a pipeline should be independently testable. If you can only test the pipeline end-to-end, debugging becomes painful because a failure at step 4 might be caused by a subtle issue at step 2.

// Each step has its own test
test("extract step finds entities", async () => {
  const input = "Apple Inc. was founded in 1976.";
  const result = await extractStep(input);
  expect(result.entities).toContainEqual({
    name: "Apple Inc.",
    type: "company",
  });
});

Ignoring the merge step

Parallel pipelines need a merge step that combines results. Without it, you get a list of independent outputs that contradict each other or repeat findings. The merge step is its own prompt, and it needs its own quality criteria.

The principle

Prompt workflows are software engineering applied to LLM calls. The same principles that make code reliable make prompt pipelines reliable: typed interfaces between components, validation at boundaries, retry logic for transient failures, logging for debugging, and tests for each unit.

Start with two steps. Get the data contract right between them. Add validation. Add retries. Get that working reliably. Then add a third step.

The goal is not a complex pipeline. The goal is reliable output. If a two-step chain produces reliable output, stop there. Complexity is cost.