Prompt Management: Prompts as Code
Jump to section
Why prompts do not belong in code
In a prototype you have the prompt as a string in code. In production the prompt changes 10x more often than surrounding logic. Every prompt change requires a deploy, code review, and build pipeline. That is unnecessary friction. Prompts need their own lifecycle — versioning, testing, rollback — independent of application code.
Prompt management has three maturity levels. Level 1: prompts in code as template strings. Level 2: prompts in dedicated files with variables and versioning via git. Level 3: prompt registry — a central service that serves prompts via API, enables A/B testing, and tracks quality metrics per version.
Prompts as templates with variables
Every production prompt is a template. It contains a static part (instructions, format, constraints) and dynamic variables (user input, context, few-shot examples). Separate them explicitly. Use Mustache, Handlebars, or plain string interpolation — but have a clear contract for what variables the prompt expects.
// prompts/summarize.v2.yaml
// name: summarize
// version: 2
// variables: [document, max_length, language]
// model: claude-sonnet-4-20250514
const promptTemplate = `
Summarize the following document in {{language}}.
Maximum length: {{max_length}} words.
Focus on actionable insights, skip background.
Document:
{{document}}
Output format: bullet points, each starting with a verb.
`;
function renderPrompt(
template: string,
vars: Record<string, string>
): string {
return template.replace(
/\{\{(\w+)\}\}/g,
(_, key) => vars[key] ?? ''
);
}Versioning and rollback
Every prompt version should be immutable. Version 2 never overwrites version 1 — both exist in parallel. A new version is first deployed to canary traffic (5-10% of requests), quality is compared against the baseline, and only then does it become the default. If the new version degrades, rollback is a single configuration switch.
Simplest implementation: prompts in YAML files in a git repository, with version in the filename (summarize.v1.yaml, summarize.v2.yaml). Configuration determines which version is active. More complex setup: prompt registry as a microservice with REST API, version database, and deployment webhooks.
// Simple file-based prompt registry
interface PromptVersion {
name: string;
version: number;
template: string;
model: string;
variables: string[];
metadata: { author: string; changedAt: string; reason: string };
}
class PromptRegistry {
private prompts = new Map<string, PromptVersion[]>();
getActive(name: string): PromptVersion {
const versions = this.prompts.get(name);
if (!versions?.length) throw new Error(`Prompt '${name}' not found`);
return versions[versions.length - 1];
}
getVersion(name: string, version: number): PromptVersion {
const v = this.prompts.get(name)?.find(p => p.version === version);
if (!v) throw new Error(`Prompt '${name}' v${version} not found`);
return v;
}
}A/B testing prompts
Changing a single word in a prompt can shift output quality by 20%. Without A/B testing you will not know. Basic setup: randomly assign each request to a variant (A = current version, B = candidate), log the variant alongside the output, and after a sufficient sample compare metrics.
Metrics depend on the use case: for summarization measure length, key point coverage, and user rating. For classification measure accuracy and F1 score. For code generation measure whether the code compiles and passes tests. Always have at least one automated metric and one human metric.
// A/B test assignment
function assignVariant(
userId: string,
testName: string,
variants: string[],
weights?: number[]
): string {
// Deterministic assignment based on user+test hash
const hash = createHash('sha256')
.update(`${userId}:${testName}`)
.digest();
const bucket = hash.readUInt32BE(0) % 100;
const w = weights ?? variants.map(() => 100 / variants.length);
let cumulative = 0;
for (let i = 0; i < variants.length; i++) {
cumulative += w[i];
if (bucket < cumulative) return variants[i];
}
return variants[variants.length - 1];
}Prompt composition and chain-of-thought
Complex tasks require prompt composition — breaking the problem into steps where the output of one prompt is the input to the next. Classic pattern: first prompt extracts relevant information from a document, second analyzes it, third generates the output. Each step has its own prompt with its own versioning.
Chain-of-thought (CoT) in production is not just 'think step by step'. It is a structured format where the model first generates reasoning (which you log but do not display to the user), then the final answer. CoT improves quality on analytical tasks by 10-30% but increases token count and latency.
A prompt is a production artifact like code — it needs versioning, review, testing, and rollback. Treat it that way from day one.
Log the prompt version, model, and input variable hash with every API call. When you debug a bad output, you will know exactly what the model received.
Add a 'reason' field to prompt metadata — why this version was created. In a month you won't know why you changed the wording if you don't write it down now.
Build a simple prompt registry: 1) Store two versions of one prompt in YAML files. 2) Write a function that deterministically assigns a variant based on user ID (50/50 split). 3) Log the variant, prompt version, and (simulated) model output to the console. 4) After 100 simulated requests count how many went to each variant.
Hint
Deterministic assignment via hash ensures the same user always gets the same variant — critical for consistent UX and valid A/B testing.
Implement a simple prompt versioning system for your project. Requirements: 1) Prompts are stored in files (not hardcoded), 2) Each prompt has a version, 3) Prompt changes are tracked in git, 4) A/B testing two prompt versions is easy. You don't need a specialized tool — a folder with YAML/JSON files and a simple loader works.
Hint
Document your process and results — they'll serve as reference for similar future tasks.
Implement a 3-prompt pipeline for document analysis: 1) Prompt 1 extracts key information from text (entities, facts, numbers). 2) Prompt 2 analyzes extracted information and draws conclusions. 3) Prompt 3 generates a structured report. Each prompt has its own version and model configuration. Measure quality: does the pipeline or a single large prompt give better results?
Hint
Pipelines work better for complex tasks where each step requires different reasoning. For simple tasks the overhead is unnecessary.
- Prompts change faster than code — they need their own lifecycle
- Every prompt version is immutable, rollback is a configuration switch
- A/B testing reveals quality differences you miss during manual testing
- Log prompt version and model with every call — without it debugging is impossible
In the next lesson, we dive into Evaluating AI Outputs: Measuring Quality — a technique that gives you a clear edge. Unlock the full course and continue now.
2/7 complete — keep going!