Skip to main content
There’s a gap between how generative AI is explained and how it actually behaves in production. The explanations are either too abstract (“it predicts the next token”) or too breathless (“it understands language”). Neither helps you build reliable systems. This is my working mental model — built from shipping real products with generative AI, not from reading papers.

What Generative AI Actually Is (Engineer’s Version)

A large language model is a function that maps a sequence of tokens to a probability distribution over the next token. It does this billions of times, producing text that looks coherent because it was trained to match the statistical patterns in human-written text at scale. That’s it. No understanding. No reasoning. No intent. Sophisticated pattern matching at a scale that produces outputs that often look like understanding. Why does this matter? Because it sets the right expectations:
  • The model doesn’t know what’s true. It knows what patterns of tokens tend to follow other patterns.
  • Confidence in the output doesn’t track accuracy. A model can be confidently wrong.
  • The model has no goal beyond “produce plausible next tokens.” Alignment, safety, helpfulness are properties we instill through training — they’re not inherent.
With that foundation, you can reason about when generative AI will work and when it will fail.

The Three Use Cases That Actually Work in Production

1. Text Transformation

Take text in one form and produce it in another: summarise, classify, translate, extract, reformat, expand. This works reliably because it leverages pattern matching well. The model has seen millions of examples of “here’s a long document, here’s its summary” and can replicate that transformation. Where I use it: Generating first drafts, summarising long documents for display, extracting structured data from unstructured text, translating between technical and plain language. Where it breaks: Domain-specific text where precision matters and common patterns are misleading. Medical, legal, financial content where a plausible-but-wrong answer is worse than no answer.

2. Generation Against a Spec

Given a clear specification — a schema, a set of requirements, a pattern to follow — generate an implementation. This is the core of AI coding tools. The model has seen enough code that “implement this function given this type signature” works well when the spec is clear. It works poorly when the spec is ambiguous, because the model fills ambiguity with patterns instead of asking for clarification. Where I use it: Scaffolding features, generating boilerplate, writing tests from descriptions, implementing well-specified functions. Where it breaks: Novel architecture patterns, highly domain-specific logic, security-sensitive code that needs adversarial thinking.

3. Conversational Reasoning

Walking through a problem, generating hypotheses, explaining tradeoffs. The “rubber duck that talks back” use case. This works because the model has seen enough problem-solving discussions to replicate the structure of reasoning, even when it doesn’t have the answer. Asking Claude to think through a complex decision with you often surfaces considerations you’d have missed, not because it knows the right answer, but because it prompts you to consider dimensions you’d glossed over. Where I use it: Architecture discussions, debugging hypotheses, writing docs that explain tradeoffs, stress-testing plans. Where it breaks: Anywhere you need ground truth rather than plausible reasoning. The model will construct a coherent argument for a wrong answer.

The Failure Modes Every Engineer Should Know

Hallucination

The model generates false information with high confidence. Not a bug — it’s the system working as designed. Plausible text, except the plausible text happens to be wrong. In practice: API endpoints with wrong signatures. Historical facts that are slightly off. Technical procedures that almost work. Mitigation: Grounding. Give the model the correct information in context (RAG) rather than asking it to recall from training. Instruct it to abstain when uncertain.

Context Drift

In long conversations or large codebases, the model’s attention distributes unevenly. Earlier context gets attended to less. The model can “forget” constraints established at the beginning of a session. In practice: Claude Code implementing something in the wrong style after a long session. A chatbot ignoring system prompt constraints when the conversation runs long. Mitigation: Restart conversations for new tasks. Put critical constraints at the end of prompts, not the beginning. Use CLAUDE.md to maintain persistent context that gets reloaded every session.

Over-Engineering

Models trained on code have seen a lot of well-architected code. They replicate architectural patterns even when simpler solutions would work. In practice: A factory pattern where a simple function would do. An abstract base class for a one-off implementation. Mitigation: Explicit constraints in CLAUDE.md. “Do not add abstractions beyond what’s necessary for the current use case.” Code review that specifically flags unnecessary complexity.

Sycophancy

Models are trained partly on human feedback, and humans tend to give positive feedback to confident, agreeable responses. This creates a systematic bias toward telling you what you want to hear. In practice: “Is this approach good?” gets a “yes” with a supportive explanation. “What are the problems with this approach?” gets a very different response. Mitigation: Ask adversarial questions. “What could go wrong with this approach?” Red-team your own plans.

My Architecture for Generative AI Features

After shipping AI features across multiple products, I’ve converged on a consistent architecture: Each step is independently testable. Each step has defined failure modes and fallback behaviour. The LLM is one component in a pipeline, not the entire system. The most common mistake: treating the LLM call as the whole feature. When the LLM call is the whole feature, every failure mode becomes a production incident.

Picking the Right Model

I use different models for different tasks. The “best model” question doesn’t have a universal answer.
TaskMy defaultWhy
Complex reasoning, multi-stepClaude OpusStrongest sustained reasoning
Coding, multi-file tasksClaude SonnetFast, codebase-aware, reliable tool use
Quick classification, simple extractionHaiku or GPT-4o-miniSpeed and cost efficiency
Multilingual, international contentGPT-4oBroader training data coverage
Structured output with strict schemasClaude SonnetReliable JSON schema following
The fastest way to pick: define 50 representative inputs, run all candidate models, and measure what matters — quality, speed, cost, schema compliance.

The Evaluation Imperative

You cannot ship a generative AI feature responsibly without an evaluation system. This is the step I see skipped most often and regretted most severely. Minimum viable eval:
  1. 50 representative inputs covering happy path, edge cases, and adversarial inputs
  2. Manual labels for expected outputs
  3. A metric that captures what “good” means for your use case
  4. A script you can run before shipping any prompt change
Prompts change. Models change. Input distributions shift. Without an eval system, you’re flying blind every time any of these change. Cost to build: one afternoon. Cost of not having it: weeks of debugging production failures and trust damage you can’t quickly repair.

What I’ve Learned About User Trust

Users form their trust in AI features quickly and update it slowly. One confidently wrong answer can break trust that took months of correct answers to build. Design principles that help: Show sources. When the AI draws on information, cite it. Unverifiable answers feel less trustworthy even when correct. Communicate uncertainty. Not every output deserves the same confidence. “I’m not sure about this…” is a more trustworthy response than always-confident answers. Provide escape hatches. For every AI-assisted action, there should be a clear way to do it without AI. Users who don’t trust the AI today should still be able to work. Make failures visible. A system that fails silently trains users to distrust it. A system that fails with a useful message trains users to understand its limits — which builds appropriate trust.

The Boring Truth About Generative AI in Production

AI features that work in production look a lot like regular software features. They have tests. They have monitoring. They have fallbacks. They have error states that users understand. They ship incrementally. The demo magic gets you to MVP. Engineering discipline gets you to production. They’re different skills and both are necessary. The most reliable AI features I’ve shipped weren’t the ones with the most impressive demos. They were the ones with the most carefully designed fallbacks.