What Generative AI Actually Is (Engineer’s Version)
A large language model is a function that maps a sequence of tokens to a probability distribution over the next token. It does this billions of times, producing text that looks coherent because it was trained to match the statistical patterns in human-written text at scale. That’s it. No understanding. No reasoning. No intent. Sophisticated pattern matching at a scale that produces outputs that often look like understanding. Why does this matter? Because it sets the right expectations:- The model doesn’t know what’s true. It knows what patterns of tokens tend to follow other patterns.
- Confidence in the output doesn’t track accuracy. A model can be confidently wrong.
- The model has no goal beyond “produce plausible next tokens.” Alignment, safety, helpfulness are properties we instill through training — they’re not inherent.
The Three Use Cases That Actually Work in Production
1. Text Transformation
Take text in one form and produce it in another: summarise, classify, translate, extract, reformat, expand. This works reliably because it leverages pattern matching well. The model has seen millions of examples of “here’s a long document, here’s its summary” and can replicate that transformation. Where I use it: Generating first drafts, summarising long documents for display, extracting structured data from unstructured text, translating between technical and plain language. Where it breaks: Domain-specific text where precision matters and common patterns are misleading. Medical, legal, financial content where a plausible-but-wrong answer is worse than no answer.2. Generation Against a Spec
Given a clear specification — a schema, a set of requirements, a pattern to follow — generate an implementation. This is the core of AI coding tools. The model has seen enough code that “implement this function given this type signature” works well when the spec is clear. It works poorly when the spec is ambiguous, because the model fills ambiguity with patterns instead of asking for clarification. Where I use it: Scaffolding features, generating boilerplate, writing tests from descriptions, implementing well-specified functions. Where it breaks: Novel architecture patterns, highly domain-specific logic, security-sensitive code that needs adversarial thinking.3. Conversational Reasoning
Walking through a problem, generating hypotheses, explaining tradeoffs. The “rubber duck that talks back” use case. This works because the model has seen enough problem-solving discussions to replicate the structure of reasoning, even when it doesn’t have the answer. Asking Claude to think through a complex decision with you often surfaces considerations you’d have missed, not because it knows the right answer, but because it prompts you to consider dimensions you’d glossed over. Where I use it: Architecture discussions, debugging hypotheses, writing docs that explain tradeoffs, stress-testing plans. Where it breaks: Anywhere you need ground truth rather than plausible reasoning. The model will construct a coherent argument for a wrong answer.The Failure Modes Every Engineer Should Know
Hallucination
The model generates false information with high confidence. Not a bug — it’s the system working as designed. Plausible text, except the plausible text happens to be wrong. In practice: API endpoints with wrong signatures. Historical facts that are slightly off. Technical procedures that almost work. Mitigation: Grounding. Give the model the correct information in context (RAG) rather than asking it to recall from training. Instruct it to abstain when uncertain.Context Drift
In long conversations or large codebases, the model’s attention distributes unevenly. Earlier context gets attended to less. The model can “forget” constraints established at the beginning of a session. In practice: Claude Code implementing something in the wrong style after a long session. A chatbot ignoring system prompt constraints when the conversation runs long. Mitigation: Restart conversations for new tasks. Put critical constraints at the end of prompts, not the beginning. Use CLAUDE.md to maintain persistent context that gets reloaded every session.Over-Engineering
Models trained on code have seen a lot of well-architected code. They replicate architectural patterns even when simpler solutions would work. In practice: A factory pattern where a simple function would do. An abstract base class for a one-off implementation. Mitigation: Explicit constraints in CLAUDE.md. “Do not add abstractions beyond what’s necessary for the current use case.” Code review that specifically flags unnecessary complexity.Sycophancy
Models are trained partly on human feedback, and humans tend to give positive feedback to confident, agreeable responses. This creates a systematic bias toward telling you what you want to hear. In practice: “Is this approach good?” gets a “yes” with a supportive explanation. “What are the problems with this approach?” gets a very different response. Mitigation: Ask adversarial questions. “What could go wrong with this approach?” Red-team your own plans.My Architecture for Generative AI Features
After shipping AI features across multiple products, I’ve converged on a consistent architecture: Each step is independently testable. Each step has defined failure modes and fallback behaviour. The LLM is one component in a pipeline, not the entire system. The most common mistake: treating the LLM call as the whole feature. When the LLM call is the whole feature, every failure mode becomes a production incident.Picking the Right Model
I use different models for different tasks. The “best model” question doesn’t have a universal answer.| Task | My default | Why |
|---|---|---|
| Complex reasoning, multi-step | Claude Opus | Strongest sustained reasoning |
| Coding, multi-file tasks | Claude Sonnet | Fast, codebase-aware, reliable tool use |
| Quick classification, simple extraction | Haiku or GPT-4o-mini | Speed and cost efficiency |
| Multilingual, international content | GPT-4o | Broader training data coverage |
| Structured output with strict schemas | Claude Sonnet | Reliable JSON schema following |
The Evaluation Imperative
You cannot ship a generative AI feature responsibly without an evaluation system. This is the step I see skipped most often and regretted most severely. Minimum viable eval:- 50 representative inputs covering happy path, edge cases, and adversarial inputs
- Manual labels for expected outputs
- A metric that captures what “good” means for your use case
- A script you can run before shipping any prompt change
