AI in Business: What Actually Works

Why Most AI Initiatives Fail
The Diagnostic: Finding High-Leverage Workflows
The Thin Slice Method
What “Production-Ready” Actually Means
Governance Before Scale
The Approval Gate Pattern
The Change Management Problem
Measuring What Matters
Lessons From Real Deployments
The Maturity Progression
The Honest Business Case

I’ve integrated AI into a fintech scale-up, contributed to AI tooling at Atlassian, and built my own AI-native products. The gap between AI pilots and AI in production is larger than most business cases acknowledge. Here’s what I’ve learned.

Why Most AI Initiatives Fail

The failure pattern is consistent. A team sees a demo, gets excited, picks a use case, builds for six weeks, and then — nothing. The system works in testing. It doesn’t get used. The autopsy usually reveals the same findings:

Demo → Excitement → Build → "This doesn't fit how we work"
                                        ↓
                              Quietly abandoned

The problem wasn’t the technology. It was the approach. AI gets treated as a product to build rather than a workflow to improve. The right question isn’t “what can AI do?” — it’s “where are we currently doing work that AI could compress?”

The Diagnostic: Finding High-Leverage Workflows

Before building anything, I run a workflow audit. The candidates that consistently work:

Workflow type	Why AI fits	Example
Repetitive decision support	High volume, consistent criteria	Expense approval triage, support ticket routing
Knowledge retrieval	Too much to memorise, search is slow	Policy Q&A, runbook lookup
Draft generation	Known output structure, human reviews final	Report drafts, email templates, code review summaries
Data extraction	Unstructured → structured at scale	Contract clause extraction, invoice parsing
Classification	Fast, consistent labelling needed	Sentiment, intent, topic tagging

The workflows to avoid for first AI initiatives:

Anything where a wrong answer causes serious harm (medical, legal, financial decisions with no review)
Workflows with no measurable baseline (you can’t prove value if you don’t know the current state)
Workflows that vary so much by context that no prompt can cover them

The Thin Slice Method

Every AI initiative I’ve run successfully started with a thin slice: one workflow, one team, minimal scope. At Weel: We picked expense approval triage as the thin slice — AI summarising the context of a flagged expense before a human reviewer sees it. Baseline: reviewers spent 4-6 minutes per item reconstructing context. After: 90 seconds. That’s the proof of value that gets you budget for the next initiative. Why thin slices work: They’re low-risk, fast to ship, and produce real data. A controlled comparison between AI-assisted and non-AI-assisted teams tells you more than any benchmark.

What “Production-Ready” Actually Means

Getting to production means more than the model working. It means the system working — including all the ways the model doesn’t.

Governance Before Scale

The questions you must answer before expanding any AI feature: Data: What data does this model see? Can it see data it shouldn’t? Where does it go after the API call? Audit: For every AI decision that affects a user or business outcome, can you reconstruct what happened? What was the input, what model, what prompt version, what output? Review: Which outputs go to users directly? Which require human review first? This isn’t a philosophical question — it’s a threshold you set explicitly. Cost: What’s the cost per request? At 10 users? At 10,000? Where does unit economics break?

The Approval Gate Pattern

For any AI action that’s hard to reverse, I use an explicit human-in-the-loop gate:

REQUIRES_APPROVAL = {
    "issue_refund",
    "send_customer_communication",
    "modify_subscription",
    "flag_account"
}

async def execute_ai_action(action: str, params: dict, reasoning: str):
    if action in REQUIRES_APPROVAL:
        return await queue_for_human_review(
            action=action,
            params=params,
            ai_reasoning=reasoning
        )
    return await execute_directly(action, params)

This pattern separates “AI recommends” from “AI acts.” The recommendation can be automated. The action stays with a human until the error rate proves it safe to automate.

The Change Management Problem

Technology is the easy part. Adoption is where AI initiatives die quietly. The resistance pattern: people either over-trust AI (stop checking outputs) or under-trust it (never use it). Neither is useful. What you want is calibrated trust — people using AI where it helps, reviewing where it matters, and ignoring it where it doesn’t fit. What actually moves adoption: Short feedback loops. Show people the before/after within their own workflow. “You used to spend 6 minutes on this. Here’s the same thing with the AI summary upfront.” Numbers make it real. Friction logs. Ask users to log every time AI made their work worse, not just better. The friction log is your product backlog. Ignoring it kills adoption; addressing it builds trust. Win stories. At Weel, we sent brief internal notes when AI saved a reviewer time. Not a newsletter — a Slack message. “This week: 340 approvals processed, 18 hours saved.” That compounds. Escape hatches. Make it optional. Users who don’t trust the AI today should still be able to work. Forced AI adoption produces forced workarounds.

Measuring What Matters

The metrics I use for AI features in business context:

Primary metrics:
  - Time saved per task (requires baseline measurement before launch)
  - Error rate: AI-assisted vs. baseline
  - Escalation rate: % of AI outputs that require human correction

Secondary metrics:
  - Cost per request (track from day one, not after it becomes a problem)
  - Adoption rate (% of eligible tasks using AI assistance)
  - User satisfaction (quick thumbs up/down on AI outputs)

Watch carefully:
  - False confidence rate: how often AI is wrong AND confident
  - Population parity: does quality degrade for any user segment?

The metric most teams skip: false confidence rate. An AI that says “I’m not sure” when it’s wrong is manageable. An AI that’s confidently wrong erodes trust faster than you can rebuild it.

Lessons From Real Deployments

Weel (expense management, fintech): AI-assisted approval triage. Key learning: the model needed explicit instructions to abstain on unusual cases. Without “if the expense doesn’t fit a clear category, say so and route to senior reviewer,” it would confidently miscategorise edge cases. Routing logic matters as much as the model. Bugcrowd (bug bounty platform): Triage copilot that summarises security reports before human review. Key learning: the summary had to be strictly grounded in the submission — no extrapolation. A bug report summary that added context the model inferred (rather than the researcher submitted) introduced inaccuracy at the exact moment accuracy mattered most. PromptLib (my product): AI-powered prompt management. Key learning: users trust AI suggestions more when they can see the source. “Suggested based on your last 5 prompts in this category” lands very differently to “here’s a suggestion.” Transparency about how AI made a decision changes how people evaluate it.

The Maturity Progression

Where most organisations sit, and where to aim: Most teams I work with are at Level 1 trying to jump to Level 3. The jump fails because Level 2 is where you build the evaluation infrastructure, governance patterns, and user trust that Level 3 requires. Go through Level 2. It’s not a detour — it’s the prerequisite.

The Honest Business Case

AI in business works when it’s applied to workflows with high volume, consistent structure, and a clear definition of success. It fails when it’s applied to showcase the technology rather than solve a specific problem. The most successful AI features I’ve seen in production weren’t impressive in demos. They were invisible — running in the background, making an existing workflow faster or more reliable. The people using them stopped noticing it was AI. They just noticed the work was easier. That invisibility is the goal.

Responsible AI: How I Think About It as a Builder

Claude Code Mastery: My Complete Workflow

Engineer × AI

AI Dev Toolkit

Shipping AI

AI Foundations

TypeScript at Scale

Design Systems

Deep Dives

System Design & Architecture

Shipping & DevOps

Testing & Quality

Observability

Career & Engineering Leadership

AI in Business: What Actually Works

Why Most AI Initiatives Fail

The Diagnostic: Finding High-Leverage Workflows

The Thin Slice Method

What “Production-Ready” Actually Means

Governance Before Scale

The Approval Gate Pattern

The Change Management Problem

Measuring What Matters

Lessons From Real Deployments

The Maturity Progression

The Honest Business Case

Engineer × AI

AI Dev Toolkit

Shipping AI

AI Foundations

TypeScript at Scale

Design Systems

Deep Dives

System Design & Architecture

Shipping & DevOps

Testing & Quality

Observability

Career & Engineering Leadership

​Why Most AI Initiatives Fail

​The Diagnostic: Finding High-Leverage Workflows

​The Thin Slice Method

​What “Production-Ready” Actually Means

​Governance Before Scale

​The Approval Gate Pattern

​The Change Management Problem

​Measuring What Matters

​Lessons From Real Deployments

​The Maturity Progression

​The Honest Business Case

Why Most AI Initiatives Fail

The Diagnostic: Finding High-Leverage Workflows

The Thin Slice Method

What “Production-Ready” Actually Means

Governance Before Scale

The Approval Gate Pattern

The Change Management Problem

Measuring What Matters

Lessons From Real Deployments

The Maturity Progression

The Honest Business Case