Skip to main content
I’ve integrated AI into a fintech scale-up, contributed to AI tooling at Atlassian, and built my own AI-native products. The gap between AI pilots and AI in production is larger than most business cases acknowledge. Here’s what I’ve learned.

Why Most AI Initiatives Fail

The failure pattern is consistent. A team sees a demo, gets excited, picks a use case, builds for six weeks, and then — nothing. The system works in testing. It doesn’t get used. The autopsy usually reveals the same findings:
Demo → Excitement → Build → "This doesn't fit how we work"

                              Quietly abandoned
The problem wasn’t the technology. It was the approach. AI gets treated as a product to build rather than a workflow to improve. The right question isn’t “what can AI do?” — it’s “where are we currently doing work that AI could compress?”

The Diagnostic: Finding High-Leverage Workflows

Before building anything, I run a workflow audit. The candidates that consistently work:
Workflow typeWhy AI fitsExample
Repetitive decision supportHigh volume, consistent criteriaExpense approval triage, support ticket routing
Knowledge retrievalToo much to memorise, search is slowPolicy Q&A, runbook lookup
Draft generationKnown output structure, human reviews finalReport drafts, email templates, code review summaries
Data extractionUnstructured → structured at scaleContract clause extraction, invoice parsing
ClassificationFast, consistent labelling neededSentiment, intent, topic tagging
The workflows to avoid for first AI initiatives:
  • Anything where a wrong answer causes serious harm (medical, legal, financial decisions with no review)
  • Workflows with no measurable baseline (you can’t prove value if you don’t know the current state)
  • Workflows that vary so much by context that no prompt can cover them

The Thin Slice Method

Every AI initiative I’ve run successfully started with a thin slice: one workflow, one team, minimal scope. At Weel: We picked expense approval triage as the thin slice — AI summarising the context of a flagged expense before a human reviewer sees it. Baseline: reviewers spent 4-6 minutes per item reconstructing context. After: 90 seconds. That’s the proof of value that gets you budget for the next initiative. Why thin slices work: They’re low-risk, fast to ship, and produce real data. A controlled comparison between AI-assisted and non-AI-assisted teams tells you more than any benchmark.

What “Production-Ready” Actually Means

Getting to production means more than the model working. It means the system working — including all the ways the model doesn’t.

Governance Before Scale

The questions you must answer before expanding any AI feature: Data: What data does this model see? Can it see data it shouldn’t? Where does it go after the API call? Audit: For every AI decision that affects a user or business outcome, can you reconstruct what happened? What was the input, what model, what prompt version, what output? Review: Which outputs go to users directly? Which require human review first? This isn’t a philosophical question — it’s a threshold you set explicitly. Cost: What’s the cost per request? At 10 users? At 10,000? Where does unit economics break?

The Approval Gate Pattern

For any AI action that’s hard to reverse, I use an explicit human-in-the-loop gate:
REQUIRES_APPROVAL = {
    "issue_refund",
    "send_customer_communication",
    "modify_subscription",
    "flag_account"
}

async def execute_ai_action(action: str, params: dict, reasoning: str):
    if action in REQUIRES_APPROVAL:
        return await queue_for_human_review(
            action=action,
            params=params,
            ai_reasoning=reasoning
        )
    return await execute_directly(action, params)
This pattern separates “AI recommends” from “AI acts.” The recommendation can be automated. The action stays with a human until the error rate proves it safe to automate.

The Change Management Problem

Technology is the easy part. Adoption is where AI initiatives die quietly. The resistance pattern: people either over-trust AI (stop checking outputs) or under-trust it (never use it). Neither is useful. What you want is calibrated trust — people using AI where it helps, reviewing where it matters, and ignoring it where it doesn’t fit. What actually moves adoption: Short feedback loops. Show people the before/after within their own workflow. “You used to spend 6 minutes on this. Here’s the same thing with the AI summary upfront.” Numbers make it real. Friction logs. Ask users to log every time AI made their work worse, not just better. The friction log is your product backlog. Ignoring it kills adoption; addressing it builds trust. Win stories. At Weel, we sent brief internal notes when AI saved a reviewer time. Not a newsletter — a Slack message. “This week: 340 approvals processed, 18 hours saved.” That compounds. Escape hatches. Make it optional. Users who don’t trust the AI today should still be able to work. Forced AI adoption produces forced workarounds.

Measuring What Matters

The metrics I use for AI features in business context:
Primary metrics:
  - Time saved per task (requires baseline measurement before launch)
  - Error rate: AI-assisted vs. baseline
  - Escalation rate: % of AI outputs that require human correction

Secondary metrics:
  - Cost per request (track from day one, not after it becomes a problem)
  - Adoption rate (% of eligible tasks using AI assistance)
  - User satisfaction (quick thumbs up/down on AI outputs)

Watch carefully:
  - False confidence rate: how often AI is wrong AND confident
  - Population parity: does quality degrade for any user segment?
The metric most teams skip: false confidence rate. An AI that says “I’m not sure” when it’s wrong is manageable. An AI that’s confidently wrong erodes trust faster than you can rebuild it.

Lessons From Real Deployments

Weel (expense management, fintech): AI-assisted approval triage. Key learning: the model needed explicit instructions to abstain on unusual cases. Without “if the expense doesn’t fit a clear category, say so and route to senior reviewer,” it would confidently miscategorise edge cases. Routing logic matters as much as the model. Bugcrowd (bug bounty platform): Triage copilot that summarises security reports before human review. Key learning: the summary had to be strictly grounded in the submission — no extrapolation. A bug report summary that added context the model inferred (rather than the researcher submitted) introduced inaccuracy at the exact moment accuracy mattered most. PromptLib (my product): AI-powered prompt management. Key learning: users trust AI suggestions more when they can see the source. “Suggested based on your last 5 prompts in this category” lands very differently to “here’s a suggestion.” Transparency about how AI made a decision changes how people evaluate it.

The Maturity Progression

Where most organisations sit, and where to aim: Most teams I work with are at Level 1 trying to jump to Level 3. The jump fails because Level 2 is where you build the evaluation infrastructure, governance patterns, and user trust that Level 3 requires. Go through Level 2. It’s not a detour — it’s the prerequisite.

The Honest Business Case

AI in business works when it’s applied to workflows with high volume, consistent structure, and a clear definition of success. It fails when it’s applied to showcase the technology rather than solve a specific problem. The most successful AI features I’ve seen in production weren’t impressive in demos. They were invisible — running in the background, making an existing workflow faster or more reliable. The people using them stopped noticing it was AI. They just noticed the work was easier. That invisibility is the goal.