From Demo to Production: Shipping AI Features

There’s a moment in every AI project where the demo works beautifully. You show it to your team, they’re impressed, someone says “ship it.” Then reality hits. The demo ran on 5 carefully chosen inputs. Production gets 50,000 chaotic, unpredictable, adversarial inputs daily. The demo took 8 seconds and nobody minded. Production users bounce after 3. The demo cost a penny per request. Production costs $400/day and your finance team has questions. I’ve shipped AI features at MetaLabs, PromptLib, and Weel. Every single time, the gap between “working demo” and “production feature” was wider than I expected. Here’s the field guide for crossing that gap.

Why Demos Lie

A demo is a proof of concept running under ideal conditions. Here’s what’s different in production: Input diversity: Your demo used well-formed English sentences. Production gets typos, code-switching, emojis, 10-page pastes, empty strings, and deliberate prompt injection attempts. Scale economics: At 100 requests, cost is invisible. At 100,000 requests, it’s a budget line item. The model that’s “basically free” in demos becomes the most expensive service in your stack. Latency composition: The demo measured “LLM response time.” Production measures the full chain — authentication, input validation, context retrieval, LLM inference, output parsing, caching, logging — and users experience the sum. Error handling: The demo crashed gracefully when you were watching. Production crashes at 3 AM when a user in Japan encounters a Unicode edge case in your prompt template. Statefulness: Demos are stateless. Production has user sessions, conversation history, cached results, concurrent requests, and race conditions.

If your AI feature works in a demo but you haven’t stress-tested it with at least 500 diverse real-world inputs, you don’t have a feature. You have a risk.

Latency Budgets for AI Features

Users have different latency expectations for different interactions. Map your AI feature to the right budget.

Interaction Type	Latency Budget	Example
Autocomplete / inline suggestions	< 200ms	Code completion, search suggestions
Quick transformation	< 1s	Summarize selected text, classify an email
Chat response (first token)	< 500ms	Customer support bot, coding assistant
Chat response (full)	< 5s	Complex question answering
Background processing	< 30s	Document analysis, report generation
Async / queued	Minutes	Batch processing, long-form generation

How to Hit Your Budget

Streaming: For anything over 1 second, stream the response. Users perceive streaming as faster because they see progress. Time to first token (TTFT) is more important than total time.

// Streaming response in Next.js
import { OpenAI } from "openai";

export async function POST(request: Request) {
  const { message } = await request.json();
  const openai = new OpenAI();

  const stream = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: message }],
    stream: true,
  });

  const encoder = new TextEncoder();

  return new Response(
    new ReadableStream({
      async start(controller) {
        for await (const chunk of stream) {
          const text = chunk.choices[0]?.delta?.content || "";
          controller.enqueue(encoder.encode(`data: ${JSON.stringify({ text })}\n\n`));
        }
        controller.enqueue(encoder.encode("data: [DONE]\n\n"));
        controller.close();
      },
    }),
    { headers: { "Content-Type": "text/event-stream" } }
  );
}

Model selection: GPT-4o-mini is 5-10x faster than GPT-4o. If the quality difference doesn’t matter for your use case, use the faster model. Parallel processing: If you need to make multiple LLM calls, make them concurrently.

import asyncio

async def process_document(doc: str) -> dict:
    summary_task = summarize(doc)
    entities_task = extract_entities(doc)
    sentiment_task = classify_sentiment(doc)

    summary, entities, sentiment = await asyncio.gather(
        summary_task, entities_task, sentiment_task
    )

    return {"summary": summary, "entities": entities, "sentiment": sentiment}

Pre-computation: Anything that can be computed ahead of time should be. Embed your knowledge base offline, not at query time. Precompute common prompts.

Error Handling and Fallbacks

LLM APIs fail. Models produce unparseable output. Rate limits hit. Your system needs to handle all of this gracefully.

The Fallback Hierarchy

async def ai_classify(text: str) -> Classification:
    try:
        result = await llm_classify(text, model="gpt-4o-mini")
        if validate_output(result):
            return result
    except RateLimitError:
        logger.warning("Primary model rate limited, falling back")
    except TimeoutError:
        logger.warning("Primary model timeout, falling back")
    except Exception as e:
        logger.error(f"Primary model error: {e}")

    # Fallback 1: Try a different model
    try:
        result = await llm_classify(text, model="claude-3-haiku")
        if validate_output(result):
            return result
    except Exception:
        pass

    # Fallback 2: Rule-based classification
    result = rule_based_classify(text)
    if result:
        return result

    # Fallback 3: Queue for human review
    await queue_for_review(text)
    return Classification(label="pending_review", confidence=0)

What to Show Users When AI Fails

Never show “An error occurred.” Tell the user what happened and what they can do.

function AIResultDisplay({ result, error }: Props) {
  if (error?.type === "rate_limit") {
    return (
      <Banner variant="warning">
        High demand right now. Your request is queued and will complete in about
        30 seconds. <Button onClick={retry}>Try again</Button>
      </Banner>
    );
  }

  if (error?.type === "low_confidence") {
    return (
      <Banner variant="info">
        I'm not confident in this answer. Here's my best attempt, but you may
        want to verify: <em>{result.text}</em>
        <Button onClick={escalateToHuman}>Get human help</Button>
      </Banner>
    );
  }

  if (error?.type === "timeout") {
    return (
      <Banner variant="error">
        This is taking longer than expected. 
        <Button onClick={retry}>Retry</Button> or{" "}
        <Button onClick={useSimpler}>Try simplified version</Button>
      </Banner>
    );
  }

  return <AIResult data={result} />;
}

Design your error states before you design your happy path. Users forgive slow or imperfect AI. They don’t forgive AI that fails silently or produces garbage without warning.

Streaming UX Patterns

Streaming changes how users interact with AI features. Here are patterns I’ve found effective.

Progressive Rendering

Don’t just dump streaming text into a container. Structure the rendering.

function StreamingResponse({ stream }: { stream: AsyncIterable<string> }) {
  const [chunks, setChunks] = useState<string[]>([]);
  const [isComplete, setIsComplete] = useState(false);

  useEffect(() => {
    let buffer = "";
    (async () => {
      for await (const chunk of stream) {
        buffer += chunk;
        setChunks(prev => [...prev, chunk]);
      }
      setIsComplete(true);
    })();
  }, [stream]);

  return (
    <div className="relative">
      <div className="prose">
        <Markdown>{chunks.join("")}</Markdown>
      </div>
      {!isComplete && <TypingIndicator />}
      {isComplete && (
        <div className="mt-4 flex gap-2">
          <CopyButton text={chunks.join("")} />
          <FeedbackButtons />
        </div>
      )}
    </div>
  );
}

Show Confidence During Streaming

If your system has a retrieval step, show the sources while the answer is being generated.

[Finding relevant sources...]
→ Found 3 matching documents
→ Using: pricing-guide.md, faq.md

[Generating answer...]
Based on our pricing guide, the annual plan costs...

Cancellation

Users should be able to cancel a streaming response. This saves tokens and money.

const controller = new AbortController();

const stream = await openai.chat.completions.create(
  { model: "gpt-4o", messages, stream: true },
  { signal: controller.signal }
);

// On user cancel:
controller.abort();

Cost Management at Scale

This is the section that would have saved me the most money if I’d read it two years ago.

Know Your Unit Economics

For every AI feature, calculate:

Cost per request = (input_tokens × input_price) + (output_tokens × output_price) + infrastructure

Daily cost = cost_per_request × daily_requests
Monthly cost = daily_cost × 30
Annual cost = monthly_cost × 12

Example for a support chatbot:

Average input: 800 tokens (system prompt + context + user message)
Average output: 200 tokens
Model: GPT-4o-mini ($0.15/1M input, $0.60/1M output)

Cost per request: (800 × $0.00000015) + (200 × $0.0000006) = $0.00024
At 10K requests/day: $2.40/day → $72/month
At 100K requests/day: $24/day → $720/month

Cost Reduction Strategies

1. Cache aggressively

import hashlib

def get_cached_response(prompt: str, ttl_hours: int = 24) -> str | None:
    cache_key = hashlib.sha256(prompt.encode()).hexdigest()
    cached = redis.get(f"llm:cache:{cache_key}")
    if cached:
        metrics.increment("llm.cache.hit")
        return cached
    metrics.increment("llm.cache.miss")
    return None

def cache_response(prompt: str, response: str, ttl_hours: int = 24):
    cache_key = hashlib.sha256(prompt.encode()).hexdigest()
    redis.setex(f"llm:cache:{cache_key}", ttl_hours * 3600, response)

In my experience, LLM response caching hits 15-30% for support and FAQ use cases. That’s 15-30% cost savings for an hour of implementation. 2. Use the cheapest model that meets your quality bar Don’t default to GPT-4o. Start with GPT-4o-mini. Test if the quality is acceptable. Only upgrade when you can demonstrate the bigger model is meaningfully better on your eval set. 3. Minimize context tokens Every token in your system prompt costs money on every request. Audit your prompts regularly. Remove verbose instructions, compress few-shot examples, truncate unnecessary context.

# Before: 2000 token system prompt
# After: 800 token system prompt
# Savings at 100K requests/day: ~$5/day

4. Smart batching for non-interactive tasks

async def batch_classify(items: list[str], batch_size: int = 10) -> list:
    prompt = "Classify each of the following items:\n"
    for i, item in enumerate(items[:batch_size]):
        prompt += f"{i+1}. {item}\n"
    prompt += "\nReturn JSON array of classifications."

    response = await llm.generate(prompt)
    return json.loads(response)

Rate Limiting

Protect your budget and your API quotas.

from slowapi import Limiter

limiter = Limiter(key_func=get_user_id)

@app.post("/api/ai/chat")
@limiter.limit("20/minute")      # Per user
@limiter.limit("1000/hour")      # Global
async def chat(request: Request):
    ...

Set per-user and global limits. Alert when you’re approaching your API provider’s rate limits. Nothing kills user experience like 429 errors.

Feature Flags for AI

AI features should be behind feature flags. Always. No exceptions.

async function getAISummary(document: string): Promise<Summary> {
  const flags = await getFeatureFlags(userId);

  if (!flags.ai_summary_enabled) {
    return generateRuleSummary(document);  // non-AI fallback
  }

  const model = flags.ai_summary_model || "gpt-4o-mini";
  const promptVersion = flags.ai_summary_prompt_version || "v2.1";

  return generateAISummary(document, { model, promptVersion });
}

Feature flags let you:

Gradual rollout: 1% → 10% → 50% → 100% with monitoring at each step
Instant rollback: Model misbehaving? Kill the flag, users get the fallback
A/B testing: Route traffic to different models or prompt versions
Cost control: Disable AI for free-tier users during cost spikes

Monitoring and Observability

You need to see what your AI features are doing in production. Here’s my monitoring stack.

Structured Logging

import structlog

logger = structlog.get_logger()

async def ai_request(input_text: str, user_id: str) -> str:
    start = time.monotonic()

    response = await llm.generate(input_text)

    duration_ms = (time.monotonic() - start) * 1000

    logger.info("ai_request",
        user_id=user_id,
        model=response.model,
        input_tokens=response.usage.prompt_tokens,
        output_tokens=response.usage.completion_tokens,
        duration_ms=duration_ms,
        prompt_version="v2.3",
        cache_hit=False,
    )

    return response.text

Dashboards I Run

Request volume: Requests per minute, broken down by feature
Latency: P50, P95, P99 with alerts on sudden increases
Error rate: By error type (timeout, rate limit, parse failure, model error)
Cost: Daily spend by model, by feature, by user tier
Quality: Sampled LLM-as-judge scores, user feedback (thumbs up/down)
Token usage: Average input/output tokens per request — catches prompt bloat

Alerts That Matter

Cost exceeds daily budget by 20%
Error rate exceeds 5% over 15-minute window
P95 latency exceeds 2x normal baseline
LLM-as-judge quality score drops below threshold
Token usage spikes (indicates prompt or input changes)

User Trust and Transparency

Users don’t trust AI by default, and they shouldn’t. Building trust requires deliberate design choices.

Show Your Work

When the AI provides information, show where it came from.

Based on your pricing guide (last updated Jan 2025):
Annual plans start at $99/month billed yearly.

Sources: pricing-guide.md, faq.md (sections: "Annual Billing", "Plan Comparison")

Communicate Uncertainty

Don’t present every AI output with equal confidence.

interface AIResponse {
  text: string;
  confidence: "high" | "medium" | "low";
  sources: Source[];
}

function renderResponse(response: AIResponse) {
  return (
    <>
      {response.confidence === "low" && (
        <InfoBanner>
          This answer may be incomplete. Consider verifying with our support team.
        </InfoBanner>
      )}
      <ResponseText>{response.text}</ResponseText>
      <SourceList sources={response.sources} />
    </>
  );
}

Feedback Loops

Every AI output should have a feedback mechanism. Thumbs up/down is the minimum.

<div className="flex items-center gap-2 mt-2 text-sm text-gray-500">
  <span>Was this helpful?</span>
  <button onClick={() => submitFeedback("positive")}>👍</button>
  <button onClick={() => submitFeedback("negative")}>👎</button>
  <button onClick={() => setShowDetails(true)}>Tell us more</button>
</div>

The feedback serves two purposes: it builds your eval dataset from real user judgments, and it makes users feel heard — which builds trust even when the AI is wrong.

Gradual Rollout Strategy

Here’s the rollout plan I use for every AI feature:

Phase 1: Shadow Mode (1-2 weeks)

Run the AI feature in parallel with the existing system. Log AI outputs but don’t show them to users. Compare AI outputs to existing outputs. Measure quality, latency, cost.

Phase 2: Internal Dogfood (1 week)

Enable for your team. Get feedback on quality, UX, edge cases. Fix the obvious issues.

Phase 3: Beta with Opt-In (2-4 weeks)

Enable for power users or beta testers. Monitor closely. Collect feedback aggressively. This is where you find the edge cases your eval set missed.

Phase 4: Gradual Rollout (2-4 weeks)

1% → 10% → 50% → 100%. At each step, compare quality and user metrics between the AI group and the control group. If metrics regress, pause and investigate.

Phase 5: Production with Monitoring

Feature is live for everyone. Monitoring dashboards running. Weekly eval suite runs. Prompt versioning in place. On-call knows how to disable the feature if something goes wrong.

This process feels slow. It’s not. It’s the fastest way to ship an AI feature that you don’t have to emergency-revert at 2 AM. I’ve learned this the expensive way.

Lessons from Shipping

These are the things I wish I’d known before shipping my first AI feature:

Budget 3x the time you expect for going from demo to production. The last 20% takes 80% of the time.
Your prompt will change monthly. Design for that. Prompts outside of code, versioning, instant rollback.
Users will find ways to break it you never imagined. Adversarial testing is not optional.
Cost surprises are the most common “production incident” for AI features. Monitor cost from day one, not after the first bill.
The best AI feature is the one users forget is AI. If it’s reliable, fast, and transparent, they stop thinking about the technology and just use the tool.
Fallbacks are not failure. A system that gracefully falls back to a simpler approach is better than one that fails spectacularly.

The gap between demo and production is real, but it’s crossable. It just takes the same engineering discipline we apply to everything else — testing, monitoring, gradual rollout, and the humility to know that our system will be wrong sometimes and to plan for that.

AI & Machine Learning

AI Applications

Practical AI

AI Tooling & Workflows

Guides

From Demo to Production: Shipping AI Features

From Demo to Production: Shipping AI Features

Why Demos Lie

Latency Budgets for AI Features

How to Hit Your Budget

Error Handling and Fallbacks

The Fallback Hierarchy

What to Show Users When AI Fails

Streaming UX Patterns

Progressive Rendering

Show Confidence During Streaming

Cancellation

Cost Management at Scale

Know Your Unit Economics

Cost Reduction Strategies

Rate Limiting

Feature Flags for AI

Monitoring and Observability

Structured Logging

Dashboards I Run

Alerts That Matter

User Trust and Transparency

Show Your Work

Communicate Uncertainty

Feedback Loops

Gradual Rollout Strategy

Phase 1: Shadow Mode (1-2 weeks)

Phase 2: Internal Dogfood (1 week)

Phase 3: Beta with Opt-In (2-4 weeks)

Phase 4: Gradual Rollout (2-4 weeks)

Phase 5: Production with Monitoring

Lessons from Shipping

AI & Machine Learning

AI Applications

Practical AI

AI Tooling & Workflows

Guides

​From Demo to Production: Shipping AI Features

​Why Demos Lie

​Latency Budgets for AI Features

​How to Hit Your Budget

​Error Handling and Fallbacks

​The Fallback Hierarchy

​What to Show Users When AI Fails

​Streaming UX Patterns

​Progressive Rendering

​Show Confidence During Streaming

​Cancellation

​Cost Management at Scale

​Know Your Unit Economics

​Cost Reduction Strategies

​Rate Limiting

​Feature Flags for AI

​Monitoring and Observability

​Structured Logging

​Dashboards I Run

​Alerts That Matter

​User Trust and Transparency

​Show Your Work

​Communicate Uncertainty

​Feedback Loops

​Gradual Rollout Strategy

​Phase 1: Shadow Mode (1-2 weeks)

​Phase 2: Internal Dogfood (1 week)

​Phase 3: Beta with Opt-In (2-4 weeks)

​Phase 4: Gradual Rollout (2-4 weeks)

​Phase 5: Production with Monitoring

​Lessons from Shipping

From Demo to Production: Shipping AI Features

Why Demos Lie

Latency Budgets for AI Features

How to Hit Your Budget

Error Handling and Fallbacks

The Fallback Hierarchy

What to Show Users When AI Fails

Streaming UX Patterns

Progressive Rendering

Show Confidence During Streaming

Cancellation

Cost Management at Scale

Know Your Unit Economics

Cost Reduction Strategies

Rate Limiting

Feature Flags for AI

Monitoring and Observability

Structured Logging

Dashboards I Run

Alerts That Matter

User Trust and Transparency

Show Your Work

Communicate Uncertainty

Feedback Loops

Gradual Rollout Strategy

Phase 1: Shadow Mode (1-2 weeks)

Phase 2: Internal Dogfood (1 week)

Phase 3: Beta with Opt-In (2-4 weeks)

Phase 4: Gradual Rollout (2-4 weeks)

Phase 5: Production with Monitoring

Lessons from Shipping