Deep Learning in Production: A Practitioner's Guide

Deep learning shows up in my work when classical ML hits a ceiling — usually because the input is unstructured (images, audio, long text) or because I need representational power that gradient-boosted trees can’t provide. In practice, I’m almost always using fine-tuned foundation models, not training from scratch. Training from scratch is for researchers with large datasets and compute budgets. Building products means starting from the best available pretrained model and adapting it.

The Decision Framework

When deep learning is the right call:

Input is images, audio, or video
Text input is high-volume with consistent task structure (classification, extraction)
You need sub-100ms latency at scale (impossible with API calls)
Data privacy prevents sending to external APIs
Task requires specialized knowledge not well-represented in general LLMs

When to stay with prompting:

Text tasks with moderate volume
When you don’t have (or can’t build) training data
When the task changes frequently
When “good enough” accuracy beats “optimal” accuracy with 3 weeks of fine-tuning

The Tools I Use and Why

Task	Tool	Why
Language fine-tuning	Hugging Face + LoRA (via PEFT)	Widest model access; LoRA keeps compute manageable
Image classification / embedding	CLIP (zero-shot or fine-tuned)	Works without labeled data; OpenAI open-sourced it
Object detection	YOLOv8 (Ultralytics)	Best accuracy-to-deployment ratio; excellent Python API
Audio transcription	Whisper (large-v3 locally or via API)	Best accuracy on accented and bilingual speech
Video key frame extraction	PySceneDetect	Fast, good enough for thumbnail generation
Training infrastructure	RunPod spot GPUs	~60% cheaper than AWS for short training runs
Experiment tracking	Weights & Biases	Visual comparison of training runs; built-in sweeps
Model optimization	llama.cpp + GGUF, ONNX Runtime	Shrinking models for edge deployment
Serving	Modal (GPU inference) / BentoML	Managed GPU for heavy models; BentoML for containerized deployment

Core Concepts Every Builder Needs

Why Foundation Models Changed Everything

Before foundation models (BERT, GPT, CLIP, Whisper), training a deep learning model for a specific task meant:

Collect thousands to millions of labeled examples
Train a model from random initialization
Hope the architecture and hyperparameters are right

This was expensive, slow, and accessible only to well-resourced teams. Foundation models shifted this:

A massive model is pretrained on enormous data
You fine-tune on your small, task-specific dataset
The model retains general knowledge and gains task-specific capability

For builders, this means you can now add deep learning capabilities with hundreds of examples, not millions.

LoRA: Fine-Tuning Without Melting Your Budget

Fine-tuning a full LLM is expensive. A 7B parameter model has 7 billion weights to update — that’s enormous compute and memory. Low-Rank Adaptation (LoRA) solves this by freezing most of the original model and adding small trainable “adapter” matrices:

Original weight matrix W (frozen): 4096 × 4096 = 16.7M parameters
LoRA: A (4096 × 16) + B (16 × 4096) = 131K parameters
Reduction: 99.2%

You’re training 1% of the parameters while getting most of the task-specific improvement. For most product use cases, the quality delta between LoRA and full fine-tuning is negligible.

Quantization: Making Models Fit

The default model precision is float32 (4 bytes per parameter). A 7B parameter model = 28GB just for weights. Quantization reduces precision to save memory and speed up inference:

Format	Bytes/param	7B model size	Quality
float32	4	28 GB	Full
float16	2	14 GB	~Full
int8	1	7 GB	Good
4-bit (GGUF Q4)	0.5	3.5 GB	Acceptable

For local deployment on consumer hardware, 4-bit quantized models are the practical choice. Quality drops ~5-10% on benchmarks, but for most production tasks the difference is imperceptible.

Production Example: Nishabdham Audio Pipeline

For Nishabdham (a Telugu poetry and literature platform), I built an audio pipeline: record poetry readings → generate bilingual subtitles with timestamps. The pipeline:

import whisper
import stable_whisper

def transcribe_telugu_poem(audio_path: str, poem_title: str) -> dict:
    """
    Transcribe Telugu audio with word-level timestamps and bilingual output.
    """
    # Load Whisper large-v3 for maximum accuracy on Telugu
    model = stable_whisper.load_model("large-v3")

    # Transcribe with word-level timestamps
    result = model.transcribe(
        audio_path,
        language="te",  # Telugu
        word_timestamps=True,
        condition_on_previous_text=True,  # Better context continuity
        initial_prompt=f"Telugu poetry recitation: {poem_title}"
    )

    # Generate SRT with both Telugu and English context
    segments = []
    for segment in result.segments:
        segments.append({
            "start": segment.start,
            "end": segment.end,
            "telugu_text": segment.text,
            "english_context": generate_context_note(segment.text),  # Claude API call
            "words": [(w.word, w.start, w.end) for w in segment.words]
        })

    return {
        "segments": segments,
        "srt": result.to_srt_vtt("srt"),
        "language_detected": result.language
    }

Post-processing for code-switching: Telugu spoken by modern speakers frequently mixes in English technical/modern words. Whisper sometimes transcribes these inconsistently:

CODE_SWITCH_PATTERNS = {
    r'\b(computer|software|internet|mobile)\b': lambda m: m.group(0).lower(),
    r'(\d+)\s*(percent|%|రూపాయలు)': normalize_numbers,
    # Add patterns as you encounter them in production
}

def normalize_telugu_transcript(text: str) -> str:
    """Handle common code-switching patterns in Telugu tech content."""
    for pattern, replacement in CODE_SWITCH_PATTERNS.items():
        text = re.sub(pattern, replacement, text, flags=re.IGNORECASE)
    return text

What the numbers looked like:

Whisper large-v3 accuracy on clean Telugu: ~92%
Accuracy on code-switched sentences: ~76%
After post-processing: ~84%
Manual review still needed for culturally significant errors (not just phonetic ones)

Fine-Tuning in Practice

When prompting isn’t sufficient and I need a fine-tuned model:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import LoraConfig, get_peft_model, TaskType
import torch

def fine_tune_classifier(
    model_name: str,
    train_data: list[dict],
    num_labels: int
):
    # Load base model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=num_labels
    )

    # Add LoRA adapters
    lora_config = LoraConfig(
        task_type=TaskType.SEQ_CLS,
        r=16,              # Rank — higher = more parameters, more capacity
        lora_alpha=32,     # Scaling factor
        lora_dropout=0.1,
        target_modules=["query", "value"]  # Which attention layers to adapt
    )

    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()
    # trainable params: 296,960 || all params: 109,779,968 || trainable%: 0.27

    # Train with standard PyTorch loop or HuggingFace Trainer
    # ...

    return model, tokenizer

The rule I follow: Don’t fine-tune until prompting has genuinely failed. I define “genuinely failed” as: 3 different prompt strategies attempted, each evaluated on a test set of 50+ representative examples, and none hit the quality threshold. Only then do I build training data and fine-tune.

Deployment Patterns

For models that need a GPU in production:

import modal

stub = modal.Stub("whisper-transcription")
image = modal.Image.debian_slim().pip_install("openai-whisper", "stable-whisper")

@stub.function(
    image=image,
    gpu="A10G",
    memory=8192,
    timeout=300
)
def transcribe(audio_bytes: bytes, language: str = "te") -> dict:
    import whisper
    import tempfile

    model = whisper.load_model("large-v3")

    with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as f:
        f.write(audio_bytes)
        result = model.transcribe(f.name, language=language)

    return {
        "text": result["text"],
        "segments": result["segments"]
    }

Modal charges per second of GPU time. For batch transcription jobs, this is dramatically cheaper than keeping a GPU instance running 24/7.

Edge Deployment with ONNX

For models that need to run without a GPU:

# Convert PyTorch model to ONNX
torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    input_names=["input_ids", "attention_mask"],
    output_names=["logits"],
    dynamic_axes={
        "input_ids": {0: "batch_size", 1: "sequence_length"},
        "attention_mask": {0: "batch_size", 1: "sequence_length"}
    }
)

# Run with ONNX Runtime (much faster than PyTorch CPU)
import onnxruntime as ort

session = ort.InferenceSession(
    "model.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

outputs = session.run(None, {
    "input_ids": input_ids.numpy(),
    "attention_mask": attention_mask.numpy()
})

What I Learned the Hard Way

Evaluation data is the real asset. The fine-tuned model becomes worthless when the base model is updated or when you find a better architecture. The evaluation dataset is yours forever — it tells you whether any model (fine-tuned or not) is doing the right thing. Invest in building it carefully before you train anything. Demo images are not real user images. Every model demo uses clean, well-lit, high-contrast inputs. Real users photograph things at night, on angles, with dirty lenses, half obscured. Test on ugly inputs before committing to a vision approach. I now have a “terrible photos” test set for every vision feature. Multimodal is harder than it looks. Image + text understanding in production hits edge cases that demos are designed to avoid. The OCR accuracy you see in a product demo is a best case, not an average case. Cold starts are a production concern. Loading a large model (Whisper large-v3 = 1.5GB) takes 10-30 seconds. This is fine for batch jobs. It’s unacceptable for synchronous user-facing features. Either keep the model warm (expensive) or design the UX to be async (background processing, email/notification when ready). The last 8% of errors is often culturally significant. Whisper gets 92% accuracy on Telugu. The 8% of errors it makes aren’t random — they cluster around culturally significant words, names, and phrases where the training data was sparse. A post-processing human review pass on the high-significance content is worth it.

Engineer × AI

AI Dev Toolkit

Shipping AI

AI Foundations

TypeScript at Scale

Design Systems

Deep Dives

System Design & Architecture

Shipping & DevOps

Testing & Quality

Observability

Career & Engineering Leadership

Deep Learning in Production: A Practitioner's Guide

The Decision Framework

The Tools I Use and Why

Core Concepts Every Builder Needs

Why Foundation Models Changed Everything

LoRA: Fine-Tuning Without Melting Your Budget

Quantization: Making Models Fit

Production Example: Nishabdham Audio Pipeline

Fine-Tuning in Practice

Deployment Patterns

Edge Deployment with ONNX

What I Learned the Hard Way

Engineer × AI

AI Dev Toolkit

Shipping AI

AI Foundations

TypeScript at Scale

Design Systems

Deep Dives

System Design & Architecture

Shipping & DevOps

Testing & Quality

Observability

Career & Engineering Leadership

​The Decision Framework

​The Tools I Use and Why

​Core Concepts Every Builder Needs

​Why Foundation Models Changed Everything

​LoRA: Fine-Tuning Without Melting Your Budget

​Quantization: Making Models Fit

​Production Example: Nishabdham Audio Pipeline

​Fine-Tuning in Practice

​Deployment Patterns

​GPU Inference on Modal

​Edge Deployment with ONNX

​What I Learned the Hard Way

The Decision Framework

The Tools I Use and Why

Core Concepts Every Builder Needs

Why Foundation Models Changed Everything

LoRA: Fine-Tuning Without Melting Your Budget

Quantization: Making Models Fit

Production Example: Nishabdham Audio Pipeline

Fine-Tuning in Practice

Deployment Patterns

GPU Inference on Modal

Edge Deployment with ONNX

What I Learned the Hard Way