Skip to main content
Deep learning shows up in my work when classical ML hits a ceiling — usually because the input is unstructured (images, audio, long text) or because I need representational power that gradient-boosted trees can’t provide. In practice, I’m almost always using fine-tuned foundation models, not training from scratch. Training from scratch is for researchers with large datasets and compute budgets. Building products means starting from the best available pretrained model and adapting it.

The Decision Framework

When deep learning is the right call:
  • Input is images, audio, or video
  • Text input is high-volume with consistent task structure (classification, extraction)
  • You need sub-100ms latency at scale (impossible with API calls)
  • Data privacy prevents sending to external APIs
  • Task requires specialized knowledge not well-represented in general LLMs
When to stay with prompting:
  • Text tasks with moderate volume
  • When you don’t have (or can’t build) training data
  • When the task changes frequently
  • When “good enough” accuracy beats “optimal” accuracy with 3 weeks of fine-tuning

The Tools I Use and Why

TaskToolWhy
Language fine-tuningHugging Face + LoRA (via PEFT)Widest model access; LoRA keeps compute manageable
Image classification / embeddingCLIP (zero-shot or fine-tuned)Works without labeled data; OpenAI open-sourced it
Object detectionYOLOv8 (Ultralytics)Best accuracy-to-deployment ratio; excellent Python API
Audio transcriptionWhisper (large-v3 locally or via API)Best accuracy on accented and bilingual speech
Video key frame extractionPySceneDetectFast, good enough for thumbnail generation
Training infrastructureRunPod spot GPUs~60% cheaper than AWS for short training runs
Experiment trackingWeights & BiasesVisual comparison of training runs; built-in sweeps
Model optimizationllama.cpp + GGUF, ONNX RuntimeShrinking models for edge deployment
ServingModal (GPU inference) / BentoMLManaged GPU for heavy models; BentoML for containerized deployment

Core Concepts Every Builder Needs

Why Foundation Models Changed Everything

Before foundation models (BERT, GPT, CLIP, Whisper), training a deep learning model for a specific task meant:
  1. Collect thousands to millions of labeled examples
  2. Train a model from random initialization
  3. Hope the architecture and hyperparameters are right
This was expensive, slow, and accessible only to well-resourced teams. Foundation models shifted this:
  1. A massive model is pretrained on enormous data
  2. You fine-tune on your small, task-specific dataset
  3. The model retains general knowledge and gains task-specific capability
For builders, this means you can now add deep learning capabilities with hundreds of examples, not millions.

LoRA: Fine-Tuning Without Melting Your Budget

Fine-tuning a full LLM is expensive. A 7B parameter model has 7 billion weights to update — that’s enormous compute and memory. Low-Rank Adaptation (LoRA) solves this by freezing most of the original model and adding small trainable “adapter” matrices:
Original weight matrix W (frozen): 4096 × 4096 = 16.7M parameters
LoRA: A (4096 × 16) + B (16 × 4096) = 131K parameters
Reduction: 99.2%
You’re training 1% of the parameters while getting most of the task-specific improvement. For most product use cases, the quality delta between LoRA and full fine-tuning is negligible.

Quantization: Making Models Fit

The default model precision is float32 (4 bytes per parameter). A 7B parameter model = 28GB just for weights. Quantization reduces precision to save memory and speed up inference:
FormatBytes/param7B model sizeQuality
float32428 GBFull
float16214 GB~Full
int817 GBGood
4-bit (GGUF Q4)0.53.5 GBAcceptable
For local deployment on consumer hardware, 4-bit quantized models are the practical choice. Quality drops ~5-10% on benchmarks, but for most production tasks the difference is imperceptible.

Production Example: Nishabdham Audio Pipeline

For Nishabdham (a Telugu poetry and literature platform), I built an audio pipeline: record poetry readings → generate bilingual subtitles with timestamps. The pipeline:
import whisper
import stable_whisper

def transcribe_telugu_poem(audio_path: str, poem_title: str) -> dict:
    """
    Transcribe Telugu audio with word-level timestamps and bilingual output.
    """
    # Load Whisper large-v3 for maximum accuracy on Telugu
    model = stable_whisper.load_model("large-v3")

    # Transcribe with word-level timestamps
    result = model.transcribe(
        audio_path,
        language="te",  # Telugu
        word_timestamps=True,
        condition_on_previous_text=True,  # Better context continuity
        initial_prompt=f"Telugu poetry recitation: {poem_title}"
    )

    # Generate SRT with both Telugu and English context
    segments = []
    for segment in result.segments:
        segments.append({
            "start": segment.start,
            "end": segment.end,
            "telugu_text": segment.text,
            "english_context": generate_context_note(segment.text),  # Claude API call
            "words": [(w.word, w.start, w.end) for w in segment.words]
        })

    return {
        "segments": segments,
        "srt": result.to_srt_vtt("srt"),
        "language_detected": result.language
    }
Post-processing for code-switching: Telugu spoken by modern speakers frequently mixes in English technical/modern words. Whisper sometimes transcribes these inconsistently:
CODE_SWITCH_PATTERNS = {
    r'\b(computer|software|internet|mobile)\b': lambda m: m.group(0).lower(),
    r'(\d+)\s*(percent|%|రూపాయలు)': normalize_numbers,
    # Add patterns as you encounter them in production
}

def normalize_telugu_transcript(text: str) -> str:
    """Handle common code-switching patterns in Telugu tech content."""
    for pattern, replacement in CODE_SWITCH_PATTERNS.items():
        text = re.sub(pattern, replacement, text, flags=re.IGNORECASE)
    return text
What the numbers looked like:
  • Whisper large-v3 accuracy on clean Telugu: ~92%
  • Accuracy on code-switched sentences: ~76%
  • After post-processing: ~84%
  • Manual review still needed for culturally significant errors (not just phonetic ones)

Fine-Tuning in Practice

When prompting isn’t sufficient and I need a fine-tuned model:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import LoraConfig, get_peft_model, TaskType
import torch

def fine_tune_classifier(
    model_name: str,
    train_data: list[dict],
    num_labels: int
):
    # Load base model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=num_labels
    )

    # Add LoRA adapters
    lora_config = LoraConfig(
        task_type=TaskType.SEQ_CLS,
        r=16,              # Rank — higher = more parameters, more capacity
        lora_alpha=32,     # Scaling factor
        lora_dropout=0.1,
        target_modules=["query", "value"]  # Which attention layers to adapt
    )

    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()
    # trainable params: 296,960 || all params: 109,779,968 || trainable%: 0.27

    # Train with standard PyTorch loop or HuggingFace Trainer
    # ...

    return model, tokenizer
The rule I follow: Don’t fine-tune until prompting has genuinely failed. I define “genuinely failed” as: 3 different prompt strategies attempted, each evaluated on a test set of 50+ representative examples, and none hit the quality threshold. Only then do I build training data and fine-tune.

Deployment Patterns

GPU Inference on Modal

For models that need a GPU in production:
import modal

stub = modal.Stub("whisper-transcription")
image = modal.Image.debian_slim().pip_install("openai-whisper", "stable-whisper")

@stub.function(
    image=image,
    gpu="A10G",
    memory=8192,
    timeout=300
)
def transcribe(audio_bytes: bytes, language: str = "te") -> dict:
    import whisper
    import tempfile

    model = whisper.load_model("large-v3")

    with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as f:
        f.write(audio_bytes)
        result = model.transcribe(f.name, language=language)

    return {
        "text": result["text"],
        "segments": result["segments"]
    }
Modal charges per second of GPU time. For batch transcription jobs, this is dramatically cheaper than keeping a GPU instance running 24/7.

Edge Deployment with ONNX

For models that need to run without a GPU:
# Convert PyTorch model to ONNX
torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    input_names=["input_ids", "attention_mask"],
    output_names=["logits"],
    dynamic_axes={
        "input_ids": {0: "batch_size", 1: "sequence_length"},
        "attention_mask": {0: "batch_size", 1: "sequence_length"}
    }
)

# Run with ONNX Runtime (much faster than PyTorch CPU)
import onnxruntime as ort

session = ort.InferenceSession(
    "model.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

outputs = session.run(None, {
    "input_ids": input_ids.numpy(),
    "attention_mask": attention_mask.numpy()
})

What I Learned the Hard Way

Evaluation data is the real asset. The fine-tuned model becomes worthless when the base model is updated or when you find a better architecture. The evaluation dataset is yours forever — it tells you whether any model (fine-tuned or not) is doing the right thing. Invest in building it carefully before you train anything. Demo images are not real user images. Every model demo uses clean, well-lit, high-contrast inputs. Real users photograph things at night, on angles, with dirty lenses, half obscured. Test on ugly inputs before committing to a vision approach. I now have a “terrible photos” test set for every vision feature. Multimodal is harder than it looks. Image + text understanding in production hits edge cases that demos are designed to avoid. The OCR accuracy you see in a product demo is a best case, not an average case. Cold starts are a production concern. Loading a large model (Whisper large-v3 = 1.5GB) takes 10-30 seconds. This is fine for batch jobs. It’s unacceptable for synchronous user-facing features. Either keep the model warm (expensive) or design the UX to be async (background processing, email/notification when ready). The last 8% of errors is often culturally significant. Whisper gets 92% accuracy on Telugu. The 8% of errors it makes aren’t random — they cluster around culturally significant words, names, and phrases where the training data was sparse. A post-processing human review pass on the high-significance content is worth it.