Skip to main content
Computer vision wasn’t something I planned to work with. It found me through the problems I was trying to solve. Users wanted to photograph their medical device screens instead of typing numbers. Finance features needed to extract data from uploaded bank statements. Creative tools needed to process images from users with phones, not studios. Every time, the lesson was the same: demo accuracy and production accuracy are different numbers, and the gap between them is your primary engineering challenge.

Where Vision Shows Up in Real Products

ProductProblemVision approach
GlucoseProUsers photograph CGM device screensObject detection + OCR
AviWealthUpload bank statements → extract transactionsDocument layout parsing
NishabdhamGenerate thumbnails for poetry recordingsKey frame extraction + CLIP
GeneralUser-uploaded images need classificationCLIP zero-shot
In each case, I didn’t choose vision because I wanted to work with vision. The user’s natural input was visual, and building a text-based alternative would have made the product worse. Users photograph things. Building systems that work with photographs is the right call.

The Tool Stack

TaskToolWhy I chose it
Document parsing / invoice extractionLayoutLMv3 / DonutPurpose-built for structured document understanding
General OCR (clean text)TesseractFast, free, good enough for high-contrast printed text
OCR (difficult conditions)PaddleOCRBetter on rotated text, low contrast, non-Latin scripts
Object detectionYOLOv8 (Ultralytics)Best accuracy-to-deployment ratio; excellent Python API
Image classification (no labels)CLIP zero-shotWorks without training data; great for prototyping
Image classification (fine-tuned)CLIP or EfficientNetWhen zero-shot accuracy isn’t good enough
Key frame extractionPySceneDetectFast scene detection for thumbnail generation
Image annotation/labelingLabel StudioOpen-source; self-hostable; good export formats
Data augmentationAlbumentationsFast, composable augmentations for training data
GPU inferenceModalManaged GPU; pay-per-second; no idle costs

Core Concepts Every Builder Needs

The Vision Pipeline

Every production vision system follows the same basic pattern: Each stage can fail independently. Building robust production systems means handling failures at every stage, not just at the end.

OCR vs Document Understanding

OCR (Optical Character Recognition): Extracts raw text from an image. Doesn’t understand structure or meaning. Tesseract gives you “42.7 Total: $” with no understanding of which number is the total. Document Understanding: Extracts structured data while understanding document layout (tables, headers, form fields). Tools like LayoutLMv3 and Donut know that a number to the right of “Total:” is the total amount, not just a random number. For anything beyond simple text extraction — invoices, bank statements, forms — document understanding is what you actually need.

Model Accuracy ≠ Feature Reliability

A model that’s 94% accurate means 6% of predictions are wrong. In production:
  • Users don’t know which 6% is wrong without checking
  • The error distribution concentrates on hard cases — bad lighting, unusual formats, edge cases
  • These are exactly the cases your most frustrated users will encounter
For every vision feature I build, the critical design question is: what happens in the 6%? A usable fallback is often more important than improving from 94% to 96%.

Production Example: GlucosePro CGM Reading Capture

The user flow: photograph the CGM device screen → extract the glucose reading → log it with timestamp. The challenge: CGM screens vary by device, the text is small and sometimes against a curved surface, and users photograph them in all lighting conditions.

Stage 1: Device Detection (YOLOv8)

I needed to crop to just the device screen before running OCR. Without this, Tesseract would try to read everything in the photo — clothing patterns, background text, noise.
from ultralytics import YOLO
import cv2
import numpy as np

class CGMDetector:
    def __init__(self, model_path: str):
        self.model = YOLO(model_path)
        self.conf_threshold = 0.5

    def detect_screen(self, image_path: str) -> dict | None:
        """
        Detect CGM device screen region in a photo.
        Returns bounding box or None if no device detected.
        """
        results = self.model(image_path, conf=self.conf_threshold)

        if not results[0].boxes:
            return None

        # Get highest-confidence detection
        best = max(results[0].boxes, key=lambda b: b.conf.item())

        return {
            "bbox": best.xyxy[0].tolist(),  # [x1, y1, x2, y2]
            "confidence": best.conf.item(),
        }

    def crop_screen(self, image_path: str, bbox: list, padding: int = 10) -> np.ndarray:
        """Crop image to detected screen with slight padding."""
        img = cv2.imread(image_path)
        x1, y1, x2, y2 = [int(c) for c in bbox]
        x1 = max(0, x1 - padding)
        y1 = max(0, y1 - padding)
        x2 = min(img.shape[1], x2 + padding)
        y2 = min(img.shape[0], y2 + padding)
        return img[y1:y2, x1:x2]
Training data: I collected 200 photos (my own devices + stock images + beta user contributions), labeled them in Label Studio (~4 hours), and augmented to 400 examples.
import albumentations as A

# Augmentations specifically simulate real-world conditions
augmentation = A.Compose([
    A.RandomBrightnessContrast(brightness_limit=0.3, p=0.5),  # Poor lighting
    A.GaussNoise(var_limit=(10, 50), p=0.3),                  # Camera noise
    A.Blur(blur_limit=3, p=0.2),                               # Hand shake
    A.Perspective(scale=(0.05, 0.1), p=0.3)                   # Angled photos
])

Stage 2: Preprocessing for OCR

def preprocess_for_ocr(cropped_image: np.ndarray) -> np.ndarray:
    gray = cv2.cvtColor(cropped_image, cv2.COLOR_BGR2GRAY)

    # Upscale 2x — Tesseract performs better on larger text
    scaled = cv2.resize(gray, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)

    # Adaptive thresholding — handles uneven lighting better than global threshold
    thresh = cv2.adaptiveThreshold(
        scaled, 255,
        cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY, 11, 2
    )

    return cv2.fastNlMeansDenoising(thresh, h=10)

Stage 3: OCR + Validation

import pytesseract

def extract_glucose_reading(preprocessed: np.ndarray) -> dict:
    config = "--psm 8 --oem 3 -c tessedit_char_whitelist=0123456789."
    text = pytesseract.image_to_string(preprocessed, config=config).strip()

    try:
        reading = float(text)
        if 2.0 <= reading <= 30.0:  # Plausible blood glucose range in mmol/L
            return {"reading": reading, "confidence": "high"}
        else:
            return {"reading": None, "confidence": "failed", "reason": f"Out of range: {reading}"}
    except ValueError:
        return {"reading": None, "confidence": "failed", "reason": "Parse error", "raw_text": text}

The Results

ConditionAccuracy
Controlled (good lighting, direct angle)94%
Typical real user conditions78%
The correction flow was as important as the model. When OCR failed, users saw their photo with an edit field pre-filled with the best OCR result. They corrected it in 5 seconds. Without this, 22% failure rate = feature-breaking bug. With correction UI, it’s minor friction.

Document Parsing: Bank Statements

AviWealth needed to extract transactions from bank statement PDFs/photos. LayoutLMv3 handles this better than raw OCR because it understands document structure.
from transformers import LayoutLMv3Processor, LayoutLMv3ForTokenClassification
from PIL import Image
import torch

class DocumentParser:
    def __init__(self):
        self.processor = LayoutLMv3Processor.from_pretrained(
            "microsoft/layoutlmv3-base"
        )
        self.model = LayoutLMv3ForTokenClassification.from_pretrained(
            "your-fine-tuned-bank-statement-model"
        )

    def extract_transactions(self, image_path: str) -> list[dict]:
        image = Image.open(image_path).convert("RGB")

        # Processor handles OCR + layout encoding automatically
        encoding = self.processor(
            image, return_tensors="pt", truncation=True
        )

        with torch.no_grad():
            outputs = self.model(**encoding)

        predictions = outputs.logits.argmax(-1).squeeze().tolist()
        return self._decode_to_transactions(predictions, encoding)
The honest reality: Bank statement formats vary enormously. I have specific parsers for the 8 most common Australian bank formats, and a fallback generic parser. Generic achieves ~70% extraction accuracy. Specific parsers achieve 90-95%. Invest in format-specific handling for your highest-volume document types.

Production Deployment

Vision models are heavy. Design for async:
# Modal for GPU inference — pay-per-second, no idle costs
import modal

stub = modal.Stub("vision-service")
image = modal.Image.debian_slim().pip_install(
    "ultralytics", "pytesseract", "opencv-python-headless"
)

@stub.function(gpu="T4", memory=4096, timeout=60)
def process_image(image_bytes: bytes, task: str) -> dict:
    import numpy as np, cv2
    nparr = np.frombuffer(image_bytes, np.uint8)
    img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)

    if task == "cgm_reading":
        return extract_cgm_reading(img)
    elif task == "bank_statement":
        return parse_document(img)
Design principle: Never make users wait for GPU cold starts (10-30 seconds). Show “analyzing your photo…” and process in background. Update with results asynchronously.

What I Learned the Hard Way

“Real user images” is a completely different category. Every model demo uses clean, well-lit, high-contrast inputs. Real users photograph things at night, upside-down, through glass, with smeared lenses. My “terrible photos” test set — intentionally bad photos of test content — is the most valuable dataset I have. Test ugly inputs before building a production system. OCR quality is a UX problem, not just an accuracy problem. When OCR fails, users need a graceful way to correct the result. Build the correction flow before spending time on accuracy improvements. The correction UX has higher ROI than most model improvements. Document parsing is harder than it looks. LayoutLMv3 is impressive in demos. Bank statement parsing in the wild hits edge cases every week — unusual table formats, scanned vs digital-native PDFs, multi-page documents with inconsistent headers. Budget 3x your initial estimate. The compute cost of vision is significant. Image models are 10-50x more expensive to run than text models at equivalent throughput. Design the UX for async processing: “Your photo is being analyzed” rather than blocking the user. Collecting 200 labeled images is doable. The assumption that “I don’t have training data” is often wrong. For the CGM detector, 200 labeled images took 4 hours total — 2 hours photographing and 2 hours labeling in Label Studio. This is accessible, not research-scale work.