Where Vision Shows Up in Real Products
| Product | Problem | Vision approach |
|---|---|---|
| GlucosePro | Users photograph CGM device screens | Object detection + OCR |
| AviWealth | Upload bank statements → extract transactions | Document layout parsing |
| Nishabdham | Generate thumbnails for poetry recordings | Key frame extraction + CLIP |
| General | User-uploaded images need classification | CLIP zero-shot |
The Tool Stack
| Task | Tool | Why I chose it |
|---|---|---|
| Document parsing / invoice extraction | LayoutLMv3 / Donut | Purpose-built for structured document understanding |
| General OCR (clean text) | Tesseract | Fast, free, good enough for high-contrast printed text |
| OCR (difficult conditions) | PaddleOCR | Better on rotated text, low contrast, non-Latin scripts |
| Object detection | YOLOv8 (Ultralytics) | Best accuracy-to-deployment ratio; excellent Python API |
| Image classification (no labels) | CLIP zero-shot | Works without training data; great for prototyping |
| Image classification (fine-tuned) | CLIP or EfficientNet | When zero-shot accuracy isn’t good enough |
| Key frame extraction | PySceneDetect | Fast scene detection for thumbnail generation |
| Image annotation/labeling | Label Studio | Open-source; self-hostable; good export formats |
| Data augmentation | Albumentations | Fast, composable augmentations for training data |
| GPU inference | Modal | Managed GPU; pay-per-second; no idle costs |
Core Concepts Every Builder Needs
The Vision Pipeline
Every production vision system follows the same basic pattern: Each stage can fail independently. Building robust production systems means handling failures at every stage, not just at the end.OCR vs Document Understanding
OCR (Optical Character Recognition): Extracts raw text from an image. Doesn’t understand structure or meaning. Tesseract gives you “42.7 Total: $” with no understanding of which number is the total. Document Understanding: Extracts structured data while understanding document layout (tables, headers, form fields). Tools like LayoutLMv3 and Donut know that a number to the right of “Total:” is the total amount, not just a random number. For anything beyond simple text extraction — invoices, bank statements, forms — document understanding is what you actually need.Model Accuracy ≠ Feature Reliability
A model that’s 94% accurate means 6% of predictions are wrong. In production:- Users don’t know which 6% is wrong without checking
- The error distribution concentrates on hard cases — bad lighting, unusual formats, edge cases
- These are exactly the cases your most frustrated users will encounter
Production Example: GlucosePro CGM Reading Capture
The user flow: photograph the CGM device screen → extract the glucose reading → log it with timestamp. The challenge: CGM screens vary by device, the text is small and sometimes against a curved surface, and users photograph them in all lighting conditions.Stage 1: Device Detection (YOLOv8)
I needed to crop to just the device screen before running OCR. Without this, Tesseract would try to read everything in the photo — clothing patterns, background text, noise.Stage 2: Preprocessing for OCR
Stage 3: OCR + Validation
The Results
| Condition | Accuracy |
|---|---|
| Controlled (good lighting, direct angle) | 94% |
| Typical real user conditions | 78% |
