When ML Beats LLMs (and Vice Versa)
The mental model I use: Reach for classical ML when:- You have labeled training data (thousands of examples, not dozens)
- The input is structured/tabular (numbers, categories, dates)
- You need sub-50ms latency at scale
- The prediction task is narrow and stable (doesn’t change often)
- Explainability matters — stakeholders need to understand why
- The input is unstructured text and volume is moderate
- You need flexibility over consistency
- You don’t have (or can’t afford to build) a labeled dataset
- The task changes frequently and retraining would be constant
- You need to handle a long tail of edge cases
The Tools I Use and Why
| Problem | Tool | Why I chose it |
|---|---|---|
| Tabular prediction (churn, fraud, classification) | CatBoost | Handles mixed types without preprocessing; best defaults out of the box |
| Time-series forecasting | Prophet + NeuralForecast | Prophet for trend/seasonality decomposition; NeuralForecast when you need neural accuracy |
| Anomaly detection | Isolation Forest | Unsupervised; works when you don’t have labeled anomalies |
| Recommendation (collaborative filtering) | Implicit (matrix factorization) | Strong baseline before anything more complex |
| Feature engineering | Featuretools + pandas | Automated feature generation for tabular data |
| Data versioning | LakeFS | Git for datasets — reproducibility without a full data platform |
| Experiment tracking | Weights & Biases | Visual training runs; better than MLflow for small teams |
| Model serving | BentoML → AWS Lambda | Low-ops path from model artifact to HTTP endpoint |
| Drift monitoring | Evidently AI | Production data vs training data comparison with minimal setup |
Core Concepts Every Builder Needs
Bias-Variance Tradeoff
The fundamental tension in ML: a model that’s too simple underfits (high bias, misses real patterns), a model that’s too complex overfits (high variance, memorizes training data and fails on new data). In practice, this means:- Gradient boosting (CatBoost/XGBoost) — good balance; handles heterogeneous tabular data well
- Deep networks — high capacity, tend to overfit without regularization
- Linear models — low variance, underfit complex relationships but highly interpretable
Evaluation Metrics That Actually Matter
Accuracy is almost always the wrong metric. The right metric depends on the cost asymmetry:| Scenario | What to optimize | Why |
|---|---|---|
| Fraud detection | Recall (catch all fraud) | False negatives (missed fraud) are expensive |
| Spam filter | Precision (avoid false positives) | False positives (blocking legitimate mail) damage user trust |
| Medical screening | AUC-ROC | Balance across all thresholds; choose threshold based on clinical context |
| Recommendation | NDCG, precision@k | Rank quality matters more than classification accuracy |
| Anomaly detection | F1 at operating threshold | Balance precision and recall at the threshold you’ll deploy |
The Data Problem
Most ML failures I’ve seen are data problems. The model is fine. The data is the problem. Common failure modes:- Label leakage — the target variable is causally related to a feature (e.g., using “account closed” as a feature to predict “account will close”)
- Training/serving skew — the feature distribution at training time doesn’t match deployment time
- Survivorship bias — your training data only includes customers who stayed, so you can’t predict churn accurately
- Class imbalance — 0.1% fraud rate means a model that predicts “not fraud” for everything gets 99.9% accuracy
Production Pattern: The AviWealth Anomaly Detector
AviWealth’s core value is helping immigrants understand their Australian finances. One feature: flagging months where spending patterns are unusually off. The wrong first approach:- Runs in 120ms per user per month
- False positive rate in beta: 8% (users accepted it)
- One beta user found a fraudulent direct debit through the alert
