Skip to main content
I’m a practitioner, not a researcher. Everything here comes from building ML features into real products — AviWealth, GlucosePro, Thinki.sh — not from reading papers or running benchmarks. The question I get asked most: “Should I use ML or just use an LLM?” The answer is usually both, applied to different parts of the problem. Understanding when each tool fits is the thing that separates expensive over-engineering from the right call.

When ML Beats LLMs (and Vice Versa)

The mental model I use: Reach for classical ML when:
  • You have labeled training data (thousands of examples, not dozens)
  • The input is structured/tabular (numbers, categories, dates)
  • You need sub-50ms latency at scale
  • The prediction task is narrow and stable (doesn’t change often)
  • Explainability matters — stakeholders need to understand why
Reach for LLMs when:
  • The input is unstructured text and volume is moderate
  • You need flexibility over consistency
  • You don’t have (or can’t afford to build) a labeled dataset
  • The task changes frequently and retraining would be constant
  • You need to handle a long tail of edge cases
The honest version: Start with prompting. It’s faster and often good enough. When it’s not — when you need consistency, speed, or you’re hitting API costs at scale — that’s when ML earns its place.

The Tools I Use and Why

ProblemToolWhy I chose it
Tabular prediction (churn, fraud, classification)CatBoostHandles mixed types without preprocessing; best defaults out of the box
Time-series forecastingProphet + NeuralForecastProphet for trend/seasonality decomposition; NeuralForecast when you need neural accuracy
Anomaly detectionIsolation ForestUnsupervised; works when you don’t have labeled anomalies
Recommendation (collaborative filtering)Implicit (matrix factorization)Strong baseline before anything more complex
Feature engineeringFeaturetools + pandasAutomated feature generation for tabular data
Data versioningLakeFSGit for datasets — reproducibility without a full data platform
Experiment trackingWeights & BiasesVisual training runs; better than MLflow for small teams
Model servingBentoML → AWS LambdaLow-ops path from model artifact to HTTP endpoint
Drift monitoringEvidently AIProduction data vs training data comparison with minimal setup

Core Concepts Every Builder Needs

Bias-Variance Tradeoff

The fundamental tension in ML: a model that’s too simple underfits (high bias, misses real patterns), a model that’s too complex overfits (high variance, memorizes training data and fails on new data). In practice, this means:
  • Gradient boosting (CatBoost/XGBoost) — good balance; handles heterogeneous tabular data well
  • Deep networks — high capacity, tend to overfit without regularization
  • Linear models — low variance, underfit complex relationships but highly interpretable
For most product ML problems, CatBoost is the right first answer. It regularizes well by default and performs strongly on medium-sized tabular datasets without hyperparameter tuning.

Evaluation Metrics That Actually Matter

Accuracy is almost always the wrong metric. The right metric depends on the cost asymmetry:
ScenarioWhat to optimizeWhy
Fraud detectionRecall (catch all fraud)False negatives (missed fraud) are expensive
Spam filterPrecision (avoid false positives)False positives (blocking legitimate mail) damage user trust
Medical screeningAUC-ROCBalance across all thresholds; choose threshold based on clinical context
RecommendationNDCG, precision@kRank quality matters more than classification accuracy
Anomaly detectionF1 at operating thresholdBalance precision and recall at the threshold you’ll deploy
I define the evaluation metric before I train anything. It forces clarity about what “better” means for this specific use case.

The Data Problem

Most ML failures I’ve seen are data problems. The model is fine. The data is the problem. Common failure modes:
  • Label leakage — the target variable is causally related to a feature (e.g., using “account closed” as a feature to predict “account will close”)
  • Training/serving skew — the feature distribution at training time doesn’t match deployment time
  • Survivorship bias — your training data only includes customers who stayed, so you can’t predict churn accurately
  • Class imbalance — 0.1% fraud rate means a model that predicts “not fraud” for everything gets 99.9% accuracy

Production Pattern: The AviWealth Anomaly Detector

AviWealth’s core value is helping immigrants understand their Australian finances. One feature: flagging months where spending patterns are unusually off. The wrong first approach:
# Naive: flag anything above 2 standard deviations
threshold = mean + 2 * std
flagged = monthly_spend > threshold
This fires too often (normal variance triggers it constantly) and misses patterns (three modestly elevated months in a row signals something, but none individually breach the threshold). What I actually built:
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler

def build_anomaly_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Build features that capture spend patterns, not just spend levels.
    """
    features = pd.DataFrame()

    for category in SPEND_CATEGORIES:
        # Current month vs 90-day baseline
        features[f'{category}_vs_baseline'] = (
            df[f'{category}_spend_30d'] /
            (df[f'{category}_spend_90d'] / 3 + 1)  # +1 to avoid div by zero
        )

        # Month-over-month change
        features[f'{category}_mom_change'] = (
            df[f'{category}_spend_30d'] - df[f'{category}_spend_30d'].shift(1)
        )

        # Seasonality adjustment (same month last year)
        features[f'{category}_seasonal_ratio'] = (
            df[f'{category}_spend_30d'] /
            (df[f'{category}_spend_ytd_avg'] + 1)
        )

    return features

def train_anomaly_detector(user_history: pd.DataFrame):
    features = build_anomaly_features(user_history)
    scaler = StandardScaler()
    scaled = scaler.fit_transform(features)

    model = IsolationForest(
        contamination=0.05,  # Expect ~5% anomalous months
        random_state=42
    )
    model.fit(scaled)
    return model, scaler

# Serving: runs as a Lambda triggered on monthly aggregation
def score_month(user_id: str, month_features: dict) -> dict:
    model, scaler = load_model(user_id)
    features = pd.DataFrame([month_features])
    scaled = scaler.transform(features)
    score = model.score_samples(scaled)[0]  # More negative = more anomalous

    return {
        "anomaly_score": float(score),
        "flagged": score < -0.3,  # Threshold tuned on beta user feedback
        "top_contributors": identify_top_contributors(features, model)
    }
Results:
  • Runs in 120ms per user per month
  • False positive rate in beta: 8% (users accepted it)
  • One beta user found a fraudulent direct debit through the alert
The isolation forest was specifically chosen because I didn’t have labeled “anomalous months” — it’s unsupervised. If I’d had labeled data, I would have used a classifier.

The ML Pipeline in Practice

Shadow deployment is the step most people skip. Before routing real traffic to a new model, run it in parallel with the existing system for 2-4 weeks. Compare outputs without affecting users. This catches serving skew and edge cases that you’ll never see in test data.

Deployment: BentoML → Lambda

My default serving pattern for models that don’t need real-time:
# service.py
import bentoml
import numpy as np

@bentoml.service
class AnomalyDetector:
    def __init__(self):
        self.model = bentoml.sklearn.load_model("anomaly_detector:latest")
        self.scaler = bentoml.sklearn.load_model("anomaly_scaler:latest")

    @bentoml.api
    def score(self, features: dict) -> dict:
        import pandas as pd
        X = pd.DataFrame([features])
        X_scaled = self.scaler.transform(X)
        score = self.model.score_samples(X_scaled)[0]
        return {
            "anomaly_score": float(score),
            "flagged": bool(score < -0.3)
        }
# Build container
bentoml build
bentoml containerize anomaly_detector:latest

# Deploy to Lambda via ECR
aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_URI
docker push $ECR_URI/anomaly-detector:latest
For models that need <50ms latency at high QPS, I serve on a dedicated instance instead of Lambda. Lambda cold starts add 200-500ms which is unacceptable for synchronous user-facing predictions.

Drift Monitoring with Evidently

The model I didn’t set up monitoring for degraded silently for 3 months. Setup cost: 2 hours. Cost of not having it: weeks of debugging.
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset

def run_weekly_drift_check(reference_data: pd.DataFrame, production_data: pd.DataFrame):
    report = Report(metrics=[
        DataDriftPreset(),
        DataQualityPreset()
    ])

    report.run(
        reference_data=reference_data,
        current_data=production_data
    )

    drift_results = report.as_dict()

    # Alert if >20% of features have drifted
    drifted_features = drift_results["metrics"][0]["result"]["number_of_drifted_columns"]
    total_features = drift_results["metrics"][0]["result"]["number_of_columns"]

    drift_ratio = drifted_features / total_features

    if drift_ratio > 0.2:
        alert_slack(f"Model drift detected: {drifted_features}/{total_features} features drifted")

    return drift_results

What I Learned the Hard Way

Start with a rule, then upgrade. My first anomaly detector was “flag if this month’s spend is 40% above the 3-month average.” One day to build. It caught 70% of what the ML model catches. The model took 3 weeks. For most use cases, ship the rule, learn from real user feedback, then build the model. Data quality > model quality. Every hour I’ve spent improving features has outperformed every hour I’ve spent tuning model hyperparameters. Bad features cannot be compensated by a better model. Good features make even simple models perform well. The model is not the product. I spent 3 weeks improving model accuracy by 4 percentage points. Then I spent 3 days improving how the alert was displayed — clearer message, specific category breakdown, “this is unusual because…” explanation. The UX improvement drove 5x more adoption than the model improvement. Monitor drift from day one. I didn’t set up Evidently until 3 months after launch. By then, the model had silently degraded because user spending patterns shifted when I changed the expense categorization logic. Drift monitoring from launch would have caught it in week 2. Don’t fine-tune what you can prompt. I spent two weeks fine-tuning a classification model for a task that a well-structured prompt on Claude handled in a day. Prompting should always be the first attempt when the input is text.