AI Toolbelt: My Opinionated Stack

How I Think About Tool Selection
Model APIs
SDKs and Frameworks
For TypeScript / Next.js Projects
For Agent Orchestration
For Python Projects
RAG and Retrieval
Evaluation and Observability
Hosting and Runtime
Local Models
Utilities I Actually Use
What I’ve Stopped Using
The Honest Principle

There are hundreds of AI tools. Most of them overlap. The question isn’t “what exists?” — it’s “what’s actually worth learning?” This is my working stack, updated as things change. It’s opinionated because vague “here are your options” lists aren’t useful. I’ll tell you what I actually use and why.

How I Think About Tool Selection

Before the list: the frame I use to pick tools. The mistake I made early: reaching for a framework when a direct API call would do. Start with the simplest tool that could work, add abstraction only when you actually need it.

Model APIs

These are the foundations. Everything else wraps around them.

Provider	Best for	When I reach for it
Anthropic Claude	Complex reasoning, large context, coding	My default. Sonnet for code and production, Opus for hard architectural decisions
OpenAI GPT-4o	General purpose, function calling, vision	Multilingual content, when I need the widest training data coverage
OpenAI GPT-4o-mini	High-volume, cost-sensitive tasks	Classification, simple extraction, anything running at scale
Google Gemini	Multimodal tasks, data residency requirements	When a client has GCP lock-in or I need native Google Docs/Sheets integration
Cohere	Enterprise search, reranking	RAG pipelines where reranking quality matters and I want a single vendor

My honest take: Use Claude for most things. Switch to GPT-4o when you hit a case where it clearly outperforms (I find this happens most with multilingual tasks and certain vision use cases). Avoid tool-hopping — consistency in your stack means fewer surprises in production.

SDKs and Frameworks

For TypeScript / Next.js Projects

Vercel AI SDK — My default for any streaming UI in a Next.js app. Handles streaming, tool calls, and model switching in a way that integrates naturally with React Server Components and Route Handlers. The useChat and useCompletion hooks remove a lot of boilerplate.

import { streamText } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';

const result = await streamText({
  model: anthropic('claude-sonnet-4-6'),
  messages,
  tools: {
    getWeather: {
      description: 'Get weather for a location',
      parameters: z.object({ location: z.string() }),
      execute: async ({ location }) => getWeatherData(location)
    }
  }
});

OpenAI Node SDK — When I’m calling OpenAI directly without Vercel AI SDK, or need the Batch API, or building something outside Next.js. Anthropic TypeScript SDK — For direct Claude API calls when I want fine-grained control or I’m not using Vercel AI SDK.

For Agent Orchestration

LangGraph — When I need stateful, multi-step agents with complex branching logic. The graph metaphor maps well to how I think about agent workflows. It’s verbose, but it’s explicit — which matters in production. LangChain — I use this less than I used to. Too much abstraction. When something breaks, the stack trace is painful to navigate. I reach for it when a LangChain community integration saves me serious time (e.g., a pre-built connector to a specific vector store).

For Python Projects

Anthropic Python SDK — Clean, well-maintained. My go-to for Python-based scripts and data pipelines. LlamaIndex — For RAG pipelines in Python. Better retrieval abstractions than LangChain for document-heavy use cases. I use it when I need to build a retrieval system with more sophistication than basic embedding search.

RAG and Retrieval

Pinecone — My default managed vector store. Reliable, fast, and the API is simple. Cost adds up at scale, but for most projects it’s worth the operational simplicity. pgvector — When I’m already using Postgres and don’t want to add a new service. Works well for moderate scale. I use this more now than I used to — keeping the stack simple has real value. Qdrant — When I need a self-hosted vector store with good performance. Docker-deployable, solid API. LlamaIndex — For the retrieval pipeline itself: chunking, indexing, querying, reranking. I pair this with whichever vector store fits the project. Cohere Rerank — When base embedding similarity isn’t producing precise enough results. Adding a reranker as a second pass often outperforms increasing embedding dimensions.

Evaluation and Observability

This is the category most teams underinvest in. Don’t. Promptfoo — My go-to for fast prompt comparisons and regression testing. Define test cases in YAML, run across multiple models, get a side-by-side comparison. Takes an hour to set up, saves weeks of debugging.

# promptfoo.yaml
prompts:
  - "Summarise this: ${document}"
  - "Provide a 3-sentence summary of: ${document}"

providers:
  - anthropic:claude-sonnet-4-6
  - openai:gpt-4o

tests:
  - vars:
      document: "file://test-cases/article-1.txt"
    assert:
      - type: contains
        value: key_topic
      - type: llm-rubric
        value: "The summary should cover the main argument"

LangSmith — When I’m using LangChain and want tracing. Excellent visibility into chain execution. Less useful when I’m not in the LangChain ecosystem. Braintrust — More recent addition to my stack. Better for continuous evaluation in CI/CD pipelines. I use it when I need eval to run on every prompt change as part of a proper deployment gate. Weights & Biases — When I’m doing fine-tuning or need to track experiments across training runs. Overkill for most production prompt engineering work.

Hosting and Runtime

Vercel — For edge functions running inference calls. Works seamlessly with the Vercel AI SDK. Good for low-latency completions in Next.js. Cloudflare Workers AI — When I need inference close to the user without managing GPU infrastructure. Limited model selection, but improving. Good for simple, high-frequency inference tasks. Modal — For running custom models, batch jobs, or anything that needs a real GPU. The Python decorator API is elegant. I use this for fine-tuning runs and large batch evaluation jobs. Replicate — When I need to run a specific open-source model quickly without deploying it myself. Good for prototyping with models that aren’t available via main APIs.

Local Models

Ollama — For running models locally. Dead simple to set up, supports most popular open models (Llama, Mistral, Phi, Gemma). I use this when I need to test something offline, work with private data, or prototype without API costs.

ollama pull llama3.2
ollama run llama3.2

LM Studio — GUI-based local model runner. Good for non-technical team members who want to experiment with local models without touching a terminal. See Local LLMs Guide for a deeper dive on when local makes sense.

Utilities I Actually Use

Instructor — Structured outputs from LLMs using Pydantic. I use this in Python when I need reliable JSON extraction and don’t want to wrestle with prompt engineering for schema compliance. Whisper.cpp — On-device audio transcription. Fast, private, free. I use this in products where sending audio to an API isn’t appropriate. Zod — Not AI-specific, but essential for AI work in TypeScript. I define tool call schemas, output validators, and API input schemas with Zod. It integrates cleanly with the Vercel AI SDK for tool definitions. PromptLib (my product) — Curated prompt library with organization and guardrails. I built this because I kept losing good prompts across projects.

What I’ve Stopped Using

LangChain for new projects — The abstraction overhead rarely pays off. I use it when I need a specific integration that would take days to rebuild, not as a foundation. Custom vector similarity functions — pgvector or a managed service handles this better than any custom implementation I’ve written. Model-specific prompt formatting by hand — The model SDKs handle this. I used to add <human> tags and \n\nAssistant: manually. Don’t do this.

The Honest Principle

Pick the tool that solves your actual constraint. Latency? Look at edge runtime options. Privacy? Local models or private cloud. Cost at scale? Cheaper models with better prompts often beat expensive models with lazy prompts. The goal isn’t the most sophisticated stack. It’s the simplest stack that ships and runs reliably.

Generative AI: My Working Mental Model

One Language, Zero Excuses: Why I Bet Everything on TypeScript

Engineer × AI

AI Dev Toolkit

Shipping AI

AI Foundations

TypeScript at Scale

Design Systems

Deep Dives

System Design & Architecture

Shipping & DevOps

Testing & Quality

Observability

Career & Engineering Leadership

AI Toolbelt: My Opinionated Stack

How I Think About Tool Selection

Model APIs

SDKs and Frameworks

For TypeScript / Next.js Projects

For Agent Orchestration

For Python Projects

RAG and Retrieval

Evaluation and Observability

Hosting and Runtime

Local Models

Utilities I Actually Use

What I’ve Stopped Using

The Honest Principle

Engineer × AI

AI Dev Toolkit

Shipping AI

AI Foundations

TypeScript at Scale

Design Systems

Deep Dives

System Design & Architecture

Shipping & DevOps

Testing & Quality

Observability

Career & Engineering Leadership

​How I Think About Tool Selection

​Model APIs

​SDKs and Frameworks

​For TypeScript / Next.js Projects

​For Agent Orchestration

​For Python Projects

​RAG and Retrieval

​Evaluation and Observability

​Hosting and Runtime

​Local Models

​Utilities I Actually Use

​What I’ve Stopped Using

​The Honest Principle

How I Think About Tool Selection

Model APIs

SDKs and Frameworks

For TypeScript / Next.js Projects

For Agent Orchestration

For Python Projects

RAG and Retrieval

Evaluation and Observability

Hosting and Runtime

Local Models

Utilities I Actually Use

What I’ve Stopped Using

The Honest Principle