Skip to main content
There are hundreds of AI tools. Most of them overlap. The question isn’t “what exists?” — it’s “what’s actually worth learning?” This is my working stack, updated as things change. It’s opinionated because vague “here are your options” lists aren’t useful. I’ll tell you what I actually use and why.

How I Think About Tool Selection

Before the list: the frame I use to pick tools. The mistake I made early: reaching for a framework when a direct API call would do. Start with the simplest tool that could work, add abstraction only when you actually need it.

Model APIs

These are the foundations. Everything else wraps around them.
ProviderBest forWhen I reach for it
Anthropic ClaudeComplex reasoning, large context, codingMy default. Sonnet for code and production, Opus for hard architectural decisions
OpenAI GPT-4oGeneral purpose, function calling, visionMultilingual content, when I need the widest training data coverage
OpenAI GPT-4o-miniHigh-volume, cost-sensitive tasksClassification, simple extraction, anything running at scale
Google GeminiMultimodal tasks, data residency requirementsWhen a client has GCP lock-in or I need native Google Docs/Sheets integration
CohereEnterprise search, rerankingRAG pipelines where reranking quality matters and I want a single vendor
My honest take: Use Claude for most things. Switch to GPT-4o when you hit a case where it clearly outperforms (I find this happens most with multilingual tasks and certain vision use cases). Avoid tool-hopping — consistency in your stack means fewer surprises in production.

SDKs and Frameworks

For TypeScript / Next.js Projects

Vercel AI SDK — My default for any streaming UI in a Next.js app. Handles streaming, tool calls, and model switching in a way that integrates naturally with React Server Components and Route Handlers. The useChat and useCompletion hooks remove a lot of boilerplate.
import { streamText } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';

const result = await streamText({
  model: anthropic('claude-sonnet-4-6'),
  messages,
  tools: {
    getWeather: {
      description: 'Get weather for a location',
      parameters: z.object({ location: z.string() }),
      execute: async ({ location }) => getWeatherData(location)
    }
  }
});
OpenAI Node SDK — When I’m calling OpenAI directly without Vercel AI SDK, or need the Batch API, or building something outside Next.js. Anthropic TypeScript SDK — For direct Claude API calls when I want fine-grained control or I’m not using Vercel AI SDK.

For Agent Orchestration

LangGraph — When I need stateful, multi-step agents with complex branching logic. The graph metaphor maps well to how I think about agent workflows. It’s verbose, but it’s explicit — which matters in production. LangChain — I use this less than I used to. Too much abstraction. When something breaks, the stack trace is painful to navigate. I reach for it when a LangChain community integration saves me serious time (e.g., a pre-built connector to a specific vector store).

For Python Projects

Anthropic Python SDK — Clean, well-maintained. My go-to for Python-based scripts and data pipelines. LlamaIndex — For RAG pipelines in Python. Better retrieval abstractions than LangChain for document-heavy use cases. I use it when I need to build a retrieval system with more sophistication than basic embedding search.

RAG and Retrieval

Pinecone — My default managed vector store. Reliable, fast, and the API is simple. Cost adds up at scale, but for most projects it’s worth the operational simplicity. pgvector — When I’m already using Postgres and don’t want to add a new service. Works well for moderate scale. I use this more now than I used to — keeping the stack simple has real value. Qdrant — When I need a self-hosted vector store with good performance. Docker-deployable, solid API. LlamaIndex — For the retrieval pipeline itself: chunking, indexing, querying, reranking. I pair this with whichever vector store fits the project. Cohere Rerank — When base embedding similarity isn’t producing precise enough results. Adding a reranker as a second pass often outperforms increasing embedding dimensions.

Evaluation and Observability

This is the category most teams underinvest in. Don’t. Promptfoo — My go-to for fast prompt comparisons and regression testing. Define test cases in YAML, run across multiple models, get a side-by-side comparison. Takes an hour to set up, saves weeks of debugging.
# promptfoo.yaml
prompts:
  - "Summarise this: ${document}"
  - "Provide a 3-sentence summary of: ${document}"

providers:
  - anthropic:claude-sonnet-4-6
  - openai:gpt-4o

tests:
  - vars:
      document: "file://test-cases/article-1.txt"
    assert:
      - type: contains
        value: key_topic
      - type: llm-rubric
        value: "The summary should cover the main argument"
LangSmith — When I’m using LangChain and want tracing. Excellent visibility into chain execution. Less useful when I’m not in the LangChain ecosystem. Braintrust — More recent addition to my stack. Better for continuous evaluation in CI/CD pipelines. I use it when I need eval to run on every prompt change as part of a proper deployment gate. Weights & Biases — When I’m doing fine-tuning or need to track experiments across training runs. Overkill for most production prompt engineering work.

Hosting and Runtime

Vercel — For edge functions running inference calls. Works seamlessly with the Vercel AI SDK. Good for low-latency completions in Next.js. Cloudflare Workers AI — When I need inference close to the user without managing GPU infrastructure. Limited model selection, but improving. Good for simple, high-frequency inference tasks. Modal — For running custom models, batch jobs, or anything that needs a real GPU. The Python decorator API is elegant. I use this for fine-tuning runs and large batch evaluation jobs. Replicate — When I need to run a specific open-source model quickly without deploying it myself. Good for prototyping with models that aren’t available via main APIs.

Local Models

Ollama — For running models locally. Dead simple to set up, supports most popular open models (Llama, Mistral, Phi, Gemma). I use this when I need to test something offline, work with private data, or prototype without API costs.
ollama pull llama3.2
ollama run llama3.2
LM Studio — GUI-based local model runner. Good for non-technical team members who want to experiment with local models without touching a terminal. See Local LLMs Guide for a deeper dive on when local makes sense.

Utilities I Actually Use

Instructor — Structured outputs from LLMs using Pydantic. I use this in Python when I need reliable JSON extraction and don’t want to wrestle with prompt engineering for schema compliance. Whisper.cpp — On-device audio transcription. Fast, private, free. I use this in products where sending audio to an API isn’t appropriate. Zod — Not AI-specific, but essential for AI work in TypeScript. I define tool call schemas, output validators, and API input schemas with Zod. It integrates cleanly with the Vercel AI SDK for tool definitions. PromptLib (my product) — Curated prompt library with organization and guardrails. I built this because I kept losing good prompts across projects.

What I’ve Stopped Using

LangChain for new projects — The abstraction overhead rarely pays off. I use it when I need a specific integration that would take days to rebuild, not as a foundation. Custom vector similarity functions — pgvector or a managed service handles this better than any custom implementation I’ve written. Model-specific prompt formatting by hand — The model SDKs handle this. I used to add <human> tags and \n\nAssistant: manually. Don’t do this.

The Honest Principle

Pick the tool that solves your actual constraint. Latency? Look at edge runtime options. Privacy? Local models or private cloud. Cost at scale? Cheaper models with better prompts often beat expensive models with lazy prompts. The goal isn’t the most sophisticated stack. It’s the simplest stack that ships and runs reliably.