Skip to main content
Language is at the center of almost everything I build. Technical documentation, creative writing tools, coaching systems, financial analysis — they all require working with text in ways that go beyond “call the API and hope.” This guide covers what I’ve learned building production NLP systems: when RAG beats fine-tuning beats prompting, what chunking actually needs to look like, and the reality of building multilingual systems when the language isn’t English.

The Decision Framework: Prompting vs RAG vs Fine-Tuning

The most common question I get: which approach for which task?
ApproachWhen to useWhen to avoid
PromptingGeneral tasks, moderate volume, changing requirementsHigh-volume narrow tasks where consistency is critical
RAGKnowledge-intensive tasks, docs that change, factual grounding neededTasks where retrieval quality can’t be controlled
Fine-tuningHigh-volume narrow tasks, domain-specific patterns, cost reduction at scaleWhen labeled data doesn’t exist or task changes often
The most common mistake: jumping to fine-tuning before trying RAG, and jumping to RAG before trying prompting. Start simple. The pipeline complexity should be proportional to the evidence that simpler approaches don’t work.

The Tools I Use and Why

TaskToolWhy
Embeddings (English)OpenAI text-embedding-3-smallGood quality/cost ratio; 1536 dimensions; widely supported
Embeddings (multilingual)multilingual-e5-largeBetter cross-lingual performance than OpenAI for Telugu
RerankingCohere RerankSignificant retrieval quality improvement; worth the latency
Vector storage (<1M vectors)pgvector (Supabase)Avoids running a separate service; SQL joins work
Vector storage (>1M vectors)PineconeBetter performance at scale; serverless option
RAG orchestrationVercel AI SDK + LlamaIndexAI SDK for streaming UI; LlamaIndex for retrieval pipelines
Audio → textWhisper large-v3Best accuracy on bilingual Telugu+English content
Bilingual processingCustom normalizer + HuggingFaceTelugu isn’t covered well by off-the-shelf pipelines
Text classificationClaude / fine-tuned modelDepends on volume; API for low volume, fine-tuned for high

Building a Production RAG Pipeline

The standard RAG tutorial misses the steps that actually determine quality. Here’s what the real pipeline looks like:

Step 1: Chunking Strategy

This is the step that makes or breaks retrieval quality. Most tutorials chunk at arbitrary character limits. That produces terrible results.
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Wrong: character-based chunking
bad_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100
)

# Better: semantic boundaries with metadata preservation
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding

def chunk_document(document: str, metadata: dict) -> list[dict]:
    """
    Chunk at semantic boundaries, not character limits.
    Preserve metadata for filtering and citation.
    """
    splitter = SemanticSplitterNodeParser(
        buffer_size=1,
        breakpoint_percentile_threshold=95,
        embed_model=OpenAIEmbedding()
    )

    nodes = splitter.get_nodes_from_documents([document])

    chunks = []
    for node in nodes:
        chunks.append({
            "text": node.text,
            "metadata": {
                **metadata,
                "chunk_index": node.metadata.get("chunk_index"),
                "section": extract_section_header(node.text)
            }
        })

    return chunks
The principle: chunk at the boundary that matches how users will ask questions. If users ask “what is the refund policy?”, your chunks should contain complete policy statements, not half a sentence from one chunk and the rest from another.

Step 2: Embedding and Storage

from openai import OpenAI
import psycopg2
import numpy as np

client = OpenAI()

def embed_and_store(chunks: list[dict], collection: str):
    """Batch embed and store with metadata."""
    texts = [chunk["text"] for chunk in chunks]

    # Batch embedding (more efficient than one at a time)
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )

    embeddings = [item.embedding for item in response.data]

    # Store in pgvector
    with psycopg2.connect(DATABASE_URL) as conn:
        with conn.cursor() as cur:
            for chunk, embedding in zip(chunks, embeddings):
                cur.execute("""
                    INSERT INTO embeddings (collection, text, embedding, metadata)
                    VALUES (%s, %s, %s, %s)
                """, (
                    collection,
                    chunk["text"],
                    np.array(embedding),
                    json.dumps(chunk["metadata"])
                ))

Step 3: Retrieval with Reranking

The quality improvement from adding a reranker is significant. In my Thinki.sh system, reranking improved user satisfaction ratings from 62% to 81%.
import cohere
from openai import OpenAI

co = cohere.Client(COHERE_API_KEY)
openai_client = OpenAI()

def retrieve_with_rerank(
    query: str,
    collection: str,
    top_k: int = 10,
    top_n: int = 3
) -> list[dict]:
    """
    Two-stage retrieval: broad embedding search → precise reranking.
    """
    # Stage 1: Embedding-based retrieval (top_k candidates)
    query_embedding = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=query
    ).data[0].embedding

    candidates = vector_search(query_embedding, collection, k=top_k)

    # Stage 2: Rerank for semantic precision
    rerank_result = co.rerank(
        query=query,
        documents=[c["text"] for c in candidates],
        top_n=top_n,
        model="rerank-english-v3.0"
    )

    reranked = [
        {
            **candidates[r.index],
            "relevance_score": r.relevance_score
        }
        for r in rerank_result.results
    ]

    return reranked

Step 4: Context-Grounded Generation

def generate_rag_response(query: str, context_chunks: list[dict]) -> str:
    context = "\n\n---\n\n".join([
        f"Source: {c['metadata'].get('section', 'Unknown')}\n{c['text']}"
        for c in context_chunks
    ])

    prompt = f"""Answer the user's question using ONLY the information in the provided context.

If the context doesn't contain enough information to answer confidently, say so explicitly.
Do not extrapolate or add information not present in the context.

Context:
{context}

Question: {query}

Answer:"""

    # ... LLM call
The explicit “do not extrapolate” instruction is the key to grounding. Without it, the model will confidently bridge gaps in the context with plausible-but-wrong information.

Concrete Example: Thinki.sh Knowledge Retrieval

Thinki.sh’s AI coaching layer retrieves relevant mental frameworks based on user problems. The challenge: users describe problems in their own language, which rarely matches framework names. User says: “I keep being surprised when my projects go over budget.” The relevant framework: Pre-Mortem + Second-Order Thinking. Keyword search fails completely here. Semantic search alone gets it partially. The reranking + grounding approach gets it right:
Query embedding: "surprised projects over budget"
Top-10 candidates include: Pre-Mortem, Planning Fallacy, Sunk Cost,
  Inversion, Optimism Bias, Scope Creep, Resource Management...

After Cohere Rerank:
  1. Pre-Mortem (relevance: 0.89)
  2. Planning Fallacy (relevance: 0.82)
  3. Second-Order Thinking (relevance: 0.74)
The reranking step added 300ms latency. User satisfaction improved from 62% to 81%. Worth it.

Building Bilingual Systems (Telugu + English)

NLP tooling is built almost entirely around English. Using it for Telugu requires intentional workarounds.

The Challenges

  • Tokenization: Standard tokenizers mangle Telugu script. Subword tokenizers that work well for European languages don’t handle Telugu’s morphological complexity well.
  • Embeddings: OpenAI’s embeddings handle Telugu for semantic similarity but are weaker for fine-grained sentiment and cultural nuance.
  • Code-switching: Modern Telugu speakers freely mix English words. Standard NLP pipelines treat these as unknown tokens.
  • Evaluation: BLEU/ROUGE scores are nearly meaningless for English-Telugu because sentence structures are so different. Human review is the only reliable metric.

The Normalization Pipeline

import re

# Telugu-specific normalization for mixed content
NORMALIZATIONS = [
    # Standardize common code-switched words
    (r'\bcomputer\b', 'కంప్యూటర్'),
    (r'\bsoftware\b', 'సాఫ్ట్‌వేర్'),
    # Number normalization
    (r'(\d+)\s*%', r'\1 శాతం'),
    # Remove excessive punctuation from OCR errors
    (r'[।।]{2,}', '।'),
]

def normalize_telugu(text: str) -> str:
    """Normalize Telugu text for better model performance."""
    for pattern, replacement in NORMALIZATIONS:
        text = re.sub(pattern, replacement, text, flags=re.IGNORECASE)
    return text.strip()

def chunk_bilingual_document(text: str) -> list[str]:
    """
    Chunk Telugu+English mixed content at natural language boundaries.
    Telugu sentences end with ।  (Devanagari danda) or \n
    """
    # Split on Telugu sentence boundaries
    sentences = re.split(r'[\n]+', text)

    # Regroup into chunks of 3-5 sentences for better context
    chunks = []
    current_chunk = []

    for sentence in sentences:
        current_chunk.append(sentence.strip())
        if len(current_chunk) >= 4:
            chunks.append(' '.join(current_chunk))
            current_chunk = []

    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return [c for c in chunks if len(c) > 20]  # Filter tiny fragments

Embeddings for Telugu

Testing different embedding models on Telugu semantic similarity:
ModelTelugu accuracyCostNotes
OpenAI text-embedding-3-smallGood for similarity$$Handles Telugu, weaker on sentiment
multilingual-e5-largeBetter for cross-lingual$ (local)Good for Telugu-English retrieval
paraphrase-multilingual-mpnetReasonable$ (local)Trained on more Telugu data
For Nishabdham (pure Telugu creative content), I use multilingual-e5-large locally. For systems where Telugu is mixed with English content, OpenAI embeddings work well enough and the operational simplicity is worth it.

Evaluation for Language Systems

BLEU/ROUGE scores are fast to compute and poorly correlated with what users actually value. I use a layered approach: Automated:
def automated_eval(outputs: list[dict], test_cases: list[dict]) -> dict:
    results = {}

    for test, output in zip(test_cases, outputs):
        # Check if key facts are present (for RAG)
        results["fact_coverage"] = check_facts_covered(
            output["text"], test["required_facts"]
        )

        # Check if hallucinations are absent
        results["hallucination_rate"] = check_hallucination(
            output["text"], output["context"]
        )

        # Check format compliance
        results["format_ok"] = validate_output_format(output["text"])

    return results
LLM-as-judge (for quality):
def llm_judge(query: str, response: str, context: str) -> dict:
    judgment = claude.messages.create(
        model="claude-sonnet-4-6",
        messages=[{
            "role": "user",
            "content": f"""Rate this RAG response on a scale of 1-5 for each criterion:

Query: {query}
Context provided: {context}
Response: {response}

Rate:
1. Faithfulness (1-5): Is the response grounded in the context?
2. Relevance (1-5): Does it answer the query?
3. Completeness (1-5): Are all key points covered?

Return JSON with keys: faithfulness (N), relevance (N), completeness (N), reasoning (string)"""
        }]
    )
    return json.loads(judgment.content[0].text)
Human review (for cultural content): For Telugu content specifically, automated evaluation misses culturally significant errors. I maintain a panel of 3 Telugu speakers who do a monthly review of a random 50-example sample. The errors they catch are qualitatively different from what automated metrics find.

What I Learned the Hard Way

Retrieval quality, not model quality, is usually the bottleneck. My first RAG system had excellent embeddings and a weak retrieval strategy. Switching from GPT-3.5 to GPT-4 didn’t help. Fixing the chunking strategy and adding a reranker did. Diagnostic first: is the right information being retrieved, or is the model just failing to use good context? Users don’t ask questions the way your documents are structured. If your knowledge base is organized by product feature, users will ask by symptom. Design chunking around query patterns, not document structure. The best way to discover these patterns is to log real queries for 2 weeks before building the retrieval system. Latency compounds in multi-step pipelines. Embed (50ms) + vector search (30ms) + rerank (300ms) + generate (800ms) = 1.2 seconds before the first token. For chat interfaces this is manageable with streaming. For anything that needs to feel real-time — autocomplete, real-time suggestions — prefetch, cache aggressively, and design the UX to hide latency. “Handles Telugu” in a model card often means “doesn’t crash on Telugu.” Test on actual representative content from your domain. Telugu NLP quality has improved significantly in recent multilingual models, but the gap with English is real and domain-specific. Build your evaluation set on real content before committing to a model.