NLP in Production: Text, Retrieval, and Bilingual Systems

Language is at the center of almost everything I build. Technical documentation, creative writing tools, coaching systems, financial analysis — they all require working with text in ways that go beyond “call the API and hope.” This guide covers what I’ve learned building production NLP systems: when RAG beats fine-tuning beats prompting, what chunking actually needs to look like, and the reality of building multilingual systems when the language isn’t English.

The Decision Framework: Prompting vs RAG vs Fine-Tuning

The most common question I get: which approach for which task?

Approach	When to use	When to avoid
Prompting	General tasks, moderate volume, changing requirements	High-volume narrow tasks where consistency is critical
RAG	Knowledge-intensive tasks, docs that change, factual grounding needed	Tasks where retrieval quality can’t be controlled
Fine-tuning	High-volume narrow tasks, domain-specific patterns, cost reduction at scale	When labeled data doesn’t exist or task changes often

The most common mistake: jumping to fine-tuning before trying RAG, and jumping to RAG before trying prompting. Start simple. The pipeline complexity should be proportional to the evidence that simpler approaches don’t work.

The Tools I Use and Why

Task	Tool	Why
Embeddings (English)	OpenAI text-embedding-3-small	Good quality/cost ratio; 1536 dimensions; widely supported
Embeddings (multilingual)	multilingual-e5-large	Better cross-lingual performance than OpenAI for Telugu
Reranking	Cohere Rerank	Significant retrieval quality improvement; worth the latency
Vector storage (<1M vectors)	pgvector (Supabase)	Avoids running a separate service; SQL joins work
Vector storage (>1M vectors)	Pinecone	Better performance at scale; serverless option
RAG orchestration	Vercel AI SDK + LlamaIndex	AI SDK for streaming UI; LlamaIndex for retrieval pipelines
Audio → text	Whisper large-v3	Best accuracy on bilingual Telugu+English content
Bilingual processing	Custom normalizer + HuggingFace	Telugu isn’t covered well by off-the-shelf pipelines
Text classification	Claude / fine-tuned model	Depends on volume; API for low volume, fine-tuned for high

Building a Production RAG Pipeline

The standard RAG tutorial misses the steps that actually determine quality. Here’s what the real pipeline looks like:

Step 1: Chunking Strategy

This is the step that makes or breaks retrieval quality. Most tutorials chunk at arbitrary character limits. That produces terrible results.

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Wrong: character-based chunking
bad_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100
)

# Better: semantic boundaries with metadata preservation
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding

def chunk_document(document: str, metadata: dict) -> list[dict]:
    """
    Chunk at semantic boundaries, not character limits.
    Preserve metadata for filtering and citation.
    """
    splitter = SemanticSplitterNodeParser(
        buffer_size=1,
        breakpoint_percentile_threshold=95,
        embed_model=OpenAIEmbedding()
    )

    nodes = splitter.get_nodes_from_documents([document])

    chunks = []
    for node in nodes:
        chunks.append({
            "text": node.text,
            "metadata": {
                **metadata,
                "chunk_index": node.metadata.get("chunk_index"),
                "section": extract_section_header(node.text)
            }
        })

    return chunks

The principle: chunk at the boundary that matches how users will ask questions. If users ask “what is the refund policy?”, your chunks should contain complete policy statements, not half a sentence from one chunk and the rest from another.

Step 2: Embedding and Storage

from openai import OpenAI
import psycopg2
import numpy as np

client = OpenAI()

def embed_and_store(chunks: list[dict], collection: str):
    """Batch embed and store with metadata."""
    texts = [chunk["text"] for chunk in chunks]

    # Batch embedding (more efficient than one at a time)
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )

    embeddings = [item.embedding for item in response.data]

    # Store in pgvector
    with psycopg2.connect(DATABASE_URL) as conn:
        with conn.cursor() as cur:
            for chunk, embedding in zip(chunks, embeddings):
                cur.execute("""
                    INSERT INTO embeddings (collection, text, embedding, metadata)
                    VALUES (%s, %s, %s, %s)
                """, (
                    collection,
                    chunk["text"],
                    np.array(embedding),
                    json.dumps(chunk["metadata"])
                ))

Step 3: Retrieval with Reranking

The quality improvement from adding a reranker is significant. In my Thinki.sh system, reranking improved user satisfaction ratings from 62% to 81%.

import cohere
from openai import OpenAI

co = cohere.Client(COHERE_API_KEY)
openai_client = OpenAI()

def retrieve_with_rerank(
    query: str,
    collection: str,
    top_k: int = 10,
    top_n: int = 3
) -> list[dict]:
    """
    Two-stage retrieval: broad embedding search → precise reranking.
    """
    # Stage 1: Embedding-based retrieval (top_k candidates)
    query_embedding = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=query
    ).data[0].embedding

    candidates = vector_search(query_embedding, collection, k=top_k)

    # Stage 2: Rerank for semantic precision
    rerank_result = co.rerank(
        query=query,
        documents=[c["text"] for c in candidates],
        top_n=top_n,
        model="rerank-english-v3.0"
    )

    reranked = [
        {
            **candidates[r.index],
            "relevance_score": r.relevance_score
        }
        for r in rerank_result.results
    ]

    return reranked

Step 4: Context-Grounded Generation

def generate_rag_response(query: str, context_chunks: list[dict]) -> str:
    context = "\n\n---\n\n".join([
        f"Source: {c['metadata'].get('section', 'Unknown')}\n{c['text']}"
        for c in context_chunks
    ])

    prompt = f"""Answer the user's question using ONLY the information in the provided context.

If the context doesn't contain enough information to answer confidently, say so explicitly.
Do not extrapolate or add information not present in the context.

Context:
{context}

Question: {query}

Answer:"""

    # ... LLM call

The explicit “do not extrapolate” instruction is the key to grounding. Without it, the model will confidently bridge gaps in the context with plausible-but-wrong information.

Concrete Example: Thinki.sh Knowledge Retrieval

Thinki.sh’s AI coaching layer retrieves relevant mental frameworks based on user problems. The challenge: users describe problems in their own language, which rarely matches framework names. User says: “I keep being surprised when my projects go over budget.” The relevant framework: Pre-Mortem + Second-Order Thinking. Keyword search fails completely here. Semantic search alone gets it partially. The reranking + grounding approach gets it right:

Query embedding: "surprised projects over budget"
Top-10 candidates include: Pre-Mortem, Planning Fallacy, Sunk Cost,
  Inversion, Optimism Bias, Scope Creep, Resource Management...

After Cohere Rerank:
  1. Pre-Mortem (relevance: 0.89)
  2. Planning Fallacy (relevance: 0.82)
  3. Second-Order Thinking (relevance: 0.74)

The reranking step added 300ms latency. User satisfaction improved from 62% to 81%. Worth it.

Building Bilingual Systems (Telugu + English)

NLP tooling is built almost entirely around English. Using it for Telugu requires intentional workarounds.

The Challenges

Tokenization: Standard tokenizers mangle Telugu script. Subword tokenizers that work well for European languages don’t handle Telugu’s morphological complexity well.
Embeddings: OpenAI’s embeddings handle Telugu for semantic similarity but are weaker for fine-grained sentiment and cultural nuance.
Code-switching: Modern Telugu speakers freely mix English words. Standard NLP pipelines treat these as unknown tokens.
Evaluation: BLEU/ROUGE scores are nearly meaningless for English-Telugu because sentence structures are so different. Human review is the only reliable metric.

The Normalization Pipeline

import re

# Telugu-specific normalization for mixed content
NORMALIZATIONS = [
    # Standardize common code-switched words
    (r'\bcomputer\b', 'కంప్యూటర్'),
    (r'\bsoftware\b', 'సాఫ్ట్‌వేర్'),
    # Number normalization
    (r'(\d+)\s*%', r'\1 శాతం'),
    # Remove excessive punctuation from OCR errors
    (r'[।।]{2,}', '।'),
]

def normalize_telugu(text: str) -> str:
    """Normalize Telugu text for better model performance."""
    for pattern, replacement in NORMALIZATIONS:
        text = re.sub(pattern, replacement, text, flags=re.IGNORECASE)
    return text.strip()

def chunk_bilingual_document(text: str) -> list[str]:
    """
    Chunk Telugu+English mixed content at natural language boundaries.
    Telugu sentences end with ।  (Devanagari danda) or \n
    """
    # Split on Telugu sentence boundaries
    sentences = re.split(r'[।\n]+', text)

    # Regroup into chunks of 3-5 sentences for better context
    chunks = []
    current_chunk = []

    for sentence in sentences:
        current_chunk.append(sentence.strip())
        if len(current_chunk) >= 4:
            chunks.append(' '.join(current_chunk))
            current_chunk = []

    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return [c for c in chunks if len(c) > 20]  # Filter tiny fragments

Embeddings for Telugu

Testing different embedding models on Telugu semantic similarity:

Model	Telugu accuracy	Cost	Notes
OpenAI text-embedding-3-small	Good for similarity	$$	Handles Telugu, weaker on sentiment
multilingual-e5-large	Better for cross-lingual	$ (local)	Good for Telugu-English retrieval
paraphrase-multilingual-mpnet	Reasonable	$ (local)	Trained on more Telugu data

For Nishabdham (pure Telugu creative content), I use multilingual-e5-large locally. For systems where Telugu is mixed with English content, OpenAI embeddings work well enough and the operational simplicity is worth it.

Evaluation for Language Systems

BLEU/ROUGE scores are fast to compute and poorly correlated with what users actually value. I use a layered approach: Automated:

def automated_eval(outputs: list[dict], test_cases: list[dict]) -> dict:
    results = {}

    for test, output in zip(test_cases, outputs):
        # Check if key facts are present (for RAG)
        results["fact_coverage"] = check_facts_covered(
            output["text"], test["required_facts"]
        )

        # Check if hallucinations are absent
        results["hallucination_rate"] = check_hallucination(
            output["text"], output["context"]
        )

        # Check format compliance
        results["format_ok"] = validate_output_format(output["text"])

    return results

LLM-as-judge (for quality):

def llm_judge(query: str, response: str, context: str) -> dict:
    judgment = claude.messages.create(
        model="claude-sonnet-4-6",
        messages=[{
            "role": "user",
            "content": f"""Rate this RAG response on a scale of 1-5 for each criterion:

Query: {query}
Context provided: {context}
Response: {response}

Rate:
1. Faithfulness (1-5): Is the response grounded in the context?
2. Relevance (1-5): Does it answer the query?
3. Completeness (1-5): Are all key points covered?

Return JSON with keys: faithfulness (N), relevance (N), completeness (N), reasoning (string)"""
        }]
    )
    return json.loads(judgment.content[0].text)

Human review (for cultural content): For Telugu content specifically, automated evaluation misses culturally significant errors. I maintain a panel of 3 Telugu speakers who do a monthly review of a random 50-example sample. The errors they catch are qualitatively different from what automated metrics find.

What I Learned the Hard Way

Retrieval quality, not model quality, is usually the bottleneck. My first RAG system had excellent embeddings and a weak retrieval strategy. Switching from GPT-3.5 to GPT-4 didn’t help. Fixing the chunking strategy and adding a reranker did. Diagnostic first: is the right information being retrieved, or is the model just failing to use good context? Users don’t ask questions the way your documents are structured. If your knowledge base is organized by product feature, users will ask by symptom. Design chunking around query patterns, not document structure. The best way to discover these patterns is to log real queries for 2 weeks before building the retrieval system. Latency compounds in multi-step pipelines. Embed (50ms) + vector search (30ms) + rerank (300ms) + generate (800ms) = 1.2 seconds before the first token. For chat interfaces this is manageable with streaming. For anything that needs to feel real-time — autocomplete, real-time suggestions — prefetch, cache aggressively, and design the UX to hide latency. “Handles Telugu” in a model card often means “doesn’t crash on Telugu.” Test on actual representative content from your domain. Telugu NLP quality has improved significantly in recent multilingual models, but the gap with English is real and domain-specific. Build your evaluation set on real content before committing to a model.

Engineer × AI

AI Dev Toolkit

Shipping AI

AI Foundations

TypeScript at Scale

Design Systems

Deep Dives

System Design & Architecture

Shipping & DevOps

Testing & Quality

Observability

Career & Engineering Leadership

NLP in Production: Text, Retrieval, and Bilingual Systems

The Decision Framework: Prompting vs RAG vs Fine-Tuning

The Tools I Use and Why

Building a Production RAG Pipeline

Step 1: Chunking Strategy

Step 2: Embedding and Storage

Step 3: Retrieval with Reranking

Step 4: Context-Grounded Generation

Concrete Example: Thinki.sh Knowledge Retrieval

Building Bilingual Systems (Telugu + English)

The Challenges

The Normalization Pipeline

Embeddings for Telugu

Evaluation for Language Systems

What I Learned the Hard Way

Engineer × AI

AI Dev Toolkit

Shipping AI

AI Foundations

TypeScript at Scale

Design Systems

Deep Dives

System Design & Architecture

Shipping & DevOps

Testing & Quality

Observability

Career & Engineering Leadership

​The Decision Framework: Prompting vs RAG vs Fine-Tuning

​The Tools I Use and Why

​Building a Production RAG Pipeline

​Step 1: Chunking Strategy

​Step 2: Embedding and Storage

​Step 3: Retrieval with Reranking

​Step 4: Context-Grounded Generation

​Concrete Example: Thinki.sh Knowledge Retrieval

​Building Bilingual Systems (Telugu + English)

​The Challenges

​The Normalization Pipeline

​Embeddings for Telugu

​Evaluation for Language Systems

​What I Learned the Hard Way

The Decision Framework: Prompting vs RAG vs Fine-Tuning

The Tools I Use and Why

Building a Production RAG Pipeline

Step 1: Chunking Strategy

Step 2: Embedding and Storage

Step 3: Retrieval with Reranking

Step 4: Context-Grounded Generation

Concrete Example: Thinki.sh Knowledge Retrieval

Building Bilingual Systems (Telugu + English)

The Challenges

The Normalization Pipeline

Embeddings for Telugu

Evaluation for Language Systems

What I Learned the Hard Way