RAG Development

RAG that answers correctly — and proves it.

From document chunking strategy to reranking to citation-grounded outputs, we build RAG systems engineered to the quality bar your business actually needs — with the eval harness to prove it.

Start a RAG Project Talk to an AI Architect

Pinecone · Weaviate · Qdrant · pgvector · Cohere Rerank · Ragas · Eval-first delivery

Most RAG systems hallucinate. Often.

The default RAG tutorial works on a demo dataset. Build that same pipeline against your real corpus — messy PDFs, tables, scanned documents, multi-language content — and answer quality drops below the bar your business needs.

Chunks split a sentence in half. The embedding model doesn't understand your domain vocabulary. Top-5 retrieval misses the document with the answer. The LLM confabulates a citation that doesn't exist. And no one has built a golden dataset to even measure the regression. Production RAG is not a vibe — it's a measured system.

The eight-stage RAG pipeline we deploy.

The difference between a 60%-accurate RAG and a 90%-accurate RAG is rarely the LLM — it's how each stage is tuned.

Document parsing

Layout-aware parsers (Unstructured, Docling, Textract) handle real-world PDFs, tables, and scans.

Chunking

Structure-aware, semantic chunking — not naive fixed-size splitting.

Embedding

Benchmarked on your data: text-embedding-3, Cohere, BGE-M3, Voyage, etc.

Vector store

pgvector, Pinecone, Weaviate, Qdrant, Milvus — chosen for your scale and stack.

Hybrid retrieval

Dense + BM25 sparse retrieval, combined via reciprocal rank fusion.

Reranking

Cross-encoder reranker (Cohere Rerank, BGE, ColBERT) lifts precision@3 by 15–30 points.

Citation-grounded generation

Strict prompt + structured output: answer only from context, cite the source, refuse if unsupported.

Guardrails + eval

Citation verification, factuality scoring, continuous eval against a golden dataset.

Show, don't tell

What grounded retrieval actually looks like.

Hybrid retrieval, a reranker, and a citation-strict prompt — the same shape we ship to production.

rag_pipeline.pypython

1def answer(question: str) -> Answer:2    # 1. Hybrid retrieval: dense + sparse, fused3    dense = vector_store.search(embed(question), k=40)4    sparse = bm25.search(question, k=40)5    candidates = reciprocal_rank_fusion(dense, sparse)67    # 2. Rerank — the single biggest precision lever8    top = reranker.rank(question, candidates)[:6]910    # 3. Generate strictly from retrieved context11    return llm.generate(12        system="Answer ONLY from context. Cite sources. "13               "If unsupported, say you don't know.",14        context=top,15        question=question,16    )

Response

“Refunds are issued within 5–7 business days to the

original payment method.” [policy.pdf · p.4]

confidence: 0.94 · citations verified: 1/1

Every claim is grounded in a retrieved source and the citation is verified against the document before the answer is returned.

Eval-first delivery

We measure quality before we ship it.

The difference between a demo and a production RAG system is a golden dataset and an automated eval that runs on every change.

We build that harness first — so prompt tweaks, model swaps, and chunking changes are measured against real questions, and a regression blocks the deploy instead of reaching your users.

See how we engage

Six failure modes that kill RAG quality.

Failure mode	What we do about it
Chunks split a sentence or table	Layout-aware parsing; recursive splitting; tables as structured chunks
Right document isn't in top-k	Hybrid (dense + BM25) + RRF; query rewriting; higher initial-k + reranking
Reranker isn't there	Always deploy a reranker for production; measure precision@3 with vs. without
LLM hallucinates beyond context	Strict 'answer only from context' prompt; structured output; per-claim citations
Citations point to nothing	Citation verification — every cited quote checked against the source
System slowly degrades	Golden dataset + automated eval on every change; regression blocks deploy

How RAG projects engage with us.

Discovery & Architecture

1–2 weeks

from $22,000

Corpus audit + embedding benchmark
Vector DB recommendation + TCO
Golden-dataset starter

Start Discovery

Production RAG Build

6–10 weeks

from $65,000

Parsing → chunking → embedding → retrieval
Hybrid + reranking + citation grounding
Full eval harness + observability + 30-day support

Start a Build

RAG Operations

monthly

from $9,500/mo

Weekly eval review + dataset expansion
Embedding-model migrations
Chunking iterations on real queries

Discuss Operations

Common questions.

When is RAG the wrong answer?

When the answers aren't in your documents. RAG retrieves and grounds; it doesn't reason about content that isn't in the source corpus. We pair it with agents or frontier models when needed.

How big should my chunks be?

No universal answer. 256–512 tokens with 50-token overlap is a reasonable start; production chunking is layout-aware and semantic. We test 3–4 strategies and pick the one that maximizes precision@3.

Which vector database?

pgvector if you're on Postgres; Pinecone for zero ops; Qdrant or Weaviate for self-hosted hybrid; Milvus or Turbopuffer at billion-vector scale.

Do we need a reranker?

For production, yes almost always. Reranking is the single highest-impact lever after chunking.

Hybrid vs. pure-vector search?

Hybrid wins on real-world corpora. Pure dense misses exact-term queries (codes, identifiers, names). We default to hybrid.

What about hallucinated citations?

Structured output with per-claim citations, then a verification step confirming each cited quote actually appears in the cited source.

Bring us your corpus.

A 30-minute call: document mix, query patterns, quality bar, latency target. We'll tell you honestly whether RAG is the right pattern and what kind of build it would take.

Start a RAG Project Book a 30-min Architecture Call