DEV.co
RAG Development

RAG that answers correctly — and proves it.

From document chunking strategy to reranking to citation-grounded outputs, we build RAG systems engineered to the quality bar your business actually needs — with the eval harness to prove it.

Pinecone · Weaviate · Qdrant · pgvector · Cohere Rerank · Ragas · Eval-first delivery

Most RAG systems hallucinate. Often.

The default RAG tutorial works on a demo dataset. Build that same pipeline against your real corpus — messy PDFs, tables, scanned documents, multi-language content — and answer quality drops below the bar your business needs.

Chunks split a sentence in half. The embedding model doesn't understand your domain vocabulary. Top-5 retrieval misses the document with the answer. The LLM confabulates a citation that doesn't exist. And no one has built a golden dataset to even measure the regression. Production RAG is not a vibe — it's a measured system.

The eight-stage RAG pipeline we deploy.

The difference between a 60%-accurate RAG and a 90%-accurate RAG is rarely the LLM — it's how each stage is tuned.

01

Document parsing

Layout-aware parsers (Unstructured, Docling, Textract) handle real-world PDFs, tables, and scans.

02

Chunking

Structure-aware, semantic chunking — not naive fixed-size splitting.

03

Embedding

Benchmarked on your data: text-embedding-3, Cohere, BGE-M3, Voyage, etc.

04

Vector store

pgvector, Pinecone, Weaviate, Qdrant, Milvus — chosen for your scale and stack.

05

Hybrid retrieval

Dense + BM25 sparse retrieval, combined via reciprocal rank fusion.

06

Reranking

Cross-encoder reranker (Cohere Rerank, BGE, ColBERT) lifts precision@3 by 15–30 points.

07

Citation-grounded generation

Strict prompt + structured output: answer only from context, cite the source, refuse if unsupported.

08

Guardrails + eval

Citation verification, factuality scoring, continuous eval against a golden dataset.

Show, don't tell

What grounded retrieval actually looks like.

Hybrid retrieval, a reranker, and a citation-strict prompt — the same shape we ship to production.

rag_pipeline.pypython
def answer(question: str) -> Answer:    # 1. Hybrid retrieval: dense + sparse, fused    dense = vector_store.search(embed(question), k=40)    sparse = bm25.search(question, k=40)    candidates = reciprocal_rank_fusion(dense, sparse)    # 2. Rerank — the single biggest precision lever    top = reranker.rank(question, candidates)[:6]    # 3. Generate strictly from retrieved context    return llm.generate(        system="Answer ONLY from context. Cite sources. "               "If unsupported, say you don't know.",        context=top,        question=question,    )
Response
“Refunds are issued within 5–7 business days to the
original payment method.” [policy.pdf · p.4]
confidence: 0.94 · citations verified: 1/1

Every claim is grounded in a retrieved source and the citation is verified against the document before the answer is returned.

Eval-first delivery

We measure quality before we ship it.

The difference between a demo and a production RAG system is a golden dataset and an automated eval that runs on every change.

We build that harness first — so prompt tweaks, model swaps, and chunking changes are measured against real questions, and a regression blocks the deploy instead of reaching your users.

See how we engage

Six failure modes that kill RAG quality.

Failure modeWhat we do about it
Chunks split a sentence or tableLayout-aware parsing; recursive splitting; tables as structured chunks
Right document isn't in top-kHybrid (dense + BM25) + RRF; query rewriting; higher initial-k + reranking
Reranker isn't thereAlways deploy a reranker for production; measure precision@3 with vs. without
LLM hallucinates beyond contextStrict 'answer only from context' prompt; structured output; per-claim citations
Citations point to nothingCitation verification — every cited quote checked against the source
System slowly degradesGolden dataset + automated eval on every change; regression blocks deploy

How RAG projects engage with us.

Discovery & Architecture
1–2 weeks
from $22,000
  • Corpus audit + embedding benchmark
  • Vector DB recommendation + TCO
  • Golden-dataset starter
Start Discovery
Production RAG Build
6–10 weeks
from $65,000
  • Parsing → chunking → embedding → retrieval
  • Hybrid + reranking + citation grounding
  • Full eval harness + observability + 30-day support
Start a Build
RAG Operations
monthly
from $9,500/mo
  • Weekly eval review + dataset expansion
  • Embedding-model migrations
  • Chunking iterations on real queries
Discuss Operations

Common questions.

When is RAG the wrong answer?
When the answers aren't in your documents. RAG retrieves and grounds; it doesn't reason about content that isn't in the source corpus. We pair it with agents or frontier models when needed.
How big should my chunks be?
No universal answer. 256–512 tokens with 50-token overlap is a reasonable start; production chunking is layout-aware and semantic. We test 3–4 strategies and pick the one that maximizes precision@3.
Which vector database?
pgvector if you're on Postgres; Pinecone for zero ops; Qdrant or Weaviate for self-hosted hybrid; Milvus or Turbopuffer at billion-vector scale.
Do we need a reranker?
For production, yes almost always. Reranking is the single highest-impact lever after chunking.
Hybrid vs. pure-vector search?
Hybrid wins on real-world corpora. Pure dense misses exact-term queries (codes, identifiers, names). We default to hybrid.
What about hallucinated citations?
Structured output with per-claim citations, then a verification step confirming each cited quote actually appears in the cited source.

Bring us your corpus.

A 30-minute call: document mix, query patterns, quality bar, latency target. We'll tell you honestly whether RAG is the right pattern and what kind of build it would take.