RAG Evaluation
In one line
RAG evaluation is the practice of measuring a retrieval-augmented system's quality across both stages — retrieval and generation — so you can see why an answer went wrong, not just that it did.
Going deeper
RAG evaluation breaks the quality question into stage-specific metrics rather than a single thumbs-up/down. Typical signals include retrieval-side context precision and recall, generation-side faithfulness (does the answer match the cited sources?) and answer relevance to the query. Ragas, TruLens and DeepEval are common open-source toolkits that standardise these.
Two angles for marketers. First, when an in-house RAG assistant degrades, you can see whether retrieval missed the right document or generation ignored the right source — a much faster path to a fix. Second, the same metrics tell you how 'citation-friendly' your own content is from a GEO standpoint: weak retrieval scores often map to weak chunking and weak structure.
The mature pattern combines automated RAG evaluation with LLM-as-a-Judge for scale, and human review on flagged regressions. Looking at a single composite score hides too much; tracking retrieval and generation metrics separately is what actually reveals the root cause.
Related terms
RAG
RAG (Retrieval-Augmented Generation) lets an LLM fetch external documents at answer time and ground its response in them — the technique behind ChatGPT Search, Perplexity and most AI search products.
LLMLLM-as-a-Judge
LLM-as-a-judge is the practice of using one LLM to grade or compare the answers of another — a standard way to scale evaluation beyond what human labelling can cover.
LLMReranker
A reranker re-scores the candidate documents that retrieval produced and reorders them — the final gate that decides which sources an AI actually cites.
LLMHybrid Search
Hybrid search combines keyword (BM25) and vector retrieval to get the best of both — the default retrieval shape behind Perplexity-style answer engines and most production RAG.
LLMLLM Benchmark
An LLM benchmark is a standardised test used to compare model capabilities — the source of those headline scores you see in every model launch announcement.