RAG Evaluation
In one line
RAG evaluation is the practice of measuring a retrieval-augmented system's quality across both stages — retrieval and generation — so you can see why an answer went wrong, not just that it did.
Going deeper
RAG evaluation breaks the quality question into stage-specific metrics rather than a single thumbs-up/down. Typical signals include retrieval-side context precision and recall, generation-side faithfulness (does the answer match the cited sources?) and answer relevance to the query. Ragas, TruLens and DeepEval are common open-source toolkits that standardise these.
Two angles for marketers. First, when an in-house RAG assistant degrades, you can see whether retrieval missed the right document or generation ignored the right source — a much faster path to a fix. Second, the same metrics tell you how 'citation-friendly' your own content is from a GEO standpoint: weak retrieval scores often map to weak chunking and weak structure.
The mature pattern combines automated RAG evaluation with LLM-as-a-Judge for scale, and human review on flagged regressions. Looking at a single composite score hides too much; tracking retrieval and generation metrics separately is what actually reveals the root cause.
Related terms
RAG
RAG (Retrieval-Augmented Generation) lets an LLM fetch external documents at answer time and ground its response in them — the technique behind ChatGPT Search, Perplexity and most AI search products.
LLMLLM-as-a-Judge
LLM-as-a-judge is the practice of using one LLM to grade or compare the answers of another — a standard way to scale evaluation beyond what human labelling can cover.
LLMReranker
A reranker re-scores the candidate documents that retrieval produced and reorders them — the final gate that decides which sources an AI actually cites.
LLMHybrid Search
Hybrid search combines keyword (BM25) and vector retrieval to get the best of both — the default retrieval shape behind Perplexity-style answer engines and most production RAG.
LLMLLM Benchmark
An LLM benchmark is a standardised test used to compare model capabilities — the source of those headline scores you see in every model launch announcement.
How does your brand show up in AI answers?
Villion measures how your brand appears across ChatGPT, Perplexity and AI Overviews, then automates the work that lifts citation rate and share of voice.
Get a free audit