Llm Evaluation
-

You can’t align what you don’t evaluate
16 min read -

LLM-as-a-Judge: What It Is, Why It Works, and How to Use It to Evaluate AI Models
Large Language ModelsA step-by-step guide to building AI quality control using large language models
9 min read -

The third and final part for evaluating the retrieval quality of your RAG pipeline with…
8 min read -

How to Evaluate Retrieval Quality in RAG Pipelines (part 2): Mean Reciprocal Rank (MRR) and Average Precision (AP)
Large Language ModelsEvaluating the retrieval quality of your RAG pipeline with binary, order-aware measures
9 min read -

How to Evaluate Retrieval Quality in RAG Pipelines: Precision@k, Recall@k, and F1@k
Large Language ModelsIn my previous posts, I have walked you through putting together a very basic RAG…
18 min read -

Those of you who’ve worked with LLM-powered applications know this: by now, building and deploying these tools…
4 min read -

A practical, step-by-step guide to building an evaluation pipeline for a real-world AI application
15 min read -

Learn how to reduce the number of hallucinations, and the impact they have
7 min read -

A guide to building and evaluating RAG solutions by leveraging LLM-as-a-Judge capabilities.
15 min read -

This article is adapted from a lecture series I gave at Deeplearn 2025: From Prototype…
4 min read