Llm Evaluation | Towards Data Science

A clean, modern illustration in a blue and white color palette depicts a magnifying glass inspecting a glowing AI neural network diagram. The background features subtle data nodes and connections, along with high-tech evaluation dashboard elements. The design is minimalist with sharp lines, resembling a professional tech article cover.

Why AI Alignment Starts With Better Evaluation

Large Language Models

You can’t align what you don’t evaluate

Hailey Quach

December 1, 2025

16 min read

LLM-as-a-Judge: What It Is, Why It Works, and How to Use It to Evaluate AI Models

Large Language Models

A step-by-step guide to building AI quality control using large language models

Piero Paialunga

November 24, 2025

9 min read

How to Evaluate Retrieval Quality in RAG Pipelines (Part 3): DCG@k and NDCG@k

Large Language Models

The third and final part for evaluating the retrieval quality of your RAG pipeline with…

Maria Mouschoutzi

November 12, 2025

8 min read

How to Evaluate Retrieval Quality in RAG Pipelines (part 2): Mean Reciprocal Rank (MRR) and Average Precision (AP)

Large Language Models

Evaluating the retrieval quality of your RAG pipeline with binary, order-aware measures

Maria Mouschoutzi

November 5, 2025

9 min read

How to Evaluate Retrieval Quality in RAG Pipelines: Precision@k, Recall@k, and F1@k

Large Language Models

In my previous posts, I have walked you through putting together a very basic RAG…

Maria Mouschoutzi

October 16, 2025

18 min read

TDS Newsletter: How to Keep LLMs Effective and Reliable Over Time

The Variable

Those of you who’ve worked with LLM-powered applications know this: by now, building and deploying these tools…

TDS Editors

October 9, 2025

4 min read

Notes on LLM Evaluation

Large Language Models

A practical, step-by-step guide to building an evaluation pipeline for a real-world AI application

Felipe Adachi

September 25, 2025

15 min read

5 Techniques to Prevent Hallucinations in Your RAG Question Answering

Large Language Models

Learn how to reduce the number of hallucinations, and the impact they have

Eivind Kjosbakken

September 23, 2025

7 min read

Evaluating Your RAG Solution

Large Language Models

A guide to building and evaluating RAG solutions by leveraging LLM-as-a-Judge capabilities.

Alex Davis

September 17, 2025

15 min read

Why Task-Based Evaluations Matter

Artificial Intelligence

This article is adapted from a lecture series I gave at Deeplearn 2025: From Prototype…

Mark Derdzinski

September 10, 2025

4 min read