RAG Explained: Reranking for Better Answers

In my last post, we took a look at how the retrieval mechanism of a RAG pipeline works. In a RAG pipeline, relevant documents from a knowledge base are identified and retrieved based on how similar they are to the user’s query. More specifically, the similarity of each text chunk is quantified using a retrieval metric, like cosine similarity, L2 distance, or dot product as a measure, then the text chunks are ranked based on their similarity scores, and finally, we pick the top text chunks that are the most similar to the user’s query.

Unfortunately, high similarity scores don’t always guarantee perfect relevance. In other words, the retriever may retrieve a text chunk that has a high similarity score, but is in fact not that useful – just not what we need to answer our user’s question 🤷🏻‍♀️. And this is where re-ranking is introduced, as a way to refine results before feeding them into the LLM.

As in my previous posts, I will once again be using the War and Peace text as an example, licensed as Public Domain and easily accessible through Project Gutenberg.

🍨DataCream is a newsletter offering stories and tutorials on AI, data, tech. If you are interested in these topics, subscribe here.

• • •

What about Reranking?

Text chunks retrieved solely based on a retrieval metric – that is, raw retrieval– may not be that useful for several different reasons:

The retrieved chunks we end up with may vary largely with the selected number of top chunks k. Depending on the number k of top chunks we retrieve, we may get very different results.
We may retrieve chunks that are semantically close to what we are looking for, but still off-topic and, in reality, not appropriate to answer the user’s query.
We may get partial matches to specific words included in the user’s query, leading to chunks that include those specific words but are in fact irrelevant.

Back to my favorite question from the ‘War and Peace’ example, if we ask ‘Who is Anna Pávlovna?’, and use a very small k (like k = 2), the retrieved chunks may not contain enough information to comprehensively answer the question. Conversely, if we allow for a large number of chunks k to be retrieved (say k = 20), we are most probably going to also retrieve some irrelevant text chunks where ‘Anna Pávlovna’ is just mentioned, but isn’t the topic of the chunk. Thus, the meaning of some of those chunks is going to be unrelated to the user’s query and useless for answering it. Therefore, we need a way to distinguish the truly relevant retrieved text chunks out of all the retrieved chunks.

Here, it is worth clarifying that one straightforward solution for this issue would be just retrieving everything and passing everything to the generation step (to the LLM). Unfortunately, this cannot be done for a bunch of reasons, like that the LLMs have certain context windows, or that the LLMs’ performance gets worse when overstuffing with information.

So, this is the issue we try to tackle by introducing the reranking step. In essence, reranking means re-evaluating the chunks that are retrieved based on the cosine similarity scores with a more accurate, yet also more expensive and slower method.

Image by author – trying to fit everything I’ve talked about so far into a single diagram 😅

There are various methods for doing this, as for instance, cross-encoders, employing an LLM to do the reranking, or using heuristics. Ultimately, by introducing this extra reranking step, we essentially implement what is called a two-stage retrieval with reranking, which is a standard industry approach. This allows for improving the relevance of the retrieved text chunks and, as a result, the quality of the generated responses.

So, let’s take a more detailed look… 🔍

• • •

Reranking with a Cross-Encoder

Cross-encoders are the standard models used for reranking in a RAG framework. Unlike retriever functions used in the initial retrieval step, which just take into account the similarity scores of different text chunks, cross-encoders are able to perform a more in-depth comparison of each of the retrieved text chunks with the user’s query. More specifically, a cross encoder jointly embeds a document and the user’s query and produces a similarity score. On the flip side, in cosine similarity-based retrieval, the document and the user’s query are embedded separately from one another, and then their similarity is calculated. As a result, some information of the original texts is lost when creating the embeddings separately, and some more information is preserved when the texts are jointly embedded. Consequently, a cross encoder can assess better the relevance between two text chunks (that is, the user’s query and a document).

So why not use a cross-encoder in the first place? The answer is because cross-encoders are very slow. For instance, a cosine similarity search for about 1,000 passages takes less than a millisecond. On the contrary, using solely a cross-encoder (like ms-marco-MiniLM-L-6-v2) to search the same set of 1,000 passages and match for a single query would be orders-of-magnitude slower!

This is to be expected if you think about it, since using a cross-encoder means that we have to pair each chunk of the knowledge base with the user’s query and embed them on the spot, and for each and every new query. On the contrary, with cosine similarity-based retrieval, we get to create all the embeddings of the knowledge base beforehand, and just once, and then once the user submits a query, we just need to embed the user’s query and calculate the pairwise cosine similarities.

For that reason, we adjust our RAG pipeline appropriately and get the best of both worlds; first, we narrow down the candidate relevant chunks with the cosine similarity search, and then, in the second step, we assess the similarity of the retrieved chunks more accurately with a cross-encoder.

• • •

Back to the ‘War and Peace’ Example

So now let’s see how all these play out in the ‘War and Peace’ example by answering one more time my favorite question – ‘Who is Anna Pávlovna?’.

My code so far looks something like this:

import os
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document

import faiss

api_key = "my_api_key"

# initialize LLM
llm = ChatOpenAI(openai_api_key=api_key, model="gpt-4o-mini", temperature=0.3)

# initialize embeddings model
embeddings = OpenAIEmbeddings(openai_api_key=api_key)

# loading documents to be used for RAG 
text_folder =  "RAG files"  

documents = []
for filename in os.listdir(text_folder):
    if filename.lower().endswith(".txt"):
        file_path = os.path.join(text_folder, filename)
        loader = TextLoader(file_path)
        documents.extend(loader.load())

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
split_docs = []
for doc in documents:
    chunks = splitter.split_text(doc.page_content)
    for chunk in chunks:
        split_docs.append(Document(page_content=chunk))
        
documents = split_docs

# normalize knowledge base embeddings
import numpy as np
def normalize(vectors):
    vectors = np.array(vectors)
    norms = np.linalg.norm(vectors, axis=1, keepdims=True)
    return vectors / norms

doc_texts = [doc.page_content for doc in documents]
doc_embeddings = embeddings.embed_documents(doc_texts)
doc_embeddings = normalize(doc_embeddings)

# faiss index with inner product
import faiss
dimension = doc_embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)  # inner product index
index.add(doc_embeddings)

# create vector database w FAISS 
vector_store = FAISS(embedding_function=embeddings, index=index, docstore=None, index_to_docstore_id=None)
vector_store.docstore = {i: doc for i, doc in enumerate(documents)}

def main():
    print("Welcome to the RAG Assistant. Type 'exit' to quit.\n")
    
    while True:
        user_input = input("You: ").strip()
        if user_input.lower() == "exit":
            print("Exiting…")
            break

        # embedding + normalize query
        query_embedding = embeddings.embed_query(user_input)
        query_embedding = normalize([query_embedding]) 

        # search FAISS index
        D, I = index.search(query_embedding, k=2)
        
        # get relevant documents
        relevant_docs = [vector_store.docstore[i] for i in I[0]]
        retrieved_context = "\n\n".join([doc.page_content for doc in relevant_docs])
        
        # D contains inner product scores == cosine similarities (since normalized)
        print("\nTop chunks and their cosine similarity scores:\n")
        for rank, (idx, score) in enumerate(zip(I[0], D[0]), start=1):
           print(f"Chunk {rank}:")
           print(f"Cosine similarity: {score:.4f}")
           print(f"Content:\n{vector_store.docstore[idx].page_content}\n{'-'*40}")
               
        # system prompt
        system_prompt = (
            "You are a helpful assistant. "
            "Use ONLY the following knowledge base context to answer the user. "
            "If the answer is not in the context, say you don't know.\n\n"
            f"Context:\n{retrieved_context}"
        )

        # messages for LLM 
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_input}
        ]

        # generate response
        response = llm.invoke(messages)
        assistant_message = response.content.strip()
        print(f"\nAssistant: {assistant_message}\n")

if __name__ == "__main__":
    main()

For k = 2, we get the following top chunks retrieved.

But, if we set k = 6, we get the following chunks retrieved, and somewhat of a more informative answer, containing additional data on our question, like the fact that she’s ‘maid of honor and favorite of the Empress Márya Fëdorovna’.

Now, let’s adjust our code to rerank those 6 chunks and see if the top 2 remain the same. To do this, we will be using a cross-encoder model to re-rank the top-k retrieved documents before passing them to your LLM. More specifically, I will be utilizing the cross-encoder/ms-marco-TinyBERT-L2 cross-encoder, which is a simple, pre-trained cross-encoding model, running on top of PyTorch. To do so, we also need to import the torch and transformers libraries.

import torch
from sentence_transformers import CrossEncoder

Then we can initialise the cross-encoder and define a function for reranking the top k chunks retrieved from the vector search:

# initialize cross-encoder model
cross_encoder = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-2', device='cuda' if torch.cuda.is_available() else 'cpu')

def rerank_with_cross_encoder(query, relevant_docs):
    
    pairs = [(query, doc.page_content) for doc in relevant_docs] # pairs of (query, document) for cross-encoder
    scores = cross_encoder.predict(pairs) # relevance scores from cross-encoder model
    
    ranked_indices = np.argsort(scores)[::-1] # sort documents based on cross-encoder score (the higher, the better)
    ranked_docs = [relevant_docs[i] for i in ranked_indices]
    ranked_scores = [scores[i] for i in ranked_indices]
    
    return ranked_docs, ranked_scores

… and also adjust of function as follows:

        ...

        # search FAISS index
        D, I = index.search(query_embedding, k=6)
        
        # get relevant documents
        relevant_docs = [vector_store.docstore[i] for i in I[0]]
        
        # rerank with our function
        reranked_docs, reranked_scores = rerank_with_cross_encoder(user_input, relevant_docs)
        
        # get top reranked chunks
        retrieved_context = "\n\n".join([doc.page_content for doc in reranked_docs[:2]])
        
        # D contains inner product scores == cosine similarities (since normalized)
        print("\nTop 6 Retrieved Chunks:\n")
        for rank, (idx, score) in enumerate(zip(I[0], D[0]), start=1):
            print(f"Chunk {rank}:")
            print(f"Similarity: {score:.4f}")
            print(f"Content:\n{vector_store.docstore[idx].page_content}\n{'-'*40}")

        # display top reranked chunks
        print("\nTop 2 Re-ranked Chunks:\n")
        for rank, (doc, score) in enumerate(zip(reranked_docs[:2], reranked_scores[:2]), start=1):
            print(f"Rank {rank}:")
            print(f"Reranker Score: {score:.4f}") 
            print(f"Content:\n{doc.page_content}\n{'-'*40}")
               
        ...

… and finally, these are the top 2 chunks, and the respective answer we get, after re-ranking with the cross-encoder:

Notice how these 2 chunks are different from the top 2 chunks we got from the vector search.

Thus, the importance of the reranking step is rendered clearly. We use the vector search to narrow down the possibly relevant chunks, out of all the available documents in the knowledge base, but then use the reranking step to identify the most relevant chunks accurately.

We can imagine the two-step retrieval as a funnel: the first stage pulls in a wide set of candidate chunks, and the reranking stage filters out the irrelevant ones. What’s left is the most useful context, leading to clearer and more accurate answers.

• • •

On my mind

So, it becomes apparent is an essential step for building a robust RAG pipeline. Fundamentally, it allows us to bridge the gap between the quick but not so precise vector search, and context-aware answers. By performing a two-step retrieval, with the vector search being the first step, and the second step being the reranking, we get the best of both worlds: efficiency at scale and higher quality responses. In practice, this two-stage approach is what makes modern RAG pipelines both practical and powerful.

• • •

Loved this post? Let’s be friends! Join me on:

📰Substack 💌 Medium 💼LinkedIn ☕Buy me a coffee!

• • •

What about pialgorithms?

Looking to bring the power of RAG into your organization?

pialgorithms can do it for you 👉 book a demo today!

RAG Explained: Reranking for Better Answers

What about Reranking?

Reranking with a Cross-Encoder

Back to the ‘War and Peace’ Example

On my mind

• • •

What about pialgorithms?

Related Articles

What Do Large Language Models “Understand”?

Deep Dive into Sora’s Diffusion Transformer (DiT) by Hand ✍︎

Deep Dive into Anthropic’s Sparse Autoencoders by Hand ✍️

Does Using an LLM During the Hiring Process Make You a Fraud as a Candidate?

Using LLMs to evaluate LLMs

A beginner’s guide to building a Retrieval Augmented Generation (RAG) application from scratch

How Human Labor Enables Machine Learning