One Component You Desperately Need in Your RAG Chatbot Toolchain


In the world of Retrieval Augmented Generation (RAG), the difference between a mediocre Chatbot or Agentic AI system and an exceptional one often comes down to retrieval quality. No matter how sophisticated your Large Language Model (LLM) is, if it's working with irrelevant context, your results will suffer. This is where rerankers come in—an emerging technology to further improve on modern lexical and AI generated vector search technique.
What Is a Reranker and Why Is It Important?
At its core, a reranker is a specialized model that takes an initial set of retrieved documents or chunks and reorders them based on their relevance to the query. Unlike embedding-based retrievers which can only compare the similarity of a search term to the text in your documents, rerankers can produce more relevant results by keeping the search term in mind as it reads the documents. This is because a reranker is a cross-encoder model, where most modern popular LLMs and Embedding models are bi-encoders or decoder only.
The Statistical Edge: What Industry Leaders Are Seeing
The importance of rerankers is backed by impressive statistics from industry leaders. Voyage AI recently announced their new Voyage 2 series of rerankers in a blog post, highlighting some remarkable performance improvements. According to their findings, when adding their rerank-2 and rerank-2-lite models on top of OpenAI's latest embedding model (v3 large), they observed accuracy improvements of 13.89% and 11.86% respectively across 93 retrieval datasets spanning multiple domains. This represents 2.3x and 1.7x the improvement attained by competitors. Additionally, these models support impressively long context lengths—16K tokens for rerank-2 and 8K tokens for rerank-2-lite.
Similarly, Cohere has made significant strides with their Rerank 3.5 model. Their latest reranker demonstrates substantial improvements over previous versions, particularly excelling at complex, nuanced query matching. Cohere's benchmarks show that Rerank 3.5 delivers up to 25% better performance on challenging retrieval tasks compared to basic embedding search alone. The model has been specifically optimized for enterprise use cases, making it particularly valuable for business applications where precision is paramount.
The Research Perspective
The importance of reranking in RAG systems is also emphasized in academic research. NVIDIA's comprehensive research paper on building effective chatbots identifies "Chunk Reranking" (RAG-C10) as one of the fifteen critical control points in RAG pipelines. According to their findings, reranking is particularly impactful when dealing with large document collections or when high precision is required. The researchers note that adding a reranking step can significantly improve the relevance of retrieved context, directly translating to better generated responses.
Implementing a Reranker: Easier Than You Might Think
Despite their sophisticated functionality, implementing rerankers in modern RAG systems has become remarkably straightforward thanks to well-designed APIs and integrations. In my preferred library LlamaIndex, they provide sample code for integrating Cohere’s API based reranker directly into your existing RAG code.
For AWS users, there's even more good news: Cohere's latest Rerank 3.5 is now available via AWS Bedrock, making integration even more accessible for organizations already using AWS services. More details can be found in this AWS blog post.
In our case, we used Cohere’s latest Reranker 3.5 via AWS Bedrock which can be accomplished with just a few lines of code:
```python
# Install dependences
pip install llama-index-llms-bedrock
pip install llama-index-core
pip install llama-index-readers-file
pip install llama-index-embeddings-bedrock
pip install boto3
pip install llama-index
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.bedrock import Bedrock
from llama_index.embeddings.bedrock import BedrockEmbedding
# Initialize the reranker
llm = Bedrock(
model="us.anthropic.claude-3-5-haiku-20241022-v1:0",
aws_access_key_id=os.environ.get('access_key'),
aws_secret_access_key=os.environ.get('secret_key'),
# aws_session_token="AWS Session Token to use",
region_name="us-east-1",
)
resp = llm.complete("Coalfire is ")
print(resp)
# Load a local PDF file into a vector store to search and rerank
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(r"mypdf.pdf").load_data()
# Add the reranker to your retrieval pipeline using LlamaIndex’s VectorStoreIndex function. You will need to use your own AWS secrets here.
vector_index = VectorStoreIndex(nodes)
from llama_index.postprocessor.bedrock_rerank import AWSBedrockRerank
reranker = AWSBedrockRerank(
top_n=2, #there will now be 2 "top-k" values. The first, "top-k", is the vector search top-k. This one, top-n, is the re-ranked limit. You have to send in more vector results into the reranker to get a better result e.g. 25 top-k, get best 2 out top-n.
model_id="cohere.rerank-v3-5:0",
aws_access_key_id=os.environ.get('access_key'),
aws_secret_access_key=os.environ.get('secret_key'),
region_name="us-west-2",
)
# Vector search, or vector search with rerank. Simply add the reranker as a postprocessor!
#retriever = vector_index.as_retriever(similarity_top_k=25) #Normal vector search
retriever = vector_index.as_retriever(similarity_top_k=25,node_postprocessors=[reranker]) #<--Pass in the postprocessor as an optional param in Retrieve functions in Llamaindex
```
Why Reranking Really Matters for Product Features: UX
Like many organizations, we put together a simple RAG pattern-based proof-of-concept. And like many organizations, we discovered simple RAG has a lot of problems, many of which show up in user testing and in your perceived user experience.
In our simple RAG feature we retrieved the Top 10 results. In my retrieval quality testing, this netted us solid retrieval benchmarks for the types of documents we’re searching (information security compliance documents), including an 80% hit rate.
Dense embeddings vector search - Top 10
Top K | Hit Rate | MRR | Precision | Recall | AP | NDCG |
10 | 80.6% | .464 | .081 | .81 | .46 | .55 |
Modern LLMs have no trouble quickly reading and digesting 10 500-word chunks. But humans do! We found our users did not engage with the citation feature because there was too much cognitive load. One of the main reasons why is that 10 citations is hard to read, digest, and keep in your head at once. It’s also a lot of work, and when your feature is trying to improve the user’s productivity then you need to streamline the task. It must be faster and easier than opening the source document directly.
For our next-gen version, we dial back the cognitive load and only show the user 3 shorter citations. This makes it a lot easier to accomplish the task, and quickly. We had one problem though--plain vector search wasn’t good enough! Our metrics dropped to an unacceptably low 58% hit rate at Top 3.
Dense embeddings vector search - Top 3
Top K | Hit Rate | MRR | Precision | Recall | AP | NDCG |
3 | 58% | .44 | .013 | .67 | .44 | .50 |
And here’s where Rerankers come back in. With nothing else changed, using the same test documents, simply adding Cohere’s Reranker 3.5 via AWS Bedrock resulted in an absolutely massive jump in performance. Our hit rate went up 32 points to 90%, and our NDCG went up nearly 35 points to .82.
25 embeddings reranked with Cohere Reranker v3.5 - Top 3
Top N | Hit Rate | MRR | Precision | Recall | AP | NDCG |
3 | 91% | .79 | .30 | .91 | .79 | .82 |
Bottom line: without the Reranker, our retrieval (and therefore accuracy) wouldn’t be good enough to enable our improved user experience.
Take a closer look at your RAG chatbot or Agentic AI system and you might find the same!