DeployQL
Scaling Retrieval

Improving Retrieval for Gen AI

Matt Barta
#AI/ML#Experimentation
Feature image

At DeployQL, we simplify finding the right context for your LLM. Here are some insights from our experience to help you optimize your retrieval setup and ensure it scales with your product.

This post is geared towards industry leaders who are less familiar with the retrieval portion of Retrieval Augmented Generation (RAG). Retrieval is a unique blend of infrastructure and machine learning concerns, and it’s common to only have direct experience in one of these areas.

We’ve talked to industry leaders working on RAG about how they need to evolve and iterate. First, there’s no silver bullet to retrieval. We have no predefined formula to share. Instead, we want to accomplish two things in this post:

  1. Share how we think about evolving retrieval over time.
  2. Suggest solutions to problems that have surprised folks when they begin to scale.

1. Evaluate your data quality

While this might happen naturally as you benchmark and build a RAG application, thinking about this beforehand helps brainstorm and crystallize possible approaches and solutions.

The more specific you can be about what you need to search, the better outcome you will have.

Here are a couple of questions you can ask yourself:

There are many blog posts about cleaning, extracting, and indexing data. It’s a critical step to get right, and it can be costly to fix mistakes. Unstructured.io wrote an excellent blog post covering the concerns they have at this stage.

We’ll cover data size and updating data a bit more below.

2. Consider how you will handle data updates

Suppose a new customer onboards to your product, and you need to index their data into your search system.

OpenSearch (or any lucene-based search system) uses immutable segments, and updating a lot of data at once can cause a spike in resource usage while updating and merging segments for the new data. It can take a lot of tuning and operational knowledge to avoid poor performance during these cases.

Data updates might also come from attempts to improve retrieval accuracy. New and better embedding models are coming out at a rapid pace, but embedding every document in your index with a new model could be incredibly expensive.

We propose an adapter paradigm to help reduce reindexing costs while still leveraging better embeddings in the query. Adapters learn a translation from one embedding to another, and this can be as small as a single linear layer.

We’re seeing adapters applied to many use cases:

3. Scaling can bring new problems

There’s a tradeoff between system latency and retrieval performance. To get better results, you must spend more time to do it. In vector search, you might see parameters like n_probe or efsearch that directly correlate to this tradeoff. Scan more data at the cost of higher latency.

As your dataset gets larger, it’s common for retrieval performance to go down. After all, you’re searching more data, and that’s changed what results you’re going to see. It’s tempting to adjust the above parameters to get around this, but latency will increase and impact how many queries you’re able to handle.

Filtering, such as by date range, can help manage this by narrowing the data pool and making search queries fasters to run. However, because filtering removes documents before ranking them, it can lower recall if your filter isn’t accurate. Consider how you structure your metadata so that you can apply filters, and understand your data so that you know if they’ll be accurate.

For more reading, check out how Neo4j addresses filtering in their blog post.

4. You probably want A/B testing.

There are a lot of resources around benchmarking RAG up front — chunking strategies, embedding model and LLM performance, top_k parameters.

There’s also a lot of wisdom on how collecting feedback is crucial for fine-tuning models and monitoring performance. Similarly, A/B testing forces you to think through how you’re going to collect data and what metrics you’re measuring for success. It gives you a chance to detect data drift or underperformance in new scenarios, which could be the case as you onboard new types of users.

Consequently, by thinking through how you’ll compare and measure changes in production, you’ve thought through how you’ll continue to iterate on your system.

For more reading, check out Daniel’s blog post.


Thinking about long-term iteration processes can be difficult when state-of-the-art is changing so quickly. There can be direct costs to updates that can be hard to reason about without understanding specific usecases of retrieval as well as how your system is performing currently.

It’s also easy to be blindsided by operational challenges. Hopefully thinking through the above give you concrete ideas on how to avoid them.

If you’re working on a RAG application and want to chat, get in touch at [email protected]

← Back to Blog