Things to keep in mind when buidling RAG • Martin Lejko Blog

Introduction

This article is just for me to keep in mind some things from an article about building a better RAG that I want to keep in mind and reflect upon when building my solution. Thats it plain and simple just my bullet points about things that I have already used and need to extend or something completly new that I just want to note when I get there.

Article notes

Right from the start, in the section two we have common pitfalls that we should avoid. If these are pitfals maybe we want to create tests for them in the testing pipeline. The two mentioned are:

Absence Bias

The model can be the best in the world but if we dont provide the relevant information it will never be able to answer the question. That is why we should measure retrieval success first. That is for sure on a TODO list for the pipeline as we do not do it right now and it is a good point. Hopefully the DeepEval will help us with that also. And it will have the tools for that.

Intervention Bias

Conclusion is that we should go slow and only keep something that we see results from. Do not overload it with hyped up things all at once. Plus we will not know what is good and what not.

Coming back to the first point of testing the retrieval, they talk about what we should test for. It is: [Precision, Recall]. Recall is the more crutial from the two as if we do not recall the right information we can never answer the question.

Next good point they made is about synthetic data, that after we have the data private or public does not matter. We will do this to create our datasets for the test of retrieval. We do it by taking our data and asking an LLM to “Generate 5 questions that can be answered by each chunk.”. After that we have our dataset of pairs. This leads to evaluation of whether our system returns that chunk for each question. That’s recall.

After this in our proof of concept, we already know that we will do iterations. Based on them we will tweak building blocks of the pipeline and stick with the best verstion, that is why we created the testing pipeline. Maybe it would be great if we also implemented a system whch would say which version of iteration is the best and not do it manually but that is a bit of a stretch for now.

If our recall will be 50%, that means half the time we are missing the relevant chunk entirely. No advanced prompt can fix that. We must investigate chunk sizing, embeddings, or re-ranking next. Even if we see higher number of retrievals we should differentiate between the queries and split them into groups. Under this is a short list how we should segment our queries and maybe create a dataset of queries that we can test on.

Topic: “sales questions,” “technical questions,” “pricing questions,” etc.
Complexity: “simple single-hop” vs. “multi-hop or comparison-based.”.
User role: “new users vs. experienced users,” “executives vs. engineers.” Or do a quick LLM-based clustering: feed your queries to a clustering algorithm (like k-means or an LLM-based topic labeling) to see emergent groups.

Quick note from me from the future, I have read the documentation of DeepEval module that I am using for testing my RAG pipeline and it does support the Retrieval tests and also testing Recall. But unfortunetly it fails for me, as my llama 3.1:8b is too weak to generate valid JSONs and the test is suggesting to upgrade the model that the testing framework uses. I will look into it if that is possible, but I do not think it is on my current hardware. There is a possiblity that I can use a virtual machine with better parameters but only for testing the RAG. I will not upgrade parameters for the RAG pipeline as it beats the purpose of my thesis. Anyways I fixed this issue by using the GEVAL tests to also do recall. It ran fine, so that is a success. Even so my consideration will be the final test anyway. Future me out.

Inventory versus Capability issues arrise and we need to analyze them by asking ourselfs where the issue lies. Is the data there but we are just simply not retrieving it, or the data is not there so it is not possible to get.

Structuring and data annotation

Something to keep in mind is that metadata of the files can help us with retrieving. That is dates, authors, statuses, localtion might be crutial at times if some of the information is included in the query. For example quering with date included. We should store this data, it can be separate DB column or a sidecar index. With some queries this can help tremendusly. Handling tables can be tricky, I know at start I was saying that we will be handling PDFs but I think that is in the past. Confluence pages will be in the txt form so we can forget about PDFs, the confluence pages can include tables which will be our enemy in a way. The article suggests that we should not chunk the table as we lose context and we should keep it structured as it will be easier to retrive the data from it.

After that the article goes over fine-tuning and rerankers. Not really stuff that conserns us. But some cheap think that would maybe not need high hardware capabilities was that we can do a quick vector search, then run re-ranking on the top K results. Which sounded reseonable. It is also adviced to give feedback to the answers that the LLM provides which we can later then feed to re-rankers and embeddings while training them. I have not thought about this at all and I doubt that this will happen in my thesis.

And that is it for the article it, later advertises this course which looks pretty nice, but I do not have time for that now. So we will leave it at that.