Writing tests for the law • Martin Lejko Blog

Introduction

Let’s pick things up again. I have been writing some of the text for my thesis during the winter holydays and got some work done. But not creating and getting new information which would push the thesis further. Just writing the text for the knowledge that I previously mentioned here. It started as a big pain but after letting go of the quality by 20% maybe in some places even a bit more. I have been able to spit some text out. But I got into a point where I need more knowledge and things to write about. And we are back where we ended and that was CI/CD pipeline for testing.

DeepEval

Like we said in the previous post we will be using for now the DeepEval python package. Saying for now as it is working like I would expect but maybe I will run into some issue. My thinking is that I will read trough their tutorial mention here everything that I find that is interesting to use and try to implement it in the later stage. I mean after reading trough it. I have worked up some of the tests with my friend Claude. As I wanted to be sure to run the test with a private LLM model, which I did succesfully. This github issue helped a bit where we can set the model that we want to use for evaluation.

deepeval set-local-model --model-name=llama3.1:8b \
    --base-url="http://localhost:11434/v1/" \
    --api-key="ollama"

Simple as this. It suggests that we should use the api key for the confidentAI to see our results on their website but I do not want to do that. I want to keep my results private. Thats why the testing will be done with llama3.1:8b model. To run my tests we just need 3 simple commands.

ollama serve
ollama run llama3.1:8b
deepeval test run name_of_the_test_file.py

Simply as this we start up the model and run the tests. I wanted to make it so that the tests are separated from the acctual pipeline. For now it is just one function that is inicialized and creates the whole pipeline which we just query. But as there will me multiple different pipelines that we wont to test and itterate upon. I will have to come up with better encapsulation. But for now it is good enough. I would now proceed to write the tests that I have in mind.

DeepEval tests

Big three tests are Correctness(GEval), Faithfulness (FaithfulnessMetric) and Contextual Relevancy (ContextualRelevancyMetric). I would start with these, as it provides a multi-faceted view of system performance. This comprehensive evaluation is crucial for identifying areas of improvement and ensuring the reliability and effectiveness. I have opted for architecture where in one file we will store all the test data as it can be separated from others and they can create their own test data. This ensures that no private data will be leaked. Other then that, documentation of DeepEval mentions datasets which is basically a list of LLMTestCase objects. Maybe it would be good to use it and not create a 1 to 1 mapping between the test data that we have in the file and the LLMTestCase object. We could create maybe a factory for creating these objects. I will think on this and maybe implemnt it. Another thing mentioned in that page were goldens which made me a bit confused. As these tests are generated by the model itself from my understanding. There are just optional parameters, which input being the only required one. Which was weird to me. I will maybe take a hybrid approach, meaning, I will have dataset of testcases where I specify

input
output
actual output
context
(optional) actual context

This way I can have more control and I know what goes in and what is going on more. After further inspection I wanted to import the test cases from the JSON or CSV straight to the dataset. But as we are running the RAG pipeline everytime we do not have the actual output parameter before hand. That would mean that we would need to write and rewrite the CSV before, transforming it into the dataset. As I did not read in any athor way. So I created this TestCase factory that takes in my dictionary simular to the JSON we could have it runs the RAG pipeline with the question in hand get the actual ouput and creates the test case object which it appends into the dataset and we evaluate. Pretty simple and straight forward. Maybe a bit custom for other users but I do not think its a big deal. From that I dump it all to the JSON file and let my friend Claude create some HTML report from the JSON data. I think it looks pretty good. I will need to come back to DeepEval Synthetizer as it allows auto generated test cases. I think that would be a good idea to use. But as the data for now is missing as I was lazy and had other things to do. I will leave it for later. I think a bit more polishing and we are pretty good to go with the testing.