skip to content
Martin Lejko

Starting with knowledge

/ 4 min read

Introduction

Hello again! It has been a week since I started my journey into the world of private LLMs. And it is again time to write down my thoughts and findings. This week started with a focus on knowledge embedding techniques used with LLMs. I watched more YouTube videos and read blog posts to understand how knowledge embedding works and its importance in enhancing the performance of LLMs. I mostly encountered two terms that I will need to dive deeper into: knowledge graph and vector databases. These terms seem to be the building blocks of knowledge embedding, and I am excited to explore them further. So that is what I will write about today. To some extent.

Techniques for Knowledge Embedding

Knowledge embedding is a technique used to represent knowledge in a structured format that can be easily processed by LLMs. The two main techniques I came across are knowledge graph and vector databases. The guy in the video I watched and written about in the previous diary post, used knowledge embedding using vector database. OpenAI Embeddings specifically. That creates a vector database and improves it with Faiss. The problem is that the Terms of usage are not very clear to me as of now. They are not straight forward and I will have to dive deeper into them. As I do not particularly want to take any chances. As they write that the data can be used to better their services and models. Which for me is concerning. I will try to explore other options as this is the first I have encountered. The Fiass will be a good start. As it has MIT license and is open source. I will just have to see how it works but it looks promising. And it doesnt use the data for their own purposes as it runs locally on my machine. He also shares the results of the queries that the LLM provided, onto some third party site. It is a nice touch that maybe I will implement at the end. But I will make it from scratch, with some security measures in place.

Vector Databases

Vector databases are used to store and retrieve vectors efficiently. They are essential for knowledge embedding as they allow LLMs to access and process knowledge quickly. The guy in the video I watched used Faiss, which is an open-source vector database library developed by Facebook AI Research. It is designed to efficiently search and retrieve vectors in large-scale datasets. Depending on the data we use and have the techniques are chosen. Turning data into vector in hudreds of dimensions is one of the options.

Knowledge Graphs

Other method I saw being used is knowledge graphs. They are used to represent knowledge in a graph format, where entities are nodes and relationships are edges. Knowledge graphs are useful for capturing complex relationships between entities and can be used to enhance the performance. Although I do not know if they will be particularly useful for my case. As the data will show as the way. It will bump it up to be first on my agenda. As one of the most important things in this field is the quality of the data we use. If the data parsed will be so complex, yet written in a way that the knowledge graph will be able to understand it, then it will be a good choice. This gives me an idea to make a segment in my thesis where I will compare the two methods and see which one is better for my case.

Onto the Next Step

I will continue being Dora the explorer and dive deeper into knowledge embedding techniques, although I will post-pone it a bit and have a look at the data I will be working with. As of now I do not have access to comapy data, but that will change in the upcoming weeks. I know that there will be PDFs and Docs that I will have to parse. For sure some mathematical equations, do not know about the structure. This does not really matter, for next time, I will create my test data from mathematical books or PDFs and see which parsers I can use and their limitations / performance. Until next time!