Sometimes the information that the model needs to reply to a prompt is recent and without that, the response is going to be wrong. Sometimes the information is private such as email or internal documents of a company and so on … therefore LLM has not seen the data before and when we ask a question about that, it is not going to give us the correct answer. Therefore we use RAG to provide the new information for the model so that it can respond more specifically and correctly.

Here is a picture showing how RAG works compared to normal LLM use:

The user experience is the same as chatting with an LLM model. However this side route helps the LLM to find the relevant and accurate information.

Now lets see how RAG retrive the information from knowledge base. When a user ask a question, the retriver should look for the relevant document and after finding the relevant docs, a new prompt is generated with the relevant data included and then the new prompt is given to LLM to respond. However searching for a relevant documents is not an easy task. There 2 serach approaches:

1- Keyword search: in which the retriver look for the exact words from the prompt in the knowledge base. 2- Semantic search: The retriver looks for the documents that have similar meaning to the prompt.

in any search method that was mentioned above, the retriver will return 20 to 50 relevant documents, however this documents have a different ranking in each search method. After that the model will use a metadat filtering and only allows some of the relevant documents to be passed on to the LLM. The reason for filtering is that sometimes the person who is prompting is from an specific department and they dont need to receive docs from other departments in a company. After filtering all the passed documents coming from both search methods are going to be combined and then ranked again. Then only the documents that bave been ranked highly withh be passed to the LLM.

Here is a picture showing that:

When both Keyword and Semantic serach is used at the same time, the search method is called Hybrid search.

Metadata filtering:

Let’s assume the documents we have in the knowledge base, are news articles, each doument is going to have multiple tags, such as date, published at, url, author, main article, title and so on. In this case, based on one or multiple metadata, we can filter the documents and find what we need easily. In RAG, the retrival is NOT done by metadata filtering. However the documents are being filtered not using the information from the prompt BUT using the information based on the user demographic or other user info. For example if the user is not a paied user, and is prompting to receive info, first all the paid documents are filtered because the user is not a paid user and there he/she can not receive info from the paid articles.

Metadata filetring alone is not very good for retriving documents and it needs to be paired with other techniques in order to be more effective.

Keyword search methods:

TF-IDF

In this search technique each document is treated like a bag of words. meaning that the order of words are not important. I simply looks at the docs and each docs what have more words like the prompt, it is relevant. when a prompt is sent to the retriver, a sparse vector is made from the prompt and all the documents and then we are ready to start ranking and scoring each doc. each word from the prompt is considered as a keyword, then each document is looks and is given a +1 score for every keyword is conatined/ look athe picture below.

if we score each document once for each keyword it contains, we will miss the frequency. If we want to include the frequency in our scoring, then it is called TF scoring, look at the picture below: and we need to normalize the score by the length of the document too: But there is still another probelm, rare words which are important might apear less frequent but they are important. to correct for this we do IDF (inverse document frequency): So now instead of Term frequency (TF), we use TF-IDF:

However modern approaches are use BM25 approach instead of TF-IDF appraoch which is a refined version of TF-IDF. Let see what is that:

BM25 (Best Matching 25):

In TF-IDF method, if the keyword is found repeatedly in a document, then that document is more important than other documents however the amount of being important is the same as the frequency. for example if the word is found 20 times in a doc and 10 time in another doc, then the first one is twice more important than the second one. In BM25, this is not the case, they first one is more important but not twice … they have something called term frequency saturation.
in TF-IDF method, the longer documents are normalized too aggresively, in BM25 they modify that too

BM25 is the most commonly used search algorithm that has been used for decades.

Now lets talk about the cons of keyword search. In the keyword search if two words are synonyms like “happy” and “glad”, they will be ignored becasue keyword search looks for the exact keyword not the synonyms. or python the programing language and python the snake are going to be considered the same words. This is the reason why there is another method for searching the relevant doumnt called “Semantic search”.

Semantic Search:

Semantic and keyword search from the high level are working the same way. The prompt and the documents will be converted to vectors and then vectors are going to be compared in order to assign scores to each document. But the way that vectors are created are different. There are many embedding models for words, sentences or a whole document. In the embedding space, similar concept tends to be closer to each other. for example:

There are different ways of finding the similarty of two point in the embedding space. There are eucleadian, cosine, dot product and so on … then you can find the most relevant docs by just finding the distance between vectors. But how these embedding are really created. For example two words that have similar meaning, such as “Hello” and “Good morning”, are positive pairs and two words that have dis-simialr meaning such as good and trumpoline are negative examples and they need to be far from each other. One way of create an embedding is to have lots and lots of positive and negative examples of positice and negative pairs.

First all the vecotrs of all the words are randomly assigned, then based on the loss function, two vectors are compared if they are being far apart, then the interanl parameters going to be updated to pur these two pairs closer to each other. This is call contrastive training. This process is repeated many times, until similar words are put close to each other and the model can not be improved further.

In the context of RAG, vector embeddings are used for: Capturing Meaning: Vector embeddings act like a map for text. They convert words and sentences into positions in vector space that capture meaning. These vectors can then be used to locate information matching a query. Comparing Similarity: When a prompt is received, it is converted into an embedding vector of its own. Then, the similarity between this prompt’s vector and other vectors in the database can be calculated. This helps identify texts closest in meaning to the prompt.

One thing is that when embedding the documents, then we need to make sure the token window is alrge enough otherwise after ceratin amount of text, the rest of the text is not being read by the mdoel to generate the embedding. This is an issue that we are going to talk about in a bit. In the hybrid search, when the ranked lists of document are provided using keyword or semantic search, the final lists need to be fused together to generate a final ranked list. One method to fuse and rerank these lists is called Reiprocal ranked fusion (RRF). :

You can also wieght the fusion to have a more weight for semnatic list than the exact keyword search final list.

Evaluating retrival:

In order to evaluate a retrival, you will need 3 things: 1- The prompt, 2- the retrived documents, 3- the ground truth (all the relevant docs that your retriver should return). Two commonly used metrics to evaluate retrival are: Precision and Recall Precision: Total number of relevant retrived docs / total number of retrived docs Recall: Total number of relevant retrived docs/ total number of relevant docs

When Recall is equal to 1, it means that all the relevant docs are retrived, when the precision is equal to 1 it means that all the retrived docs are the correct one, the more retrived we have that are incorrect, the precision will go down. but as long as recall is equal to one, it means that all relebant docs are among the retrived ones. There are several other metrics to evaluate the retrival such as MAP and mean reciprocal rank which I will not expalin here but the goal of all of these metrics is to score our retrivaal.

okay so far then we learned about the Retrival methods and metrics for evaluating them. lets recap again, in the keyword search method, the query become tokenized as well as the knowledge base, then relevant documents will be retrived based on searching the keyword tokens. documents with more keyword in them will be more relevant in this search.longer documents will be normalized ususally using bm25 technique. In semantic search however, we usually will use an embedding model and its embedding space. First we use the model encoder to create embeddings from the query. we can encode the douments in the same space. Then we will find the similarity or the distance between the query vector and the documents. Then we find its closets vectors. In the hybrid method, then the list of relevant docs using semantic search and keyword search will be combined with Reciprocal ranking fusion technique.

When searching into very large databases with milion or bilion documents in them, the retrival searhes become super slow. so what shoudl be done in this case? in the semantic search that we have vector space and we created the embedding point for the knowledge and the prompt, for every search we need to calculate the distance between the prompt vector and ALL the document vectors. This is time consuming. here is a picture of what is happening :

so instead of KNN, K nearest neighbor search, there are other methods in which the accuracy of retrival will be scarified a little bit but they are faster. These methods are called ANN, Approximate neigherst neighbors. one of the ANN methods is the following:

Navigable small world in this method a proximity graph is build first between all the points in the knowledge base embedding space, there are edges between each point and its neighest neighbors. Then query is embedded, a RANDOM node from the knowledge base is chosen, then the distance between that point and query is calculated and then the distance between the doc node’s NN and the query point is calculated, if one the neighbors has smaller distance we move to that point and this process continues until no closer neighbor is found. Then another RANDOM point will be chosen and this continues until few neigherst NN to the query are found. This is not garanteed to find the actual nearest neighbors of the query point but is very fast and the results are acceptable.

Hierarchical navigable small world In this method the way that the proximity graph is build is different:

Also the way the search is done:

in the production RAG systems, the vector databases are used which are optimized and have APIs to perform the retrival searches much faster and optimized. One of these vector databases called Weaviate. You can perform Bm25 serach, Semantic serahc, filtering and hyrbid search using this vector database. In any RAG system, few more adaptiations are required for the model to perform at scale. In the following, we will introduce these few adaptations:

Chunking:

Imagine that your knowledge base has 1000 books. When you vectorize the knowledge base, each book going to be represented by a vector. but the specific infor of a chapter or a paragraph is going to be lost becasue the entire book is shown with a single vector and a lot of details are lost. instead the vector is an average representation of the book. Therefore the retrival is not going to be as good too for the obvious reasons. So instead we can chunk the books to pages, paragraphs or even sentences. In this case instead of 1000 books in our knowledge base, we will have millions of paragraphs for example but we dont care becasue the vector base can search through all that easily. for the chunk size of for example 250 characters, it means that from character 1 to 250 are going to be chunk 1, from 250 to 500 are going to be chunk 2 and so on. However since the sentences might be cut in the middle or the chunk might happen in the middle of the word and so on, to fix this, we can allow overlap between the chunks. for example an overlap of 25 character between the chunks.

The chunch size can be fixed or variant. To recap, we chunk the docs, then our vectorbase, vectorize the chunks and then perform all sorts of searching and give us back few relevant documents that are going to be added to the prompt to be given to the LLM to generate the final response.

Issues with chunking:

When you chunk a document, you might anyway loose the context and that can be sometimes completely change the correct answer. For this to not happen there are advanced techniques for chunking which I explain in a bit:

Semantic chunking In this method a chunk is created, then the following chunk is created, if they are semantically related, then they will be merged to a one chunk and so on. This is good but very expensive becasue we repeatedly should generate vector for a given chunk and so on

LLM chunking You give the docs to LLM and ask it to generate chunks with variable sizes based on their context. You can ask the LLM to add context lables for the chunks too.

Query Parsing:

Most of the times, the user query is not very clear and a bad, messy or ambigues query can result in a bad retrival in our RAG system. So it is always recommended to improve the use query before giving it to the RAG system. This can be as simple as just ask a LLM model to clean up and rewrite the query for us. or it can be more complex. But it is always recommended

Bi-Encoder:

Creates separate semantic vectors for pormpt and the documents using an embeding model encoder. then it performs ANN to find the best match and so on … like we explained before. This can be very fast becasue the document encoding can be done in advance and later on it is only the user query that is going to be encoded. So this capability makes this method fast.

Cross-Encoder:

Combines the prompt and the documents and then create the emebdding for them. It can gives us a relevance score when it combines the prompt with the document. Overall it results better than Bi-Encoder however we can not pre-calculate the embedding of the documents and we need to do at the time we receive the user query

ColBERT:

It combines somehow the Bi-Encoder and Cross-Encoder advantages. You still generate the embedding vectors of the documents ahead of time like in Bi-Encoder method. But try to catch a interaction between prompt and the documents like in cross-encoder. In this method, instead of one vector for prompt and one vector for each chunk of the document, the model generates one vector for each token. This means that prompt will not be represented with one vector but multiple vectors, each for each token. Then the same thing will be done for the documents, per token we will have vectors. Then a similarity score will be computed for prompt tokens with respect to each doument. if two tokens are similar, they will have a higher score. Then a document will have a higher score if its tokens are more similar to the prompt tokens. this score is called SimScore.

About LLMs:

Transformers:

In order to dive deeper and build better RAG systems, we need to understand how LLM works and their architecture. LLMs are made of transformers. a transformer has an encoder and a decoder. The first paper about transformers was published in 2017 called “Attention is all you need”. It was a transformer focused on machine translation. The encoder received a text for example in German language and developed a deep understanding of the paragraph’s meaning. the decoder, then will use this deep understanding of the german paragraph to generate an english version of it. Each sentence in a text first will be tokenized. Then each token will be assigned two sets of vectors. The first set of dense vector is semantic vectors which show the meaning of the token and the second set is the positional vectors which refer to the positon of the token in a sentense. Again EACH token would have 2 vector representation. The relationship between tokens are based on attentions. The attention mechanism is done using the atten-heads. Each transformer can have multiple attention-heads. In each attention-head each token pays attentions to other tokens based on some relationships. So after attention, the vectors go to the feed forward layer and for each token, a new set of vector which represents the meaning of the token is generated. Then this second guess will be feed again to the input embedding to go through the attention layer and then feed forward to generate a third guess for that token, this process is repeated over and over and sometimes up to 64 times in LLMs. see the figure below:

Then the model is ready to generate new tokens based on the refined token it generated after many iterations in transformer model. For each given token the model calculates the probability of the next token for all the tokens in its vocabulary. then a new token will be generated. Then this token is going to be added to the end of the sentence and then to generate a new token, the model will use the newly generated token and the previous one to make sure it will generate a new token that make sense based on the generated token and its context with the previous one. This will continue until it hits the end of a pharse and a task is compeleted like answering a question.

LLM Sampling Sterategy:

Since every token in the vocab has a chance to be generated as the next token, one of the most important aspects of LLM is how to handel randomness in generating next token. If the distribution of sampling next token has an obvious choice then it is easy to pick the best token but when the disributin is flat the nex token can be dfficult to choose.

Greedy Decoding: Only takes the token with highest probability as the next token. this method is deterministic and has its pros and cons. It can generate generic sentences all the time and sometimes get stuck in a loop of generating the same thing over and over. But this can be a good choice if the sentence is a code. Then being deterministic is a good thing in coding and debugging. Temprature is somehting that can change the randomness. look at the figure bellow:

Top-k sampling: In this method you allow LLM to only sample from the k candidate token with highest probabilities. Top-p sampling: In this method you allow LLM to choose from the tokens with accumulate probabiliy to a certain amount p.

LLM Charactristics:

Model Size: How many params LLM has. Small models have 1-10 billion params while larger models have between 100-500 params. Cost: The amount for models charge the users per token to use Context window: the max number of tokens LLM can process split both between the prompt and the completion. Latency and speed: the time to generate the first token and the time to generate the next token Training cutoff date: last point in time the model’s training data

Prompt Engineering:

In order to get the best from your LLM, you need to write a good prompt. There are many different ways of prompting to imporve the output quality of a LLM model. OpenAI message format: In this way of prompting the prompt is written in a series of user, system and assistant messages with roles and content. In the “System” message, usually goes high-level instruction for the model, for example the date the model was trained on, so the model knows that about what question it might not have the answer if the question is asked about something happend after the date of training and so on. It can also contain the tone and the personality that the user likes to receive the answer. how LLM should communicate, not respond with harmful stuff and so on. In-context learning: You add few examples to your prompt to help LLM learn the tone and the structure of the expected response.It can be just one example which is called on-shot learning. If you include few examples more than 1, then it is called few-shots learning. You can do this by RAG too. For example if you are building a costumer service chatbot, you can use RAG to retrieve few examples of succesful chat between the user and the customer assistant and then you can use them in the prompt. Encourging reasoning: Asking LLM to think aloud and write the step by step reasoning before answering the prompt, usually anhances the output. Chain of thought: Asking LLM to write down all the steps it is going to take to arrive to the answer

but be careful because asking the model to print all the steps and so on … it is going to faster fillup your conetxt window available in the model you are using.

Handling hallucination:

There is no perfect way of handeling hallucination and there is no garantee that hallucination never is going to heppen. However we can always try to make the model better. RAG ietself is a way of decreasing ahllucination.

Self-consistency method: In this method, you use the model repeatedly generate the response to a prompt and check consistency. The idea is that if it is hallucinate, then the response is going to be not consistent. Reduce hallucination using RAG: The LLM can give factual response ONLY based on the retrieved documents. Citation generation: Asking LLM to give citation for the source of responses. however the problem can be that LLM halluciate the citation itself.

Evaluating LLM performance:

Response relevancy: Whether a response is relevant to the user prompt or not. One way to do this is to generate multiple prompts that the response can be the response to those prompts. Then a semantic vector will be generated for each of those prompts and the original prompt. Then a cosine similarity measure will be calculated for each of those vectors and the vector of the original prompt. Finally all these similarities will be averaged and a final score which is called the response relevancy will be assigned. Faithfullness metric: To check if the LLM response is consistent with the retrived docs. One way is to find all the factual parts of the answer and then check how many docs in the retrived docs are those fact are from. The percentage of the correct claims are the faithfullness most of the metrics are using some where in their evaluation, LLM-as-a-judge. either that or human-as-a-judge should be used to evaluate the outcome of an LLM model.

Agentic RAG:

You have a RAG model where agents compelets its steps. Each step is done by calling an LLM model sperarately. Complex task are divided to many steps and each step is done with a LLM call. the LLM model used in different steps can be different.

RAG Evaluation Sterategies:

In the context of RAG systems, telemetry is key for monitoring and optimizing performance. By collecting and transmitting data on the system’s operations, such as spans (individual steps) and traces (full workflows), telemetry provides a way to watch how the system retrieves, processes, and generates information. This visibility helps identify bottlenecks and diagnose issues, improving system efficiency. In telemetry, a span represents a single operation or task within your system. It’s like a snapshot of a specific action, recording when it starts and ends. Spans also include details like what the task is doing and any important events that occur. By tracking spans, you can see how long operations take and spot any issues, helping you understand and improve your system’s performance. Using some platform such as Phoenix and Weaviate can help monitoring the performance easy.

Building a costumized dataset is another way of evaluating your system and trying to improve it for real world prompts. How to create a costumized dataset? the costumized dataset is a collection of prompts your system has previously received as well as all the information you collected from the journey of that prompt through your system. You can save only the input prompt and the ansswer of the LLM, or you can save tons of other stuff. Such as the documents your vector database retrived, documents before and after the ranking, the result of the query rewriter, augmentted prompt and so on! Logging this things will help you find the source of problems later on when they arise

Quantization:

LLM models use 16-bit memory for each param they save in the memory. for the models with 1 bilion to 1-trilion params, a huge amount of memory is required. with quantization, it means that smaller amount of bits going to be used to save each params. for example 8 bits instead of 16 bits. With this a little bit of information will be lost but still the model can perform quit well

Cost vs response quality

Smaller models one way of reducing cost is experiment with smaller models. whether it is the core LLM to generate the final response or router LLMs in an agentic system. You might be able to achieve similar final quality with a smaller and therefore cheaper models. a model can be smaller becasue of the few parametrs it has or it is because it has been quantized and therefore less bits of memory are used in those models. Ususally smaller models and smaller prompts can save a lot of cost and it is always important to understand where is the main cost is coming from to try to troubleshoot it

Latency vs response quality

For example adding a retriver to your system, adds latency and whatever else you add to make a better retrival and reranking and so on is going to add to the system latency. However it is alwyas impotrant to rememebr that the real colprit is almost always the calls to your LLM model. it is true that retriver is going also to impact the latency but it is almost much less than the calls to LLM. Therefore always smaller LLMs or quantized LLMs are going to be much faster. Even if you dont want to change the main LLM for the final response, maybe for the router LLMs that you use before the final LLM call, can be replaced with these smaller models.

Build Your Own AI Agent

Recipe Recommender System with VAE