- Lab
-
Libraries: If you want this lab, consider one of these libraries.
ChatGPT RAG Pipeline
In this lab, you’ll practice implementing a RAG using JSON. When you’re finished, you’ll have working generative AI that can parse JSON data.
Lab Info
Table of Contents
-
Challenge
Introduction
Welcome to the Code Lab on building a Retrieval-Augmented Generation (RAG) system from scratch! RAG is a powerful technique that enhances large language models (LLMs) by providing them with external, up-to-date, or proprietary information at query time.
Instead of relying solely on the model's pre-trained knowledge, a RAG system first retrieves relevant documents from a knowledge base and then passes that information to the LLM as context to generate an answer.
You will build a Python script that can read a local text file, break it into chunks, convert the chunks and a user's question into vector embeddings using OpenAI's API, and then use cosine similarity to find the most relevant chunk from the document.
-
Challenge
Configuring parameters
Start by opening up
RAG_notebook.ipynb. Then the first and only configuration you will have to change is swapping out the<YOUR API KEY HERE>with the API Key generated at the top center of your screen. TheOPENAI_API_KEYvariable is the variable you are meant to define. Other key configuration variables you should identify and could alter if desired are the chunking settings.A larger chunk size means each retrieved chunk contains more surrounding context. That can help when answers depend on nearby paragraphs, but it also makes each chunk less focused. With smaller chunks, retrieval is usually more precise because each chunk is about one narrower idea. The downside is fragmentation: the model may retrieve a chunk that contains only part of the answer and miss the surrounding explanation.
Overlap helps avoid losing information at chunk boundaries. If an important sentence starts near the end of one chunk and continues into the next, overlap lets both chunks contain some of that content. If overlap is too small, boundary cuts can hurt retrieval and answer quality. If overlap is too large, you get a lot of near-duplicate chunks, which wastes storage space and causes redundant results.
TOP_Kis responsible for how many chucks to receive, to high and you waste time grabbing irrelevant chunks, to low and you risk missing the desired chunk. -
Challenge
Reading and embedding files
In a RAG system, chunking and embedding make large documents searchable by meaning instead of by keywords.
Chunking text breaks a long document into smaller pieces so the retrieval system can find the specific section relevant to a user’s question. If you embed an entire document as one vector, the model can only retrieve the whole thing, even if only a small paragraph contains the answer. Smaller chunks improve retrieval precision and keep the prompt size manageable. For section 3.1 you will run code which is designed to read through a document and chunk the text, then embed the chunks and store them in indexed objects to make retrieving a particular chunk easier. It is important that embedded chunks are tried to their raw chunks to facilitate the returning of proper documentation.
In your case, you will create an object with the raw text chunks using the chunking code from 3.2 then parse through the indexed objects to create the embedding for each chunk while pairing them within the same object for easy lookup. Finally using code from 3.3 you will embed each of the documents into word vectors. a word vector (or embedding vector) is a numeric representation of text that captures its meaning so it can be compared mathematically. Once the chunks are embedded you can compare the similarities between embedded questions and context within chunks to retrieve the most similar documents.
BONUS: Word vectors are actually high dimensional spaces between 100-1000 dimensions, where each dimension is a continuous space that represents geometric relationships that encode semantic and syntactic structure learned from training data.
Because of this you can use math to determine similarities of chunks of text based on semantic and syntactic similarities, so in cases where a document references the exact same material but without using any of the same key words you can retrieve the proper document.
-
Challenge
Comparison of embeddings
To identify which chunk of the text is most applicable to the question being asked, in 4.1's code you will determine the similarity between the words in the question and the words in the chunks. The most common similarity scoring function used in production level RAGs is the cosine similarity function, which you will use in this lab.
Similarities are scored off of semantic and syntactic meaning of words allowing for better comparisons than keyword searching. Due to the inability to ever perfectly analyze the context of a question, the model should return the best n results to ensure the user can parse through the parts of the document to ensure the AI response is correct, especially on large RAGs.
-
Challenge
Prompt engineering RAG driver
To start with a prompt for RAG systems in 5.1, you will need to compare and identify the top N more relevant chunks. You will do this with your similarity score on the bedded chunks and attempt them to the actual user prompt to answer the embedded question.
It is valuable to set up RAGs this way to minimize the amount of input tokens being sent to GPT as well as reduce the size of the prompt to help reduce instances of hallucinations.
In 5.2, you will gather these chunks and the location of the chunks in the larger document by
chunk_id. This will make additional verification easier, if that's later desired. This final cell is simply the driver that will create the RAG, and then will parse questions sent to it, showing the topChunks. In production environments you will not need to embed documents every time, and the functioning RAG should be separated from backend functionality. Reading files into memory and indexing into chunks before embedding will always be the first steps. From there you will embed questions, and alter the prompt depending on the similarity scores.In production, it's possible and common to set threshold amounts for score to limit models from responding in instances where confidence in the similarity is below a particular threshold.
Once you receive the response from the model, you can add to the response key features such as links to the different chunks used to answer the question.
About the author
Real skill practice before real-world application
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Learn by doing
Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.
Follow your guide
All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.
Turn time into mastery
On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.