Libraries: If you want this lab, consider one of these libraries.
AI

ChatGPT RAG Pipeline

In this lab, you’ll practice implementing a RAG using JSON. When you’re finished, you’ll have working generative AI that can parse JSON data.

Get started Contact sales

Lab Info

Level

Intermediate

Last updated

Mar 27, 2026

Duration

30m

Challenge

Introduction

Welcome to the Code Lab on building a Retrieval-Augmented Generation (RAG) system from scratch! RAG is a powerful technique that enhances large language models (LLMs) by providing them with external, up-to-date, or proprietary information at query time.

Instead of relying solely on the model's pre-trained knowledge, a RAG system first retrieves relevant documents from a knowledge base and then passes that information to the LLM as context to generate an answer.

You will build a Python script that can read a local text file, break it into chunks, convert the chunks and a user's question into vector embeddings using OpenAI's API, and then use cosine similarity to find the most relevant chunk from the document.
Challenge

Configuring parameters

Start by opening up RAG_notebook.ipynb . Then the first and only configuration you will have to change is swapping out the <YOUR API KEY HERE> with the API Key generated at the top center of your screen. The OPENAI_API_KEY variable is the variable you are meant to define. Other key configuration variables you should identify and could alter if desired are the chunking settings.

A larger chunk size means each retrieved chunk contains more surrounding context. That can help when answers depend on nearby paragraphs, but it also makes each chunk less focused. With smaller chunks, retrieval is usually more precise because each chunk is about one narrower idea. The downside is fragmentation: the model may retrieve a chunk that contains only part of the answer and miss the surrounding explanation.

Overlap helps avoid losing information at chunk boundaries. If an important sentence starts near the end of one chunk and continues into the next, overlap lets both chunks contain some of that content. If overlap is too small, boundary cuts can hurt retrieval and answer quality. If overlap is too large, you get a lot of near-duplicate chunks, which wastes storage space and causes redundant results.

TOP_K is responsible for how many chucks to receive, to high and you waste time grabbing irrelevant chunks, to low and you risk missing the desired chunk.
Challenge

Reading and embedding files

In a RAG system, chunking and embedding make large documents searchable by meaning instead of by keywords.

Chunking text breaks a long document into smaller pieces so the retrieval system can find the specific section relevant to a user’s question. If you embed an entire document as one vector, the model can only retrieve the whole thing, even if only a small paragraph contains the answer. Smaller chunks improve retrieval precision and keep the prompt size manageable. For section 3.1 you will run code which is designed to read through a document and chunk the text, then embed the chunks and store them in indexed objects to make retrieving a particular chunk easier. It is important that embedded chunks are tried to their raw chunks to facilitate the returning of proper documentation.

In your case, you will create an object with the raw text chunks using the chunking code from 3.2 then parse through the indexed objects to create the embedding for each chunk while pairing them within the same object for easy lookup. Finally using code from 3.3 you will embed each of the documents into word vectors. a word vector (or embedding vector) is a numeric representation of text that captures its meaning so it can be compared mathematically. Once the chunks are embedded you can compare the similarities between embedded questions and context within chunks to retrieve the most similar documents.

BONUS: Word vectors are actually high dimensional spaces between 100-1000 dimensions, where each dimension is a continuous space that represents geometric relationships that encode semantic and syntactic structure learned from training data.

Because of this you can use math to determine similarities of chunks of text based on semantic and syntactic similarities, so in cases where a document references the exact same material but without using any of the same key words you can retrieve the proper document.
Challenge

Comparison of embeddings

To identify which chunk of the text is most applicable to the question being asked, in 4.1's code you will determine the similarity between the words in the question and the words in the chunks. The most common similarity scoring function used in production level RAGs is the cosine similarity function, which you will use in this lab.

Similarities are scored off of semantic and syntactic meaning of words allowing for better comparisons than keyword searching. Due to the inability to ever perfectly analyze the context of a question, the model should return the best n results to ensure the user can parse through the parts of the document to ensure the AI response is correct, especially on large RAGs.
Challenge

Prompt engineering RAG driver

To start with a prompt for RAG systems in 5.1, you will need to compare and identify the top N more relevant chunks. You will do this with your similarity score on the bedded chunks and attempt them to the actual user prompt to answer the embedded question.

It is valuable to set up RAGs this way to minimize the amount of input tokens being sent to GPT as well as reduce the size of the prompt to help reduce instances of hallucinations.

In 5.2, you will gather these chunks and the location of the chunks in the larger document by chunk_id. This will make additional verification easier, if that's later desired. This final cell is simply the driver that will create the RAG, and then will parse questions sent to it, showing the top Chunks. In production environments you will not need to embed documents every time, and the functioning RAG should be separated from backend functionality. Reading files into memory and indexing into chunks before embedding will always be the first steps. From there you will embed questions, and alter the prompt depending on the similarity scores.

In production, it's possible and common to set threshold amounts for score to limit models from responding in instances where confidence in the similarity is below a particular threshold.

Once you receive the response from the model, you can add to the response key features such as links to the different chunks used to answer the question.

About the author

Josh Meier

I am, Josh Meier, an avid explorer of ideas an a lifelong learner. I have a background in AI with a focus in generative AI. I am passionate about AI and the ethics surrounding its use and creation and have honed my skills in generative AI models, ethics and applications and thrive to improve in my understanding of these models.

Real skill practice before real-world application

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Learn by doing

Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.

Follow your guide

All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.

Turn time into mastery

On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.

ChatGPT RAG Pipeline

Lab Info

Table of Contents

Introduction

Configuring parameters

Reading and embedding files

Comparison of embeddings

Prompt engineering RAG driver

About the author

Real skill practice before real-world application

Learn by doing

Follow your guide

Turn time into mastery

Get started with Pluralsight