- Lab
-
Libraries: If you want this lab, consider one of these libraries.
- AI
Lab: Semantic Search and Recommendation System
Develop a solid understanding of semantic search and recommendation systems in this hands-on lab, Semantic Search and Recommendation System. Through practical exercises, you'll clean a movie dataset, generate text embeddings, build similarity search and recommendation workflows with FAISS, store and filter embeddings in ChromaDB, evaluate retrieval quality with Precision@K and Recall@K, and visualize embedding clusters with UMAP. By working through a practical movie search scenario, you'll gain the skills needed to build, evaluate, and inspect embedding-powered search and recommendation applications.
Lab Info
Table of Contents
-
Challenge
Introduction
Welcome to the Lab: Semantic Search and Recommendation System. This hands-on Code Lab is designed for developers who want to understand how embeddings, vector search, vector databases, evaluation metrics, and visualization work together in a practical semantic search workflow. Throughout the lab, you’ll clean a movie dataset, generate embeddings, search for and recommend movies with FAISS, store and filter embeddings in ChromaDB, evaluate retrieval quality with Precision@K and Recall@K, and visualize embedding clusters with UMAP.
By the end of this Code Lab, you’ll be able to build an end-to-end semantic search and recommendation pipeline. You’ll understand how text is converted into vectors, how nearest-neighbor search finds similar items, how vector databases support metadata filtering, how retrieval quality is measured, and how 2D projections can help reveal whether embeddings capture meaningful structure. ---
Prerequisites
Basic Python Knowledge
- Familiarity with core Python concepts such as functions, lists, dictionaries, loops, and importing modules.
Introductory AI and Embedding Concepts
- A basic understanding of embeddings, semantic search, or vector similarity is helpful but not required.
- Learners should be comfortable with the idea that text can be converted into numeric vectors and compared by meaning.
--- ## Movie Semantic Search AppText Editor and Terminal Experience
- Comfort using a text editor or IDE.
- Experience with basic command-line operations, such as navigating directories and running Python scripts.
- Ability to run code, test changes, and observe program output in the terminal.
This lab provides a simple movie dataset and search app. You will use Python to clean movie data, generate embeddings, and search for similar movies by meaning rather than exact keywords.
Throughout the lab, you will build semantic search and recommendation workflows with FAISS and ChromaDB. You will evaluate search quality with Precision@K and Recall@K, then visualize movie embeddings with UMAP to inspect clusters and outliers. > The final code for each step is stored in the
__solution/codefolder. For example, the final code for Step 2 is available in the__solution/code/Step02directory. -
Challenge
Prepare the dataset and generate embeddings
In this step, you will prepare a small movie dataset for semantic search. You will clean the raw movie text, generate embedding vectors from movie summaries, and validate that the embeddings capture meaning by finding similar movies through nearest-neighbor search.
Navigate to the following URL https://{{hostname}}--8080.pluralsight.run/data/movies.json to view the raw movie dataset in JSON format.
To enable embedding generation, copy the API key from the top bar and replace
<pluralsight-openai-api-key>in the.envfile. ---Raw datasets often contain noise such as HTML tags, encoded characters, inconsistent spacing, repeated text, strange symbols, or irrelevant formatting. Cleaning this text helps the embedding model focus on the actual movie content.
The full preprocessing script uses
clean_textto clean each movie title and summary. It also convertsyearto an integer, parsesgenresinto a list, adds aprimary_genrefield, and writes the cleaned records todata/movies_clean.json.This preprocessing step matters because embedding models convert text into vectors. Cleaner text usually produces better embeddings, which helps semantic search find related movies even when the query does not use the exact same words.
Explanation -
unescape(text)converts HTML entities into normal characters. For example,'becomes'. -
re.sub(r"<[^>]+>", " ", text)replaces HTML tags with spaces. Using a space instead of an empty string prevents nearby words from being joined together. After whitespace cleanup, for example,<p>Hello</p><p>World</p>becomesHello Worldinstead ofHelloWorld. -
re.sub(r"\s+", " ", text).strip()replaces repeated whitespace with a single space and removes extra spaces from the beginning and end of the text.
python src/preprocess.pyYou should see output similar to the example below.
--- BEFORE --- { "id": 1, "title": "Blade Runner", "year": "1982", ... } --- AFTER --- { "id": 1, "title": "Blade Runner", "year": 1982, ... } Saved 100 records to data/movies_clean.jsonNavigate to the following URL to view the cleaned movie dataset in JSON format: https://{{hostname}}--8080.pluralsight.run/data/movies_clean.json
---Embeddings convert text into fixed-length vectors that capture semantic meaning. Similar texts tend to produce similar vectors, allowing applications to compare content based on meaning rather than exact word matches.
For OpenAI's
text-embedding-3-smallmodel, each embedding contains 1,536 floating-point values. Although the numbers themselves are not human-readable, they form the foundation of semantic search and retrieval systems.In this task, you'll generate embeddings for movie content using the OpenAI embeddings API and inspect the resulting vectors.
Navigate to the **Terminal** and execute the following command to test the changes.Explanation
-
client.embeddings.create(model=model, input=texts)sends the list of input texts to the embedding model in a single batched API call. -
The API response contains one embedding for each input text.
[item.embedding for item in response.data]extracts each embedding vector and stores the vectors in a plain Python list. -
embed_query(text)is a helper function that embeds one piece of text and returns one vector. -
The
embed_and_inspectfunction builds the input text from the movie title and summary, sends it to the configured embedding provider, and prints the first 10 numbers of the returned vector.
python src/embed_sample.pyYou should see output similar to the example below.
Text being embedded: Title: Blade Runner Summary: In a dystopian Los Angeles of 2019, a special police officer hunts down bioengineered humans called 'replicants' who have escaped to Earth. As he investigates, he begins to question the boundary between human and machine, and what makes a life worth living. First 10 dimensions: [-0.0143, 0.0271, -0.0058, 0.0192, -0.0046, 0.0117, 0.0083, -0.0231, 0.0009, 0.0334] Embedding length: 1536
Embeddings represent the meaning of text as numeric vectors. In this step, each movie summary is converted into a vector that can be compared with other movie vectors.
Semantic search compares meaning instead of exact words. For example, a query about "artificial intelligence" can still match a movie about "replicants" or "machines" because those ideas are related.
Before embeddings can be compared efficiently, they are commonly normalized so every vector has the same length. Once normalized, similarity between vectors can be measured using cosine similarity.
In this next task, you'll prepare movie embeddings for similarity comparison and generate a similarity matrix that compares every movie against every other movie.
Navigate to the **Terminal** and execute the following command to test the changes.Explanation
-
np.array(raw, dtype=np.float32)converts the Python list of embedding vectors into a NumPy array shaped(N, D), whereNis the number of texts andDis the embedding dimension. -
NumPy arrays make it easier to perform fast vector operations such as normalization and matrix multiplication.
-
np.linalg.norm(vectors, axis=1, keepdims=True)calculates the L2 length of each row.axis=1means the length is calculated across the columns of each row.keepdims=Truekeeps the result shaped as(N, 1), so NumPy can divide each row by its own length. -
vectors = vectors / normsrescales each row so every movie vector has length1. -
Once vectors are normalized, the dot product between two vectors is equivalent to cosine similarity.
-
movie_vectors @ movie_vectors.Tperforms one matrix multiplication to compare every movie vector with every other movie vector. -
The result is an
(N, N)similarity matrix, wheresim_matrix[i, j]is the cosine similarity between movieiand moviej. -
The diagonal values are
1.0because each movie is perfectly similar to itself.
python src/validate_embeddings.pyIf the embeddings are working well, movies with related themes, genres, or concepts should appear near each other, even when they do not use the same exact words. You should see output similar to the example below.
Showing each movie's nearest neighbors (self at rank 1, sim=1.000) [[1. 0.51 0.15 0.13 0.21] [0.51 1. 0.14 0.16 0.22] [0.15 0.14 1. 0.47 0.18] [0.13 0.16 0.47 1. 0.17] [0.21 0.22 0.18 0.17 1. ]] The Matrix (Science Fiction) 1. The Matrix Science Fiction sim=1.000 2. Blade Runner Science Fiction sim=0.510 3. The Shining Horror sim=0.210 4. Pride and Prejudice Romance sim=0.150 5. The Notebook Romance sim=0.130 Blade Runner (Science Fiction) 1. Blade Runner Science Fiction sim=1.000 2. The Matrix Science Fiction sim=0.510 3. The Shining Horror sim=0.220 4. The Notebook Romance sim=0.160 5. Pride and Prejudice Romance sim=0.140In this example,
The MatrixandBlade Runnerare both science fiction movies and appear close to each other. This shows that the embeddings are capturing semantic similarity rather than only exact word matches.---
Congratulations! You cleaned the dataset, created your first embeddings, and verified that similar movies appear close together in vector space. This is the foundation of semantic search: embed the query, compare it with stored movie vectors, and return the closest matches.
-
-
Challenge
Search and recommend with FAISS
In this step, you will use FAISS to search movie embeddings, compare Flat and HNSW indexes, benchmark their speed, and build a simple movie recommendation workflow from multiple inputs.
FAISS, or Facebook AI Similarity Search, is a library for fast similarity search over dense vector embeddings. A FAISS index stores vectors that allows you to search for the nearest vectors to a query vector.
In this step, you will build and compare two FAISS Flat search approaches. Each Flat index compares the query vector against all stored movie vectors to retrieve the top-K matches. --- A Flat index is simple and accurate because it checks every stored vector and returns the exact top-K matches. However, this brute-force approach becomes less practical as the dataset grows because each query must be compared against all vectors. This makes Flat indexes a useful baseline for understanding exact search before exploring faster approximate nearest-neighbor indexes later in the lab.
Explanation -
faiss.IndexFlatL2(dim)creates a Flat index that compares vectors using squared L2 distance. Smaller values mean closer matches. -
faiss.IndexFlatIP(dim)creates a Flat index that compares vectors using inner product. Larger values mean closer matches. -
index.add(movie_vectors)loads the movie vectors into the FAISS index so they can be searched. -
embed_and_normalize([query])converts the search query into an embedding vector and normalizes it so it can be compared with the stored movie vectors. -
.search(query_vec, k)returns two arrays: the top-K distances or scores, and the indices of the matching movies in the original movie vector array. -
For
IndexFlatL2, smaller squared L2 distances mean closer matches. -
For
IndexFlatIP, larger inner product scores mean closer matches. Because both the movie vectors and query vector are L2-normalized, the inner product score behaves like cosine similarity. -
With normalized vectors,
IndexFlatL2andIndexFlatIPshould usually return the same movies in the same order, although the distance and score values will differ.
python src/flat_search.pyYou should see output similar to the example below.
Query: 'movies about horror and romance' IndexFlatL2 (smaller squared L2 distance = closer): 1. Halloween L2_dist=1.1708 2. Psycho L2_dist=1.2535 3. A Nightmare on Elm Street L2_dist=1.2778 4. It Follows L2_dist=1.2847 ... IndexFlatIP (larger cosine = closer): 1. Halloween cosine=0.4146 2. Psycho cosine=0.3732 3. A Nightmare on Elm Street cosine=0.3611 4. It Follows cosine=0.3576 ...Notice that both indexes return the same ranking, but the scores are shown differently.
IndexFlatL2reports squared distance, where smaller is better.IndexFlatIPreports cosine-style similarity, where larger is better. --- A Flat index compares the query vector against every stored vector. This gives exact results, but it becomes slower and less practical as the dataset grows because each query must be compared with all vectors.HNSW stands for Hierarchical Navigable Small World. It is an approximate nearest-neighbor algorithm commonly used in vector search systems. Instead of comparing every vector, HNSW builds a graph of vector relationships. Each vector is connected to nearby vectors, and a query walks through the graph to find close matches. This graph-based approach avoids checking every vector, making searches much faster on larger datasets.
Because HNSW is approximate, it may not always return the exact same results as Flat search. In practice, it often returns very similar results while significantly reducing search time.
Explanation -
faiss.IndexHNSWFlat(dim, M, faiss.METRIC_INNER_PRODUCT)creates an HNSW index that uses inner-product similarity. -
Mcontrols how many graph connections each vector can have. Larger values can improve recall, but they also increase memory usage. -
efConstructioncontrols how carefully the graph is built. Larger values can improve index quality, but they make indexing slower. This value must be set before calling.add()because it affects how the graph is constructed. -
efSearchcontrols how deeply the graph is searched at query time. Larger values can improve recall, but they make searches slower. This value can be adjusted later to tune search behavior. -
index_hnsw.add(movie_vectors)adds the vectors to the index and builds the HNSW graph. -
faiss.METRIC_INNER_PRODUCTuses the same similarity metric asIndexFlatIP. Because the vectors are normalized, the resulting similarity scores behave like cosine similarity.
python src/hnsw_search.pyYou should see output similar to the example below.
Query: 'movies about artificial intelligence and machines' Rank Flat (exact) HNSW (approx) 1 Ex Machina Ex Machina 2 2001: A Space Odyssey 2001: A Space Odyssey 3 The Matrix The Matrix 4 Blade Runner Blade Runner ...You already built Flat and HNSW indexes over a 100-movie dataset. At that small scale, both indexes usually return the same or very similar results, so HNSW’s speed advantage is not easy to see.
This benchmark uses random synthetic vectors with sizes of 10,000 and 50,000 to make the tradeoff visible.
Navigate to the **Terminal** and execute the following command to test the changes.Explanation
-
make_vectors(num_vectors, dim)generates a(num_vectors, dim)array of randomfloat32vectors. Thefaiss.normalize_L2(vectors)rescales each row to unit length, so inner product is equivalent to cosine similarity. -
build_flat(vectors)builds afaiss.IndexFlatIPover the synthetic vectors. Flat indexes are cheap to build because they only store the vectors. -
build_hnsw(vectors)builds afaiss.IndexHNSWFlatover the same vectors. HNSW indexes are slower to build because they construct a navigable graph during.add(). -
Both indexes are built over the same
dband timed against the samequeriesbatch. Reusing the same data on both sides keeps the comparison fair, so any difference in search time comes from the index type and not from the inputs. -
time_search(index, queries, K)runs one untimed warmup query, then runs several timed search batches, averages them, and returns milliseconds per single query. The warmup keeps one-time costs like cache loading out of the measurement.
python src/flat_vs_hnsw_benchmark.pyYou should see output similar to the example below.
Vectors | Flat build (s) | HNSW build (s) | Flat search (ms) | HNSW search (ms) | Speedup -------------------------------------------------------------------------------------------- 10,000 | 0.013 | 2.166 | 0.355 | 0.205 | 1.7x 50,000 | 0.072 | 23.830 | 5.812 | 0.821 | 7.1x
Item-to-item recommendation uses the same FAISS search pattern from earlier tasks. The difference is the query vector. Instead of embedding a new text query, you use the mean of two or more movie vectors.
The mean-centroid approach combines multiple input movies into one query vector. Each movie is a point in embedding space, and the average of those points becomes a new point that represents the combined preference.
This approach can surface movies that sit between the input movies. For example, if the inputs include Sci-Fi, Romance, and Horror movies, the recommendations may include movies that share themes across those genres rather than matching only one title.
Explanation -
movie_vectors[valid_idxs]selects the vectors for the input movie titles. -
.mean(axis=0, keepdims=True)averages those vectors column by column.axis=0creates one average vector across the selected movies, andkeepdims=Truekeeps the result as a 2D array with shape(1, D), which is the format FAISS expects for search queries. -
faiss.normalize_L2(mean_vec)re-normalizes the averaged vector back to unit length in place. This is important because the index uses inner product, and normalized vectors make inner product work like cosine similarity. The FAISS helper updates the array directly and does not return a new value. -
index.search(mean_vec, K + len(valid_idxs))searches for the nearest neighbors of the combined preference vector. -
The search asks for
K + Nresults, whereNis the number of valid input movies. The input movies often appear near the top because the mean vector was built from them. Asking for extra results givesprint_resultsenough candidates after it removes the original input titles. -
set(valid_idxs)is passed toskipso the selected input movies are not shown back as recommendations.
python src/recommend.py many "The Matrix" "The Notebook" "The Shining"You should see output similar to the example below.
More like: [The Matrix, The Notebook, The Shining] (mean of 3 movies) 1. Ex Machina Science Fiction sim=0.604 2. Psycho Horror sim=0.592 3. Eternal Sunshine of the Spotless Mind Romance sim=0.583 4. Groundhog Day Comedy sim=0.579 5. Memento Thriller sim=0.578In this example, the selected movies come from different genres: Sci-Fi, Romance, and Horror. The mean-centroid vector searches near the center of those preferences, so the results include a mix of genres.
--- Congratulations! You compared Flat and HNSW search on the movie corpus, benchmarked both at scale to see the speed tradeoff, and used a FAISS HNSW index to recommend movies from a combined preference vector.
-
-
Challenge
Store and search with ChromaDB
In this step, you will use ChromaDB to store movie embeddings, run semantic search, and filter results by metadata such as genre or year.
The embeddings for the cleaned movie dataset have already been generated for this task and are available in
/data/embeddings/movies_clean_openai_text_small.json.
ChromaDB is a vector database that stores embeddings, documents, and metadata together. This makes it useful for building semantic search and recommendation systems without manually managing a low-level vector index.
A collection is the main storage unit in ChromaDB. In this task, you will create a persistent ChromaDB collection and add the movie embeddings, readable documents, and metadata to it.
Explanation -
In this task, the
moviescollection stores one record for each movie. -
client.create_collection(name=name, metadata={"hnsw:space": "cosine"})creates a collection that uses cosine distance for vector search. -
Cosine distance is different from cosine similarity. With cosine distance, smaller values mean closer matches. A distance near
0means the vectors are very similar. -
chromadb.PersistentClient(path=...)stores the database on disk. This allows later scripts to reopen and query the same collection without rebuilding it each time. -
The
idslist converts each movie ID into a string because ChromaDB expects record IDs to be strings. -
The
documentslist stores readable text for each movie. In this task, each document includes the movie title and summary. -
The
metadataslist stores structured information about each movie, such astitle,year,primary_genre, andgenres_str. -
ChromaDB metadata values must be scalar values, such as strings, numbers, or booleans. Because
genresis a list, it is converted into a single comma-separated string namedgenres_str. -
movie_vectors.tolist()converts the NumPy embedding array into a regular Python list of lists, which is the format ChromaDB expects when adding embeddings. -
collection.add(...)inserts all movie records into the collection in one batch. Theids,embeddings,documents, andmetadataslists must stay aligned so each movie gets the correct ID, vector, text, and metadata.
python src/chroma_setup.pyYou should see output similar to the example below.
Collection 'movies' now has 100 records. Sample record (id=1): { "id": "1", "metadata": { "title": "Blade Runner", "year": 1982, "primary_genre": "Science Fiction", "genres_str": "Science Fiction, Neo-Noir, Thriller" }, "document_preview": "Title: Blade Runner\n\nSummary: In a dystopian Los Angeles of 2019, a special police officer hunts d..." }Congratulations! You created a persistent ChromaDB collection and added movie embeddings, documents, and metadata to it.
The query embedding must be created with the same embedding model used during ingestion. If the stored movie embeddings and query embedding come from different models, they may not be comparable because they are in different embedding spaces.
collection.query(...)is ChromaDB's search method. It compares the query embedding against the stored movie embeddings and returns the closest matches.Because the collection was configured with
metadata={"hnsw:space": "cosine"}in the previous task, the returned distance values are cosine distances.Cosine distance is different from cosine similarity. Smaller distance means a closer match, and a distance near
0means the vectors are very similar.Explanation -
embed_query_for_chroma(query)converts the user's text query into an embedding vector in the format ChromaDB expects. -
query_embeddings=query_embpasses the query vector to ChromaDB. -
n_results=n_resultscontrols how many matches are returned. For example, ifn_resultsis5, ChromaDB returns the top five closest movies. -
include=["metadatas", "distances"]tells ChromaDB to return the movie metadata and distance values. ChromaDB always returns record IDs, soincludecontrols which extra fields are included. -
You do not need to manually normalize the query embedding in this function. The query embedding should be passed in the same format as the stored movie embeddings, and ChromaDB will use the collection's configured cosine distance metric during search.
-
The returned result is a dictionary. For a single query, the matching metadata is available from
results["metadatas"][0], and the matching distances are available fromresults["distances"][0]. -
The
print_search_resultshelper uses those values to display the movie title, year, genre, and cosine distance.
python src/chroma_search.py search "movies about artificial intelligence"You should see output similar to the example below.
Query: 'movies about artificial intelligence' 1. Ex Machina (2014) Science Fiction cos_dist=0.5292 2. 2001: A Space Odyssey (1968) Science Fiction cos_dist=0.5821 3. The Matrix (1999) Science Fiction cos_dist=0.5954 4. Arrival (2016) Science Fiction cos_dist=0.6139 5. Blade Runner (1982) Science Fiction cos_dist=0.6141
A simple filter checks for an exact metadata match. For example,
{"primary_genre": "Science Fiction"}returns only movies whoseprimary_genreis"Science Fiction".ChromaDB also supports comparison operators for numeric metadata. For example,
{"year": {"$gte": 2010}}returns only movies released in 2010 or later.Multiple conditions can be combined with
$andor$or. For example:{ "$and": [ {"primary_genre": "Science Fiction"}, {"year": {"$gte": 2010}}, ] }In this task, the metadata filter narrows the candidate records before the final matches are returned.
Explanation -
search_with_filterworks likesearch_collectionfrom the previous task, but it adds one extra parameter:where. -
where=wherepasses a metadata filter to ChromaDB. The filter controls which records are allowed to appear in the search results. -
ChromaDB returns cosine distance because the collection was configured with
metadata={"hnsw:space": "cosine"}in the previous task. Smaller distance means a closer semantic match. -
If no records match the filter, ChromaDB returns empty result sets.
python src/chroma_search.py filter "feel-good romantic movie" "Romance"You should see output similar to the example below.
Query: 'feel-good romantic movie' Filter: primary_genre='Romance' 1. 500 Days of Summer (2009) Romance cos_dist=0.5262 2. Eternal Sunshine of the Spotless Mind (2004) Romance cos_dist=0.5633 3. Lost in Translation (2003) Romance cos_dist=0.5751 4. La La Land (2016) Romance cos_dist=0.5886 5. Amelie (2001) Romance cos_dist=0.5918Now run the same query with an additional year constraint.
python src/chroma_search.py filter "feel-good romantic movie" "Romance" 2010You should see output similar to the example below.
Query: 'feel-good romantic movie' Filter: primary_genre='Romance', year>=2010 1. La La Land (2016) Romance cos_dist=0.5885 2. Call Me by Your Name (2017) Romance cos_dist=0.6331 ``` --- Congratulations! You stored movie embeddings in ChromaDB and ran semantic search with and without metadata filters. -
-
Challenge
Evaluate retrieval quality
In this step, you will evaluate semantic search using Precision@K and Recall@K. You will measure how many of the top search results are relevant, determine how many known relevant items were retrieved, and compare two ChromaDB collections built using different embedding models.
The evaluation queries for this task have already been created. The file contains the evaluation queries along with the IDs of the relevant movies used to score the search results.
Open https://{{hostname}}--8080.pluralsight.run/data/eval_queries.json to inspect the evaluation queries in JSON format.
Precision@K answers the question:
Of the top K results returned, what fraction are actually relevant?. It is one of the foundational metrics for evaluating semantic search.To compute Precision@K, you need three things:
- A ranked list of retrieved IDs
- A set of known relevant IDs from an evaluation set
- The value of
K
The formula is simple: count how many of the top
Kretrieved IDs appear in the relevant set, then divide that count byK.Explanation -
The
relevant_idsset is the ground truth. It contains the IDs that are considered relevant for a specific evaluation query. In this lab, those relevant IDs are stored indata/eval_queries.json. -
retrieved_ids[:k]takes the firstkitems from the ranked retrieval results. Since the results are already sorted by similarity, these are the top matches returned by the search system. -
sum(1 for r in top_k if r in relevant_ids)checks each retrieved ID intop_kand adds1when that ID exists in the relevant set. The total count is stored inhits. -
hits / kreturns a score between0and1. For example, a Precision@5 score of0.80means 4 out of the top 5 results were relevant. -
relevant_idsis typed asset[int]because membership checks such asr in relevant_idsare faster with a set than with a list. -
This function assumes
kis greater than0, because dividing by0would cause an error. -
Precision@K can decrease as
kincreases because lower-ranked results are usually less relevant than the top results. For example, Precision@5 may be higher than Precision@10.
python src/precision.py romanticYou should see output similar to the example below.
Query: 'feel-good romantic comedy' Relevant items in eval set: [40, 46, 47, 52, 53] Top 10 retrieved: 1. id=46 500 Days of Summer ✓ relevant 2. id=47 Amelie ✓ relevant 3. id=43 Eternal Sunshine of the Spotless Mind 4. id=48 Lost in Translation 5. id=44 La La Land 6. id=57 Some Like It Hot 7. id=40 When Harry Met Sally ✓ relevant 8. id=39 The Notebook 9. id=22 Edward Scissorhands 10. id=49 Call Me by Your Name P@5 = 0.40 P@10 = 0.30
Precision tells you what fraction of the returned results are relevant, while recall tells you what fraction of the known relevant results were found.
Recall@K is an information retrieval metric that measures what fraction of the known relevant items the search found. Recall@K answers the question:
Of all the relevant items that exist, what fraction did we return in the top K?To compute Recall@K, compare the top K retrieved IDs with the relevant IDs from the evaluation set. Count how many relevant items were found, then divide that count by the total number of relevant items. The formula is
|top_k ∩ relevant_ids| / |relevant_ids|. The numerator is the number of relevant items found in the topkresults. The denominator is the total number of relevant items in the evaluation set for that query.Explanation -
set(retrieved_ids[:k])takes the firstkretrieved IDs and converts them into a set. This makes it easy and efficient to compare them with the relevant IDs. -
if not relevant_ids: return 0.0handles the edge case where the evaluation set has no relevant items for a query. Without this check, the function could cause a divide-by-zero error. -
top_k & relevant_idsuses Python's set-intersection operator. It returns the IDs that appear in both sets: the relevant items that were returned in the topk.
python src/recall_and_analysis.py romanticYou should see output similar to this:
Query: 'feel-good romantic comedy' Relevant in eval set (5): [40, 46, 47, 52, 53] R@5 = 0.40 R@10 = 0.60 === Per-query analysis (K=5) === Found (2/5 relevant items returned in top 5): + id=46 500 Days of Summer [primary_genre=Romance] + id=47 Amelie [primary_genre=Romance] Missed (3 relevant items NOT in top 5): - id=40 When Harry Met Sally [primary_genre=Romance] - id=52 Bridesmaids [primary_genre=Comedy] - id=53 Groundhog Day [primary_genre=Comedy] Snuck in (3 irrelevant items in top 5): ? id=43 Eternal Sunshine of the Spotless Mind [primary_genre=Romance] ? id=48 Lost in Translation [primary_genre=Romance] ? id=44 La La Land [primary_genre=Romance]A Recall@5 score of
1.00means every relevant item appeared in the top 5 results. A Recall@5 score of0.40means 40% of the relevant items were found in the top 5.Recall@K cannot decrease as
kgrows because the search has more chances to include relevant items. For example, Recall@10 is often higher than Recall@5.The script also prints a Found/Missed/Snuck-in breakdown. “Found” items are relevant results returned in the top
k, “Missed” items are relevant results that did not make the topk, and “Snuck-in” items are irrelevant results that appeared in the topk.Precision and recall measure retrieval quality from different angles. A search with high precision but low recall returns mostly relevant results but misses some relevant items. A search with high recall but low precision finds many relevant items but may also include more irrelevant results.
Embedding quality is not the only factor that affects retrieval performance. Other factors, such as data cleaning, text preprocessing, metadata quality, the similarity metric, and the value of
K, can also affect the final Precision@K and Recall@K scores. Comparing the final table helps you see whether changing the embedding model improves or reduces retrieval quality on the same evaluation set.Now, you will compare Precision@K and Recall@K between two embedding models:
text-embedding-3-smallandall-minilm. You will evaluate two pre-built ChromaDB collections:movies, which was created usingtext-embedding-3-small, andmovies_all_minilm, which was created usingall-minilm.Both collections are evaluated against the same set of evaluation queries. Since each collection was built with a different embedding model, the evaluation queries must be embedded using the corresponding model before scoring. Queries for the
moviescollection should usetext-embedding-3-small, while queries for themovies_all_minilmcollection should useall-minilm.The evaluation queries embedded with
all-minilmhave already been created for this lab and are available in/data/embeddings/eval_queries_all_minilm.json.Navigate to the **Terminal** and run the following command:Explanation
-
evaluate_collectionopens an existing ChromaDB collection, runs each evaluation query against it, and scores the retrieved movie IDs. -
collection.query(...)runs semantic search for one query embedding. The query embedding is passed throughquery_embeddings=[query_embeddings[i]], so each evaluation query is searched separately. -
ChromaDB returns matching IDs ordered by distance. The first
KIDs are treated as the top K retrieved results. -
retrieved_ids = [int(rid) for rid in chroma_result["ids"][0]]converts ChromaDB string IDs back into integers so they can be compared with the hand-labeled relevant IDs from the evaluation set. -
precision_at_k(retrieved_ids, relevant, K)reuses the Precision@K function from Task 1 to measure what fraction of the top K returned results are relevant. -
recall_at_k(retrieved_ids, relevant, K)reuses the Recall@K function from Task 2 to measure what fraction of the known relevant items were found in the top K results. -
Each dictionary appended to
resultsstores the precision and recall scores for one evaluation query. The providedshow_allfunction then averages those scores across the full evaluation set.
python src/eval_embeddings.pyYou should see output similar to this. The table compares the two collections by embedding dimension, average Precision@5, and average Recall@5.
Embedding queries for 'movies' with the configured provider... Embedding queries for 'movies_all_minilm' with 'all-minilm'... Collection Dim P@5 R@5 ------------------------------------------- movies 1536 0.60 0.57 movies_all_minilm 384 0.43 0.39 ``` --- Congratulations! You measured retrieval quality with Precision@K and Recall@K, then used those metrics to compare two ChromaDB collections built with different embedding models. -
Challenge
Visualize embeddings with UMAP
In this step, you will use UMAP to project a small set of embeddings into a 2D plot. This makes it easier to visually inspect whether similar movies appear close together, whether genres form groups, and whether unrelated items stand apart as outliers.
UMAP, short for Uniform Manifold Approximation and Projection, is a dimensionality reduction technique commonly used to visualize high-dimensional embeddings. Embedding vectors often have hundreds or thousands of dimensions, making them difficult to inspect directly.
UMAP reduces them to 2D coordinates that can be plotted on a scatter plot. UMAP tries to keep similar items close together in the 2D view. This helps reveal clusters, outliers, and overall structure in the embedding space.
Navigate to https://{{hostname}}--8080.pluralsight.run/data/umap_items.json to inspect the sample dataset used for the UMAP analysis in JSON format.
The dataset includes a few intentional outliers, such as
How to fix a flat tireandGoing fishing at the lake, to help illustrate how outliers appear in the visualization.The embeddings for this dataset have already been generated and are available in the
data/embeddingsfolder. This step gives you a visual way to check whether the embedding model groups related items together and separates unrelated items.Navigate to the **Terminal** and run the following command to test the changes.Explanation
-
umap.UMAP(...)creates the reducer object with the projection settings. -
n_neighbors=5controls how many nearby points UMAP considers when building the projection. Since this task uses only 16 items, a smaller value helps preserve local groups. -
min_dist=0.15controls how tightly points can appear in the 2D layout. A smaller value can make related items appear in more compact groups. -
metric="cosine"compares vectors using cosine distance, which is commonly used with text embeddings. -
random_state=seedmakes the projection reproducible, so running the script with the same seed gives a consistent layout. -
reducer.fit_transform(vectors)projects the input vectors from shape(N, D)to(N, 2), where each item gets anxandycoordinate.
python src/umap_analysis.pyYou should see output similar to this:
Saved plots/umap_analysis.png Outliers included for visual inspection: - How to fix a flat tire - Going fishing at the lake - Pasta carbonara recipe - Learning Python basicsNavigate to the following URL https://{{hostname}}--8080.pluralsight.run/plots/umap_analysis.png to view the projection.
The 16 items should generally show related movie items grouping near each other, while the four non-movie outliers should appear farther from the main movie groups as
Xmarkers. This task scales the UMAP projection from a small curated set to the full 100-movie corpus. This exercise helps you visually check whether movies with similar genres or themes appear near each other in embedding space.Navigate to the **Terminal** and run the following command to test the changes.Explanation
-
load_movies_data()loads the movie metadata and embedding vectors for the selected embedding model. -
Returning both
moviesandcoords_2dallows the plotting code to match each 2D point back to its movie title and genre. -
umap.UMAP(...)creates the reducer object with the projection settings. -
n_neighbors=15controls how many nearby points UMAP considers when building the projection. Since this task uses 100 movies, this value gives UMAP a broader neighborhood to learn from. -
min_dist=0.1controls how tightly points can appear in the 2D layout. A smaller value can make related movies appear in more compact groups. -
metric="cosine"compares vectors using cosine distance, which is commonly used with text embeddings. -
random_state=seedmakes the projection reproducible, so running the script with the same seed gives a consistent layout. -
reducer.fit_transform(vectors)projects the input vectors from shape(100, D)to(100, 2), where each movie gets anxandycoordinate.
python src/umap_visualize.pyYou should see output similar to this:
Saved plots/umap_movies.pngNavigate to the following URL https://{{hostname}}--8080.pluralsight.run/plots/umap_movies.png to view the projection. --- You projected embeddings into 2D with UMAP and used the resulting visualization to inspect clusters, genre-like structure, and outliers in the dataset.
Congratulations! You have successfully completed the lab.
Key Takeaways
- Understand how text embeddings represent movie summaries as vectors for semantic search.
- Build similarity search and recommendation workflows with FAISS Flat and HNSW indexes.
- Store embeddings, documents, and metadata in ChromaDB for persistent semantic search.
- Apply metadata filters to narrow search results by fields such as genre and year.
- Evaluate retrieval quality with Precision@K and Recall@K.
- Visualize embedding clusters and outliers in 2D with UMAP.
-
About the author
Real skill practice before real-world application
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Learn by doing
Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.
Follow your guide
All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.
Turn time into mastery
On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.