Libraries: If you want this lab, consider one of these libraries.
AI

Lab: Semantic Search and Recommendation System

Develop a solid understanding of semantic search and recommendation systems in this hands-on lab, Semantic Search and Recommendation System. Through practical exercises, you'll clean a movie dataset, generate text embeddings, build similarity search and recommendation workflows with FAISS, store and filter embeddings in ChromaDB, evaluate retrieval quality with Precision@K and Recall@K, and visualize embedding clusters with UMAP. By working through a practical movie search scenario, you'll gain the skills needed to build, evaluate, and inspect embedding-powered search and recommendation applications.

Get started Contact sales

Lab Info

Level

Intermediate

Last updated

Jul 31, 2026

Duration

2h 2m

Challenge

Introduction
Welcome to the Lab: Semantic Search and Recommendation System. This hands-on Code Lab is designed for developers who want to understand how embeddings, vector search, vector databases, evaluation metrics, and visualization work together in a practical semantic search workflow. Throughout the lab, you’ll clean a movie dataset, generate embeddings, search for and recommend movies with FAISS, store and filter embeddings in ChromaDB, evaluate retrieval quality with Precision@K and Recall@K, and visualize embedding clusters with UMAP.

By the end of this Code Lab, you’ll be able to build an end-to-end semantic search and recommendation pipeline. You’ll understand how text is converted into vectors, how nearest-neighbor search finds similar items, how vector databases support metadata filtering, how retrieval quality is measured, and how 2D projections can help reveal whether embeddings capture meaningful structure. ---

Prerequisites
Basic Python Knowledge

Familiarity with core Python concepts such as functions, lists, dictionaries, loops, and importing modules.
Introductory AI and Embedding Concepts

A basic understanding of embeddings, semantic search, or vector similarity is helpful but not required.

Learners should be comfortable with the idea that text can be converted into numeric vectors and compared by meaning.
Text Editor and Terminal Experience

Comfort using a text editor or IDE.

Experience with basic command-line operations, such as navigating directories and running Python scripts.

Ability to run code, test changes, and observe program output in the terminal.
--- ## Movie Semantic Search App
This lab provides a simple movie dataset and search app. You will use Python to clean movie data, generate embeddings, and search for similar movies by meaning rather than exact keywords.

Throughout the lab, you will build semantic search and recommendation workflows with FAISS and ChromaDB. You will evaluate search quality with Precision@K and Recall@K, then visualize movie embeddings with UMAP to inspect clusters and outliers. > The final code for each step is stored in the __solution folder. For example, the final code for Step 2 is available in the __solution/Step02 directory.
Challenge

Prepare the dataset and generate embeddings
In this step, you will prepare a small movie dataset for semantic search. You will clean the raw movie text, generate embedding vectors from movie summaries, and validate that the embeddings capture meaning by finding similar movies through nearest-neighbor search.

Navigate to the following URL https://{{hostname}}--8080.pluralsight.run/data/movies.json to view the raw movie dataset in JSON format.

To enable embedding generation, copy the API key from the top bar and replace <pluralsight-openai-api-key> in the .env file. ---

Raw datasets often contain noise such as HTML tags, encoded characters, inconsistent spacing, repeated text, strange symbols, or irrelevant formatting. Cleaning this text helps the embedding model focus on the actual movie content.

The full preprocessing script uses clean_text to clean each movie title and summary. It also converts year to an integer, parses genres into a list, adds a primary_genre field, and writes the cleaned records to data/movies_clean.json.

This preprocessing step matters because embedding models convert text into vectors. Cleaner text usually produces better embeddings, which helps semantic search find related movies even when the query does not use the exact same words.
Explanation

unescape(text) converts HTML entities into normal characters. For example, ' becomes '.

re.sub(r"<[^>]+>", " ", text) replaces HTML tags with spaces. Using a space instead of an empty string prevents nearby words from being joined together. After whitespace cleanup, for example, <p>Hello</p><p>World</p> becomes Hello World instead of HelloWorld.

re.sub(r"\s+", " ", text).strip() replaces repeated whitespace with a single space and removes extra spaces from the beginning and end of the text.
Navigate to the **Terminal** and execute the following command to test the changes.
```
python src/preprocess.py
```
You should see output similar to the example below.
```
--- BEFORE ---
{
  "id": 1,
  "title": "Blade Runner",
  "year": "1982",
  ...
}

--- AFTER ---
{
  "id": 1,
  "title": "Blade Runner",
  "year": 1982,
  ...
}

Saved 100 records to data/movies_clean.json
```
Navigate to the following URL to view the cleaned movie dataset in JSON format: https://{{hostname}}--8080.pluralsight.run/data/movies_clean.json
---
Embeddings convert text into fixed-length vectors that capture semantic meaning. Similar texts tend to produce similar vectors, allowing applications to compare content based on meaning rather than exact word matches.

For OpenAI's text-embedding-3-small model, each embedding contains 1,536 floating-point values. Although the numbers themselves are not human-readable, they form the foundation of semantic search and retrieval systems.

In this task, you'll generate embeddings for movie content using the OpenAI embeddings API and inspect the resulting vectors.
Explanation

client.embeddings.create(model=model, input=texts) sends the list of input texts to the embedding model in a single batched API call.

The API response contains one embedding for each input text. [item.embedding for item in response.data] extracts each embedding vector and stores the vectors in a plain Python list.

embed_query(text) is a helper function that embeds one piece of text and returns one vector.

The embed_and_inspect function builds the input text from the movie title and summary, sends it to the configured embedding provider, and prints the first 10 numbers of the returned vector.
Navigate to the **Terminal** and execute the following command to test the changes.
```
python src/embed_sample.py
```
You should see output similar to the example below.
```
Text being embedded:
Title: Blade Runner

Summary: In a dystopian Los Angeles of 2019, a special police officer hunts down bioengineered humans called 'replicants' who have escaped to Earth. As he investigates, he begins to question the boundary between human and machine, and what makes a life worth living.

First 10 dimensions: [-0.0143, 0.0271, -0.0058, 0.0192, -0.0046, 0.0117, 0.0083, -0.0231, 0.0009, 0.0334]
Embedding length:    1536
```
Embeddings represent the meaning of text as numeric vectors. In this step, each movie summary is converted into a vector that can be compared with other movie vectors.

Semantic search compares meaning instead of exact words. For example, a query about "artificial intelligence" can still match a movie about "replicants" or "machines" because those ideas are related.

Before embeddings can be compared efficiently, they are commonly normalized so every vector has the same length. Once normalized, similarity between vectors can be measured using cosine similarity.

In this next task, you'll prepare movie embeddings for similarity comparison and generate a similarity matrix that compares every movie against every other movie.
Explanation

np.array(raw, dtype=np.float32) converts the Python list of embedding vectors into a NumPy array shaped (N, D), where N is the number of texts and D is the embedding dimension.

NumPy arrays make it easier to perform fast vector operations such as normalization and matrix multiplication.

np.linalg.norm(vectors, axis=1, keepdims=True) calculates the L2 length of each row. axis=1 means the length is calculated across the columns of each row. keepdims=True keeps the result shaped as (N, 1), so NumPy can divide each row by its own length.

vectors = vectors / norms rescales each row so every movie vector has length 1.

Once vectors are normalized, the dot product between two vectors is equivalent to cosine similarity.

movie_vectors @ movie_vectors.T performs one matrix multiplication to compare every movie vector with every other movie vector.

The result is an (N, N) similarity matrix, where sim_matrix[i, j] is the cosine similarity between movie i and movie j.

The diagonal values are 1.0 because each movie is perfectly similar to itself.
Navigate to the **Terminal** and execute the following command to test the changes.
```
python src/validate_embeddings.py
```
If the embeddings are working well, movies with related themes, genres, or concepts should appear near each other, even when they do not use the same exact words. You should see output similar to the example below.
```
Showing each movie's nearest neighbors (self at rank 1, sim=1.000)
[[1.    0.51  0.15  0.13  0.21]
 [0.51  1.    0.14  0.16  0.22]
 [0.15  0.14  1.    0.47  0.18]
 [0.13  0.16  0.47  1.    0.17]
 [0.21  0.22  0.18  0.17  1.  ]]

The Matrix (Science Fiction)
  1. The Matrix                Science Fiction       sim=1.000
  2. Blade Runner              Science Fiction       sim=0.510
  3. The Shining               Horror                sim=0.210
  4. Pride and Prejudice       Romance               sim=0.150
  5. The Notebook              Romance               sim=0.130

Blade Runner (Science Fiction)
  1. Blade Runner              Science Fiction       sim=1.000
  2. The Matrix                Science Fiction       sim=0.510
  3. The Shining               Horror                sim=0.220
  4. The Notebook              Romance               sim=0.160
  5. Pride and Prejudice       Romance               sim=0.140
```
In this example, The Matrix and Blade Runner are both science fiction movies and appear close to each other. This shows that the embeddings are capturing semantic similarity rather than only exact word matches.

---

Congratulations! You cleaned the dataset, created your first embeddings, and verified that similar movies appear close together in vector space. This is the foundation of semantic search: embed the query, compare it with stored movie vectors, and return the closest matches.
Challenge

Search and recommend with FAISS
In this step, you will use FAISS to search movie embeddings, compare Flat and HNSW indexes, benchmark their speed, and build a simple movie recommendation workflow from multiple inputs.

FAISS, or Facebook AI Similarity Search, is a library for fast similarity search over dense vector embeddings. A FAISS index stores vectors that allows you to search for the nearest vectors to a query vector.

In this step, you will build and compare two FAISS Flat search approaches. Each Flat index compares the query vector against all stored movie vectors to retrieve the top-K matches. --- A Flat index is simple and accurate because it checks every stored vector and returns the exact top-K matches. However, this brute-force approach becomes less practical as the dataset grows because each query must be compared against all vectors. This makes Flat indexes a useful baseline for understanding exact search before exploring faster approximate nearest-neighbor indexes later in the lab.
Explanation

faiss.IndexFlatL2(dim) creates a Flat index that compares vectors using squared L2 distance. Smaller values mean closer matches.

faiss.IndexFlatIP(dim) creates a Flat index that compares vectors using inner product. Larger values mean closer matches.

index.add(movie_vectors) loads the movie vectors into the FAISS index so they can be searched.

embed_and_normalize([query]) converts the search query into an embedding vector and normalizes it so it can be compared with the stored movie vectors.

.search(query_vec, k) returns two arrays: the top-K distances or scores, and the indices of the matching movies in the original movie vector array.

For IndexFlatL2, smaller squared L2 distances mean closer matches.

For IndexFlatIP, larger inner product scores mean closer matches. Because both the movie vectors and query vector are L2-normalized, the inner product score behaves like cosine similarity.

With normalized vectors, IndexFlatL2 and IndexFlatIP should usually return the same movies in the same order, although the distance and score values will differ.
Navigate to the **Terminal** and execute the following command to test the changes.
```
python src/flat_search.py
```
You should see output similar to the example below.
```
Query: 'movies about horror and romance'

IndexFlatL2  (smaller squared L2 distance = closer):
  1. Halloween                       L2_dist=1.1708
  2. Psycho                          L2_dist=1.2535
  3. A Nightmare on Elm Street       L2_dist=1.2778
  4. It Follows                      L2_dist=1.2847
...

IndexFlatIP  (larger cosine = closer):
  1. Halloween                       cosine=0.4146
  2. Psycho                          cosine=0.3732
  3. A Nightmare on Elm Street       cosine=0.3611
  4. It Follows                      cosine=0.3576
...
```
Notice that both indexes return the same ranking, but the scores are shown differently.

IndexFlatL2 reports squared distance, where smaller is better. IndexFlatIP reports cosine-style similarity, where larger is better. --- A Flat index compares the query vector against every stored vector. This gives exact results, but it becomes slower and less practical as the dataset grows because each query must be compared with all vectors.

HNSW stands for Hierarchical Navigable Small World. It is an approximate nearest-neighbor algorithm commonly used in vector search systems. Instead of comparing every vector, HNSW builds a graph of vector relationships. Each vector is connected to nearby vectors, and a query walks through the graph to find close matches. This graph-based approach avoids checking every vector, making searches much faster on larger datasets.

Because HNSW is approximate, it may not always return the exact same results as Flat search. In practice, it often returns very similar results while significantly reducing search time.
Explanation

faiss.IndexHNSWFlat(dim, M, faiss.METRIC_INNER_PRODUCT) creates an HNSW index that uses inner-product similarity.

M controls how many graph connections each vector can have. Larger values can improve recall, but they also increase memory usage.

efConstruction controls how carefully the graph is built. Larger values can improve index quality, but they make indexing slower. This value must be set before calling .add() because it affects how the graph is constructed.

efSearch controls how deeply the graph is searched at query time. Larger values can improve recall, but they make searches slower. This value can be adjusted later to tune search behavior.

index_hnsw.add(movie_vectors) adds the vectors to the index and builds the HNSW graph.

faiss.METRIC_INNER_PRODUCT uses the same similarity metric as IndexFlatIP. Because the vectors are normalized, the resulting similarity scores behave like cosine similarity.
Navigate to the **Terminal** and execute the following command to test the changes.
```
python src/hnsw_search.py
```
You should see output similar to the example below.
```
Query: 'movies about artificial intelligence and machines'

Rank Flat (exact)                  HNSW (approx)                 
1    Ex Machina                    Ex Machina                    
2    2001: A Space Odyssey         2001: A Space Odyssey         
3    The Matrix                    The Matrix                    
4    Blade Runner                  Blade Runner                  
...
```
You already built Flat and HNSW indexes over a 100-movie dataset. At that small scale, both indexes usually return the same or very similar results, so HNSW’s speed advantage is not easy to see.

This benchmark uses random synthetic vectors with sizes of 10,000 and 50,000 to make the tradeoff visible.
Explanation

make_vectors(num_vectors, dim) generates a (num_vectors, dim) array of random float32 vectors. The faiss.normalize_L2(vectors) rescales each row to unit length, so inner product is equivalent to cosine similarity.

build_flat(vectors) builds a faiss.IndexFlatIP over the synthetic vectors. Flat indexes are cheap to build because they only store the vectors.

build_hnsw(vectors) builds a faiss.IndexHNSWFlat over the same vectors. HNSW indexes are slower to build because they construct a navigable graph during .add().

Both indexes are built over the same db and timed against the same queries batch. Reusing the same data on both sides keeps the comparison fair, so any difference in search time comes from the index type and not from the inputs.

time_search(index, queries, K) runs one untimed warmup query, then runs several timed search batches, averages them, and returns milliseconds per single query. The warmup keeps one-time costs like cache loading out of the measurement.
Navigate to the **Terminal** and execute the following command to test the changes.
```
python src/flat_vs_hnsw_benchmark.py
```
You should see output similar to the example below.
```
  Vectors | Flat build (s) | HNSW build (s) | Flat search (ms) | HNSW search (ms) |  Speedup
--------------------------------------------------------------------------------------------

   10,000 |          0.013 |          2.166 |            0.355 |            0.205 |     1.7x
   50,000 |          0.072 |         23.830 |            5.812 |            0.821 |     7.1x
```
Item-to-item recommendation uses the same FAISS search pattern from earlier tasks. The difference is the query vector. Instead of embedding a new text query, you use the mean of two or more movie vectors.

The mean-centroid approach combines multiple input movies into one query vector. Each movie is a point in embedding space, and the average of those points becomes a new point that represents the combined preference.

This approach can surface movies that sit between the input movies. For example, if the inputs include Sci-Fi, Romance, and Horror movies, the recommendations may include movies that share themes across those genres rather than matching only one title.
Explanation

movie_vectors[valid_idxs] selects the vectors for the input movie titles.

.mean(axis=0, keepdims=True) averages those vectors column by column. axis=0 creates one average vector across the selected movies, and keepdims=True keeps the result as a 2D array with shape (1, D), which is the format FAISS expects for search queries.

faiss.normalize_L2(mean_vec) re-normalizes the averaged vector back to unit length in place. This is important because the index uses inner product, and normalized vectors make inner product work like cosine similarity. The FAISS helper updates the array directly and does not return a new value.

index.search(mean_vec, K + len(valid_idxs)) searches for the nearest neighbors of the combined preference vector.

The search asks for K + N results, where N is the number of valid input movies. The input movies often appear near the top because the mean vector was built from them. Asking for extra results gives print_results enough candidates after it removes the original input titles.

set(valid_idxs) is passed to skip so the selected input movies are not shown back as recommendations.
Navigate to the **Terminal** and execute the following command to test the changes.
```
python src/recommend.py many "The Matrix" "The Notebook" "The Shining"
```
You should see output similar to the example below.
```
More like: [The Matrix, The Notebook, The Shining] (mean of 3 movies)
  1. Ex Machina                      Science Fiction       sim=0.604
  2. Psycho                          Horror                sim=0.592
  3. Eternal Sunshine of the Spotless Mind  Romance               sim=0.583
  4. Groundhog Day                   Comedy                sim=0.579
  5. Memento                         Thriller              sim=0.578
```
In this example, the selected movies come from different genres: Sci-Fi, Romance, and Horror. The mean-centroid vector searches near the center of those preferences, so the results include a mix of genres.

--- Congratulations! You compared Flat and HNSW search on the movie corpus, benchmarked both at scale to see the speed tradeoff, and used a FAISS HNSW index to recommend movies from a combined preference vector.
Challenge

Store and search with ChromaDB
In this step, you will use ChromaDB to store movie embeddings, run semantic search, and filter results by metadata such as genre or year.

The embeddings for the cleaned movie dataset have already been generated for this task and are available in /data/embeddings/movies_clean_openai_text_small.json.

ChromaDB is a vector database that stores embeddings, documents, and metadata together. This makes it useful for building semantic search and recommendation systems without manually managing a low-level vector index.

A collection is the main storage unit in ChromaDB. In this task, you will create a persistent ChromaDB collection and add the movie embeddings, readable documents, and metadata to it.
Explanation

In this task, the movies collection stores one record for each movie.

client.create_collection(name=name, metadata={"hnsw:space": "cosine"}) creates a collection that uses cosine distance for vector search.

Cosine distance is different from cosine similarity. With cosine distance, smaller values mean closer matches. A distance near 0 means the vectors are very similar.

chromadb.PersistentClient(path=...) stores the database on disk. This allows later scripts to reopen and query the same collection without rebuilding it each time.

The ids list converts each movie ID into a string because ChromaDB expects record IDs to be strings.

The documents list stores readable text for each movie. In this task, each document includes the movie title and summary.

The metadatas list stores structured information about each movie, such as title, year, primary_genre, and genres_str.

ChromaDB metadata values must be scalar values, such as strings, numbers, or booleans. Because genres is a list, it is converted into a single comma-separated string named genres_str.

movie_vectors.tolist() converts the NumPy embedding array into a regular Python list of lists, which is the format ChromaDB expects when adding embeddings.

collection.add(...) inserts all movie records into the collection in one batch. The ids, embeddings, documents, and metadatas lists must stay aligned so each movie gets the correct ID, vector, text, and metadata.
Navigate to the **Terminal** and run the following command to test your changes.
```
python src/chroma_setup.py
```
You should see output similar to the example below.
```
Collection 'movies' now has 100 records.

Sample record (id=1):
{
  "id": "1",
  "metadata": {
    "title": "Blade Runner",
    "year": 1982,
    "primary_genre": "Science Fiction",
    "genres_str": "Science Fiction, Neo-Noir, Thriller"
  },
  "document_preview": "Title: Blade Runner\n\nSummary: In a dystopian Los Angeles of 2019, a special police officer hunts d..."
}
```
Congratulations! You created a persistent ChromaDB collection and added movie embeddings, documents, and metadata to it.

The query embedding must be created with the same embedding model used during ingestion. If the stored movie embeddings and query embedding come from different models, they may not be comparable because they are in different embedding spaces.

collection.query(...) is ChromaDB's search method. It compares the query embedding against the stored movie embeddings and returns the closest matches.

Because the collection was configured with metadata={"hnsw:space": "cosine"} in the previous task, the returned distance values are cosine distances.

Cosine distance is different from cosine similarity. Smaller distance means a closer match, and a distance near 0 means the vectors are very similar.
Explanation

embed_query_for_chroma(query) converts the user's text query into an embedding vector in the format ChromaDB expects.

query_embeddings=query_emb passes the query vector to ChromaDB.

n_results=n_results controls how many matches are returned. For example, if n_results is 5, ChromaDB returns the top five closest movies.

include=["metadatas", "distances"] tells ChromaDB to return the movie metadata and distance values. ChromaDB always returns record IDs, so include controls which extra fields are included.

You do not need to manually normalize the query embedding in this function. The query embedding should be passed in the same format as the stored movie embeddings, and ChromaDB will use the collection's configured cosine distance metric during search.

The returned result is a dictionary. For a single query, the matching metadata is available from results["metadatas"][0], and the matching distances are available from results["distances"][0].

The print_search_results helper uses those values to display the movie title, year, genre, and cosine distance.
Navigate to the **Terminal** and execute the following command to test the changes.
```
python src/chroma_search.py search "movies about artificial intelligence"
```
You should see output similar to the example below.
```
Query: 'movies about artificial intelligence'

  1. Ex Machina                      (2014)  Science Fiction       cos_dist=0.5292
  2. 2001: A Space Odyssey           (1968)  Science Fiction       cos_dist=0.5821
  3. The Matrix                      (1999)  Science Fiction       cos_dist=0.5954
  4. Arrival                         (2016)  Science Fiction       cos_dist=0.6139
  5. Blade Runner                    (1982)  Science Fiction       cos_dist=0.6141
```
A simple filter checks for an exact metadata match. For example, {"primary_genre": "Science Fiction"} returns only movies whose primary_genre is "Science Fiction".

ChromaDB also supports comparison operators for numeric metadata. For example, {"year": {"$gte": 2010}} returns only movies released in 2010 or later.

Multiple conditions can be combined with $and or $or. For example:
```
{
    "$and": [
        {"primary_genre": "Science Fiction"},
        {"year": {"$gte": 2010}},
    ]
}
```
In this task, the metadata filter narrows the candidate records before the final matches are returned.
Explanation

search_with_filter works like search_collection from the previous task, but it adds one extra parameter: where.

where=where passes a metadata filter to ChromaDB. The filter controls which records are allowed to appear in the search results.

ChromaDB returns cosine distance because the collection was configured with metadata={"hnsw:space": "cosine"} in the previous task. Smaller distance means a closer semantic match.

If no records match the filter, ChromaDB returns empty result sets.
Navigate to the **Terminal** and execute the following command to test the changes.
```
python src/chroma_search.py filter "feel-good romantic movie" "Romance"
```
You should see output similar to the example below.
```
Query: 'feel-good romantic movie'  Filter: primary_genre='Romance'

  1. 500 Days of Summer              (2009)  Romance               cos_dist=0.5262
  2. Eternal Sunshine of the Spotless Mind  (2004)  Romance               cos_dist=0.5633
  3. Lost in Translation             (2003)  Romance               cos_dist=0.5751
  4. La La Land                      (2016)  Romance               cos_dist=0.5886
  5. Amelie                          (2001)  Romance               cos_dist=0.5918
```
Now run the same query with an additional year constraint.
```
python src/chroma_search.py filter "feel-good romantic movie" "Romance" 2010
```
You should see output similar to the example below.
```
Query: 'feel-good romantic movie'  Filter: primary_genre='Romance', year>=2010

  1. La La Land                      (2016)  Romance               cos_dist=0.5885
  2. Call Me by Your Name            (2017)  Romance               cos_dist=0.6331
``` ---

Congratulations! You stored movie embeddings in ChromaDB and ran semantic search with and without metadata filters.
```
Challenge

Evaluate retrieval quality
In this step, you will evaluate semantic search using Precision@K and Recall@K. You will measure how many of the top search results are relevant, determine how many known relevant items were retrieved, and compare two ChromaDB collections built using different embedding models.

The evaluation queries for this task have already been created. The file contains the evaluation queries along with the IDs of the relevant movies used to score the search results.

Open https://{{hostname}}--8080.pluralsight.run/data/eval_queries.json to inspect the evaluation queries in JSON format.

Precision@K answers the question: Of the top K results returned, what fraction are actually relevant?. It is one of the foundational metrics for evaluating semantic search.

To compute Precision@K, you need three things:
- A ranked list of retrieved IDs
- A set of known relevant IDs from an evaluation set
- The value of K
The formula is simple: count how many of the top K retrieved IDs appear in the relevant set, then divide that count by K.
Explanation

The relevant_ids set is the ground truth. It contains the IDs that are considered relevant for a specific evaluation query. In this lab, those relevant IDs are stored in data/eval_queries.json.

retrieved_ids[:k] takes the first k items from the ranked retrieval results. Since the results are already sorted by similarity, these are the top matches returned by the search system.

sum(1 for r in top_k if r in relevant_ids) checks each retrieved ID in top_k and adds 1 when that ID exists in the relevant set. The total count is stored in hits.

hits / k returns a score between 0 and 1. For example, a Precision@5 score of 0.80 means 4 out of the top 5 results were relevant.

relevant_ids is typed as set[int] because membership checks such as r in relevant_ids are faster with a set than with a list.

This function assumes k is greater than 0, because dividing by 0 would cause an error.

Precision@K can decrease as k increases because lower-ranked results are usually less relevant than the top results. For example, Precision@5 may be higher than Precision@10.
Navigate to the **Terminal** and run the following command:
```
python src/precision.py romantic
```
You should see output similar to the example below.
```
Query: 'feel-good romantic comedy'
Relevant items in eval set: [40, 46, 47, 52, 53]

Top 10 retrieved:
   1. id=46   500 Days of Summer              ✓ relevant
   2. id=47   Amelie                          ✓ relevant
   3. id=43   Eternal Sunshine of the Spotless Mind   
   4. id=48   Lost in Translation              
   5. id=44   La La Land                       
   6. id=57   Some Like It Hot                 
   7. id=40   When Harry Met Sally            ✓ relevant
   8. id=39   The Notebook                     
   9. id=22   Edward Scissorhands              
  10. id=49   Call Me by Your Name             

P@5   = 0.40
P@10  = 0.30
```
Precision tells you what fraction of the returned results are relevant, while recall tells you what fraction of the known relevant results were found.

Recall@K is an information retrieval metric that measures what fraction of the known relevant items the search found. Recall@K answers the question: Of all the relevant items that exist, what fraction did we return in the top K?

To compute Recall@K, compare the top K retrieved IDs with the relevant IDs from the evaluation set. Count how many relevant items were found, then divide that count by the total number of relevant items. The formula is |top_k ∩ relevant_ids| / |relevant_ids|. The numerator is the number of relevant items found in the top k results. The denominator is the total number of relevant items in the evaluation set for that query.
Explanation

set(retrieved_ids[:k]) takes the first k retrieved IDs and converts them into a set. This makes it easy and efficient to compare them with the relevant IDs.

if not relevant_ids: return 0.0 handles the edge case where the evaluation set has no relevant items for a query. Without this check, the function could cause a divide-by-zero error.

top_k & relevant_ids uses Python's set-intersection operator. It returns the IDs that appear in both sets: the relevant items that were returned in the top k.
Navigate to the **Terminal** and run the following command:
```
python src/recall_and_analysis.py romantic
```
You should see output similar to this:
```
Query: 'feel-good romantic comedy'
Relevant in eval set (5): [40, 46, 47, 52, 53]

R@5   = 0.40
R@10  = 0.60

=== Per-query analysis (K=5) ===

Found (2/5 relevant items returned in top 5):
  + id=46   500 Days of Summer              [primary_genre=Romance]
  + id=47   Amelie                          [primary_genre=Romance]

Missed (3 relevant items NOT in top 5):
  - id=40   When Harry Met Sally            [primary_genre=Romance]
  - id=52   Bridesmaids                     [primary_genre=Comedy]
  - id=53   Groundhog Day                   [primary_genre=Comedy]

Snuck in (3 irrelevant items in top 5):
  ? id=43   Eternal Sunshine of the Spotless Mind  [primary_genre=Romance]
  ? id=48   Lost in Translation             [primary_genre=Romance]
  ? id=44   La La Land                      [primary_genre=Romance]
```
A Recall@5 score of 1.00 means every relevant item appeared in the top 5 results. A Recall@5 score of 0.40 means 40% of the relevant items were found in the top 5.

Recall@K cannot decrease as k grows because the search has more chances to include relevant items. For example, Recall@10 is often higher than Recall@5.

The script also prints a Found/Missed/Snuck-in breakdown. “Found” items are relevant results returned in the top k, “Missed” items are relevant results that did not make the top k, and “Snuck-in” items are irrelevant results that appeared in the top k.

Precision and recall measure retrieval quality from different angles. A search with high precision but low recall returns mostly relevant results but misses some relevant items. A search with high recall but low precision finds many relevant items but may also include more irrelevant results.

Embedding quality is not the only factor that affects retrieval performance. Other factors, such as data cleaning, text preprocessing, metadata quality, the similarity metric, and the value of K, can also affect the final Precision@K and Recall@K scores. Comparing the final table helps you see whether changing the embedding model improves or reduces retrieval quality on the same evaluation set.

Now, you will compare Precision@K and Recall@K between two embedding models: text-embedding-3-small and all-minilm. You will evaluate two pre-built ChromaDB collections: movies, which was created using text-embedding-3-small, and movies_all_minilm, which was created using all-minilm.

Both collections are evaluated against the same set of evaluation queries. Since each collection was built with a different embedding model, the evaluation queries must be embedded using the corresponding model before scoring. Queries for the movies collection should use text-embedding-3-small, while queries for the movies_all_minilm collection should use all-minilm.

The evaluation queries embedded with all-minilm have already been created for this lab and are available in /data/embeddings/eval_queries_all_minilm.json.
Explanation

evaluate_collection opens an existing ChromaDB collection, runs each evaluation query against it, and scores the retrieved movie IDs.

collection.query(...) runs semantic search for one query embedding. The query embedding is passed through query_embeddings=[query_embeddings[i]], so each evaluation query is searched separately.

ChromaDB returns matching IDs ordered by distance. The first K IDs are treated as the top K retrieved results.

retrieved_ids = [int(rid) for rid in chroma_result["ids"][0]] converts ChromaDB string IDs back into integers so they can be compared with the hand-labeled relevant IDs from the evaluation set.

precision_at_k(retrieved_ids, relevant, K) reuses the Precision@K function from Task 1 to measure what fraction of the top K returned results are relevant.

recall_at_k(retrieved_ids, relevant, K) reuses the Recall@K function from Task 2 to measure what fraction of the known relevant items were found in the top K results.

Each dictionary appended to results stores the precision and recall scores for one evaluation query. The provided show_all function then averages those scores across the full evaluation set.
Navigate to the **Terminal** and run the following command:
```
python src/eval_embeddings.py
```
You should see output similar to this. The table compares the two collections by embedding dimension, average Precision@5, and average Recall@5.
```
Embedding queries for 'movies' with the configured provider...
Embedding queries for 'movies_all_minilm' with 'all-minilm'...

Collection         Dim     P@5     R@5     
-------------------------------------------
movies             1536    0.60    0.57    
movies_all_minilm  384     0.43    0.39  
``` ---

Congratulations! You measured retrieval quality with Precision@K and Recall@K, then used those metrics to compare two ChromaDB collections built with different embedding models.
```
Challenge

Visualize embeddings with UMAP
In this step, you will use UMAP to project a small set of embeddings into a 2D plot. This makes it easier to visually inspect whether similar movies appear close together, whether genres form groups, and whether unrelated items stand apart as outliers.

UMAP, short for Uniform Manifold Approximation and Projection, is a dimensionality reduction technique commonly used to visualize high-dimensional embeddings. Embedding vectors often have hundreds or thousands of dimensions, making them difficult to inspect directly.

UMAP reduces them to 2D coordinates that can be plotted on a scatter plot. UMAP tries to keep similar items close together in the 2D view. This helps reveal clusters, outliers, and overall structure in the embedding space.

Navigate to https://{{hostname}}--8080.pluralsight.run/data/umap_items.json to inspect the sample dataset used for the UMAP analysis in JSON format.

The dataset includes a few intentional outliers, such as How to fix a flat tire and Going fishing at the lake, to help illustrate how outliers appear in the visualization.

The embeddings for this dataset have already been generated and are available in the data/embeddings folder. This step gives you a visual way to check whether the embedding model groups related items together and separates unrelated items.
Explanation

umap.UMAP(...) creates the reducer object with the projection settings.

n_neighbors=5 controls how many nearby points UMAP considers when building the projection. Since this task uses only 16 items, a smaller value helps preserve local groups.

min_dist=0.15 controls how tightly points can appear in the 2D layout. A smaller value can make related items appear in more compact groups.

metric="cosine" compares vectors using cosine distance, which is commonly used with text embeddings.

random_state=seed makes the projection reproducible, so running the script with the same seed gives a consistent layout.

reducer.fit_transform(vectors) projects the input vectors from shape (N, D) to (N, 2), where each item gets an x and y coordinate.
Navigate to the **Terminal** and run the following command to test the changes.
```
python src/umap_analysis.py
```
You should see output similar to this:
```
Saved plots/umap_analysis.png

Outliers included for visual inspection:
  - How to fix a flat tire
  - Going fishing at the lake
  - Pasta carbonara recipe
  - Learning Python basics
```
Navigate to the following URL https://{{hostname}}--8080.pluralsight.run/plots/umap_analysis.png to view the projection.

The 16 items should generally show related movie items grouping near each other, while the four non-movie outliers should appear farther from the main movie groups as X markers. This task scales the UMAP projection from a small curated set to the full 100-movie corpus. This exercise helps you visually check whether movies with similar genres or themes appear near each other in embedding space.
Explanation

load_movies_data() loads the movie metadata and embedding vectors for the selected embedding model.

Returning both movies and coords_2d allows the plotting code to match each 2D point back to its movie title and genre.

umap.UMAP(...) creates the reducer object with the projection settings.

n_neighbors=15 controls how many nearby points UMAP considers when building the projection. Since this task uses 100 movies, this value gives UMAP a broader neighborhood to learn from.

min_dist=0.1 controls how tightly points can appear in the 2D layout. A smaller value can make related movies appear in more compact groups.

metric="cosine" compares vectors using cosine distance, which is commonly used with text embeddings.

random_state=seed makes the projection reproducible, so running the script with the same seed gives a consistent layout.

reducer.fit_transform(vectors) projects the input vectors from shape (100, D) to (100, 2), where each movie gets an x and y coordinate.
Navigate to the **Terminal** and run the following command to test the changes.
```
python src/umap_visualize.py
```
You should see output similar to this:
```
Saved plots/umap_movies.png
```
Navigate to the following URL https://{{hostname}}--8080.pluralsight.run/plots/umap_movies.png to view the projection. --- You projected embeddings into 2D with UMAP and used the resulting visualization to inspect clusters, genre-like structure, and outliers in the dataset.

Congratulations! You have successfully completed the lab.

Key Takeaways
- Understand how text embeddings represent movie summaries as vectors for semantic search.
- Build similarity search and recommendation workflows with FAISS Flat and HNSW indexes.
- Store embeddings, documents, and metadata in ChromaDB for persistent semantic search.
- Apply metadata filters to narrow search results by fields such as genre and year.
- Evaluate retrieval quality with Precision@K and Recall@K.
- Visualize embedding clusters and outliers in 2D with UMAP.

About the author

Asmin Bhandari

Asmin Bhandari is a full stack developer with years of experience in designing, developing and testing many applications and web based systems.

Real skill practice before real-world application

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Learn by doing

Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.

Follow your guide

All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.

Turn time into mastery

On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.

Lab: Semantic Search and Recommendation System

Lab Info

Table of Contents

Introduction

Prerequisites

Prepare the dataset and generate embeddings

Search and recommend with FAISS

Store and search with ChromaDB

Evaluate retrieval quality

Visualize embeddings with UMAP

Key Takeaways

About the author

Real skill practice before real-world application

Learn by doing

Follow your guide

Turn time into mastery

Get started with Pluralsight