Libraries: If you want this lab, consider one of these libraries.
AI

Q-learning

In this lab, you’ll practice tabular Q-learning and train an agent to master a Gym environment. When you’re finished, you’ll have a clever agent that learns from experience and makes smart moves on its own!

Get started Contact sales

Lab Info

Level

Intermediate

Last updated

Jan 22, 2026

Duration

1h 1m

Challenge

Introduction
Introduction

Imagine standing on a frozen lake made of slippery tiles. Some tiles are safe, others will make you fall into freezing water, and only one leads to safety. You don’t have a map, and no one tells you which move is correct. The only way to succeed is to try, fail, learn, and adapt.

This is the type of problem Reinforcement Learning (RL) is designed to solve.

In Reinforcement Learning, an agent learns how to behave by repeatedly interacting with an environment. At each step, the agent observes its situation, chooses an action, and receives feedback in the form of a reward. Over time, the agent improves its decisions by favoring actions that lead to better long-term outcomes.

In this lab, the environment is discrete and fully observable, making it ideal for Tabular Q-learning — one of the simplest yet most powerful RL algorithms.

Core Elements

Throughout this lab, learning happens through the interaction of a few key components:
- Environment : A grid-based world where each cell represents a possible situation. Some states are safe, some are terminal failures, and one represents success.
- State : The agent’s current position on the grid.
- Actions The possible moves the agent can take (left, right, up, down).
- Reward Signal :
  
  A positive reward for reaching the goal
  
  Zero or negative reward for falling into a hole or taking unnecessary steps
- Policy : A rule that determines which action to take in each state.
- Q-table : A table that stores the expected long-term value of taking a specific action in a specific state.
What makes Q-learning special?

Unlike supervised learning, there is:
- No labeled dataset
- No correct action provided upfront
- No knowledge of the environment’s dynamics
Q-learning works by estimating action values directly from experience. Each interaction updates the Q-table using observed rewards and future expectations. Over time, this table becomes a map of “what works” and “what doesn’t.”

Key characteristics of Tabular Q-learning:
- Model-free: no knowledge of transitions is required
- Off-policy: learning can occur even while exploring
- Incremental: values improve step by step through experience
This makes Q-learning especially well-suited for small, discrete environments like FrozenLake.

Exploration vs Exploitation

Early in training, the agent knows nothing. Random exploration is necessary to discover which paths are safe and which are dangerous. Later, once knowledge accumulates, the agent must rely more on what it has learned.

This balance is handled through an epsilon-greedy strategy, where the agent:
- Explores with probability ε
- Exploits the best known action with probability (1 − ε)
As training progresses, ε is gradually reduced, allowing the agent to shift from exploration to confident decision-making.

Learning Objectives

By the end of this lab, the agent should:
- Learn to navigate the environment without falling into holes
- Discover the safest path to the goal
- Maximize cumulative reward over episodes
- Demonstrate how simple value updates can lead to intelligent behavior
Through hands-on implementation, you’ll see how a table of numbers — updated through trial and error — can encode a complete decision-making strategy.

You are now ready to build the environment, define the learning dynamics, and train your first Tabular Q-learning agent.

info > If you get stuck at some point in the lab, solution files have been provided for you within the solution folder in your filetree.
Challenge

Building the foundation (Initialization)
Initializing the Q-table

Before your agent can learn how to navigate FrozenLake, it needs a way to store experience.

In FrozenLake, the world is a grid of icy tiles:
- Some tiles are safe
- Some are holes
- One tile is the goal
At every moment, the agent:
- Is standing on one tile → this is the state
- Can choose one movement (left, right, up, down) → this is the action
How Q-learning Thinks About FrozenLake

Q-learning doesn’t memorize paths. Instead, it learns to answer this question:

“If I am on this tile, how good is it to move in each direction?”

To do that, it uses a Q-table:
- Rows → states (tiles on the lake)
- Columns → actions (possible moves)
- Values → expected future reward for taking an action in a state
At the very beginning:
- The agent knows nothing
- Every move looks equally bad (or equally good)
So you start with a table filled with zeros.

Where Do States and Actions Come From?

FrozenLake already tells you everything you need:
- How many tiles exist on the lake
- How many moves are allowed at each tile
Gym exposes this information directly through the environment.

Once you extract:
- the number of states
- the number of actions
you can create a Q-table with the correct shape and let learning begin.

Files to Modify

Make sure you have completed all the TODOs in the file Task1.py located in the src/ folder.

Helpful Hint :
- env.observation_space.n → number of states
- env.action_space.n → number of actions
Use these values to define the shape of your Q-table.

Run Your Experiment

Once you have completed all TODOs in this step:
1. Make sure your current directory is the workspace root.
  
  This is important so that all imports work correctly.
2. Run the experiment script using the following command:
```
python3 src/Task1.py
```
Challenge

Epsilon-greedy
Action Selection with ε-greedy (FrozenLake Intuition)

In FrozenLake, the agent stands on an icy grid where every move matters.
At each state (a tile on the lake), the agent must decide which direction to move next.

Early in training, the agent knows almost nothing about the lake:
- Some tiles are safe
- Some tiles lead to holes
- Only one path reaches the goal
To learn efficiently, the agent must balance exploration and exploitation.

Exploration vs Exploitation
- Exploration
  The agent tries random actions to discover the environment:
  
  What happens if I go left?
  
  Is this path dangerous?
  
  Does this tile lead closer to the goal?
- Exploitation
  The agent uses what it has already learned:
  
  Choose the action with the highest expected reward
  
  Follow the safest and most rewarding path so far
The ε-greedy strategy is a simple and effective way to manage this trade-off.

ε-greedy Strategy in FrozenLake

At each step on the frozen grid:
- With probability ε (epsilon)
  → the agent explores by choosing a random direction
  (slips happen, curiosity matters )
- With probability 1 − ε
  → the agent exploits by choosing the best-known action
  (follow the safest path learned so far)
As training progresses, ε is usually reduced so the agent explores less and trusts its learned policy more.

This function is a core building block of Q-learning:
without exploration, the agent may never discover the optimal path.

Files to Modify

Make sure you have completed all the TODOs in the file Task2.py located in the src/ folder.

Helpful Hints
- np.random.rand() generates a random float in [0, 1)
- np.random.choice(...) can sample a random action
- np.argmax(...) returns the index of the maximum Q-value
Run Your Experiment

Once you have completed all TODOs in this step:
1. Make sure your current directory is the workspace root.
  
  This is important so that all imports work correctly.
2. Run the experiment script using the following command:
```
python3 src/Task2.py
```
Challenge

Q-values updates
Q-learning Update Rule (Core Learning Step)

At the heart of Tabular Q-learning lies a simple but powerful idea:
after taking an action, the agent updates its belief about how good that action was.

In FrozenLake, this means:
- The agent slips or moves across icy tiles
- Observes a reward (usually 0, sometimes 1 when reaching the goal)
- Updates one single value in the Q-table to reflect what it just learned
Each update answers the question:

“Given what just happened, was this action in this state better or worse than I thought?”

What the Update Does Intuitively

When the agent moves from one tile to another:
1. It looks at the reward it just received
2. It looks ahead to the best possible future reward from the next tile
3. It slightly adjusts its current Q-value toward this new information
This gradual adjustment is what allows the agent to:
- Learn safe paths
- Avoid holes
- Eventually cross the frozen lake efficiently
Function Responsibilities

Your Q-learning update function must:
- Take the current experience of the agent
- Update only one cell in the Q-table
- Modify the Q-table in place (no copy, no return needed)
What Must Be Updated

Only one value must be updated:

Files to Modify

Make sure you have completed all the TODOs in the file Task3.py located in the src/ folder.

Helpful Hints
- Use np.max(q_table[next_state]) to find the best future value.
- Carefully place parentheses to correctly compute the formula:
```
Q[state, action] = Q[state, action] + α * (reward + γ * max(Q[next_state]) - Q[state, action])
```
- Remember, only update the Q-value for the chosen state-action pair.
Run Your Experiment

Once you have completed all TODOs in this step:
1. Make sure your current directory is the workspace root.
  
  This is important so that all imports work correctly.
2. Run the experiment script using the following command:
```
python3 src/Task3.py
```
Challenge

Epsilon Decay
Epsilon Decay Function

The epsilon decay function is used in Q-learning to gradually reduce the exploration rate (ε) as training progresses. This allows the agent to explore randomly at the beginning, then exploit its learned knowledge more as it becomes confident in its policy.

Purpose
- Encourage exploration at the start of training.
- Encourage exploitation of learned knowledge later.
- Control how fast the agent shifts from exploring to exploiting.
Files to Modify

Make sure you have completed all the TODOs in the file Task4.py located in the src/ folder.

Helpful Hints
- Use np.exp(...) to compute the exponential decay.
- Conceptually, the formula is: epsilon = epsilon_min + (epsilon_max - epsilon_min) * decay_term
Ensure that epsilon never falls below epsilon_min, even as episode increases.

Think of epsilon as sliding from high exploration to low exploration smoothly over episodes.

Run Your Experiment

Once you have completed all TODOs in this step:
1. Make sure your current directory is the workspace root.
  
  This is important so that all imports work correctly.
2. Run the experiment script using the following command:
```
python3 src/Task4.py
```
Challenge

Training the Agent & Visualizing Its Behavior
In this step, you will bring everything together and train a Q-learning agent to solve the FrozenLake environment.

FrozenLake is a grid of frozen tiles :
- Some tiles are safe
- Some tiles are holes (episode ends with failure)
- One tile is the goal
The agent starts at the top-left corner and must learn how to reach the goal without falling into holes.
Since there is no labeled dataset, the agent learns purely through trial and error, guided by rewards.

What Happens During Training?

Each episode represents one attempt to cross the frozen lake.

During an episode:
1. The agent observes its current position (state)
2. It chooses an action (up, down, left, right)
3. The environment responds with:
  
  a new state
  
  a reward
  
  a termination signal
4. The agent updates its Q-table based on what it just experienced
Over thousands of episodes:
- Bad actions (falling into holes) get low Q-values
- Good actions (reaching the goal) get high Q-values
- Exploration gradually decreases as the agent becomes more confident
Your Objective in This Step

You will:
- Train a Q-learning agent using an epsilon-greedy policy
- Apply epsilon decay to shift from exploration to exploitation
- Store rewards to monitor learning progress
- Run a greedy evaluation episode
- Record a video of the trained agent solving FrozenLake
Video Evaluation

After training, you will visualize what the agent has learned:
- Create a deterministic FrozenLake environment
- Run one greedy episode (no exploration)
- Record the agent’s behavior as a video
- Save the video to the output folder
This allows you to see whether the agent truly learned to avoid holes and reach the goal.

Files to Modify

Make sure you have completed all the TODOs in the file Task5.py located in the src/ folder.

Hints
- env.reset() returns (state, info)
- env.step(action) returns (next_state, reward, terminated, truncated, info)
- Use epsilon_decay(...) to smoothly reduce exploration
- The Q-table is updated in place
- During evaluation, always use:
  np.argmax(q_table[state])
Run Your Experiment

Once you have completed all TODOs in this step:
1. Make sure your current directory is the workspace root.
  
  This is important so that all imports work correctly.
2. Run the experiment script using the following command:
```
python3 src/Task5.py
```
Challenge

Experimentation
At this stage, the agent knows how to learn — but how well it learns depends heavily on its hyperparameters.

Instead of automatically searching for the “best” values, this step is about intuition-building: you will manually choose hyperparameters, run an experiment, and observe how the agent behaves on the FrozenLake .

This mirrors real-world reinforcement learning workflows, where understanding the effect of parameters is more important than blindly optimizing them.

Goal of This Step

You will:
- Train a Q-learning agent using your own chosen parameters
- Evaluate the learned policy using greedy actions only
- Observe how different choices affect:
  
  Learning speed
  
  Stability
  
  Final performance
  
  Path efficiency on FrozenLake
No automated search.
No predefined lists.
You are the experiment designer.

FrozenLake Intuition

On FrozenLake:
- The agent must reach the goal without falling into holes
- Rewards are sparse (only the goal gives reward)
- Poor parameter choices can cause:
  
  Endless wandering
  
  Unsafe paths
  
  No learning at all
This makes FrozenLake an excellent environment to feel the impact of hyperparameters.

What Is Being Evaluated?

After training, the agent is evaluated using:
- Greedy policy only (argmax(Q[state]))
- No exploration (ε = 0)
- Multiple episodes to compute a stable average reward
This answers a simple question:

Has the agent really learned a reliable policy, or was it just lucky during training?

Experiment Structure

Each experiment follows this flow:
1. Create a deterministic FrozenLake environment
2. Initialize a fresh Q-table
3. Train the agent using your chosen parameters
4. Evaluate the learned policy
5. Print and analyze the results
Every experiment is independent and reproducible.

Hyperparameters You Control

In the main() function, you manually set:
- Learning rate (α)
  Controls how strongly new experiences overwrite old ones
- Discount factor (γ)
  Controls how much future rewards matter
- Number of episodes
  Controls how long the agent is allowed to learn
- Step penalty (optional)
  Encourages shorter, more efficient paths
These parameters directly shape the agent’s behavior on the ice.

How to Experiment Effectively

Try changing one parameter at a time and observe:
- Does learning become unstable?
- Does the agent reach the goal more consistently?
- Does it take shorter paths?
- Does training require more episodes?
Example experiments:
- High α vs low α
- Short-term (low γ) vs long-term planning (high γ)
- With vs without step penalty
What Success Looks Like

A good experiment setup should result in:
- Non-zero average reward during evaluation
- Consistent goal-reaching behavior
- A clear improvement over random movement
If performance degrades, that’s not failure —
it’s learning how reinforcement learning actually behaves.

Files to Modify

Make sure you have completed all the TODOs in the file Task6.py located in the src/ folder.

Takeaway

This step transforms you from:

“I ran Q-learning”
into
“I understand how Q-learning behaves.”

Once you can reason about hyperparameters on FrozenLake, you’re ready to scale these ideas to larger, more complex environments.

About the author

Marc Harb

Marc is a Senior Data Scientist with a solid foundation in Communication and Computer Engineering and holds a Master's degree in AI and Deep Learning from one of France's leading universities. His career is driven by a deep passion for data science and artificial intelligence, combining technical expertise with innovative thinking to deliver impactful solutions.

Real skill practice before real-world application

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Learn by doing

Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.

Follow your guide

All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.

Turn time into mastery

On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.

Q-learning

Lab Info

Table of Contents

Introduction

Introduction

Core Elements

What makes Q-learning special?

Exploration vs Exploitation

Learning Objectives

Building the foundation (Initialization)

Initializing the Q-table

How Q-learning Thinks About FrozenLake

Where Do States and Actions Come From?

Files to Modify

Helpful Hint :

Run Your Experiment

Epsilon-greedy

Action Selection with ε-greedy (FrozenLake Intuition)

Exploration vs Exploitation

ε-greedy Strategy in FrozenLake

Files to Modify

Helpful Hints

Run Your Experiment

Q-values updates

Q-learning Update Rule (Core Learning Step)

What the Update Does Intuitively

Function Responsibilities

What Must Be Updated

Files to Modify

Make sure you have completed all the TODOs in the file Task3.py located in the src/ folder.

Helpful Hints

Run Your Experiment

Epsilon Decay

Epsilon Decay Function

Purpose

Files to Modify

Helpful Hints

Run Your Experiment

Training the Agent & Visualizing Its Behavior

What Happens During Training?

Your Objective in This Step

Video Evaluation

Files to Modify

Hints

Run Your Experiment

Experimentation

Goal of This Step

FrozenLake Intuition

What Is Being Evaluated?

Experiment Structure

Hyperparameters You Control

How to Experiment Effectively

What Success Looks Like

Files to Modify

Takeaway

About the author

Real skill practice before real-world application

Learn by doing

Follow your guide

Turn time into mastery

Get started with Pluralsight

Make sure you have completed all the TODOs in the file `Task3.py` located in the `src/` folder.