- Lab
-
Libraries: If you want this lab, consider one of these libraries.
- AI
Q-learning
In this lab, you’ll practice tabular Q-learning and train an agent to master a Gym environment. When you’re finished, you’ll have a clever agent that learns from experience and makes smart moves on its own!
Lab Info
Table of Contents
-
Challenge
Introduction
Introduction
Imagine standing on a frozen lake made of slippery tiles. Some tiles are safe, others will make you fall into freezing water, and only one leads to safety. You don’t have a map, and no one tells you which move is correct. The only way to succeed is to try, fail, learn, and adapt.
This is the type of problem Reinforcement Learning (RL) is designed to solve.
In Reinforcement Learning, an agent learns how to behave by repeatedly interacting with an environment. At each step, the agent observes its situation, chooses an action, and receives feedback in the form of a reward. Over time, the agent improves its decisions by favoring actions that lead to better long-term outcomes.
In this lab, the environment is discrete and fully observable, making it ideal for Tabular Q-learning — one of the simplest yet most powerful RL algorithms.
Core Elements
Throughout this lab, learning happens through the interaction of a few key components:
-
Environment : A grid-based world where each cell represents a possible situation. Some states are safe, some are terminal failures, and one represents success.
-
State : The agent’s current position on the grid.
-
Actions The possible moves the agent can take (left, right, up, down).
-
Reward Signal :
- A positive reward for reaching the goal
- Zero or negative reward for falling into a hole or taking unnecessary steps
-
Policy : A rule that determines which action to take in each state.
-
Q-table : A table that stores the expected long-term value of taking a specific action in a specific state.
What makes Q-learning special?
Unlike supervised learning, there is:
- No labeled dataset
- No correct action provided upfront
- No knowledge of the environment’s dynamics
Q-learning works by estimating action values directly from experience. Each interaction updates the Q-table using observed rewards and future expectations. Over time, this table becomes a map of “what works” and “what doesn’t.”
Key characteristics of Tabular Q-learning:
- Model-free: no knowledge of transitions is required
- Off-policy: learning can occur even while exploring
- Incremental: values improve step by step through experience
This makes Q-learning especially well-suited for small, discrete environments like FrozenLake.
Exploration vs Exploitation
Early in training, the agent knows nothing. Random exploration is necessary to discover which paths are safe and which are dangerous. Later, once knowledge accumulates, the agent must rely more on what it has learned.
This balance is handled through an epsilon-greedy strategy, where the agent:
- Explores with probability ε
- Exploits the best known action with probability (1 − ε)
As training progresses, ε is gradually reduced, allowing the agent to shift from exploration to confident decision-making.
Learning Objectives
By the end of this lab, the agent should:
- Learn to navigate the environment without falling into holes
- Discover the safest path to the goal
- Maximize cumulative reward over episodes
- Demonstrate how simple value updates can lead to intelligent behavior
Through hands-on implementation, you’ll see how a table of numbers — updated through trial and error — can encode a complete decision-making strategy.
You are now ready to build the environment, define the learning dynamics, and train your first Tabular Q-learning agent.
info > If you get stuck at some point in the lab, solution files have been provided for you within the
solutionfolder in your filetree. -
-
Challenge
Building the foundation (Initialization)
Initializing the Q-table
Before your agent can learn how to navigate FrozenLake, it needs a way to store experience.
In FrozenLake, the world is a grid of icy tiles:
- Some tiles are safe
- Some are holes
- One tile is the goal
At every moment, the agent:
- Is standing on one tile → this is the state
- Can choose one movement (left, right, up, down) → this is the action
How Q-learning Thinks About FrozenLake
Q-learning doesn’t memorize paths. Instead, it learns to answer this question:
“If I am on this tile, how good is it to move in each direction?”
To do that, it uses a Q-table:
- Rows → states (tiles on the lake)
- Columns → actions (possible moves)
- Values → expected future reward for taking an action in a state
At the very beginning:
- The agent knows nothing
- Every move looks equally bad (or equally good)
So you start with a table filled with zeros.
Where Do States and Actions Come From?
FrozenLake already tells you everything you need:
- How many tiles exist on the lake
- How many moves are allowed at each tile
Gym exposes this information directly through the environment.
Once you extract:
- the number of states
- the number of actions
you can create a Q-table with the correct shape and let learning begin.
Files to Modify
Make sure you have completed all the TODOs in the file
Task1.pylocated in thesrc/folder.
Helpful Hint :
-
env.observation_space.n→ number of states -
env.action_space.n→ number of actions
Use these values to define the shape of your Q-table.
Run Your Experiment
Once you have completed all TODOs in this step:
-
Make sure your current directory is the workspace root.
- This is important so that all imports work correctly.
-
Run the experiment script using the following command:
python3 src/Task1.py -
Challenge
Epsilon-greedy
Action Selection with ε-greedy (FrozenLake Intuition)
In FrozenLake, the agent stands on an icy grid where every move matters.
At each state (a tile on the lake), the agent must decide which direction to move next.Early in training, the agent knows almost nothing about the lake:
- Some tiles are safe
- Some tiles lead to holes
- Only one path reaches the goal
To learn efficiently, the agent must balance exploration and exploitation.
Exploration vs Exploitation
-
Exploration
The agent tries random actions to discover the environment:- What happens if I go left?
- Is this path dangerous?
- Does this tile lead closer to the goal?
-
Exploitation
The agent uses what it has already learned:- Choose the action with the highest expected reward
- Follow the safest and most rewarding path so far
The ε-greedy strategy is a simple and effective way to manage this trade-off.
ε-greedy Strategy in FrozenLake
At each step on the frozen grid:
-
With probability ε (epsilon)
→ the agent explores by choosing a random direction
(slips happen, curiosity matters ) -
With probability 1 − ε
→ the agent exploits by choosing the best-known action
(follow the safest path learned so far)
As training progresses, ε is usually reduced so the agent explores less and trusts its learned policy more.
This function is a core building block of Q-learning:
without exploration, the agent may never discover the optimal path.
Files to Modify
Make sure you have completed all the TODOs in the file
Task2.pylocated in thesrc/folder.
Helpful Hints
np.random.rand()generates a random float in[0, 1)np.random.choice(...)can sample a random actionnp.argmax(...)returns the index of the maximum Q-value
Run Your Experiment
Once you have completed all TODOs in this step:
-
Make sure your current directory is the workspace root.
- This is important so that all imports work correctly.
-
Run the experiment script using the following command:
python3 src/Task2.py -
Challenge
Q-values updates
Q-learning Update Rule (Core Learning Step)
At the heart of Tabular Q-learning lies a simple but powerful idea:
after taking an action, the agent updates its belief about how good that action was.In FrozenLake, this means:
- The agent slips or moves across icy tiles
- Observes a reward (usually
0, sometimes1when reaching the goal) - Updates one single value in the Q-table to reflect what it just learned
Each update answers the question:
“Given what just happened, was this action in this state better or worse than I thought?”
What the Update Does Intuitively
When the agent moves from one tile to another:
- It looks at the reward it just received
- It looks ahead to the best possible future reward from the next tile
- It slightly adjusts its current Q-value toward this new information
This gradual adjustment is what allows the agent to:
- Learn safe paths
- Avoid holes
- Eventually cross the frozen lake efficiently
Function Responsibilities
Your Q-learning update function must:
- Take the current experience of the agent
- Update only one cell in the Q-table
- Modify the Q-table in place (no copy, no return needed)
What Must Be Updated
Only one value must be updated:
Files to Modify
Make sure you have completed all the TODOs in the file
Task3.pylocated in thesrc/folder.Helpful Hints
- Use
np.max(q_table[next_state])to find the best future value. - Carefully place parentheses to correctly compute the formula:
Q[state, action] = Q[state, action] + α * (reward + γ * max(Q[next_state]) - Q[state, action])- Remember, only update the Q-value for the chosen state-action pair.
Run Your Experiment
Once you have completed all TODOs in this step:
-
Make sure your current directory is the workspace root.
- This is important so that all imports work correctly.
-
Run the experiment script using the following command:
python3 src/Task3.py -
Challenge
Epsilon Decay
Epsilon Decay Function
The epsilon decay function is used in Q-learning to gradually reduce the exploration rate (ε) as training progresses. This allows the agent to explore randomly at the beginning, then exploit its learned knowledge more as it becomes confident in its policy.
Purpose
- Encourage exploration at the start of training.
- Encourage exploitation of learned knowledge later.
- Control how fast the agent shifts from exploring to exploiting.
Files to Modify
Make sure you have completed all the TODOs in the file
Task4.pylocated in thesrc/folder.
Helpful Hints
- Use np.exp(...) to compute the exponential decay.
- Conceptually, the formula is:
epsilon = epsilon_min + (epsilon_max - epsilon_min) * decay_term
Ensure that epsilon never falls below epsilon_min, even as episode increases.
Think of epsilon as sliding from high exploration to low exploration smoothly over episodes.
Run Your Experiment
Once you have completed all TODOs in this step:
-
Make sure your current directory is the workspace root.
- This is important so that all imports work correctly.
-
Run the experiment script using the following command:
python3 src/Task4.py -
Challenge
Training the Agent & Visualizing Its Behavior
In this step, you will bring everything together and train a Q-learning agent to solve the FrozenLake environment.
FrozenLake is a grid of frozen tiles :
- Some tiles are safe
- Some tiles are holes (episode ends with failure)
- One tile is the goal
The agent starts at the top-left corner and must learn how to reach the goal without falling into holes.
Since there is no labeled dataset, the agent learns purely through trial and error, guided by rewards.
What Happens During Training?
Each episode represents one attempt to cross the frozen lake.
During an episode:
- The agent observes its current position (state)
- It chooses an action (up, down, left, right)
- The environment responds with:
- a new state
- a reward
- a termination signal
- The agent updates its Q-table based on what it just experienced
Over thousands of episodes:
- Bad actions (falling into holes) get low Q-values
- Good actions (reaching the goal) get high Q-values
- Exploration gradually decreases as the agent becomes more confident
Your Objective in This Step
You will:
- Train a Q-learning agent using an epsilon-greedy policy
- Apply epsilon decay to shift from exploration to exploitation
- Store rewards to monitor learning progress
- Run a greedy evaluation episode
- Record a video of the trained agent solving FrozenLake
Video Evaluation
After training, you will visualize what the agent has learned:
- Create a deterministic FrozenLake environment
- Run one greedy episode (no exploration)
- Record the agent’s behavior as a video
- Save the video to the output folder
This allows you to see whether the agent truly learned to avoid holes and reach the goal.
Files to Modify
Make sure you have completed all the TODOs in the file
Task5.pylocated in thesrc/folder.
Hints
env.reset()returns(state, info)env.step(action)returns(next_state, reward, terminated, truncated, info)- Use
epsilon_decay(...)to smoothly reduce exploration - The Q-table is updated in place
- During evaluation, always use:
np.argmax(q_table[state])
Run Your Experiment
Once you have completed all TODOs in this step:
-
Make sure your current directory is the workspace root.
- This is important so that all imports work correctly.
-
Run the experiment script using the following command:
python3 src/Task5.py -
Challenge
Experimentation
At this stage, the agent knows how to learn — but how well it learns depends heavily on its hyperparameters.
Instead of automatically searching for the “best” values, this step is about intuition-building: you will manually choose hyperparameters, run an experiment, and observe how the agent behaves on the FrozenLake .
This mirrors real-world reinforcement learning workflows, where understanding the effect of parameters is more important than blindly optimizing them.
Goal of This Step
You will:
- Train a Q-learning agent using your own chosen parameters
- Evaluate the learned policy using greedy actions only
- Observe how different choices affect:
- Learning speed
- Stability
- Final performance
- Path efficiency on FrozenLake
No automated search.
No predefined lists.
You are the experiment designer.
FrozenLake Intuition
On FrozenLake:
- The agent must reach the goal without falling into holes
- Rewards are sparse (only the goal gives reward)
- Poor parameter choices can cause:
- Endless wandering
- Unsafe paths
- No learning at all
This makes FrozenLake an excellent environment to feel the impact of hyperparameters.
What Is Being Evaluated?
After training, the agent is evaluated using:
- Greedy policy only (
argmax(Q[state])) - No exploration (
ε = 0) - Multiple episodes to compute a stable average reward
This answers a simple question:
Has the agent really learned a reliable policy, or was it just lucky during training?
Experiment Structure
Each experiment follows this flow:
- Create a deterministic FrozenLake environment
- Initialize a fresh Q-table
- Train the agent using your chosen parameters
- Evaluate the learned policy
- Print and analyze the results
Every experiment is independent and reproducible.
Hyperparameters You Control
In the
main()function, you manually set:-
Learning rate (
α)
Controls how strongly new experiences overwrite old ones -
Discount factor (
γ)
Controls how much future rewards matter -
Number of episodes
Controls how long the agent is allowed to learn -
Step penalty (optional)
Encourages shorter, more efficient paths
These parameters directly shape the agent’s behavior on the ice.
How to Experiment Effectively
Try changing one parameter at a time and observe:
- Does learning become unstable?
- Does the agent reach the goal more consistently?
- Does it take shorter paths?
- Does training require more episodes?
Example experiments:
- High
αvs lowα - Short-term (low
γ) vs long-term planning (highγ) - With vs without step penalty
What Success Looks Like
A good experiment setup should result in:
- Non-zero average reward during evaluation
- Consistent goal-reaching behavior
- A clear improvement over random movement
If performance degrades, that’s not failure —
it’s learning how reinforcement learning actually behaves.
Files to Modify
Make sure you have completed all the TODOs in the file
Task6.pylocated in thesrc/folder.
Takeaway
This step transforms you from:
“I ran Q-learning”
into
“I understand how Q-learning behaves.”Once you can reason about hyperparameters on FrozenLake, you’re ready to scale these ideas to larger, more complex environments.
About the author
Real skill practice before real-world application
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Learn by doing
Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.
Follow your guide
All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.
Turn time into mastery
On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.