- Lab
-
Libraries: If you want this lab, consider one of these libraries.
- AI
Intro to Reinforcement Learning
In this lab, you'll practice building reinforcement learning environments and agents. When you're finished, you'll have trained an agent and analyzed its performance with visualizations and metrics.
Lab Info
Table of Contents
-
Challenge
Introduction
Welcome to the Intro to Reinforcement Learning Code Lab!
In this hands-on lab, you'll build a Q-learning agent that learns to navigate a grid-world environment. You'll implement core reinforcement learning (RL) components including state spaces, action spaces, reward functions, and training algorithms.
Background
Globomantics is developing autonomous warehouse robots to optimize their logistics operations. Before deploying expensive hardware, they need to validate navigation algorithms in simulation. As an machine learning (ML) engineer on the team, you've been tasked with building a proof-of-concept Q-learning agent that can navigate a grid-world environment to reach target locations while avoiding obstacles.
Your simulation will help the team understand how the robot learns optimal paths, measure training performance, and identify potential issues before hardware deployment. Success in this virtual environment will directly influence the company's decision to proceed with the robotics initiative.
Familiarizing with the Program Structure
The lab environment includes the following key files:
grid_environment.py: Defines the grid-world environment with states, actions, and rewardsq_agent.py: Implements the Q-learning agent with epsilon-greedy explorationtrain_agent.py: Trains the agent and saves training historyvisualize.py: Creates trajectory visualizations and performance plotsanalyze_performance.py: Calculates performance metrics and statistics
The environment uses Python 3.10+ with NumPy for numerical operations and Matplotlib for visualization. All dependencies are pre-installed in the lab environment.
To run scripts, use the terminal with commands like:
python3 train_agent.pyThe results will be saved to the
output/directory, which you can view by opening images directly.Important Note: Complete tasks in order. Each task builds on the previous one. Test your code frequently by running the validation checks to catch errors early.
-
Challenge
Understanding Reinforcement Learning Fundamentals
Reinforcement Learning is a machine learning approach in which an agent learns to make decisions by interacting with an environment. The agent receives rewards or penalties based on its actions and learns to maximize cumulative rewards over time.
The key components you'll work with are:
- Environment: The world the agent interacts with (our grid-world)
- State: The agent's current position and situation
- Action: Choices the agent can make (move up, down, left, right)
- Reward: Feedback signal indicating action quality
- Policy: The agent's strategy for selecting actions
- Q-Learning: An algorithm that learns action-value functions
Reinforcement learning differs from supervised learning because there is no labeled dataset. Instead, the agent must explore and learn through trial and error. Q-learning learns which actions are valuable in each state without needing a model of the environment.
Now that you understand RL fundamentals, you're ready to implement your first RL environment and agent. ### Understanding Reward Functions
The reward function is the most critical component of Reinforcement Learning because it defines the behavior the agent should learn. A well-designed reward function guides the agent toward the desired outcome, while a poorly designed one can lead to unexpected or suboptimal policies.
In the warehouse robot scenario, the agent should:
- Reach the goal location (large positive reward)
- Avoid wasting time (small negative reward per step)
- Learn efficient paths
The reward function directly influences the agent's behavior. Too much penalty per step and the agent rushes, potentially missing better paths. Too little penalty and it wanders aimlessly. The goal reward must be large enough to motivate the agent to reach the target even when step penalties accumulate.
-
Challenge
Understanding Q-Learning Agents
Now that you have a working environment, you need an agent that can learn from it. A Q-learning agent learns by storing Q-values, which estimate how effective each action is in each state.
The Q-table is the core of your agent. It's a 2D array where:
- Rows represent states (positions in the grid)
- Columns represent actions (UP, DOWN, LEFT, RIGHT)
- Values are expected cumulative rewards for taking that action in that state
For our 5x5 grid with 4 actions, the Q-table is 25x4. Initially, all values are zero because the agent knows nothing. As it explores and receives rewards, it updates these values to reflect what it learns.
The agent also needs to store hyperparameters:
- Learning rate (α): How much new information overrides old (0.1 = gradual learning)
- Discount factor (Îł): How much future rewards matter (0.95 = values future highly)
- Epsilon (ε): Exploration rate for balancing trying new actions vs using learned knowledge
These parameters control how the agent learns. Too high a learning rate and it forgets quickly; too low and it learns slowly. The discount factor determines whether the agent plans ahead or focuses on immediate rewards.
Now you'll initialize the Q-learning agent with these components. ### Understanding Epsilon-Greedy Exploration
Exploration verses exploitation is a fundamental challenge in reinforcement learning. The agent must balance:
- Exploitation: Using current knowledge to maximize rewards (choosing the best known action)
- Exploration: Trying new actions to discover potentially better strategies
Epsilon-greedy is a simple but effective strategy. With probability epsilon, the agent explores by choosing a random action. Otherwise, it exploits its knowledge by choosing the action with the highest Q-value.
During training, epsilon typically decreases over time. The agent explores more early in training when it has limited knowledge, and exploits more later, when it has learned better strategies. This gradual shift helps the agent learn efficiently and avoid getting stuck in poor behaviors. ### Understanding Q-Learning Updates
Q-learning is a temporal-difference learning algorithm. It updates Q-values using the Bellman equation:
Q(s,a) = Q(s,a) + α[r + γ-max(Q(s',a')) - Q(s,a)]Where: | Term | Meaning | |--------------------|------------------------| |
max(Q(s',a'))| The maximum Q-value in the next state. | |α(alpha) | The learning rate that controls how much new information replaces existing knowledge. | |γ(gamma) | The discount factor that determines how much future rewards contribute to the update. | |r| The immediate reward. | |s'| The next state. |This update process makes the agent's Q-values more accurate over time. The agent learns to predict not just immediate rewards, but long-term cumulative rewards.
-
Challenge
Putting It All Together
The Training Loop
You now have the key parts of a reinforcement learning system: the environment, the Q-learning agent, action selection, and Q-value updates. Now it's time to connect them and watch your agent learn through experience.
Training happens in episodes. Each episode is one complete attempt to reach the goal:
- Start at the initial state
- Select actions and take steps
- Update Q-values after each step
- Continue until reaching the goal or hitting the step limit
- Reset and start a new episode
Over hundreds of episodes, patterns emerge. The agent discovers that certain actions in certain states lead to the goal. Q-values near the goal become positive (because of the +10.0 reward). These positive values propagate backward through the grid as the agent learns paths to success.
Early in training, behavior appears random due to a higher exploration rate (epsilon). Later episodes show clear improvement with epsilon decays. The agent exploits learned knowledge, and paths become more direct. By around episode 400-500, your agent should reach the goal consistently in 8-9 steps (the optimal path length).
The training script will track three key metrics for each episode:
- Total Reward: Sum of rewards accumulated (higher is better)
- Step Count: Number of moves to reach the goal (lower is better)
- Trajectory: The actual path taken (for visualization later)
Now you'll implement the training loop that brings your Q-learning agent to life. Get ready; the fun begins! ### Understanding Trajectory Visualization
Visualizing the agent's behavior helps you understand what it has learned. A trajectory shows the path the agent takes through the environment during an episode.
Early in training, trajectories appear random as the agent explores. As training progresses, trajectories become more direct as the agent learns optimal paths.
Visualizing multiple episodes reveals:
- Whether the agent has learned to reach the goal consistently
- If the agent takes efficient paths or makes unnecessary moves
- How exploration affects the agent's behavior
-
Challenge
Performance Analysis of Your Agent
Measuring Success: Performance Analysis
You've trained your agent and visualized its behavior. Now it's time to quantify how well it learned.
In reinforcement learning, performance evaluation requires looking across many episodes rather than focusing on a single run. Statistical measures help you determine whether the agent has learned an effective policy.
Think of it like evaluating a student. One test score doesn't tell the whole story, you'll need the average across multiple tests, consistency, and success rate.
Your agent's performance is measured by:
Average Reward: The mean total reward per episode. Higher values indicate that the agent reaches the goal more often and efficiently. A well-trained agent typically averages around 8-9 (the +10.0 goal reward minus ~1.0 in step penalties for an 8-10 step path).
Average Steps: The mean number of moves to reach the goal. Lower is better. The optimal path in a 5x5 grid is 8 steps (Manhattan distance from corner to corner). Your agent should average 8-12 steps after training.
Success Rate: The percentage of episodes where the agent reaches the goal (episodes with total reward > 0). A well-trained agent should achieve 90-100% success rate. Anything below 80% suggests the agent needs more training or hyperparameter tuning.
These metrics tell you not just if your agent learned, but how well it learned. They're essential for comparing different training approaches, debugging learning issues, and proving your RL system works before deploying it.
Now you'll implement the functions that calculate these performance metrics from your training history.
-
Challenge
Conclusion
Congratulations on completing the Intro to Reinforcement Learning lab! You've successfully built and trained a Q-learning agent that learns to navigate a grid-world environment.
What You've Accomplished
Throughout this lab, you have:
- Built an RL Environment: Defined state spaces, action spaces, and reward functions for a grid-world simulation.
- Implemented Q-Learning: Created an agent that learns action-value functions through temporal difference learning.
- Applied Exploration Strategies: Used epsilon-greedy action selection to balance exploration and exploitation.
- Trained an Agent: Ran training loops that allowed your agent to learn optimal navigation policies.
- Visualized Behavior: Created trajectory plots showing how the agent moves through the environment.
- Analyzed Performance: Calculated metrics including average rewards, steps per episode, and success rates.
Key Takeaways
The most important lessons from this lab are:
- RL Learns from Experience: Unlike supervised learning, RL agents learn by trial and error through environmental interaction.
- Rewards Shape Behavior: The reward function directly determines what the agent learns to do.
- Q-Learning Learns Value: Q-tables store expected cumulative rewards, enabling the agent to make better decisions.
- Balance Exploration and Exploitation: Epsilon-greedy strategies help agents discover good policies while leveraging known information.
- Visualization Reveals Learning: Trajectory plots and performance metrics make abstract learning processes concrete and interpretable.
Extending Your Skills
To continue improving your RL expertise, consider exploring:
- Deep Q-Networks (DQN) for larger state spaces
- Policy gradient methods like REINFORCE and Actor-Critic
- Multi-agent reinforcement learning
- Model-based RL approaches
- Real-world applications like robotics control and game playing
The Q-learning fundamentals you've mastered here form the foundation for advanced RL algorithms. Every complex RL system builds on these core concepts of states, actions, rewards, and value functions.
About the author
Real skill practice before real-world application
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Learn by doing
Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.
Follow your guide
All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.
Turn time into mastery
On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.