Libraries: If you want this lab, consider one of these libraries.
AI

Monte Carlo Methods

In this lab, you’ll practice applying Monte Carlo prediction and control in a custom warehouse-cart simulation. When you’re finished, you’ll have a value-based routing policy and a fully-tested workflow for collecting episodes, estimating state values, and evaluating policy performance

Get started Contact sales

Lab Info

Level

Intermediate

Last updated

Jan 15, 2026

Duration

35m

Challenge

## Step 1: Introduction
Welcome to the Monte Carlo Methods Code Lab!

In this hands-on lab, you'll implement Monte Carlo prediction and control methods for a warehouse cart navigation system. You'll learn how to collect episodes, estimate state values using first-visit and every-visit Monte Carlo methods, and create a greedy policy based on learned values.

Background

You are part of a small engineering team developing automation tools for a micro-fulfillment warehouse. The company is rolling out compact, modular storage units that can be installed in retail backrooms, each containing a narrow aisle and a single autonomous cart responsible for retrieving items. Before these units can be deployed at scale, the operations group needs a reliable way to evaluate and improve the cart's routing decisions inside the aisle. As part of the engineering team, you've been tasked with building a proof-of-concept Monte Carlo prediction system that can estimate how valuable each position in the aisle is and create an improved routing policy.

Your simulation will help the team understand how the cart learns efficient navigation patterns, measure the improvement over random behavior, and identify potential issues before hardware deployment. Success in this virtual environment will directly influence the company's decision to proceed with the automation initiative.

Familiarizing with the Program Structure

The lab environment includes the following key files:
- warehouse_environment.py: Defines the warehouse aisle environment with states, actions, and rewards
- collect_episodes.py: Collects episodes using a random policy
- mc_prediction.py: Implements first-visit and every-visit Monte Carlo prediction
- greedy_policy.py: Creates and evaluates a greedy policy based on learned state values
The environment uses Python 3.10+ with NumPy for numerical operations. All dependencies are pre-installed in the lab environment.

All commands in this lab assume your working directory is /home/ps-user/workspace.

Understanding the Warehouse Environment

Before you can collect episodes and implement Monte Carlo methods, you need to understand the environment structure. The episodes you'll collect consist of states, actions, and rewards that come from interacting with this environment. The warehouse environment is structured as a narrow linear aisle where the cart navigates to retrieve items. The environment is defined in warehouse_environment.py and models the following structure:

States: Each state represents a position along the aisle, numbered from 0 (start position at one end) to aisle_length - 1 (goal position at the other end). The cart's position determines which state it occupies.

Actions: The cart can take three actions at each position:
- LEFT (0): Move one position toward the start (decrease position)
- RIGHT (1): Move one position toward the goal (increase position)
- STAY (2): Remain at the current position
Goal: The goal state is positioned at the end of the aisle (aisle_length - 1). This is a deliberate design choice for the lab: it creates a single, unambiguous terminal target and a clear notion of progress as the cart moves through the aisle. Keeping the goal fixed at one end makes the episode boundaries and value estimates easier to interpret, so you can focus on implementing Monte Carlo prediction and building a better policy without extra scenario complexity.

Rewards: The reward function encourages efficient navigation toward the goal:
- Reaching the goal state: +10.0 (large positive reward for successful item retrieval)
- All other positions: -0.1 (small step penalty to discourage unnecessary movement and encourage efficient paths)
Episodes: An episode begins when the cart starts at position 0 and ends when it reaches the goal position (or when a maximum step limit is reached). Each episode records a sequence of (state, action, reward) tuples showing the cart's path through the aisle.

To run scripts, use the terminal with commands like:
```
cd /home/ps-user/workspace/code
python3 collect_episodes.py
```
The results will be saved to the output/ directory as NumPy files (.npy format).

Keep the provided random seed (113) unchanged so your terminal output matches the lab.

Important Note: Complete tasks in order. Each task builds on the previous one. Run the provided checks frequently to catch errors early.

info > If you get stuck on a task, you can find solution files for each task in the solution folder in your filetree.
Challenge

## Step 2: Collect Monte Carlo Training Episodes with a Random Policy

Monte Carlo prediction estimates how valuable each aisle position is by averaging the returns observed from complete episodes of the cart's experience. An episode is a sequence of (state, action, reward) tuples recorded as the cart moves through the aisle from start to goal. From each episode, you compute returns: the discounted cumulative reward that follows from visiting each position. By averaging returns across many episodes, you estimate how good each position is under a given policy. This approach learns directly from the cart's experience without requiring a model of how the environment transitions between states.

To generate episodes, the cart needs a policy: a rule that decides which action to take at each position. Since you don't have learned values yet, you'll start with a random policy that selects an action uniformly at random from the available actions (LEFT=0, RIGHT=1, STAY=2). This baseline has no strategy, so it produces varied trajectories and gives you a clear reference point for later comparison.

Now that you understand how Monte Carlo uses sampled episodes and returns, you're ready to implement the random policy and record the transitions from a run through the aisle. Now that you can collect one episode, collect many episodes so the next step can learn value estimates from repeated experience.
Challenge

## Step 3: Compute State Values with First Visit and Every Visit Monte Carlo

The return (also called the discounted cumulative reward) is the total reward the cart receives from a position in the aisle until the episode ends, with future rewards discounted by a factor gamma. The discount factor gamma (typically 0.9 to 0.99) determines how much the cart values immediate rewards versus future rewards.

For a state at time t, the return Gₜ is:

Gₜ = Rₜ₊₁ + γ Rₜ₊₂ + γ² Rₜ₊₃ + ...

Where Rₜ₊₁ is the immediate reward received after taking an action in state Sₜ, and γ (gamma) is the discount factor.

Returns are computed backwards through an episode. Starting from the final transition, you accumulate discounted rewards backwards, then reverse the order so returns align with their corresponding states. This backward computation is essential for Monte Carlo methods because each step's return depends on future rewards.

Monte Carlo prediction estimates how valuable each position in the aisle is by averaging the returns observed from the cart's episodes. Unlike dynamic programming methods that require a model of the environment, Monte Carlo methods learn directly from the cart's sampled experience navigating the aisle.

There are two main approaches:

First-Visit Monte Carlo: Updates a state's value only the first time it appears in an episode. This ensures each episode contributes at most one sample per state, making the estimates unbiased.

Every-Visit Monte Carlo: Updates a state's value every time it appears in an episode. This uses all occurrences of a state, potentially providing more data but with different statistical properties.

Both methods converge to the true state values, but first-visit is theoretically cleaner while every-visit can be more efficient in practice.

The incremental update formula used in both methods is:

V[state] += (G - V[state]) / N[state]

Where N[state] is the number of times the state has been visited. This formula maintains a running average of returns, ensuring convergence to the true expected return as more episodes are processed. ### Understanding First-Visit Monte Carlo

First-visit Monte Carlo only updates an aisle position's value the first time the cart visits it in each episode. This requires tracking which positions have been seen in the current episode using a set. The incremental update formula averages the returns observed for each position, ensuring convergence as more episodes are processed. This approach ensures each episode contributes at most one sample per position, making the estimates unbiased and theoretically cleaner for the operations team's analysis. ### Understanding Every-Visit Monte Carlo

Every-visit Monte Carlo updates an aisle position's value every time the cart visits it in an episode, using all occurrences rather than just the first. This provides more data points per episode but with different statistical properties than first-visit. Comparing both methods helps the operations team understand how different sampling approaches affect the evaluation of the aisle layout and the reliability of the learned state values.
Challenge

## Step 4: Build and Evaluate a Greedy Routing Policy Using Monte Carlo State Values
A greedy policy selects, at each state, the action that leads to the state with the highest estimated value. This is a simple but effective way to create a policy from a value function.

In your warehouse environment, the greedy policy selects actions that move the cart toward states with higher values. Since Monte Carlo prediction learns that states closer to the goal have higher values (due to the +10.0 goal reward), the greedy policy will often guide the cart towards the goal.

With state values learned from Monte Carlo prediction, you will create a greedy policy that looks one step ahead and selects actions leading to the highest-scoring next states. When multiple actions tie for the best score, it breaks ties by preferring RIGHT, making the policy deterministic. This creates a predictable and efficient strategy for the warehouse cart, though it may not explore alternative paths that could be better. ### Understanding Policy Evaluation and Comparison

To demonstrate the value of learning to the operations team, you need to evaluate and compare both policies. Policy evaluation runs the cart through multiple episodes with each policy and collects statistics: total reward per episode, steps to reach the goal, and the trajectory (sequence of positions visited).

To evaluate a policy's performance, you run multiple episodes and collect statistics:
- Average reward: The mean total reward per episode (higher is better)
- Average steps: The mean number of moves to reach the goal (lower is better)
- Success rate: The percentage of episodes where the goal is reached
The greedy policy often achieves higher average rewards, fewer steps to goal, and a higher success rate, giving the operations group measurable evidence that learned navigation improves cart efficiency. Comparing the greedy policy to the random baseline demonstrates how learning improves navigation efficiency and reliability.
Challenge

## Step 5 - Conclusion
Congratulations on completing the Monte Carlo Methods lab! You've successfully implemented Monte Carlo prediction and control for a warehouse cart navigation system.

What You've Accomplished

Throughout this lab, you have:
- Collected Episodes: Created a random policy and collected complete episodes of experience, recording every state, action, and reward encountered during navigation.
- Computed Returns: Implemented the calculation of discounted cumulative rewards (returns) for each step in episodes, working backwards from the end.
- Implemented First-Visit MC: Created first-visit Monte Carlo prediction that updates state values only on the first occurrence of each state in an episode.
- Implemented Every-Visit MC: Extended your code to support every-visit Monte Carlo prediction that updates state values on every occurrence, and compared the two methods.
- Built a Greedy Policy: Used learned state values to create a greedy routing policy that selects actions leading to higher-value states.
- Evaluated and Compared Policies: Implemented policy evaluation to run multiple episodes and compared the greedy policy to the random baseline, demonstrating measurable improvements in reward, steps to goal, and success rate.
Key Takeaways

The most important lessons from this lab are:
- Monte Carlo Methods Learn from Episodes: Unlike model-based methods, Monte Carlo algorithms learn directly from complete episodes of experience without requiring a model of the environment.
- Returns Capture Long-Term Value: The discounted return from a state represents the total future reward, making it a natural target for value estimation.
- First-Visit vs Every-Visit: First-visit MC only updates values on the first occurrence of a state in an episode, while every-visit MC updates on every occurrence. Both converge to the true values, but with different statistical properties.
- Value Functions Enable Policy Improvement: Once you have accurate state value estimates, you can create better policies by selecting actions that lead to higher-value states.
What Happens If You Change Things?

This is your last chance to experiment in the environment. Take this opportunity to try new things.

Here are some things to try out:
- Run several greedy-policy episodes in a row and print each full trajectory so you can visually inspect how the cart moves compared to the random policy.
- Change the random seed (for example, change 113 to another value) and compare the resulting greedy trajectories to confirm that the learned values produce consistent behavior across different runs.
- Increase aisle_length (for example, 10 to 20) and compare how many episodes you need before the value estimates stabilize.
- Change the step penalty (for example, -0.1 to -0.5) and observe how it changes the learned values and policy performance.
- Make the environment less “obvious” by adding a second terminal outcome. For example: add an “early exit” at state 0 where taking action LEFT immediately ends the episode with a smaller reward (for example, +2.0). Then compare what the greedy policy does with different values of gamma.

About the author

Nicolae Caprarescu

Nicolae has been a Software Engineer since 2013, focusing on Java and web stacks. Nicolae holds a degree in Computer Science and enjoys teaching, traveling and motorsports.

Real skill practice before real-world application

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Learn by doing

Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.

Follow your guide

All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.

Turn time into mastery

On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.

Monte Carlo Methods

Lab Info

Table of Contents

## Step 1: Introduction

Background

Familiarizing with the Program Structure

Understanding the Warehouse Environment

## Step 2: Collect Monte Carlo Training Episodes with a Random Policy

## Step 3: Compute State Values with First Visit and Every Visit Monte Carlo

## Step 4: Build and Evaluate a Greedy Routing Policy Using Monte Carlo State Values

## Step 5 - Conclusion

What You've Accomplished

Key Takeaways

What Happens If You Change Things?

About the author

Real skill practice before real-world application

Learn by doing

Follow your guide

Turn time into mastery

Get started with Pluralsight