- Lab
-
Libraries: If you want this lab, consider one of these libraries.
- AI
Actor-Critic Methods
In this lab, you'll practice implementing actor-critic algorithms including A2C, A3C, and SAC. When you're finished, you'll have compared synchronous, asynchronous, and entropy-regularized approaches.
Lab Info
Table of Contents
-
Challenge
Introduction
Welcome to the Actor-Critic Methods Code Lab. In this hands-on lab, you'll implement advanced reinforcement learning algorithms including A2C, A3C, and SAC.
- A2C (Advantage Actor-Critic): Synchronous training with advantage estimation
- A3C (Asynchronous Advantage Actor-Critic): Parallel training with multiple workers
- SAC (Soft Actor-Critic): Entropy-regularized training for better exploration
You'll build actor and critic networks, train agents with both synchronous and asynchronous approaches, and compare entropy-regularized methods.
Background
You're an AI researcher at CarvedRock Robotics building control systems for robotic arms used in manufacturing. The company needs to benchmark different actor-critic algorithms to determine which provides the best training speed, sample efficiency, and stability for continuous control tasks.
Your team has decided to evaluate three state-of-the-art algorithms:
- A2C for baseline synchronous training
- A3C for parallel asynchronous training
- SAC for entropy-regularized exploration
Success in this benchmarking effort will guide the selection of the primary algorithm for the production robotic control system.
Familiarizing Yourself with the Program Structure
The lab environment includes the following key files:
networks.py: Defines actor and critic neural network architecturesa2c_agent.py: Implements the Advantage Actor-Critic agenta3c_agent.py: Implements asynchronous training with multiple workerssac_agent.py: Implements Soft Actor-Critic with entropy regularizationtrain_a2c.py: Training script for A2C agenttrain_a3c.py: Training script for A3C with parallel workerstrain_sac.py: Training script for SAC agentevaluate.py: Evaluates trained agents and compares performancevisualize.py: Creates training curves and comparison plots
The environment uses:
- Python 3.10+ with PyTorch for neural networks
- Gymnasium for RL environments
- NumPy for numerical operations
- Matplotlib for visualization
All dependencies are pre-installed in the lab environment.
Solution Directory
If you are stuck in a task, you can review the complete solution files (prefixed with
sol-) in thesolutions/directory.How to Run Scripts
To run scripts, use the terminal with commands like:
python3 train_a2c.pyThe results will be saved to the
output/directory, which you can view by opening files directly.Workflow Notes
warning> Note: Complete tasks in order.
Each task builds on the previous one. Test your code frequently by running the task validations to catch errors early.
-
Challenge
Understanding Actor-Critic Methods
Understanding Actor-Critic Methods
Actor-Critic methods combine two key components to learn optimal policies:
- Actor: The policy network that selects actions based on the current state
- Critic: The value network that estimates how good states or actions are
Unlike pure policy gradient methods (which include only an actor) or value-based methods (which include only a critic), actor-critic methods leverage both components. The critic evaluates actions taken by the actor, providing lower-variance gradient estimates that lead to more stable and efficient learning.
The three algorithms you'll implement represent different innovations:
- A2C: Synchronous advantage actor-critic that updates after a fixed number of steps
- A3C: Asynchronous version that parallelizes training across multiple workers
- SAC: Adds entropy regularization to encourage exploration and improve stability
Understanding these differences is crucial because each algorithm makes different trade-offs between sample efficiency, computational efficiency, and training stability.
You now have the foundational context needed to begin implementing your first actor-critic networks.
-
Challenge
Advantage Actor-Critic Algorithm
Understanding Advantage Estimation
The advantage function measures how much better an action is compared to the average action in that state. It's calculated as:
A(s,a) = Q(s,a) - V(s)In this equation,
Q(s,a)represents action-value, andV(s)represents the state-value. Using advantages instead of raw action-values reduces variance in policy gradient updates, leading to more stable training.In practice, you estimate advantages using temporal difference (TD) learning. The critic learns to estimate
V(s), and you use TD errors to compute advantages. This approach is more sample-efficient than Monte Carlo methods and provides lower-variance gradient estimates than pure policy-gradient methods. ### Understanding Policy Gradient LossThe actor needs a loss function to learn from experience, but unlike supervised learning where you have correct answers, reinforcement learning only has rewards. The policy gradient loss bridges this gap.
The key insight is to make actions that led to good outcomes more probable. The loss function accomplishes this through:
- Log probabilities: Measure how likely the action was under the current policy
- Advantages: Tells you if the action was better or worse than average
- Negative mean: Minimize the negative (which maximizes the positive) to increase probability of good actions
The policy gradient loss function is:
-(log_probs * advantages).mean()The loss function increases the probability of actions with positive advantages and decreases probability of actions with negative advantages. The advantages act as weights—larger positive advantages create stronger updates toward that action.
This is why advantage estimation in the previous task was crucial. It provides the training signal that tells the actor which actions to reinforce. Without advantages estimates, the actor wouldn't know how to adjust the policy to improve performance. ### Actor Loss Sign Convention
Policy gradient methods maximize expected return. Because most optimizers perform minimization, the actor loss is defined as the negative of the weighted log probabilities.
-
Challenge
Asynchronous Advantage Actor-Critic Algorithm
Understanding Asynchronous Training
In A3C, you parallelize training by running multiple agents simultaneously, with each agent collecting experience from its own environment. These workers periodically sync their gradients with a global network, enabling faster learning through increased data collection.
Key benefits of asynchronous training:
- Increased throughput: Multiple workers collect experience in parallel.
- Exploration diversity: Different workers explore different parts of the state space.
- Gradient diversity: Reduces correlation between training samples.
The trade-off is increased implementation complexity. You must also manage synchronization carefully to ensure stable learning across all workers. ### Understanding Gradient Synchronization
A3C uses asynchronous parallel training, which requires a mechanism for sharing learning across multiple workers without causing interference. Gradient synchronization provides this mechanism.
Gradient synchronization consists of two key operations:
- Sync with global: Before training, each worker copies the latest parameters from the global networks to its local networks. This ensures workers start with the most up-to-date policy.
- Push gradients: After computing gradients locally, workers push them to the global networks. The global networks accumulate these gradients and update their parameters.
This synchronization pattern allows workers to explore different parts of the environment simultaneously while still learning from one another's experiences. Unlike simple parameter copying, gradient-based synchronization preserves the optimization dynamics and enables faster convergence.
Without proper synchronization, workers would train independently and never benefit from parallel exploration. With gradient synchronization, A3C achieves faster training than single-agent A2C while maintaining training stability.
-
Challenge
Soft Actor-Critic Algorithm
Understanding Entropy Regularization
Soft Actor-critic (SAC) adds an entropy term to the objective function to explicitly encourage exploration during training.
The modified objective is:
J = E[Σ(r_t + α * H(π(·|s_t)))]Where
H(π)is the policy entropy andαis the temperature parameter.Higher entropy produces more random action selection, which increases exploration, while lower entropy produces more deterministic behavior and favors exploitation.
This entropy bonus helps the agent:
- Maintain exploration throughout training
- Avoid premature convergence to suboptimal policies
- Discover multiple solutions to tasks
The temperature parameter
αis automatically adjusted during training to balance exploration and exploitation. -
Challenge
Training and Performance Comparison
Understanding Training Scripts
Now that you've implemented the core algorithms (A2C, A3C, and SAC), you're ready to train them in an environment. Training scripts orchestrate the entire learning process: they create the environment, initialize agents, run episodes, collect experience, and save results.
Each algorithm requires slightly different training setups:
- A2C: Runs a single agent that updates after collecting a fixed number of steps
- A3C: Spawns multiple worker threads that train in parallel and sync with global networks
- SAC: Uses a replay buffer to store and sample past experiences for off-policy learning
When you run all three training scripts, they generate performance data that you can later analyze and compare. The training process typically takes several minutes as each agent learns to balance the pendulum through trial and error. ### Understanding Performance Evaluation
After training your three agents, you need to evaluate and compare their performance. This is crucial for understanding which algorithm works best for your specific task.
Performance evaluation involves:
- Loading the saved training rewards from all three algorithms.
- Computing summary statistics, such as mean and standard deviation, over the final episodes.
- Generating visualizations that compare learning efficiency and final performance.
By analyzing the last 100 episodes, you get a reliable measure of each algorithm's converged performance. This comparison helps you understand the trade-offs, such as A2C's simplicity, A3C's speed and SAC's stability. ### Understanding Training Visualization
Raw numbers tell part of the story, but visualizations reveal the learning dynamics. Plotting training curves sow you how each algorithm learned over time, not just where it ended up.
Training visualizations allow you to compare key learning characteristics:
- Learning speed: How quickly each algorithm improves.
- Stability: Whether training curves are smooth or noisy.
- Sample efficiency: How many episodes each algorithm requires to reach strong performance.
- Convergence: Whether performance plateaus or continues to improve.
Creating these plots is the final step in your benchmarking study. You'll generate side-by-side comparisons that clearly show each algorithm's learning behavior and make results easier to interpret and communicate to your team at CarvedRock Robotics.
About the author
Real skill practice before real-world application
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Learn by doing
Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.
Follow your guide
All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.
Turn time into mastery
On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.