-
Course
- AI
Distributed Computing for ML
This course teaches you how to build and optimize distributed machine learning pipelines using Ray and PyTorch, covering multi-process training, backend tuning, gradient compression, and remote node integration for scalable model development.
What you'll learn
Building machine learning models at scale introduces a range of performance and infrastructure challenges. In this course, Distributed Computing for ML, you’ll gain the skills to design, deploy, and optimize scalable machine learning workflows across multi-node environments. First, you’ll learn how to set up a distributed cluster using Ray and PyTorch—from simulating a local cluster to training models across multiple processes. Next, you’ll examine key performance factors such as resource utilization, data partitioning, and communication tradeoffs between processes. Finally, you’ll implement optimization techniques including Distributed Stochastic Gradient Descent (DSGD), experiment with communication backends like Gloo and NCCL, and tune cluster topologies for better performance. You’ll also explore advanced strategies like integrating remote GPU nodes, applying gradient compression, and benchmarking I/O efficiency. When you’re finished with this course, you’ll have the skills and knowledge needed to build and monitor distributed machine learning pipelines on both local and remote infrastructure.
Table of contents
About the author
I'm Anthony Alampi, an interactive designer and developer living in Austin, Texas. I'm a former professional video game developer and current web design company owner.
More Courses by Anthony