- Course
GenAI Inference and Serving Architecture
Running GenAI systems efficiently is key for real-world AI. This course will teach you how to make informed model-selection decisions and implement fast, scalable, and cost-optimized transformer inference pipelines.
- Course
GenAI Inference and Serving Architecture
Running GenAI systems efficiently is key for real-world AI. This course will teach you how to make informed model-selection decisions and implement fast, scalable, and cost-optimized transformer inference pipelines.
Get started today
Access this course and other top-rated tech content with one of our business plans.
Try this course for free
Access this course and other top-rated tech content with one of our individual plans.
This course is included in the libraries shown below:
- AI
What you'll learn
Deploying modern large language models (LLMs) efficiently is challenging due to their high computational demands, complex sampling behavior, and rapidly evolving inference optimizations.
In this course, GenAI Inference and Serving Architecture, you’ll gain the ability to design, analyze, and optimize high-performance inference pipelines for transformer models.
First, you’ll explore the fundamentals of model inference, including tokenization, forward passes, sampling strategies, and the key performance metrics that govern latency and throughput.
Next, you’ll discover how to implement batching, KV-cache management, and long-context optimization techniques to dramatically improve efficiency at scale.
Finally, you’ll learn how to optimize GPU utilization, manage infrastructure costs, and apply advanced techniques such as speculative decoding, quantization, and model compression.
When you’re finished with this course, you’ll have the skills and knowledge of LLM inference optimization needed to build, tune, and scale cost-efficient, high-performance GenAI systems in production.