Featured resource
2025 Tech Upskilling Playbook
Tech Upskilling Playbook

Build future-ready tech teams and hit key business milestones with seven proven plays from industry leaders.

Check it out
  • Course

GenAI Inference and Serving Architecture

Running GenAI systems efficiently is key for real-world AI. This course will teach you how to make informed model-selection decisions and implement fast, scalable, and cost-optimized transformer inference pipelines.

Advanced
1h 56m

Created by Yasir Khan

Last Updated Jan 30, 2026

Course Thumbnail
  • Course

GenAI Inference and Serving Architecture

Running GenAI systems efficiently is key for real-world AI. This course will teach you how to make informed model-selection decisions and implement fast, scalable, and cost-optimized transformer inference pipelines.

Advanced
1h 56m

Created by Yasir Khan

Last Updated Jan 30, 2026

Get started today

Access this course and other top-rated tech content with one of our business plans.

Try this course for free

Access this course and other top-rated tech content with one of our individual plans.

This course is included in the libraries shown below:

  • AI
What you'll learn

Deploying modern large language models (LLMs) efficiently is challenging due to their high computational demands, complex sampling behavior, and rapidly evolving inference optimizations.

In this course, GenAI Inference and Serving Architecture, you’ll gain the ability to design, analyze, and optimize high-performance inference pipelines for transformer models.

First, you’ll explore the fundamentals of model inference, including tokenization, forward passes, sampling strategies, and the key performance metrics that govern latency and throughput.

Next, you’ll discover how to implement batching, KV-cache management, and long-context optimization techniques to dramatically improve efficiency at scale.

Finally, you’ll learn how to optimize GPU utilization, manage infrastructure costs, and apply advanced techniques such as speculative decoding, quantization, and model compression.

When you’re finished with this course, you’ll have the skills and knowledge of LLM inference optimization needed to build, tune, and scale cost-efficient, high-performance GenAI systems in production.

GenAI Inference and Serving Architecture
Advanced
1h 56m
Table of contents

About the author
Yasir Khan - Pluralsight course - GenAI Inference and Serving Architecture
Yasir Khan
28 courses 0.0 author rating 0 ratings

Dr. Yasir Khan is a global tech consultant and 38Labs founder. He's passionate about digital transformation, data & AI, and regularly shares technology insights on Pluralsight.

2025 Forrester Wave™ names Pluralsight as a Leader among tech skills dev platforms

See how our offering and strategy stack up.

forrester wave report