Course

GenAI Inference and Serving Architecture

Running GenAI systems efficiently is key for real-world AI. This course will teach you how to make informed model-selection decisions and implement fast, scalable, and cost-optimized transformer inference pipelines.

Advanced

1h 56m

Created by Yasir Khan

Last Updated Jan 30, 2026

Get started today

Access this course and other top-rated tech content with one of our business plans.

Start a free team trial

Buy now

Try this course for free

Access this course and other top-rated tech content with one of our individual plans.

Start a free trial

Buy now

This course is included in the libraries shown below:

Course

GenAI Inference and Serving Architecture

Advanced

1h 56m

Created by Yasir Khan

Last Updated Jan 30, 2026

Get started today

Access this course and other top-rated tech content with one of our business plans.

Start a free team trial

Buy now

Try this course for free

Access this course and other top-rated tech content with one of our individual plans.

Start a free trial

Buy now

This course is included in the libraries shown below:

What you'll learn

Deploying modern large language models (LLMs) efficiently is challenging due to their high computational demands, complex sampling behavior, and rapidly evolving inference optimizations.

In this course, GenAI Inference and Serving Architecture, you’ll gain the ability to design, analyze, and optimize high-performance inference pipelines for transformer models.

First, you’ll explore the fundamentals of model inference, including tokenization, forward passes, sampling strategies, and the key performance metrics that govern latency and throughput.

Next, you’ll discover how to implement batching, KV-cache management, and long-context optimization techniques to dramatically improve efficiency at scale.

Finally, you’ll learn how to optimize GPU utilization, manage infrastructure costs, and apply advanced techniques such as speculative decoding, quantization, and model compression.

When you’re finished with this course, you’ll have the skills and knowledge of LLM inference optimization needed to build, tune, and scale cost-efficient, high-performance GenAI systems in production.

GenAI Inference and Serving Architecture

Advanced

1h 56m

Table of contents

About the author

Yasir Khan

29 courses

0.0 author rating

0 ratings

Dr. Yasir Khan is a global tech consultant and 38Labs founder. He's passionate about digital transformation, data & AI, and regularly shares technology insights on Pluralsight.

More Courses by Yasir

GenAI Inference and Serving Architecture

GenAI Inference and Serving Architecture

Get started today

Try this course for free

GenAI Inference and Serving Architecture

What you'll learn

GenAI Inference and Serving Architecture

Model Inference and Efficient Model Selection 32m

Batching and Throughput Optimization 18m

GPU Scaling and Infrastructure Considerations 19m

Cost Engineering and Optimization Strategies 19m

Advanced Inference Techniques for Speed and Quality 27m

2025 Forrester Wave™ names Pluralsight as a Leader among tech skills dev platforms