Parallel Computing with CUDA

An entry-level course on CUDA - a GPU programming technology from NVIDIA.
Course info
Rating
(103)
Level
Intermediate
Updated
Oct 3, 2013
Duration
4h 12m
Table of contents
Description
Course info
Rating
(103)
Level
Intermediate
Updated
Oct 3, 2013
Duration
4h 12m
Description

This introductory course on CUDA shows how to get started with using the CUDA platform and leverage the power of modern NVIDIA GPUs. It covers the basics of CUDA C, explains the architecture of the GPU and presents solutions to some of the common computational problems that are suitable for GPU acceleration.

About the author
About the author

Dmitri is a developer, speaker, podcaster, technical evangelist and wannabe quant.

More from the author
ReSharper Fundamentals
Intermediate
2h 54m
Apr 28, 2017
More courses by Dmitri Nesteruk
Section Introduction Transcripts
Section Introduction Transcripts

Introduction to CUDA C
In this module we finally get to write some code and the language that we're going to be using is called CUDA C. Which for the most part is just standard C. So we do finally get to write some code in this module, but there will be plenty of theoretical material that's pretty much essential to understanding how CUDA does what it does. So first of all we'll talk about the compilation process, the way that C code actually gets turned into something that can run on a GPU. Then we'll have the obligatory "Hello Cuda" demo where we'll write some fairly simplistic C code to add up two arrays together. We'll discuss the location qualifiers that CUDA uses to indicate where the code runs and then I want to talk about CUDAs execution model. Basically explaining what exactly happens when you send a kernel for CUDA to execute. I will also cover the issue of grid and block dimensions the way that they can be specified and examined at runtime. Then we'll discuss the very simple mechanism that can be used to handle errors in CUDA applications and to finish things off I'll show you how to get information about the device you're running on at runtime.

The Many Types of Memory
When we talk about memory in programming for the CPU we mainly talk about RAM, that's Random Access Memory. And of course there are other types of memory you might have, I don't know SSDs or hard drives for long-term storage and you obviously have your CPU registers. Now in CUDA the situation is slightly different because there are several different types of memory that you might, in a way, equate with Random Access Memory. And all of these types of memory have their own performance characteristics and programming mechanic. And obviously all of them have a purpose in the CUDA ecosystem. So I want to take you back to a diagram that we've already seen. Now this slide is interesting for us because it demonstrates where the different types of memory actually reside. And also the pointy arrows show us which part of the architecture can read what data. And obviously one thing that's not shown here is the host memory and whether the CPU can read it, but as you remember this is what we've got CUDA for. So its API actually lets us read and write to device memory. One thing you'll notice about this diagram is the two new arrows that I've added. And when we talk about constant and texture memory I'll explain what these actually mean. So in this module we're going to talk about all the different types of memory presented here, what these types of memory are actually used for, and how to work with them.

Thread Cooperation and Synchronization
All the examples I've shown in the previous modules demonstrate how you can get each thread to do something on its own. And all of the examples have carefully avoided the issue of having threads interacting with one another. So in this module we'll talk about interaction between threads as well as the exact mechanisms in which they're scheduled and executed. Before we jump into this modules contents I've got a bit of an announcement to make. So until recently we've been using Visual Studio 2010 to cut up out examples, but with the release of CUDA 5. 5 it's now possible to use Visual Studio 2012. So I hope don't mind the fact that I'm going to continue this course using Visual Studio 2012. As you can see I've already taken in all the examples from the previous modules and they are included in a solution. Notice that Visual Studio has this back what's compatibility thing for a C++ project, so you can open up Visual Studio 2010 projects with absolutely no problem whatsoever. Just wanted to give you a heads up because the UI will look slightly different now, but in terms of operation pretty much everything is as it was in Visual Studio 2010.

Atomic Operations
In this module we're going to talk about what Atomic Operations are and how you'd actually use them. So in this module we'll begin with a very simple example of a situation where things breakdown without the use of atomics. And we'll then discuss exactly what atomic functions are and which ones are provided by CUDA. Then we'll try doing an array sum once again, but this time we'll first do it in a very naïve way, just to see how things fail without the use of atomics. And then we'll use atomics to actually sum all the elements in an array using a single temporary variable. And finally we'll do a more comprehensive example by calculating the value of Pi using a Monte Carlo method.

Events and Streams
Welcome in this module we're going to discuss the concepts of CUDA Events as well as CUDA Streams. We'll begin by discussing what events are, what they're used for, and we'll take a look at the API that's used for creating and recording events. We'll then look at a very simple example of editing event recording to an existing projects that we did previously. We'll then talk about the idea of pinned or page locked memory and pinned memory is required for us to be able to use streams so we'll talk about what those are and where they show up in the cued API. And we'll finish off this module with two examples, one just to illustrate how to create a single stream and demonstrate how to queue up operations on it. And the other for finding out how to add another stream to our calculations in an attempt to maybe improve their performance.

CUDA in Advanced Scenarios
Welcome, in this module we're going to talk about using CUDA in Advanced Scenarios such as dynamic loading of kernels or for example, multi-GPU programming. So here are some of the things we're going to talk about. We're going to begin this module with a brief reminder of what PTX is and what you can do with it. And then we'll actually use PTX, all be it indirectly, in the discussion of the CUDA driver API. We'll then return to a topic that we already discussed in memory and discuss some of its peculiarities that can be useful for us. And then we'll talk about using CUDA for multi-GPU programming, that is programming when you have more than one device in your system. And we'll finish it all off with a discuss of the thrust library.