Introduction
|
CPUs and GPUs are both useful and each has its own place in our toolbox
In the context of GPU programming, we often refer to the GPU as the device and the CPU as the host
Using GPUs to accelerate computation can provide large performance gains
Using the GPU with Python is not particularly difficult
|
Using your GPU with CuPy
|
CuPy provides GPU accelerated version of many NumPy functions.
Always have CPU and GPU versions of your code so that you can compare performance, as well as validate your code.
|
Accelerate your Python code with Numba
|
|
A Better Look at the GPU
|
|
Your First GPU Kernel
|
Precede your kernel definition with the __global__ keyword
Use built-in variables threadIdx , blockIdx , gridDim and blockDim to identify each thread
|
Registers, Global, and Local Memory
|
Registers can be used to locally store data and avoid repeated memory operations
Global memory is the main memory space and it is used to share data between host and GPU
Local memory is a particular type of memory that can be used to store data that does not fit in registers and is private to a thread
|
Shared Memory and Synchronization
|
Shared memory is faster than global memory and local memory
Shared memory can be used as a user-controlled cache to speedup code
Size of shared memory arrays must be known at compile time if allocated inside a thread
It is possible to declare extern shared memory arrays and pass the size during kernel invocation
Use __shared__ to allocate memory in the shared memory space
Use __syncthreads() to wait for shared memory operations to be visible to all threads in a block
|
Constant Memory
|
|
Concurrent access to the GPU
|
|