Shared Memory and Synchronization
Overview
Teaching: 30 min
Exercises: 25 minQuestions
Is there a way to share data between threads of a same block?
Can threads inside a block wait for other threads?
Objectives
Learn how to share data between threads
Learn how to synchronize threads
So far we looked at how to use CUDA to accelerate the computation, but a common pattern in all the examples we encountered so far is that threads worked in isolation. While having different threads perform the same operation on different data is a good pattern for working with GPUs, there are cases in which threads need to communicate. This communication may be necessary because of the way the algorithm we are trying to implement works, or it may derive from a performance goal we are trying to achieve.
Shared Memory
Shared memory is a CUDA memory space that is shared by all threads in a thread block. In this case shared means that all threads in a thread block can write and read to block-allocated shared memory, and all changes to this memory will be eventually available to all threads in the block.
To allocate an array in shared memory we need to preface the definition with the identifier __shared__
.
Challenge: use of shared memory
Modify the following code to allocate the
temp
array in shared memory.extern "C" __global__ void vector_add(const float * A, const float * B, float * C, const int size) { int item = (blockIdx.x * blockDim.x) + threadIdx.x; float temp[3]; if ( item < size ) { temp[0] = A[item]; temp[1] = B[item]; temp[2] = temp[0] + temp[1]; C[item] = temp[2]; } }
Solution
To use shared memory for the
temp
array add the identifier__shared__
to its definition, like in the following code.extern "C" __global__ void vector_add(const float * A, const float * B, float * C, const int size) { int item = (blockIdx.x * blockDim.x) + threadIdx.x; __shared__ float temp[3]; if ( item < size ) { temp[0] = A[item]; temp[1] = B[item]; temp[2] = temp[0] + temp[1]; C[item] = temp[2]; } }
While syntactically correct, the previous example is functionally wrong.
The reason is that the temp
array is not anymore private to the thread allocating it, but it is now shared by the whole thread block.
Challenge: what is the result of the previous code block?
The previous code example is functionally wrong. Can you guess what the result of its execution will be?
Solution
The result is non deterministic, and definitely not the same as the previous versions of
vector_add
. Threads will overwrite each other temporary values,and there will be no guarantee on which value is visible by each thread.
To fix the previous kernel we should allocate enough shared memory for each thread to store three values, so that each thread has its own section of the shared memory array to work with.
To allocate enough memory we need to replace the constant 3 in __shared__ float temp[3]
with something else.
If we know that each thread block has 1024 threads, we can write something like the following:
__shared__ float temp[3 * 1024];
But we know by experience that having constants in the code is not a scalable and maintainable solution. The problem is that we need to have a constant value if we want to declare a shared memory array, because the compiler needs to know how much memory to allocate.
A solution to this problem is to not specify the size of the array, and allocate the memory somewhere else.
extern __shared__ float temp[];
And then use CuPy to instruct CUDA about how much shared memory, in bytes, each thread block needs.
This can be done by adding the named parameter shared_mem
to the kernel call.
vector_add_gpu((2, 1, 1), (size // 2, 1, 1), (a_gpu, b_gpu, c_gpu, size), shared_mem=((size // 2) * 3 * cupy.dtype(cupy.float32).itemsize))
As you may have noticed, we had to retrieve the size in bytes of the data type cupy.float32
, and this is done with cupy.dtype(cupy.float32).itemsize
.
After these changes, the body of the kernel needs to be modified to use the right indices:
extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
int item = (blockIdx.x * blockDim.x) + threadIdx.x;
int offset = threadIdx.x * 3;
extern __shared__ float temp[];
if ( item < size )
{
temp[offset + 0] = A[item];
temp[offset + 1] = B[item];
temp[offset + 2] = temp[offset + 0] + temp[offset + 1];
C[item] = temp[offset + 2];
}
}
And for completeness, we present the full Python code.
import math
import numpy as np
import cupy
# vector size
size = 2048
# GPU memory allocation
a_gpu = cupy.random.rand(size, dtype=cupy.float32)
b_gpu = cupy.random.rand(size, dtype=cupy.float32)
c_gpu = cupy.zeros(size, dtype=cupy.float32)
gpu_args = (a_gpu, b_gpu, c_gpu, size)
# CPU memory allocation
a_cpu = cupy.asnumpy(a_gpu)
b_cpu = cupy.asnumpy(b_gpu)
c_cpu = np.zeros(size, dtype=np.float32)
# CUDA code
vector_add_cuda_code = r'''
extern "C"
__global__ void vector_add(const float * A, const float * B, float * C, const int size)
{
int item = (blockIdx.x * blockDim.x) + threadIdx.x;
int offset = threadIdx.x * 3;
extern __shared__ float temp[];
if ( item < size )
{
temp[offset + 0] = A[item];
temp[offset + 1] = B[item];
temp[offset + 2] = temp[offset + 0] + temp[offset + 1];
C[item] = temp[offset + 2];
}
}
'''
# compile and execute code
vector_add_gpu = cupy.RawKernel(vector_add_cuda_code, "vector_add")
threads_per_block = 32
grid_size = (int(math.ceil(size / threads_per_block)), 1, 1)
block_size = (threads_per_block, 1, 1)
vector_add_gpu(grid_size, block_size, gpu_args, shared_mem=(threads_per_block * 3 * cupy.dtype(cupy.float32).itemsize))
# execute Python code and compare results
vector_add(a_cpu, b_cpu, c_cpu, size)
np.allclose(c_cpu, c_gpu)
The code is now correct, although it is still not very useful. We are definitely using shared memory, and we are using it the correct way, but there is no performance gain we achieved by doing so. Actually, we are making our code slower, not faster, because shared memory is slower than registers.
Let us, therefore, work on an example where using shared memory is actually useful. We start again with some Python code.
def histogram(input_array, output_array):
for item in input_array:
output_array[item] = output_array[item] + 1
The histogram
function, as the name suggests, computes the histogram of an array of integers, i.e. counts how many instances of each integer are in input_array
, and writes the count in output_array
.
We can now generate some data and run the code.
input_array = np.random.randint(256, size=2048, dtype=np.int32)
output_array = np.zeros(256, dtype=np.int32)
histogram(input_array, output_array)
Everything as expected. We can now write the equivalent code in CUDA.
extern "C"
__global__ void histogram(const int * input, int * output)
{
int item = (blockIdx.x * blockDim.x) + threadIdx.x;
output[input[item]] = output[input[item]] + 1;
}
Challenge: error in the histogram
If you look at the CUDA
histogram
code, there is a logical error that prevents it to produce the correct results. Can you find it?extern "C" __global__ void histogram(const int * input, int * output) { int item = (blockIdx.x * blockDim.x) + threadIdx.x; output[input[item]] = output[input[item]] + 1; }
Solution
The GPU is a highly parallel device, executing multiple threads at the same time. In the previous code different threads could be updating the same output item at the same time, producing wrong results.
To solve this problem, we need to use a function from the CUDA library named atomicAdd
.
This function ensures that the increment of output_array
happens in an atomic way, so that there are no conflicts in case multiple threads want to update the same item at the same time.
extern "C"
__global__ void histogram(const int * input, int * output)
{
int item = (blockIdx.x * blockDim.x) + threadIdx.x;
atomicAdd(&(output[input[item]]), 1);
}
And the full Python code snippet.
import math
import numpy as np
import cupy
from cupyx.profiler import benchmark
def histogram(input_array, output_array):
for item in input_array:
output_array[item] = output_array[item] + 1
# input size
size = 2**25
# allocate memory on CPU and GPU
input_gpu = cupy.random.randint(256, size=size, dtype=cupy.int32)
input_cpu = cupy.asnumpy(input_gpu)
output_gpu = cupy.zeros(256, dtype=cupy.int32)
output_cpu = cupy.asnumpy(output_gpu)
# CUDA code
histogram_cuda_code = r'''
extern "C"
__global__ void histogram(const int * input, int * output)
{
int item = (blockIdx.x * blockDim.x) + threadIdx.x;
atomicAdd(&(output[input[item]]), 1);
}
'''
# compile and setup CUDA code
histogram_gpu = cupy.RawKernel(histogram_cuda_code, "histogram")
threads_per_block = 256
grid_size = (int(math.ceil(size / threads_per_block)), 1, 1)
block_size = (threads_per_block, 1, 1)
# check correctness
histogram(input_cpu, output_cpu)
histogram_gpu(grid_size, block_size, (input_gpu, output_gpu))
if np.allclose(output_cpu, output_gpu):
print("Correct results!")
else:
print("Wrong results!")
# measure performance
%timeit -n 1 -r 1 histogram(input_cpu, output_cpu)
execution_gpu = benchmark(histogram_gpu, (grid_size, block_size, (input_gpu, output_gpu)), n_repeat=10)
gpu_avg_time = np.average(execution_gpu.gpu_times)
print(f"{gpu_avg_time:.6f} s")
The CUDA code is now correct, and computes the same result as the Python code; it is also faster than the Python code, as you can see from measuring the execution time.
However, we are accumulating the results directly in global memory, and the more conflicts we have in global memory, the lower the performance of our histogram
will be.
Moreover, the access pattern to the output array is very irregular, being dependent on the content of the input array.
GPUs are designed for very regular computations, and so if we can make the histogram more regular we can hope in a further improvement in performance.
As you may expect, we can improve the memory access pattern by using shared memory.
Challenge: use shared memory to speed up the histogram
Implement a new version of the CUDA
histogram
function that uses shared memory to reduce conflicts in global memory. Modify the following code and follow the suggestions in the comments.extern "C" __global__ void histogram(const int * input, int * output) { int item = (blockIdx.x * blockDim.x) + threadIdx.x; // Declare temporary histogram in shared memory int temp_histogram[256]; // Update the temporary histogram in shared memory atomicAdd(); // Update the global histogram in global memory, using the temporary histogram atomicAdd(); }
Hint: for this exercise, you can safely assume that the size of
output
is the same as the number of threads in a block.Hint:
atomicAdd
can be used on both global and shared memory.Solution
The following code shows one of the possible solutions.
extern "C" __global__ void histogram(const int * input, int * output) { int item = (blockIdx.x * blockDim.x) + threadIdx.x; __shared__ int temp_histogram[256]; atomicAdd(&(temp_histogram[input[item]]), 1); atomicAdd(&(output[threadIdx.x]), temp_histogram[threadIdx.x]); }
The idea behind this solution is to reduce the expensive conflicts in global memory by having a temporary histogram in shared memory. After a block has finished processing its fraction of the input array, and the local histogram is populated, threads collaborate to update the global histogram. Not only this solution potentially reduces the conflicts in global memory, it also produces a better access pattern because threads read adjacent items of the
input
array, and write to adjacent elements of theoutput
array during the second call toatomicAdd
.
Thread Synchronization
There is still one potentially big issue in the histogram
code we just wrote, and the issue is that shared memory is not coherent without explicit synchronization.
The problem lies in the following two lines of code:
atomicAdd(&(temp_histogram[input[item]]), 1);
atomicAdd(&(output[threadIdx.x]), temp_histogram[threadIdx.x]);
In the first line each thread updates one arbitrary position in shared memory, depending on the value of the input, while in the second line each thread reads the element in shared memory corresponding to its thread ID. However, the changes to shared memory are not automatically available to all other threads, and therefore the final result may not be correct.
To solve this issue, we need to explicitly synchronize all threads in a block, so that memory operations are also finalized and visible to all.
To synchronize threads in a block, we use the __syncthreads()
CUDA function.
Moreover, shared memory is not initialized, and the programmer needs to take care of that too.
So we need to first initialize temp_histogram
, wait that all threads are done doing this, perform the computation in shared memory, wait again that all threads are done, and only then update the global array.
extern "C"
__global__ void histogram(const int * input, int * output)
{
int item = (blockIdx.x * blockDim.x) + threadIdx.x;
__shared__ int temp_histogram[256];
// Initialize shared memory and synchronize
temp_histogram[threadIdx.x] = 0;
__syncthreads();
// Compute shared memory histogram and synchronize
atomicAdd(&(temp_histogram[input[item]]), 1);
__syncthreads();
// Update global histogram
atomicAdd(&(output[threadIdx.x]), temp_histogram[threadIdx.x]);
}
And the full Python code snippet.
import math
import numpy as np
import cupy
from cupyx.profiler import benchmark
def histogram(input_array, output_array):
for item in input_array:
output_array[item] = output_array[item] + 1
# input size
size = 2**25
# allocate memory on CPU and GPU
input_gpu = cupy.random.randint(256, size=size, dtype=cupy.int32)
input_cpu = cupy.asnumpy(input_gpu)
output_gpu = cupy.zeros(256, dtype=cupy.int32)
output_cpu = cupy.asnumpy(output_gpu)
# CUDA code
histogram_cuda_code = r'''
extern "C"
__global__ void histogram(const int * input, int * output)
{
int item = (blockIdx.x * blockDim.x) + threadIdx.x;
__shared__ int temp_histogram[256];
// Initialize shared memory and synchronize
temp_histogram[threadIdx.x] = 0;
__syncthreads();
// Compute shared memory histogram and synchronize
atomicAdd(&(temp_histogram[input[item]]), 1);
__syncthreads();
// Update global histogram
atomicAdd(&(output[threadIdx.x]), temp_histogram[threadIdx.x]);
}
'''
# compile and setup CUDA code
histogram_gpu = cupy.RawKernel(histogram_cuda_code, "histogram")
threads_per_block = 256
grid_size = (int(math.ceil(size / threads_per_block)), 1, 1)
block_size = (threads_per_block, 1, 1)
# check correctness
histogram(input_cpu, output_cpu)
histogram_gpu(grid_size, block_size, (input_gpu, output_gpu))
if np.allclose(output_cpu, output_gpu):
print("Correct results!")
else:
print("Wrong results!")
# measure performance
%timeit -n 1 -r 1 histogram(input_cpu, output_cpu)
execution_gpu = benchmark(histogram_gpu, (grid_size, block_size, (input_gpu, output_gpu)), n_repeat=10)
gpu_avg_time = np.average(execution_gpu.gpu_times)
print(f"{gpu_avg_time:.6f} s")
While both versions of the GPU histogram are correct, the one using shared memory is faster; but how fast? On a NVIDIA Tesla T4 accessed via Google Colab, the shared memory version is ten times faster than the version doing atomic operations on global memory.
Key Points
Shared memory is faster than global memory and local memory
Shared memory can be used as a user-controlled cache to speedup code
Size of shared memory arrays must be known at compile time if allocated inside a thread
It is possible to declare
extern
shared memory arrays and pass the size during kernel invocationUse
__shared__
to allocate memory in the shared memory spaceUse
__syncthreads()
to wait for shared memory operations to be visible to all threads in a block