CUDA
CUDA is a parallel computing platform and application programming interface (API) model created by Nvidia. It allows developers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing – an approach known as GPGPU (General-Purpose computing on Graphics Processing Units).
Why do we need to learn CUDA?
Existing foundation models are not hardware-optimized. We need to design hardware-aware models (e.g., FlashAttention series).
Not all operators for new emerging models are supported by PyTorch. We have to implement them by ourselves.
GPU architecture
Common CUDA optimizations
Hardware-aware algorithms
CUDA extension for PyTorch
cuda_kernel = """
extern "C" __global__
void square_kernel(const float* __restrict__ input, float* __restrict__ output, int size) {
const int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < size) {
output[index] = input[index] * input[index];
}
}
"""
import torch
import torch.utils.cpp_extension
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
module = torch.utils.cpp_extension.load_inline(
name='square',
cpp_sources='',
cuda_sources=cuda_kernel,
functions=['square_kernel']
)
def square(input):
output = torch.empty_like(input)
threads_per_block = 1024
blocks_per_grid = (input.numel() + (threads_per_block - 1)) // threads_per_block
module.square_kernel(blocks_per_grid, threads_per_block, input, output, input.numel())
return output
# Example usage
input_tensor = torch.randn(100, device=device)
output_tensor = square(input_tensor)
References
MIT 6.S096, Jan 2014. “Effective Programming In C And C++”.
Stanford CS149, Fall 2023. “Parallel Computing”.
MIT 6.5940, Fall 2023. TinyEngine and Parallel Processing.
MIT Han Lab. “Parallel Computing Tutorial”.
MIT 6.5940, Fall 2024. “Lab 5: Optimize LLM on Edge Devices”.
Tinkerd. “Writing CUDA Kernels for PyTorch” Tech Blog.
Richard Zou. “Custom C++ and CUDA Extensions” PyTorch Tutorial.
Georgi Gerganov. “llama.cpp: LLM inference in C/C++” Github repo.
Christopher Ré. “Systems for Foundation Models, and Foundation Models for Systems” NeurIPS 2023 invited talk.
NVIDIA. “CUDA C++ Programming Guide” NVIDIA Documentation.