CUDA

CUDA is a parallel computing platform and application programming interface (API) model created by Nvidia. It allows developers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing – an approach known as GPGPU (General-Purpose computing on Graphics Processing Units).

Why do we need to learn CUDA?

Existing foundation models are not hardware-optimized. We need to design hardware-aware models (e.g., FlashAttention series).
Not all operators for new emerging models are supported by PyTorch. We have to implement them by ourselves.

GPU architecture

Common CUDA optimizations

Hardware-aware algorithms

CUDA extension for PyTorch

cuda_kernel = """
extern "C" __global__
void square_kernel(const float* __restrict__ input, float* __restrict__ output, int size) {
    const int index = blockIdx.x * blockDim.x + threadIdx.x;
    if (index < size) {
        output[index] = input[index] * input[index];
    }
}
"""

import torch
import torch.utils.cpp_extension

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
module = torch.utils.cpp_extension.load_inline(
    name='square',
    cpp_sources='',
    cuda_sources=cuda_kernel,
    functions=['square_kernel']
)

def square(input):
    output = torch.empty_like(input)
    threads_per_block = 1024
    blocks_per_grid = (input.numel() + (threads_per_block - 1)) // threads_per_block
    module.square_kernel(blocks_per_grid, threads_per_block, input, output, input.numel())
    return output

# Example usage
input_tensor = torch.randn(100, device=device)
output_tensor = square(input_tensor)

References

MIT 6.S096, Jan 2014. “Effective Programming In C And C++”.
Stanford CS149, Fall 2023. “Parallel Computing”.
MIT 6.5940, Fall 2023. TinyEngine and Parallel Processing.
MIT Han Lab. “Parallel Computing Tutorial”.
MIT 6.5940, Fall 2024. “Lab 5: Optimize LLM on Edge Devices”.
Tinkerd. “Writing CUDA Kernels for PyTorch” Tech Blog.
Richard Zou. “Custom C++ and CUDA Extensions” PyTorch Tutorial.
Georgi Gerganov. “llama.cpp: LLM inference in C/C++” Github repo.
Christopher Ré. “Systems for Foundation Models, and Foundation Models for Systems” NeurIPS 2023 invited talk.
NVIDIA. “CUDA C++ Programming Guide” NVIDIA Documentation.