GPU Acceleration with CUDA
Anonymous contributor
Published Feb 5, 2025
Contribute to Docs
CUDA (Compute Unified Device Architecture) allows PyTorch tensors and neural networks to execute on GPUs, providing parallel computing capabilities and memory optimization for accelerated deep learning operations.
Syntax
torch.cuda.is_available() # Check CUDA availability
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device) # Move model to GPU
tensor.to(device) # Move tensor to GPU
cuda:N
: Specifies GPU device number N (0-indexed).device
: Object representing the compute device (GPU/CPU).
Memory Hierarchy
- CUDA operates with distinct memory spaces: host (CPU) and device (GPU) memory.
- Data must be explicitly transferred between these spaces via CUDA memory operations.
- GPU memory types include global memory, shared memory, and registers, each with different access speeds and capacities.
Parallel Processing
- CUDA organizes computation into grids of thread blocks.
- Thread blocks contain multiple threads that execute in parallel.
- Warps are groups of 32 threads that execute simultaneously on NVIDIA GPUs.
- Thread synchronization and coordination are crucial for correct parallel execution.
Memory Transfer Optimization
- Synchronous data transfer using CUDA streams reduces overhead.
- Pinned memory enables faster CPU-GPU transfers.
- Coalesced memory access patterns improve memory bandwidth utilization.
- Memory prefetching can hide transfer latency.
Compute Architecture
- Each GPU contains multiple Streaming Multiprocessors (SMs).
- SMs manage resources like registers, shared memory, and cache.
- CUDA cores within SMs execute arithmetic operations in parallel.
- Different GPU architectures (compute capabilities) support varying features.
Example
The following example demonstrates moving a linear model and input tensor to GPU for accelerated computation:
import torchimport torch.nn as nn# Create model and sample datamodel = nn.Linear(10, 1)input_data = torch.randn(100, 10)# Move to GPUdevice = torch.device("cuda:0")model.to(device)input_data = input_data.to(device)# Forward passoutput = model(input_data)print(f"Output tensor device: {output.device}")
The output of the above code will be:
Output tensor device: cuda:0
GPU Acceleration with CUDA
- CUDA Operations
- CUDA operations provide specialized functions for GPU memory management, stream control, device handling, and synchronization in PyTorch.
- Memory Management
- Enables efficient GPU memory allocation, transfer, and optimization for deep learning operations.
- Performance Optimization
- Accelerate training and deep learning models with PyTorch CUDA library.
All contributors
- Anonymous contributor
Contribute to Docs
- Learn more about how to get involved.
- Edit this page on GitHub to fix an error or make an improvement.
- Submit feedback to let us know how we can improve Docs.