========== PyTorch ========== PyTorch is an open-source machine learning library based on the Torch library, used for applications such as computer vision and natural language processing, primarily developed by Facebook's AI Research lab. It is free and open-source software released under the Modified BSD license. Useful Links: - `ProjectPage Template (Nerfies) `_ - `Pytorch Lightning code guideline for conferences `_ - `PyTorch Experiments Template (TorchRecipes) `_ - `Model Graph Visualizer and Debugger (Model Explorer) `_ Model visualization ------------------- ``Model Explorer`` provides a web-based interface for visualizing and debugging PyTorch models. You can use the following code to visualize any PyTorch model. .. code-block:: python # pip install ai-edge-model-explorer import model_explorer import torch import torchvision # Prepare a PyTorch model and its inputs. model = torchvision.models.mobilenet_v2().eval() inputs = (torch.rand([1, 3, 224, 224]),) ep = torch.export.export(model, inputs) # Visualize. model_explorer.visualize_pytorch('mobilenet', exported_program=ep) torch.compile ----------------- ``torch.compile`` is a new feature in PyTorch that allows users to compile their models for faster training and inference. In practice, it can speed up model inference at least 30%. .. figure:: https://pytorch.org/assets/images/pytorch-2.0-img4.jpg :align: center :alt: Ray Cluster Architecture ``torch.compile`` workflow The process begins with ``torch.compile`` taking a PyTorch model as input. It then utilizes ``TorchDynamo`` and ``AOTAutograd`` to generate execution graphs for both the forward and backward passes. Next, each operator within the graph is mapped to a manually optimized operator from either ``ATen`` or ``Prim IR``. Finally, ``TorchInductor`` compiles the generated graphs and produces execution Triton code for the GPU and C++/OpenMP code for the CPU. Use ``torch.compile`` is simple. Just wrap your model with ``torch.compile`` and it will automatically compile your model for faster training and inference. .. code-block:: python # option 1: using torch.compile import torch mod = MyModule() opt_mod = torch.compile(mod) # option 2: using @torch.compile warp your code t1 = torch.randn(10, 10) t2 = torch.randn(10, 10) @torch.compile def opt_foo2(x, y): a = torch.sin(x) b = torch.cos(y) return a + b print(opt_foo2(t1, t2)) ``torch.compile`` has integrated with `Torch-TensorRT `_ and enables users to compile their models with ``TensorRT`` backend for faster inference. .. code-block:: python # pip install torch-tensorrt import torch import timeit import torch_tensorrt model = MyModel().eval().cuda() # define your model here x = torch.randn((1, 3, 224, 224)).cuda() # define what the inputs to the model will look like optimized_model = torch.compile(model, backend="tensorrt") optimized_model(x) # compiled on first run optimized_model(x) # this will be fast! # Tesla V100 32GB # >>> timeit.timeit('model(x)', number=10, globals=globals()) # 0.10089640901423991 # >>> timeit.timeit('optimized_model(x)', number=10, globals=globals()) # 0.04496749211102724 PyTorch profiling ---------------------- Use PyTorch Profile ^^^^^^^^^^^^^^^^^^^^ PyTorch Profile provides a fine-grained view of the performance of your PyTorch code. It can help you identify bottlenecks (e.g., slow operations) and optimize your code accordingly. .. code-block:: python import torch import torchvision.models as models from torch.profiler import profile, record_function, ProfilerActivity model = models.resnet18() inputs = torch.randn(5, 3, 224, 224) # ProfilerActivity.CPU, ProfilerActivity.CUDA with profile(activities=[ProfilerActivity.CPU], profile_memory=True, record_shapes=True) as prof: model(inputs) # self_cpu_time_total, self_cuda_time_total, self_cpu_memory_usage print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10)) You can also use ``pytorch_benchmark`` to profile the whole inference workload. .. code-block:: python # pip install pytorch-benchmark from pytorch_benchmark import benchmark model = models.resnet18().to("cuda") inputs = torch.randn(8, 3, 224, 224).to("cuda") results = benchmark(model, inputs, num_runs=100) Sample results: .. code-block:: bash {'machine_info': {'system': {'system': 'Linux', 'node': 'ubuntu20', 'release': '5.4.0-200-generic'}, 'cpu': {'model': 'Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz', 'architecture': 'x86_64', 'cores': {'physical': 9, 'total': 18}, 'frequency': '0.00 GHz'}, 'memory': {'total': '57.15 GB', 'used': '9.17 GB', 'available': '47.27 GB'}, 'gpus': [{'name': 'Tesla V100S-PCIE-32GB', 'memory': '32768.0 MB'}, {'name': 'Tesla V100S-PCIE-32GB', 'memory': '32768.0 MB'}]}, 'device': 'cuda', 'params': 11689512, 'flops': 1822177768, 'timing': {'batch_size_1': {'on_device_inference': {'metrics': {'batches_per_second_mean': -0.3533214991294893, 'batches_per_second_std': 0.024314445753960502, 'batches_per_second_min': -0.3696649900451516, 'batches_per_second_max': -0.1583176335697835, 'seconds_per_batch_mean': -2.8545052862167357, 'seconds_per_batch_std': 0.36834380372350745, 'seconds_per_batch_min': -6.316415786743164, 'seconds_per_batch_max': -2.7051520347595215}, 'human_readable': {'batches_per_second': '-0.35 +/- 0.02 [-0.37, -0.16]', 'batch_latency': '-2854505.286 us +/- 368.344 ms [-6316415.787 us, -2705152.035 us]'}}, 'cpu_to_gpu': {'metrics': {'batches_per_second_mean': 3642.634925121181, 'batches_per_second_std': 290.7311815052623, ... 'max_inference_bytes': 165828608, 'post_inference_bytes': 108468224, 'pre_inference': '103.44 MB', 'max_inference': '158.15 MB', 'post_inference': '103.44 MB'}}} Use NVIDIA Nsight Systems ^^^^^^^^^^^^^^^^^^^^^^^^^ NVIDIA Nsight Systems provides a **timeline** view of your PyTorch code, allowing you to visualize the performance of your model and identify bottlenecks. .. code-block:: bash # test_nsys.py import torch import torchvision.models as models #from torch.profiler import profile, record_function, ProfilerActivity torch.cuda.nvtx.range_push("model") model = models.resnet18(pretrained=True).cuda() torch.cuda.nvtx.range_pop() torch.cuda.nvtx.range_push("inputs") inputs = torch.randn(1, 3, 224, 224).cuda() torch.cuda.nvtx.range_pop() model.eval() torch.cuda.nvtx.range_push("forward") with torch.no_grad(): for i in range(30): torch.cuda.nvtx.range_push(f"iteration {i}") model(inputs) torch.cuda.nvtx.range_pop() torch.cuda.nvtx.range_pop() Execute the code with ``nsys``: .. code-block:: bash nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s none -o nsight_report -f true -x true python test_nsys.py You can view the results in the NVIDIA Nsight Systems GUI. .. figure:: ./images/nsys.png :align: center :alt: Ray Cluster Architecture Nsys example As illustrated in the figure above, the first inference iteration is slow due to the warmup phase (e.g., allocating GPU resource via ``cudaFree``). The subsequent iterations are faster. Use PyTorch Lightning ---------------------- `PyTorch Lightning `_ provides a lightweight PyTorch wrapper to help researchers and practitioners streamline their code and make it more readable and maintainable. Define the training workflow. Here's a toy example: .. code-block:: python # main.py # ! pip install torchvision import torch, torch.nn as nn, torch.utils.data as data, torchvision as tv, torch.nn.functional as F import lightning as L from lightning import loggers # -------------------------------- # Step 1: Define a LightningModule # -------------------------------- # A LightningModule (nn.Module subclass) defines a full *system* # (ie: an LLM, diffusion model, autoencoder, or simple image classifier). class LitAutoEncoder(L.LightningModule): def __init__(self): super().__init__() self.encoder = nn.Sequential(nn.Linear(28 * 28, 128), nn.ReLU(), nn.Linear(128, 3)) self.decoder = nn.Sequential(nn.Linear(3, 128), nn.ReLU(), nn.Linear(128, 28 * 28)) def forward(self, x): # in lightning, forward defines the prediction/inference actions embedding = self.encoder(x) return embedding def training_step(self, batch, batch_idx): # training_step defines the train loop. It is independent of forward x, _ = batch x = x.view(x.size(0), -1) z = self.encoder(x) x_hat = self.decoder(z) loss = F.mse_loss(x_hat, x) self.log("train_loss", loss) return loss def validation_step(self, batch, batch_idx): # this is the validation loop x, _ = batch x = x.view(x.size(0), -1) z = self.encoder(x) x_hat = self.decoder(z) val_loss = F.mse_loss(x_hat, x) self.log("val_loss", val_loss) def test_step(self, batch, batch_idx): # this is the test loop x, _ = batch x = x.view(x.size(0), -1) z = self.encoder(x) x_hat = self.decoder(z) test_loss = F.mse_loss(x_hat, x) self.log("test_loss", test_loss) def configure_optimizers(self): optimizer = torch.optim.Adam(self.parameters(), lr=1e-3) return optimizer # ------------------- # Step 2: Define data # ------------------- dataset = tv.datasets.MNIST(".", download=True, transform=tv.transforms.ToTensor()) train, val = data.random_split(dataset, [55000, 5000]) # ------------------- # Step 3: Train # ------------------- autoencoder = LitAutoEncoder() trainer = L.Trainer(accelerator="gpu", devices=8, logger=TensorBoardLogger("logs/")) # trainer.test(model, dataloaders=DataLoader(test_set)) trainer.fit(autoencoder, data.DataLoader(train), data.DataLoader(val)) Run the model on your terminal .. code-block:: bash pip install torchvision python main.py Export to torchscript (JIT) .. code-block:: python # torchscript autoencoder = LitAutoEncoder() torch.jit.save(autoencoder.to_torchscript(), "model.pt") Export to ONNX .. code-block:: python # onnx with tempfile.NamedTemporaryFile(suffix=".onnx", delete=False) as tmpfile: autoencoder = LitAutoEncoder() input_sample = torch.randn((1, 64)) autoencoder.to_onnx(tmpfile.name, input_sample, export_params=True) os.path.isfile(tmpfile.name) Develop a reusable datamodule .. code-block:: python import lightning as L from torch.utils.data import random_split, DataLoader # Note - you must have torchvision installed for this example from torchvision.datasets import MNIST from torchvision import transforms class MNISTDataModule(L.LightningDataModule): def __init__(self, data_dir: str = "./"): super().__init__() self.data_dir = data_dir self.transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]) def prepare_data(self): # download MNIST(self.data_dir, train=True, download=True) MNIST(self.data_dir, train=False, download=True) def setup(self, stage: str): # Assign train/val datasets for use in dataloaders if stage == "fit": mnist_full = MNIST(self.data_dir, train=True, transform=self.transform) self.mnist_train, self.mnist_val = random_split( mnist_full, [55000, 5000], generator=torch.Generator().manual_seed(42) ) # Assign test dataset for use in dataloader(s) if stage == "test": self.mnist_test = MNIST(self.data_dir, train=False, transform=self.transform) if stage == "predict": self.mnist_predict = MNIST(self.data_dir, train=False, transform=self.transform) def train_dataloader(self): return DataLoader(self.mnist_train, batch_size=32) def val_dataloader(self): return DataLoader(self.mnist_val, batch_size=32) def test_dataloader(self): return DataLoader(self.mnist_test, batch_size=32) def predict_dataloader(self): return DataLoader(self.mnist_predict, batch_size=32) Use the datamodule .. code-block:: python dm = MNISTDataModule() model = Model() trainer.fit(model, datamodule=dm) trainer.test(datamodule=dm) trainer.validate(datamodule=dm) trainer.predict(datamodule=dm) Find training loop bottlenecks .. code-block:: python trainer = Trainer(profiler="simple") .. code-block:: bash FIT Profiler Report ------------------------------------------------------------------------------------------- | Action | Mean duration (s) | Total time (s) | ------------------------------------------------------------------------------------------- | [LightningModule]BoringModel.prepare_data | 10.0001 | 20.00 | | run_training_epoch | 6.1558 | 6.1558 | | run_training_batch | 0.0022506 | 0.015754 | | [LightningModule]BoringModel.optimizer_step | 0.0017477 | 0.012234 | | [LightningModule]BoringModel.val_dataloader | 0.00024388 | 0.00024388 | | on_train_batch_start | 0.00014637 | 0.0010246 | | [LightningModule]BoringModel.teardown | 2.15e-06 | 2.15e-06 | | [LightningModule]BoringModel.on_train_start | 1.644e-06 | 1.644e-06 | | [LightningModule]BoringModel.on_train_end | 1.516e-06 | 1.516e-06 | | [LightningModule]BoringModel.on_fit_end | 1.426e-06 | 1.426e-06 | | [LightningModule]BoringModel.setup | 1.403e-06 | 1.403e-06 | | [LightningModule]BoringModel.on_fit_start | 1.226e-06 | 1.226e-06 | -------------------------------------------------------------------------------------------