a , a deep studying mannequin is executed on a devoted GPU accelerator utilizing enter knowledge batches it receives from a CPU host. Ideally, the GPU — the dearer useful resource — ought to be maximally utilized, with minimal intervals of idle time. Particularly, which means that each time it completes its execution on a batch, the next batch will probably be “ripe and prepared” for processing. When this doesn’t occur, the GPU idles whereas ready for enter knowledge — a standard efficiency bottleneck also known as GPU hunger.
In earlier posts, (e.g., see A Caching Technique for Figuring out Bottlenecks on the Information Enter Pipeline), we mentioned widespread causes of this concern, together with: inefficient storage retrieval, CPU useful resource exhaustion, and host-to-device switch bottlenecks. On this submit, we zoom in on knowledge switch bottlenecks and revisit their identification and backbone — this time with the assistance of NVIDIA Nsight™ Programs (nsys), a efficiency profiler designed for analyzing the system-wide exercise of workloads operating on NVIDIA GPUs.
NVIDIA Nsight vs. PyTorch Profiler
Readers aware of our work could also be stunned on the point out of NVIDIA Nsight profiler reasonably than PyTorch Profiler. In our earlier posts now we have advocated strongly for using PyTorch Profiler in AI/ML mannequin growth as a instrument for figuring out and optimizing runtime efficiency. Again and again, now we have demonstrated its software to all kinds of efficiency points. Its use doesn’t require any particular installations and may be run with out particular OS permissions. NVIDIA Nsight profiler, however, requires a devoted system setup (or a devoted NVIDIA container) and — for a few of its options — elevated permissions, making its use much less accessible and extra difficult than PyTorch Profiler.
The 2 profilers differ of their focus: PyTorch profiler is a framework profiler tightly coupled with PyTorch and closely centered on how fashions use the PyTorch software program stack and supporting libraries. NVIDIA Nsight profiler is a system-level profiler; it doesn’t know the main points of the mannequin being run or which framework is getting used, however reasonably how the parts of your complete system are getting used and utilized. Whereas PyTorch Profiler excels at tracing the low-level operations of a PyTorch mannequin execution, nsys offers an in depth view of the actions of your complete system (GPU {hardware}, CUDA streams, OS interrupts, Community, PCIe, and many others.). For a lot of efficiency points PyTorch profiler is enough for figuring out and fixing the supply of the bottleneck; However some conditions name for nsys profiler, the “huge weapons”, for deriving deeper insights into the interior workings of the underlying system.
On this submit we intend to reveal a few of the distinctive capabilities of nsys profiler and their software to the widespread data-transfer bottleneck.
Define
To facilitate our dialogue we’ll outline a toy ML workload with a data-transfer efficiency bottleneck and proceed to introduce various successive optimizations in an try to unravel it. All through the method, we’ll use the nsys profiler in an effort to analyze the system efficiency and assess the influence of the code modifications.
Setup
We are going to run our experiments on an Amazon EC2 g6e.2xlarge occasion with an NVIDIA L40S GPU operating an AWS Deep Studying (Ubuntu 24.04) AMI with PyTorch (2.8). To put in the nsys-cli profiler (model 2025.6.1) we comply with the official NVIDIA pointers:
wget https://developer.nvidia.com/downloads/belongings/instruments/safe/nsight-systems/2025_6/NsightSystems-linux-cli-public-2025.6.1.190-3689520.deb
sudo apt set up ./NsightSystems-linux-cli-public-2025.6.1.190-3689520.deb
The NVIDIA Instruments Extension (NVTX) library permits us to annotate our code with human-readable labels to extend the readability and comprehension of the efficiency hint. Whereas PyTorch affords built-in NVTX assist through its torch.cuda.nvtx APIs, we’ll use the standalone nvtx bundle (model 0.2.14) which helps color-coding the hint timeline for higher visible evaluation:
pip set up nvtx
Disclaimers
The code we’ll share is meant for demonstrative functions; please don’t depend on its correctness or optimality. Please don’t interpret our use of any library, instrument, or platform, as an endorsement of its use. The influence of the optimizations we’ll cowl can fluctuate tremendously based mostly on the main points of the mannequin and the runtime setting. Please you should definitely assess their impact by yourself use case earlier than integrating their use.
Many due to Yitzhak Levi and Gilad Wasserman for his or her contributions to this submit.
A Toy PyTorch Mannequin
We introduce a coaching script deliberately designed to include a bottleneck on the data-input pipeline.
Within the code block under we outline a easy picture classification mannequin with a ResNet-18 spine.
import time, torch, torchvision
DEVICE = "cuda"
mannequin = torchvision.fashions.resnet18().to(DEVICE).prepare()
optimizer = torch.optim.Adam(mannequin.parameters())
Subsequent, we outline an artificial dataset which we’ll use to coach our toy mannequin.
from torch.utils.knowledge import Dataset, DataLoader
WARMUP_STEPS = 10
PROFILE_STEPS = 3
COOLDOWN_STEPS = 1
TOTAL_STEPS = WARMUP_STEPS + PROFILE_STEPS + COOLDOWN_STEPS
BATCH_SIZE = 64
TOTAL_SAMPLES = TOTAL_STEPS * BATCH_SIZE
IMG_SIZE = 512
# An artificial Dataset with random photographs and labels
class FakeDataset(Dataset):
def __len__(self):
return TOTAL_SAMPLES
def __getitem__(self, index):
img = torch.randn((3, IMG_SIZE, IMG_SIZE))
label = torch.tensor(index % 10)
return img, label
train_loader = DataLoader(
FakeDataset(),
batch_size=BATCH_SIZE
)
Lastly, we outline an ordinary coaching step programmed to run nsys-profiler for 3 steps utilizing the torch.cuda.profiler.begin and cease instructions — supposed to be used at the side of the nsys cli. We spotlight the parts of the coaching step utilizing the nvtx.annotate utility. Please seek advice from the official documentation for extra particulars on profiling with nsys in PyTorch.
import nvtx
from torch.cuda import profiler
def copy_data(batch):
knowledge, targets = batch
data_gpu = knowledge.to(DEVICE)
targets_gpu = targets.to(DEVICE)
return data_gpu, targets_gpu
def compute_step(mannequin, batch, optimizer):
knowledge, targets = batch
output = mannequin(knowledge)
loss = torch.nn.useful.cross_entropy(output, targets)
loss.backward()
optimizer.step()
optimizer.zero_grad()
return loss
data_iter = iter(train_loader)
for i in vary(TOTAL_STEPS):
if i == WARMUP_STEPS:
# begin nsys profiler
torch.cuda.synchronize()
start_time = time.perf_counter()
profiler.begin()
elif i == WARMUP_STEPS + PROFILE_STEPS:
# cease nsys profiler
torch.cuda.synchronize()
profiler.cease()
end_time = time.perf_counter()
with nvtx.annotate(f"Batch {i}", colour="blue"):
with nvtx.annotate("get batch", colour="crimson"):
batch = subsequent(data_iter)
with nvtx.annotate("copy batch", colour="yellow"):
batch = copy_data(batch)
with nvtx.annotate("Compute", colour="inexperienced"):
compute_step(mannequin, batch, optimizer)
total_time = end_time - start_time
throughput = PROFILE_STEPS / total_time
print(f"Throughput: {throughput:.2f} steps/sec")
We run our script utilizing the cudaProfilerApi possibility to begin and cease the profiler programmatically. Please see the official documentation for full particulars on profiling from the nsys cli.
nsys profile
--capture-range=cudaProfilerApi
--trace=cuda,nvtx,osrt
--output=baseline
python prepare.py
This leads to a baseline.nsys-rep hint file that we copy over to our growth machine for evaluation.
To be able to draw a comparability to PyTorch profiler, we outline an alternate coaching loop programmed with PyTorch Profiler and annotated with the torch.profiler.record_function utility:
from torch.profiler import (
profile, record_function, schedule, tensorboard_trace_handler
)
with profile(
schedule=schedule(wait=0, warmup=WARMUP_STEPS,
lively=PROFILE_STEPS, repeat=1),
on_trace_ready=tensorboard_trace_handler('./baseline'),
record_shapes=True,
with_stack=True
) as prof:
for i in vary(TOTAL_STEPS):
with record_function("get batch"):
batch = subsequent(data_iter)
with record_function("copy batch"):
batch = copy_data(batch)
with record_function("compute"):
compute_step(mannequin, batch, optimizer)
prof.step()
The throughput of our baseline experiment is 2.97 steps-per-second. Within the subsequent sections we’ll use the profile traces to determine efficiency bottlenecks in our coaching step and attempt to enhance on this outcome.
Baseline Efficiency Evaluation
To investigate the resultant nsys hint file, we open it within the Nsight Programs GUI software. Within the picture under we zoom in on the timeline of two of the coaching steps captured by the profiler:

The hint comprises a wealth of knowledge, only a subset of which we’ll contact on on this submit. Please see the nsys documentation for extra functionalities and options.
The timeline is split into two components: the CUDA part which experiences GPU exercise and the threads part which experiences the CPU exercise. The CUDA part makes a transparent distinction between the GPU kernel (compute) exercise (90.9%) and reminiscence exercise (9.1%). The highest bars in every part report the utilization of every of the sources and each sections embrace an NVTX part with the coloured annotations we included in our coaching step. We be aware the next observations:
- The GPU is idle for roughly 50% of every coaching step. This may be seen by the portion of time taken by every batch (in blue) within the GPU NVTX bar and the massive blocks of whitespace in between them.
- The GPU exercise for every batch begins instantly after the “get batch” exercise has accomplished on the CPU. It begins with the host-to-device reminiscence copy, marked in gentle inexperienced and continues with the kernel computations, marked in gentle blue.
- As soon as the CPU has launched the GPU reminiscence and compute instructions for batch N, it proceeds to the subsequent batch within the coaching loop — resulting in a partial overlap of batch N+1 on the CPU with batch N on the GPU.
- The overwhelming majority of the CPU thread is spent on the “get batch” exercise. This constitutes the first bottleneck in our baseline experiment.
The profiling hint factors to a transparent offender — the dataloader. By default, PyTorch performs single course of knowledge loading — a single CPU course of is used to load the subsequent knowledge enter batch, copy it to the GPU, and launch the compute kernels — all in a sequential method. This sometimes leads to extreme under-utilization of the CPU sources by: 1) limiting dataloading to only a single course of, and a pair of) making the loading of the subsequent batch contingent on the completion of the CPU processing (i.e., kernel loading) of the earlier batch. Our irresponsible use of our CPU sources has resulted in our GPU being starved for enter knowledge.
The identical conclusion may have been reached utilizing PyTorch Profiler hint proven under:

Right here too, we will see lengthy intervals of GPU underutilization which are attributable to the lengthy “get batch” blocks on the CPU facet.
Optimization 1: Multi-Course of Information Loading
Step one is to switch the information enter pipeline to make use of multi-process knowledge loading. We set the variety of staff to match the 8 vCPUs obtainable on our Amazon EC2 g6e.2xlarge occasion. In a real-world situation, this worth ought to be tuned for optimum throughput:
NUM_WORKERS = 8
train_loader = DataLoader(
FakeDataset(),
batch_size=BATCH_SIZE,
num_workers=NUM_WORKERS
)
Following this variation our throughput jumps to 4.81 steps per second — a 62% enchancment over our baseline outcome. The corresponding nsys profiler hint is proven under:

Observe that the crimson “get batch” phase has grow to be only a tiny sliver of every step within the NVTX bar. As an alternative, the yellow “copy batch” block now takes middle stage. Because of our use of multi-process dataloading, there’s now at all times a brand new batch prepared for processing — however can we do higher?
Taking a more in-depth take a look at the GPU part we see that there’s nonetheless a good portion (~290 milliseconds) of idle time in between the reminiscence operation and the kernel compute. This idle time is completely aligned with an “munmap” operation within the OS runtime bar. The “munmap” block is a CPU-side reminiscence cleanup operation carried out simply after the CUDA reminiscence copy is full. It happens on the tail-end of the lengthy yellow “copy batch” operation. The compute kernels are launched onto the GPU solely after the reminiscence cleanup has accomplished. It is a clear sample of synchronous host-to-device reminiscence copy: The CPU can not proceed with kernel loading till the information copy operation has been absolutely accomplished and the GPU stays idle till the CPU masses the kernels.
The PyTorch profiler hint exhibits the identical GPU idle time but it surely doesn’t present the identical “munmap” trace. That is our first instance of the benefit of the system-wide visibility of the nsys profiler.

With our discovering of the data-copy efficiency bottleneck in hand, we proceed to our subsequent optimization.
Optimization 2: Asynchronous Information Switch
The answer to the bottleneck now we have discovered is to program our coaching step to load knowledge asynchronously. This allows the CPU to launch the compute kernels instantly after sending the reminiscence copy command — with out ready for the reminiscence copy to be accomplished. This fashion the GPU can start processing the kernels as quickly because the CUDA reminiscence copy is finished. Enabling asynchronous knowledge copy requires two modifications: First we should program the dataloader to make use of pinned reminiscence (as an alternative of pageable reminiscence), and second, we should cross non_blocking=True argument to the to() operations:
NUM_WORKERS = 8
ASYNC_DATATRANSFER = True
train_loader = DataLoader(
FakeDataset(),
batch_size=BATCH_SIZE,
num_workers=NUM_WORKERS,
pin_memory=ASYNC_DATATRANSFER
)
def copy_data(batch):
knowledge, targets = batch
data_gpu = knowledge.to(DEVICE, non_blocking=ASYNC_DATATRANSFER)
targets_gpu = targets.to(DEVICE, non_blocking=ASYNC_DATATRANSFER)
return data_gpu, targets_gpu
Utilizing asynchronous dataloading leads to a throughput of 5.91 steps per second — an extra 23% enchancment and 99% enchancment general. The resultant profiling hint is proven under:

We now see the entire CPU operations bunched collectively in the beginning of the hint. We have now eliminated all efficiency obstacles on the CPU facet permitting it to freely load the information and kernels to the GPU. Within the GPU part, we see steady exercise with none idle time. We do, nonetheless, see a transparent separation between CUDA reminiscence actions (in gentle inexperienced) and CUDA kernel actions (in gentle blue). PyTorch profiler, in distinction, doesn’t make this distinction clear. That is one other benefit of the hardware-centric profiler and, within the case of our toy experiment, is what informs the subsequent steps of our optimization.

Optimization 3: Pipelining With CUDA Streams
Our closing optimizations derive from the truth that trendy GPUs, such because the NVIDIA L40S, use impartial engines for copying reminiscence (the DMA) and executing compute kernels (the SMs). We will reap the benefits of this by parallelizing the distinct reminiscence and kernel actions we noticed within the nsys profiler hint. We are going to program this by way of using CUDA streams.
In a earlier submit, we expanded on the chance for optimizing AI/ML workloads utilizing CUDA Streams. Right here, we apply an identical pipelining technique: We outline two distinct “copy” and “compute” CUDA streams and program the “copy” stream to repeat batch N+1 on the similar time that the “compute” stream is processing batch N:
# outline two CUDA streams
compute_stream = torch.cuda.Stream()
copy_stream = torch.cuda.Stream()
# extract first batch
next_batch = subsequent(data_iter)
with torch.cuda.stream(copy_stream):
next_batch = copy_data(next_batch)
for i in vary(TOTAL_STEPS):
if i == WARMUP_STEPS:
torch.cuda.synchronize()
start_time = time.perf_counter()
profiler.begin()
elif i == WARMUP_STEPS + PROFILE_STEPS:
torch.cuda.synchronize()
profiler.cease()
end_time = time.perf_counter()
with nvtx.annotate(f"Batch {i}", colour="blue"):
# anticipate copy stream to finish copy of batch N
compute_stream.wait_stream(copy_stream)
batch = next_batch
# execute mannequin on batch N+1 compute stream
attempt:
with nvtx.annotate("get batch", colour="crimson"):
next_batch = subsequent(data_iter)
with torch.cuda.stream(copy_stream):
with nvtx.annotate("copy batch", colour="yellow"):
next_batch = copy_data(next_batch)
besides:
# reached finish of dataset
next_batch = None
# execute mannequin on batch N compute stream
with torch.cuda.stream(compute_stream):
with nvtx.annotate("Compute", colour="inexperienced"):
compute_step(mannequin, batch, optimizer)
total_time = end_time - start_time
throughput = PROFILE_STEPS / total_time
print(f"Throughput: {throughput:.2f} steps/sec")
This optimization leads to a throughput of 6.44 steps per second — a 9% enchancment over our earlier experiment. We be aware that the influence of this optimization is capped by the period of the longer of the 2 operation varieties. In our earlier profile hint, the reminiscence block took 15.5 milliseconds and the kernel block took 155 milliseconds. Within the present profile hint, your complete GPU steps takes 155 milliseconds, which implies that the reminiscence copy time is accomplished hidden by the kernel compute time and that our optimization reaches the utmost attainable outcome.
The usage of the CUDA streams and its influence on GPU utilization may be seen within the traces of each profilers:


Optimization 4: Prefetching to CUDA
For our closing step, we transfer the information copying from the principle coaching loop course of to the information loading course of: Slightly than explicitly calling the copy perform contained in the coaching loop, we assume that the batches returned from the information iterator are already positioned on the GPU.
Within the code block under, we wrap our dataloader with a CUDA-prefetching iterator class. Observe, that this can be a simplified implementation supposed for the needs of demonstration. Extra work could also be required for extra advanced eventualities (e.g., DDP coaching). Alternatively, chances are you’ll take into account a third-party implementation equivalent to torchtnt.utils.knowledge.data_prefetcher.CudaDataPrefetcher:
class DataPrefetcher:
def __init__(self, loader):
self.loader = iter(loader)
self.stream = torch.cuda.Stream()
self.next_batch = None
self.preload()
def preload(self):
attempt:
knowledge, targets = subsequent(self.loader)
with torch.cuda.stream(self.stream):
with nvtx.annotate("copy batch", colour="yellow"):
next_data = knowledge.to(DEVICE, non_blocking=True)
next_targets = targets.to(DEVICE, non_blocking=True)
self.next_batch = (next_data, next_targets)
besides:
self.next_batch = (None, None)
def __iter__(self):
return self
def __next__(self):
torch.cuda.current_stream().wait_stream(self.stream)
knowledge, targets = self.next_batch
self.preload()
return knowledge, targets
data_iter = DataPrefetcher(train_loader)
for i in vary(TOTAL_STEPS):
if i == WARMUP_STEPS:
torch.cuda.synchronize()
start_time = time.perf_counter()
profiler.begin()
elif i == WARMUP_STEPS + PROFILE_STEPS:
torch.cuda.synchronize()
profiler.cease()
end_time = time.perf_counter()
with nvtx.annotate(f"Batch {i}", colour="blue"):
with nvtx.annotate("get batch", colour="crimson"):
batch = subsequent(data_iter)
with nvtx.annotate("Compute", colour="inexperienced"):
loss = compute_step(mannequin, batch, optimizer)
total_time = end_time - start_time
throughput = PROFILE_STEPS / total_time
print(f"Throughput: {throughput:.2f} steps/sec")
This optimization leads to a throughput of 6.44 steps per second — the identical as our earlier experiment. This could not shock us since now we have already seen that the throughput is sure by the 155 millisecond GPU compute and our optimization has not performed something to scale back the kernel compute time.
Extra usually, regardless of the elimination of the copy name from the principle loop, you’ll have a tough time discovering a state of affairs the place this can have a significant influence on efficiency for the reason that name is already being referred to as asynchronously. Nevertheless, given the minimal modifications to the coaching loop, chances are you’ll discover this resolution to be cleaner and/or to be extra relevant to be used with high-level libraries that don’t allow fine-grained management of the coaching loop.
Unsurprisingly, the profile traces for this experiment seem almost similar to the earlier ones. The primary distinction is the location of the yellow “copy knowledge” block within the NVTX row of the CPU part.


Outcomes
The desk under summarizes the outcomes of our experiments:

The optimizations, which had been pushed by way of Nsight Programs profiler, resulted in an general improve of 2.17X to the runtime efficiency.
Abstract
GPU hunger is a standard efficiency bottleneck that may have a devastating influence on the effectivity and prices of AI/ML workloads. On this submit, we demonstrated tips on how to use the Nsight Programs profiler to check the causes of the efficiency bottleneck and take knowledgeable steps in direction of their decision. Alongside the way in which, we emphasised the distinctive capabilities of Nsight Programs profiler when in comparison with the built-in framework-centric PyTorch Profiler — particularly its deep system-level visibility.
Our focus, on this submit has been on the host-to-device knowledge copy that sometimes happens in the beginning of the coaching step. Nevertheless, data-transfer bottlenecks can seem at totally different levels of coaching. In a sequel to this submit we intend to repeat our nsys profiling evaluation on knowledge copies entering into the other way — from the machine to the host. Keep tuned!
















