• Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy
Saturday, September 13, 2025
newsaiworld
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us
No Result
View All Result
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us
No Result
View All Result
Morning News
No Result
View All Result
Home Machine Learning

Pipelining AI/ML Coaching Workloads with CUDA Streams

Admin by Admin
June 26, 2025
in Machine Learning
0
4.webp.webp
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter

READ ALSO

If we use AI to do our work – what’s our job, then?

10 Python One-Liners Each Machine Studying Practitioner Ought to Know


ninth in our sequence on efficiency profiling and optimization in PyTorch geared toward emphasizing the crucial position of efficiency evaluation and optimization in machine studying growth. All through the sequence we’ve got reviewed all kinds of sensible instruments and methods for analyzing and boosting the runtime efficiency of PyTorch-based AI/ML fashions. Our purpose has been twofold:

  1. To emphasise the significance of routine analysis and optimization of AI/ML workloads.
  2. To show the accessibility of all kinds instruments and methods for analyzing and optimizing AI/ML runtime efficiency. You don’t should be a CUDA professional to meaningfully enhance your mannequin efficiency and cut back compute prices.

On this submit, we are going to discover the usage of CUDA streams, a strong function of NVIDIA’s CUDA programming mannequin that provides a complicated methodology of overlapping GPU operations and operating them concurrently. Though we usually affiliate our AI/ML mannequin coaching workload with a single monolithic (a.ok.a., “unbreakable”) computation graph G operating on the GPU, there are some situations the place the graph might be decomposed into two distinct subgraphs G1 and G2, the place G=G2*G1. In such circumstances CUDA streams allow “pipelining” the computation graph, i.e., programming our coaching step to run G1 (on batch enter n+1) in parallel to G2 (on the nth output of G1). This system is very helpful when:

  • Neither subgraph absolutely makes use of the GPU when run alone, and
  • The 2 subgraphs are of comparable computational value (i.e., neither dominates runtime).

We are going to discover two frequent situations the place “pipelining” is possible:

  1. Partial-model coaching or finetuning:
    It’s frequent to freeze a pre-trained mannequin spine (e.g., function extractor or encoder) and practice solely a mannequin head (e.g., decoder). Because the frozen spine doesn’t depend on gradients from the head, the 2 might be executed concurrently.
  2. Offloading knowledge preprocessing to the GPU:
    A standard methodology for addressing bottlenecks within the enter pipeline (often known as GPU hunger), knowledge preprocessing might be moved to the GPU. Whereas prepending preprocessing operations to the mannequin graph improves efficiency, extra positive factors might be achieved by operating preprocessing on a separate CUDA stream in parallel with mannequin execution—assuming preprocessing isn’t trivial in comparison with mannequin compute.

To facilitate our dialogue, we are going to outline two toy coaching scripts and measure the coaching efficiency beneath completely different situations. The experiments have been run on an Amazon EC2 g5.2xlarge occasion (containing an NVIDIA A10G GPU and eight vCPUs) operating a PyTorch (2.6) Deep Studying AMI (DLAMI).

Please notice: the code snippets that we share are for demonstration functions solely —please don’t depend on their correctness or optimality. The affect of utilizing CUDA streams will range relying on mannequin structure and system configuration. We encourage you to conduct your individual profiling and experimentation earlier than integrating CUDA streams (or some other instrument method we check with) into your workflow.

Half 1: Pipelining an Encoder-Decoder Mannequin

The primary use-case we discover entails a CNN-based picture segmentation mannequin consisting of a set (pre-trained) encoder and a trainable decoder. On this situation, for the reason that encoder weights are frozen and unaffected by backpropagation, the encoder might be executed independently of the decoder’s coaching. On this part, we assess the affect of pipelining the coaching course of utilizing CUDA streams.

A Toy Picture Segmentation Coaching Experiment

We start by defining a easy CNN-based picture encoder together with its corresponding decoder.

undefined

Subsequent, we assemble an artificial dataset of random photos and segmentation maps.

from torch.utils.knowledge import DataLoader
from torchvision.datasets.imaginative and prescient import VisionDataset

# A dataset with random photos and per-pixel labels
class FakeDataset(VisionDataset):
    def __init__(self):
        tremendous().__init__(root=None)
        self.measurement = 1000000

    def __getitem__(self, index):
        # create a random picture
        img = torch.randint(0, 256, (3, img_size, img_size),
                            dtype=torch.uint8)

        # create a random label map
        goal = torch.randint(0, num_classes, (img_size, img_size))

        return img, goal

    def __len__(self):
        return self.measurement

train_set = FakeDataset()

train_loader = DataLoader(
    dataset=train_set,
    batch_size=8,
    num_workers=8
)

Lastly, we outline the loss perform, optimizer, and coaching loop. Observe, that we freeze the encoder’s weights and practice solely the decoder.

import time

system = torch.system("cuda")
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(decoder.parameters())

# Freeze the encoder weights
encoder.requires_grad_(False)
encoder.eval().to(system)

decoder.practice().to(system)

warmup = 10
active_batches = 100
total_iters = warmup + active_batches

for idx, knowledge in enumerate(train_loader):
    inputs = knowledge[0].to(system=system, non_blocking=True).float()
    labels = knowledge[1].to(system=system, non_blocking=True)
    optimizer.zero_grad()
    with torch.no_grad():
        options = encoder(inputs)
    output = decoder(options)
    loss = criterion(output, labels)
    loss.backward()
    optimizer.step()

    if idx == warmup:
        # sync the GPU and begin the timer
        torch.cuda.synchronize()
        t0 = time.perf_counter()

    if idx == total_iters:
        break

# anticipate the GPU to finnish after which cease the timer
torch.cuda.synchronize()
total_time = time.perf_counter() - t0
print(f'throughput: {active_batches / total_time}')

Our baseline coaching script achieves a mean throughput of 83 steps per second, with a mean GPU utilization of 85%.

Pipelining the Mannequin Execution With CUDA Streams

Within the revised model of the coaching loop proven beneath, we introduce two CUDA streams: one for executing the encoder and one for coaching the decoder. In every iteration, we carry out two operations concurrently:

  1. Practice the decoder utilizing the picture options and labels from batch N.
  2. Execute the encoder on enter batch N+1 to generate its picture options.
encoder_stream = torch.cuda.Stream()
decoder_stream = torch.cuda.Stream()

# initialize the options to None
options = None

for idx, knowledge in enumerate(train_loader):
    inputs = knowledge[0].to(system, non_blocking=True).float()
    labels_next = knowledge[1].to(system, non_blocking=True)

    if options just isn't None:
        with torch.cuda.stream(decoder_stream):
            decoder_stream.wait_stream(encoder_stream)

            optimizer.zero_grad()
            output = decoder(options)
            loss = criterion(output, labels)
            loss.backward()
            optimizer.step()

    with torch.cuda.stream(encoder_stream):
        with torch.no_grad():
            options =  encoder(inputs)
        # File that options was produced on s1_backbone
        options.record_stream(encoder_stream)

    labels = labels_next

    if idx == warmup:
        # sync the GPU and begin the timer
        torch.cuda.synchronize()
        t0 = time.perf_counter()
    if idx == total_iters:
        break

# anticipate the GPU to complete after which cease the timer
torch.cuda.synchronize()
total_time = time.perf_counter() - t0
print(f'throughput: {active_batches / total_time}')

This modification yields a mean throughput of 91 steps per second, representing a 9.6% speedup. It is a important enchancment — particularly contemplating that our baseline already had excessive GPU utilization (85%).

Sensitivity of Pipelining to Workload Properties

The effectiveness of pipelining with CUDA streams is extremely depending on the specifics of the coaching workload and runtime surroundings. If the encoder is considerably bigger than the decoder (or vice versa), pipelining could provide little profit and even hinder efficiency. Conversely, when the GPU is underutilized, pipelining tends to yield extra substantial positive factors.

As an example this dependency, we reran the experiment with various batch sizes. The outcomes are summarized beneath:

Affect of Pipelining With CUDA Streams on Throughput (by Writer)

Because the batch measurement will increase, the advantage of pipelining diminishes. That is seemingly as a result of bigger batch sizes naturally result in larger (and extra environment friendly) GPU utilization, leaving much less room for enchancment via concurrent execution.

Half 2: Offloading Augmentations onto the GPU

On this part, we are going to apply the usage of CUDA streams to the acceleration of knowledge augmentation. In earlier weblog posts (e.g., right here and right here), we’ve got studied the issue of bottlenecks on the info enter pipeline from completely different views and reviewed a number of methods for diagnosing and addressing them. A standard causes of those bottlenecks is CPU useful resource exhaustion, the place the CPU can not meet the computational calls for of the preprocessing pipeline. The result’s GPU hunger — a situation by which the costly GPU sits idle, ready for knowledge to reach.

One efficient answer is to dump heavy knowledge preprocessing to the GPU. We are going to show this method and take it a step additional by executing the augmentations on a devoted CUDA stream, enabling concurrent execution with the mannequin coaching.

A Toy Picture Classification Coaching Experiment

We start by defining a easy CNN-based picture classification mannequin:

import torch
import torch.nn as nn

import torch
import torch.nn as nn

img_size = 256
num_classes = 10
mannequin = nn.Sequential(
    # Begin with 256x256 picture
    nn.Conv2d(3, 16, kernel_size=1),
    nn.ReLU(inplace=True),
    nn.Conv2d(16, 32, kernel_size=2, stride=2),  # 2x downsample
    nn.ReLU(inplace=True),
    nn.Conv2d(32, 64, kernel_size=2, stride=2),  # 4x downsample
    nn.ReLU(inplace=True),
    nn.Conv2d(64, 128, kernel_size=2, stride=2),  # 8x downsample
    nn.ReLU(inplace=True),
    nn.Conv2d(128, 256, kernel_size=2, stride=2),  # 16x downsample
    nn.ReLU(inplace=True),
    nn.Conv2d(256, 512, kernel_size=2, stride=2),  # 32x downsample
    nn.ReLU(inplace=True),
    nn.Conv2d(512, 1024, kernel_size=2, stride=2),  # 64x downsample
    nn.ReLU(inplace=True),
    nn.Conv2d(1024, 2048, kernel_size=2, stride=2),  # 128X downsample
    nn.ReLU(inplace=True),
    nn.Conv2d(2048, 4096, kernel_size=2, stride=2),  # 256X
    nn.Flatten(),
    nn.Linear(4096, num_classes)
)

Subsequent, we create an artificial dataset with an augmentation pipeline deliberately designed to trigger a extreme efficiency bottleneck:

import random
from torch.utils.knowledge import DataLoader
import torchvision.transforms.v2 as T
from torchvision.datasets.imaginative and prescient import VisionDataset
import torchvision.transforms.v2.practical as F
import torchvision.ops as ops

# A dataset with random photos and labels
class FakeDataset(VisionDataset):
    def __init__(self, remodel = None):
        tremendous().__init__(root=None, remodel=remodel)
        self.measurement = 1000000

    def __getitem__(self, index):
        # create a random picture
        img = torch.randint(0, 256, (3, img_size, img_size),
                           dtype=torch.uint8)
        # create a random label
        goal = torch.randint(0, num_classes, (1, ))

        if self.remodel:
            # Apply tranformations
            img = self.remodel(img)

        return img, goal

    def __len__(self):
        return self.measurement

augmentations = T.Compose([
    T.ToDtype(torch.float32),
    T.RandomCrop(img_size//2),
    T.Resize(img_size),
    T.RandomRotation(degrees=45.0),
    T.GaussianBlur(kernel_size=7),
    T.Normalize(mean=[0, 0, 0], std=[1, 1, 1])
])

train_set = FakeDataset(remodel=augmentations)

train_loader = DataLoader(
    dataset=train_set,
    batch_size=32,
    num_workers=8
)

Lastly, we outline the loss perform, optimizer, and coaching loop:

import time

system = torch.system("cuda")
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(mannequin.parameters())

mannequin.practice().to(system)

warmup = 10
active_batches = 100
total_iters = warmup + active_batches

for idx, knowledge in enumerate(train_loader):
    inputs = knowledge[0].to(system=system, non_blocking=True)
    labels = knowledge[1].to(system=system, non_blocking=True).squeeze()
    optimizer.zero_grad()
    output = mannequin(inputs)
    loss = criterion(output, labels)
    loss.backward()
    optimizer.step()

    if idx == warmup:
        # sync the GPU and begin the timer
        torch.cuda.synchronize()
        t0 = time.perf_counter()

    if idx == total_iters:
        break

# anticipate the GPU to finnish after which cease the timer
torch.cuda.synchronize()
total_time = time.perf_counter() - t0
print(f'throughput: {active_batches / total_time}')

Working this baseline script ends in a mean throughput of 20.41 steps per second and a GPU utilization of solely 42%. The heavy knowledge augmentations are choking the CPU resulting in GPU hunger. See our earlier submit for extra info on detecting bottlenecks on the info enter pipeline.

Offloading Knowledge Augmentations to the GPU

To deal with the efficiency bottleneck on the info enter pipeline, we transfer the augmentations onto the GPU.

Step one is to outline customized knowledge transforms that apply random rotations and crops per pattern in a batch. That is necessary as a result of the built-in torchvision transforms apply the identical augmentation throughout your complete batch — shedding the per-sample randomness seen on the CPU.

We implement the BatchRandomCrop remodel utilizing the roi_align operator.

class BatchRandomCrop(T.Rework):
    def __init__(self, output_size):
        tremendous().__init__()
        self.output_size = output_size

    def remodel(self, img: torch.Tensor, params: dict):
        batch_size, _, original_height, original_width = img.form
        system = img.system
        max_top = original_height - self.output_size
        max_left = original_width - self.output_size

        # Generate random prime and left coords for every picture within the batch
        random_top = torch.randint(0, max_top + 1, (batch_size,),
                                   system=system, dtype=torch.float32)
        random_left = torch.randint(0, max_left + 1, (batch_size,),
                                    system=system, dtype=torch.float32)

        image_indices = torch.arange(batch_size, system=system,
                                     dtype=torch.float32)

        bins = torch.stack([
            image_indices,
            random_left,
            random_top,
            random_left + self.output_size,
            random_top + self.output_size
        ], dim=1)

        cropped_batch = ops.roi_align(
            img,
            bins,
            output_size=self.output_size
        )
        return cropped_batch 

We implement the BatchRandomRotate transfrom by iterating over the entire photos within the batch and making use of a random rotation to every one. Observe that this model just isn’t vectorized; a totally vectorized implementation could be extra would require better effort.

class BatchRandomRotation(T.Rework):
    def __init__(self, levels):
        tremendous().__init__()
        self .levels = levels

    def remodel(self, inpt: torch.Tensor, params: dict):
        # break up the batch into an inventory of particular person photos
        photos = checklist(torch.unbind(inpt, dim=0))

        augmented_images = []
        for img_tensor in photos:
            # generate a random angle
            angle = random.uniform(-self.levels, self.levels)

            # apply the rotation to the one picture
            transformed_img = F.rotate(
                img_tensor,
                angle=angle
            )
            augmented_images.append(transformed_img)

        # stack the remodeled photos
        return torch.stack(augmented_images, dim=0)

We now outline batch_transform that mimics the CPU-based augmentation pipeline outlined above:

batch_transform = T.Compose([
    T.ToDtype(torch.float32),
    BatchRandomCrop(img_size//2),
    T.Resize(img_size),
    BatchRandomRotation(degrees=45.0),
    T.GaussianBlur(kernel_size=7),
    T.Normalize(mean=[0, 0, 0], std=[1, 1, 1])
]) 

Lastly, we reset the dataset and replace the coaching loop to use the brand new batch_transform:

train_set = FakeDataset(remodel=None)

train_loader = DataLoader(
    dataset=train_set,
    batch_size=32,
    num_workers=8
)

for idx, knowledge in enumerate(train_loader):
    inputs = knowledge[0].to(system=system, non_blocking=True)
    labels = knowledge[1].to(system=system, non_blocking=True).squeeze()
    
    # apply augmentations
    inputs = batch_transform(inputs)
    
    optimizer.zero_grad()
    output = mannequin(inputs)
    loss = criterion(output, labels)
    loss.backward()
    optimizer.step()

    if idx == warmup:
        torch.cuda.synchronize()
        t0 = time.perf_counter()

    if idx == total_iters:
        break

torch.cuda.synchronize()
total_time = time.perf_counter() - t0
print(f'throughput: {active_batches / total_time}')

This up to date coaching script improves throughput to 35.22 steps per second — a 72.57% speedup over the baseline consequence.

Pipelining Augmentations With CUDA Streams

Subsequent, we pipeline the augmentation and coaching steps utilizing two separate CUDA streams: one for operating the info remodel one for coaching the mannequin. In every iteration of the loop we carry out two concurrent operations:

  1. We practice the mannequin on the augmented batch N.
  2. Carry out GPU-based knowledge augmentations on batch N+1
transform_stream = torch.cuda.Stream()
model_stream = torch.cuda.Stream()

# initialize the remodeled worth to None
remodeled = None

for idx, knowledge in enumerate(train_loader):
    inputs = knowledge[0]
    labels_next = knowledge[1]

    if remodeled just isn't None:
        with torch.cuda.stream(model_stream):
            labels = labels.to(system, non_blocking=True).squeeze()
            model_stream.wait_stream(transform_stream)
            optimizer.zero_grad()
            output = mannequin(remodeled)
            loss = criterion(output, labels)
            loss.backward()
            optimizer.step()

    with torch.cuda.stream(transform_stream):
        inputs = inputs.to(system, non_blocking=True)
        remodeled = batch_transform(inputs)
        # File that the tensor was produced on transform_stream
        remodeled.record_stream(transform_stream)

    labels = labels_next

    if idx == warmup:
        torch.cuda.synchronize()
        t0 = time.perf_counter()
    if idx == total_iters:
        break

torch.cuda.synchronize()
total_time = time.perf_counter() - t0
print(f'throughput: {active_batches / total_time}')

This additional improves the throughput to 38.82 steps per second — a ten.2% improve over the serialized answer, and 90.20% sooner than the unique baseline

Sensitivity of Pipelining to Workload Properties

As we noticed in Half 1, the advantage of pipelining utilizing CUDA streams varies primarily based on the small print of the workload. Within the desk beneath, we seize the outcomes for a number of completely different batch sizes:

Affect of Pipelining With CUDA Streams on Throughput (by Writer)

Because the batch measurement will increase, GPU offloading turns into simpler, considerably boosting efficiency. On the similar time, the positive factors from pipelining lower. That is seemingly do to the very fact bigger batch sizes improve the GPU effectivity, lowering the alternatives for overlap.

Abstract

Relating to operating AI/ML workloads, each millisecond counts. On this submit we explored the affect of pipelining an AI/ML coaching step utilizing CUDA stream in two frequent situations: partial mannequin coaching and offloading knowledge augmentations to the GPU. In each circumstances, the pipelined answer outperformed the serialized implementation — although the extent of the development diverse considerably primarily based on the worth of the batch measurement.

As we’ve emphasised all through the submit, the anticipated affect of the usage of CUDA streams can range drastically primarily based on the AI/ML workload. For instance, in circumstances the place the GPU is already being effectively utilized, the overhead of utilizing CUDA streams may very well result in a degradation in runtime efficiency. We strongly advocate testing this method by yourself workloads earlier than adopting this method.

We hope one can find the method described on this submit helpful. For extra tip, tips, and methods for profiling and optimizing AI/ML workflows, take a look at the opposite posts on this sequence.

Tags: AIMLCUDAPipeliningStreamsTrainingWorkloads

Related Posts

Mike von 2hzl3nmoozs unsplash scaled 1.jpg
Machine Learning

If we use AI to do our work – what’s our job, then?

September 13, 2025
Mlm ipc 10 python one liners ml practitioners 1024x683.png
Machine Learning

10 Python One-Liners Each Machine Studying Practitioner Ought to Know

September 12, 2025
Luna wang s01fgc mfqw unsplash 1.jpg
Machine Learning

When A Distinction Truly Makes A Distinction

September 11, 2025
Mlm ipc roc auc vs precision recall imblanced data 1024x683.png
Machine Learning

ROC AUC vs Precision-Recall for Imbalanced Knowledge

September 10, 2025
Langchain for eda build a csv sanity check agent in python.png
Machine Learning

LangChain for EDA: Construct a CSV Sanity-Examine Agent in Python

September 9, 2025
Jakub zerdzicki a 90g6ta56a unsplash scaled 1.jpg
Machine Learning

Implementing the Espresso Machine in Python

September 8, 2025
Next Post
In the center hong kong flag and the word policy….jpeg

Stablecoin Licensing and Tokenized Bonds Incoming

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

0 3.png

College endowments be a part of crypto rush, boosting meme cash like Meme Index

February 10, 2025
Gemini 2.0 Fash Vs Gpt 4o.webp.webp

Gemini 2.0 Flash vs GPT 4o: Which is Higher?

January 19, 2025
1da3lz S3h Cujupuolbtvw.png

Scaling Statistics: Incremental Customary Deviation in SQL with dbt | by Yuval Gorchover | Jan, 2025

January 2, 2025
0khns0 Djocjfzxyr.jpeg

Constructing Data Graphs with LLM Graph Transformer | by Tomaz Bratanic | Nov, 2024

November 5, 2024
How To Maintain Data Quality In The Supply Chain Feature.jpg

Find out how to Preserve Knowledge High quality within the Provide Chain

September 8, 2024

EDITOR'S PICK

0ln2sc 1uo Bl0b4y.jpeg

Harmonizing and Pooling Datasets for Well being Analysis in R | by Rodrigo M Carrillo Larco, MD, PhD | Jan, 2025

January 22, 2025
Bitcoin20and20national20reserve id 97b3bc47 d003 4acf 80ec e9dc61fd3047 size900.jpeg

Why Nations Are Rethinking Reserves Following America’s Daring Wager on 200K Bitcoin

July 7, 2025
Cryptobox 2.png

5 Methods to Get Wealthy with Cryptocurrency in 2024 – CryptoNinjas

November 27, 2024
Image Fx 69.png

How AI is Serving to Drive Advances in Stock Administration Software program

March 12, 2025

About Us

Welcome to News AI World, your go-to source for the latest in artificial intelligence news and developments. Our mission is to deliver comprehensive and insightful coverage of the rapidly evolving AI landscape, keeping you informed about breakthroughs, trends, and the transformative impact of AI technologies across industries.

Categories

  • Artificial Intelligence
  • ChatGPT
  • Crypto Coins
  • Data Science
  • Machine Learning

Recent Posts

  • Generalists Can Additionally Dig Deep
  • If we use AI to do our work – what’s our job, then?
  • ‘Sturdy Likelihood’ Of US Forming Strategic Bitcoin Reserve In 2025
  • Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy

© 2024 Newsaiworld.com. All rights reserved.

No Result
View All Result
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us

© 2024 Newsaiworld.com. All rights reserved.

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?