• Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy
Friday, February 13, 2026
newsaiworld
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us
No Result
View All Result
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us
No Result
View All Result
Morning News
No Result
View All Result
Home Machine Learning

AI in A number of GPUs: Understanding the Host and System Paradigm

Admin by Admin
February 13, 2026
in Machine Learning
0
Intel.jpeg
0
SHARES
1
VIEWS
Share on FacebookShare on Twitter


is a part of a collection about distributed AI throughout a number of GPUs:

  • Half 1: Understanding the Host and System Paradigm (this text)
  • Half 2: Level-to-Level and Collective Operations (coming quickly)
  • Half 3: How GPUs Talk (coming quickly)
  • Half 4: Gradient Accumulation & Distributed Information Parallelism (DDP) (coming quickly)
  • Half 5: ZeRO (coming quickly)
  • Half 6: Tensor Parallelism (coming quickly)

Introduction

This information explains the foundational ideas of how a CPU and a discrete graphics card (GPU) work collectively. It’s a high-level introduction designed that can assist you construct a psychological mannequin of the host-device paradigm. We’ll focus particularly on NVIDIA GPUs, that are probably the most generally used for AI workloads.

For built-in GPUs, akin to these present in Apple Silicon chips, the structure is barely completely different, and it gained’t be lined on this put up.

The Large Image: The Host and The System

A very powerful idea to understand is the connection between the Host and the System.

  • The Host: That is your CPU. It runs the working system and executes your Python script line by line. The Host is the commander; it’s answerable for the general logic and tells the System what to do.
  • The System: That is your GPU. It’s a robust however specialised coprocessor designed for massively parallel computations. The System is the accelerator; it doesn’t do something till the Host offers it a process.

Your program all the time begins on the CPU. Once you need the GPU to carry out a process, like multiplying two giant matrices, the CPU sends the directions and the info over to the GPU.

The CPU-GPU Interplay

The Host talks to the System via a queuing system.

  1. CPU Initiates Instructions: Your script, operating on the CPU, encounters a line of code meant for the GPU (e.g., tensor.to('cuda')).
  2. Instructions are Queued: The CPU doesn’t wait. It merely locations this command onto a particular to-do checklist for the GPU referred to as a CUDA Stream — extra on this within the subsequent part.
  3. Asynchronous Execution: The CPU doesn’t watch for the precise operation to be accomplished by the GPU, the host strikes on to the subsequent line of your script. That is referred to as asynchronous execution, and it’s a key to reaching excessive efficiency. Whereas the GPU is busy crunching numbers, the CPU can work on different duties, like getting ready the subsequent batch of information.

CUDA Streams

A CUDA Stream is an ordered queue of GPU operations. Operations submitted to a single stream execute so as, one after one other. Nonetheless, operations throughout completely different streams can execute concurrently — the GPU can juggle a number of unbiased workloads on the similar time.

By default, each PyTorch GPU operation is enqueued on the present lively stream (it’s often the default stream which is robotically created). That is easy and predictable: each operation waits for the earlier one to complete earlier than beginning. For many code, you by no means discover this. But it surely leaves efficiency on the desk when you’ve work that might overlap.

A number of Streams: Concurrency

The traditional use case for a number of streams is overlapping computation with knowledge transfers. Whereas the GPU processes batch N, you’ll be able to concurrently copy batch N+1 from CPU RAM to GPU VRAM:

Stream 0 (compute): [process batch 0]────[process batch 1]───
Stream 1 (knowledge):   ────[copy batch 1]────[copy batch 2]───

This pipeline is feasible as a result of compute and knowledge switch occur on separate {hardware} models contained in the GPU, enabling true parallelism. In PyTorch, you create streams and schedule work onto them with context managers:

READ ALSO

Not All RecSys Issues Are Created Equal

The right way to Mannequin The Anticipated Worth of Advertising Campaigns

compute_stream = torch.cuda.Stream()
transfer_stream = torch.cuda.Stream()

with torch.cuda.stream(transfer_stream):
    # Enqueue the switch on transfer_stream
    next_batch = next_batch_cpu.to('cuda', non_blocking=True)

with torch.cuda.stream(compute_stream):
    # This runs concurrently with the switch above
    output = mannequin(current_batch)

Observe the non_blocking=True flag on .to(). With out it, the switch would nonetheless block the CPU thread even while you intend it to run asynchronously.

Synchronization Between Streams

Since streams are unbiased, it is advisable to explicitly sign when one depends upon one other. The blunt instrument is:

torch.cuda.synchronize()  # waits for ALL streams on the machine to complete

A extra surgical strategy makes use of CUDA Occasions. An occasion marks a particular level in a stream, and one other stream can wait on it with out halting the CPU thread:

occasion = torch.cuda.Occasion()

with torch.cuda.stream(transfer_stream):
    next_batch = next_batch_cpu.to('cuda', non_blocking=True)
    occasion.report()  # mark: switch is completed

with torch.cuda.stream(compute_stream):
    compute_stream.wait_event(occasion)  # do not begin till switch completes
    output = mannequin(next_batch)

That is extra environment friendly than stream.synchronize() as a result of it solely stalls the dependent stream on the GPU facet — the CPU thread stays free to maintain queuing work.

For day-to-day PyTorch coaching code you gained’t have to handle streams manually. However options like DataLoader(pin_memory=True) and prefetching rely closely on this mechanism underneath the hood. Understanding streams helps you acknowledge why these settings exist and offers you the instruments to diagnose delicate efficiency bottlenecks after they seem.

PyTorch Tensors

PyTorch is a robust framework that abstracts away many particulars, however this abstraction can generally obscure what is occurring underneath the hood.

Once you create a PyTorch tensor, it has two elements: metadata (like its form and knowledge kind) and the precise numerical knowledge. So while you run one thing like this t = torch.randn(100, 100, machine=machine), the tensor’s metadata is saved within the host’s RAM, whereas its knowledge is saved within the GPU’s VRAM.

This distinction is essential. Once you run print(t.form), the CPU can instantly entry this data as a result of the metadata is already in its personal RAM. However what occurs for those who run print(t), which requires the precise knowledge dwelling in VRAM?

Host-System Synchronization

Accessing GPU knowledge from the CPU can set off a Host-System Synchronization, a standard efficiency bottleneck. This happens at any time when the CPU wants a outcome from the GPU that isn’t but accessible within the CPU’s RAM.

For instance, contemplate the road print(gpu_tensor) which prints a tensor that’s nonetheless being computed by the GPU. The CPU can’t print the tensor’s values till the GPU has completed all of the calculations to acquire the ultimate outcome. When the script reaches this line, the CPU is compelled to block, i.e. it stops and waits for the GPU to complete. Solely after the GPU completes its work and copies the info from its VRAM to the CPU’s RAM can the CPU proceed.

As one other instance, what’s the distinction between torch.randn(100, 100).to(machine) and torch.randn(100, 100, machine=machine)? The primary methodology is much less environment friendly as a result of it creates the info on the CPU after which transfers it to the GPU. The second methodology is extra environment friendly as a result of it creates the tensor immediately on the GPU; the CPU solely sends the creation command.

These synchronization factors can severely influence efficiency. Efficient GPU programming entails minimizing them to make sure each the Host and System keep as busy as attainable. In spite of everything, you need your GPUs to go brrrrr.

Picture by creator: generated with ChatGPT

Scaling Up: Distributed Computing and Ranks

Coaching giant fashions, akin to Giant Language Fashions (LLMs), typically requires extra compute energy than a single GPU can provide. Coordinating work throughout a number of GPUs brings you into the world of distributed computing.

On this context, a brand new and essential idea emerges: the Rank.

  • Every rank is a CPU course of which will get assigned a single machine (GPU) and a singular ID. When you launch a coaching script throughout two GPUs, you’ll create two processes: one with rank=0 and one other with rank=1.

This implies you’re launching two separate situations of your Python script. On a single machine with a number of GPUs (a single node), these processes run on the identical CPU however stay unbiased, with out sharing reminiscence or state. Rank 0 instructions its assigned GPU (cuda:0), whereas Rank 1 instructions one other GPU (cuda:1). Though each ranks run the identical code, you’ll be able to leverage a variable that holds the rank ID to assign completely different duties to every GPU, like having each course of a distinct portion of the info (we’ll see examples of this within the subsequent weblog put up of this collection).

Conclusion

Congratulations for studying all the way in which to the tip! On this put up, you realized about:

  • The Host/System relationship
  • Asynchronous execution
  • CUDA Streams and the way they allow concurrent GPU work
  • Host-System synchronization

Within the subsequent weblog put up, we are going to dive deeper into Level-to-Level and Collective Operations, which allow a number of GPUs to coordinate complicated workflows akin to distributed neural community coaching.

Tags: DeviceGPUsHostMultipleParadigmUnderstanding

Related Posts

Not all recsys problems are created equal featured image 1.png
Machine Learning

Not All RecSys Issues Are Created Equal

February 12, 2026
1ymzneeuc1v5tkdomidkala.jpg
Machine Learning

The right way to Mannequin The Anticipated Worth of Advertising Campaigns

February 11, 2026
Benediktgeyer canoe 2920401 1920.jpg
Machine Learning

The Machine Studying Classes I’ve Discovered Final Month

February 9, 2026
Article thumbnail2 1.jpg
Machine Learning

The right way to Construct Your Personal Customized LLM Reminiscence Layer from Scratch

February 8, 2026
Egor komarov j5rpypdp1ek unsplash scaled 1.jpg
Machine Learning

Immediate Constancy: Measuring How A lot of Your Intent an AI Agent Really Executes

February 7, 2026
Py spy article image.jpg
Machine Learning

Why Is My Code So Gradual? A Information to Py-Spy Python Profiling

February 6, 2026

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

Chainlink Link And Cardano Ada Dominate The Crypto Coin Development Chart.jpg

Chainlink’s Run to $20 Beneficial properties Steam Amid LINK Taking the Helm because the High Creating DeFi Challenge ⋆ ZyCrypto

May 17, 2025
Image 100 1024x683.png

Easy methods to Use LLMs for Highly effective Computerized Evaluations

August 13, 2025
Gemini 2.0 Fash Vs Gpt 4o.webp.webp

Gemini 2.0 Flash vs GPT 4o: Which is Higher?

January 19, 2025
Blog.png

XMN is accessible for buying and selling!

October 10, 2025
0 3.png

College endowments be a part of crypto rush, boosting meme cash like Meme Index

February 10, 2025

EDITOR'S PICK

Dogecoin Id Bfce8365 4d5a 4931 8abe 6378a9ee0876 Size900.jpg

Dogecoin Surges 13% as DOGE Division Launches Official .gov Web site

January 21, 2025
Bitcoin20etf Id 1c811d0c 7f89 4ee2 8629 563d5f4f2517 Size900.jpg

Bitcoin Inflows Soar as US Election Attracts Buyers to Crypto

October 14, 2024
Temp 2.png

How I Use AI to Persuade Corporations to Undertake Sustainability

November 27, 2025
Gemini generated image jpjittjpjittjpji 1.jpg

Utilizing Claude Abilities with Neo4j | In the direction of Knowledge Science

October 29, 2025

About Us

Welcome to News AI World, your go-to source for the latest in artificial intelligence news and developments. Our mission is to deliver comprehensive and insightful coverage of the rapidly evolving AI landscape, keeping you informed about breakthroughs, trends, and the transformative impact of AI technologies across industries.

Categories

  • Artificial Intelligence
  • ChatGPT
  • Crypto Coins
  • Data Science
  • Machine Learning

Recent Posts

  • AI in A number of GPUs: Understanding the Host and System Paradigm
  • 30+ Chrome extensions disguised as AI chatbots steal secrets and techniques • The Register
  • Cardano bets on LayerZero to unlock $80B in cross-chain belongings
  • Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy

© 2024 Newsaiworld.com. All rights reserved.

No Result
View All Result
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us

© 2024 Newsaiworld.com. All rights reserved.

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?