. Manufacturing brokers battle over the identical GPU — and on one shared card, a latency-sensitive agent’s p99 latency quietly received 66% worse whereas each pod nonetheless reported wholesome. Here’s what that battle truly prices, measured to the p99, not hand-waved.
That is Half 2 of the “Manufacturing-Grade Agentic Inference” collection. Every half removes one form of redundant work from an agentic LLM pipeline. Half 1 kills redundant prefill. Half 2 (this half) tackles redundant ready — how a number of micro-agents share one GPU via time-slicing. Half 3 retains RAG retrieval on the GPU with a customized CUDA Prime-Ok kernel. Half 4 persists agent state throughout hand-offs so the subsequent agent by no means has the cold-start downside.
Key Takeaways
- Sharing a GPU shouldn’t be free, and your scheduler is not going to let you know. When two brokers share one time-sliced GPU, Kubernetes fortunately stories each pods as
Working. The injury hides within the latency tail. - The median lies; the tail tells the reality. In my run (with solely 2 brokers), each stored an almost-unchanged p50. However the small, latency-sensitive one’s p99 jumped from 3.68 ms to six.10 ms (≈1.66×) and its jitter (p99/p50) went from 1.02 to 1.70.
- The latency-sensitive agent degrades first. The small, twitchy workload suffered way over the heavy, regular one, despite the fact that each “received a GPU.”
- Throughput barely moved, which is the entire lure. A mean-rate throughput proxy dropped only some p.c — so a dashboard watching averages would name this successful whereas your tail-sensitive agent quietly misses one deadline in fifty.
- It runs on a $150 GPU. Every part beneath is measured on a single five-year-old GTX 1080 with the inventory NVIDIA Kubernetes System Plugin and CUDA time-slicing. No H100, no MIG, no magic. This was intentional, not everybody can afford H100 – some nonetheless hold utilizing their previous {hardware}. And actually, operating an agentic AI manufacturing on H100 doesn’t require any magic; however on a $150 GPU, it absolutely does.
TL;DR: I put two very completely different agent workloads — a small, latency-sensitive FFT employee and a heavy, transformer-style GEMM employee — into separate Kubernetes pods, every politely asking for nvidia.com/gpu: "1", and let the NVIDIA gadget plugin’s CUDA time-slicing drop them each onto one bodily GTX 1080. Then I timed each iteration with CUDA occasions, rolled it up into p50/p95/p99, computed a degradation issue (shared tail / solo tail), and cross-checked it towards DCGM GPU-utilization counters. End result: medians and throughput barely flinched, however tail latency and jitter blew up — worst for the small, latency-critical agent. Kubernetes says “two wholesome pods.” The silicon says “one among you is ravenous within the queue.” Kubernetes stories “two wholesome pods.” The silicon stories a memory-bus road battle, and the p99 tail tells you who paid the value.
Github Repo: https://github.com/AnubhabBanerjee/Kube-Timeslice-Profiler
(Fast confession earlier than we begin: I got here at this from a 5G/6G RAN engineering background. Because it seems, it’s precisely the form of downside AI RAN is at the moment coping with. On edge servers, operators try to co-locate latency-critical baseband processing with heavy LLM inference on the identical GPUs. It turns into a scheduling nightmare the second the AI workload begins ravenous the latency-critical functions of reminiscence bandwidth—and that’s precisely why I wrote this publish.)
Structure psychological mannequin — hold this open when you learn.
Two pods → every asks for nvidia.com/gpu: 1 → the gadget plugin cheerfully says "certain, listed here are 4 GPUs" (there may be precisely 1) → CUDA time-slices the one actual GPU → everyone takes turns → the tail pays the invoice.
Every part beneath is simply commentary on one a part of that line.
1. A confession: “Working” is the most costly phantasm in Kubernetes
Similar to the earlier publish on this collection, allow us to begin with a dramatic dialog earlier than we slowly dive into extra boring, technical stuff.
You: “Kubernetes, please run my two brokers.“
Kubernetes: “Accomplished. Each pods are
Working. ✅”You: “On the identical GPU?“
Kubernetes: “Yep. Each requested for
nvidia.com/gpu: 1, so I gave each a GPU.“You: “However I solely personal one GPU.“
Kubernetes: “Appropriate. And I gave every of them a GPU.” 🫡
You: “Wait, What!? How?? They’ll’t each have—”
Kubernetes: “Shhh. Don’t fear about it. Look how inexperienced they’re.”
Your Grafana dashboard: “Every part appears good, bro. 🟢”
In the meantime…
Your bodily GPU: (screaming in context-switches)
Your p99 latency: (quietly doubling within the nook)
Effectively, possibly it was not that dramatic in any case, however you get my level, proper? The scheduler’s concept of “wholesome” is the pod is alive and a course of is operating. It has no opinion about whether or not your latency-critical agent is getting elbowed off the GPU forty instances a second. Pod part says Working. The agent says nothing, as a result of, nicely, truly no one requested it.
This follows straight from the place Half 1 left off. Within the SwarmKV publish I had two brokers studying one doc, and I bragged about prefilling as soon as and fanning the KV cache out. Then, within the caveats, I admitted the embarrassing half: each department’s precise GPU work nonetheless ran behind one world mutex. The orchestration fanned out; the compute lined up single file. Two brokers, two turns. Fifty brokers, fifty turns. I hand-rolled a lock and referred to as it a day.
That’s superb for a demo. It’s a catastrophe for manufacturing, the place “an agent swarm” means a dozen small specialised fashions — a router, a summarizer, a security checker, a retriever, a pile of tool-callers — all awake without delay, all wanting the identical accelerator. You can’t purchase every of them an H100 (until your identify is Jensen Huang). You pack them onto one shared GPU and hope the scheduler kinds it out.
So I wished to reply one blunt query: when two brokers share one GPU, what does each truly pay — and can something in my cluster inform me?
Spoiler alert: it prices actual milliseconds, it lands virtually fully on the small quick agent, and no, nothing in your cluster will let you know. So I constructed a software that does.
2. Two brokers with reverse personalities
The repo behind this publish runs two containerized PyTorch staff that stand in for the 2 sorts which you discover in mainly virtually each agent swarm:
- A small, twitchy, latency-sensitive agent (
fft_worker.py). It runs a steady loop of massive 2-D advanced FFTs. Consider it because the router / guardrail / tool-caller class — the brokers that should reply now or the entire world begins falling aside. - A giant, regular, compute-hungry agent (
matmul_worker.py). It runs a steady stream of huge sq. matrix multiplies — the GEMM on the coronary heart of a transformer ahead cross. That is the heavyweight truly doing the mannequin’s pondering.
Their complete workload is sort of easy for every. The FFT employee pre-allocates a 4096×4096 advanced tensor and beats on it:
# ----- Pre-allocate tensors -----
# Single allocation retains cuFFT plan creation and allocator visitors out of the per-iteration ``elapsed_time`` window on GPU.
# ``complex64`` matches typical PHY IQ knowledge width; real-only FFT would under-report reminiscence visitors related to DRAM rivalry with GEMM tenants.
knowledge = torch.randn(MATRIX_SIZE, MATRIX_SIZE, gadget=gadget, dtype=torch.complex64)
# First launches pay JIT/plan prices; 5 iterations is a small mounted rely—formal steady-state trimming nonetheless occurs in ``generate_results`` §1.4.
# Throwaway ``fft2`` calls prime instruction and fixed caches so timed iterations see repeatable SM occupancy, not driver one-shot spikes.
for _ in vary(5):
# Project to ``_`` discards output tensor deal with instantly; we solely want kernel execution uncomfortable side effects on gadget resident ``knowledge``.
torch.fft.fft2(knowledge)
# Closing sync ensures no warmup kernel overlaps the primary timed iteration’s occasion pair—essential for CUDA occasion timing validity §3.
sync()
The GEMM employee pre-allocates two FP32 matrices and multiplies them eternally:
# Matmul wants two operands resident on gadget; allocating as soon as retains allocator and paging out of the timed cuBLAS path every iteration.
# FP32 is the default coaching/inference dtype on Pascal-class GPUs with out Tensor Cores; this matches the “GEMM on 1080” narrative in README.
A = torch.randn(MATRIX_SIZE, MATRIX_SIZE, gadget=gadget)
B = torch.randn(MATRIX_SIZE, MATRIX_SIZE, gadget=gadget)
# cuBLAS autotuning can choose completely different algorithms throughout first launches; warmup iterations take in that non-determinism earlier than ``KTS_APP`` strains.
# 5 repeats mirror FFT employee so cross-tenant comparisons in papers don't confound completely different warmup depths with silicon interference results.
for _ in vary(5):
# End result discarded; peak reminiscence stays flat as a result of output tensor is freed every iteration earlier than timed loop allocates nothing new per iter.
torch.matmul(A, B)
# Sync closes the warmup window so first ``_ev_start.report`` doesn't overlap trailing warmup kernels on the identical default CUDA stream semantics.
sync()
The purpose was by no means to construct a intelligent mannequin — it was to construct two GPU residents with reverse manners and watch them share one room. One finishes in about 3.6 ms and needs to go once more instantly; the opposite takes about 20 ms and simply desires to grind. Now put them on the identical GPU and ask the one attention-grabbing query: who blinks first?
Each staff are configured by atmosphere variables, so a pod spec can re-tune them with out rebuilding the picture:
# ----- Configuration (overridable through env vars so pod specs can tune per experiment) -----
# ``ITERATIONS`` default matches FFT employee so DF numerators/denominators use comparable pattern counts with out env overrides in YAML.
# Elevating iterations lengthens shared-GPU ``kubectl wait``; reducing spikes variance in p99 tails used for rivalry storytelling in ``outcomes.md``.
ITERATIONS = int(os.environ.get("ITERATIONS", 800))
# ``MATRIX_SIZE`` dominates FLOPs per iteration; env override allows you to downshift VRAM when MatMul shares 8 GB with FFT co-tenant allocations.
# Time-slicing doesn't partition reminiscence—each pods’ peak allocations should match one bodily card or the slower OOMKill path invalidates the experiment.
MATRIX_SIZE = int(os.environ.get("MATRIX_SIZE", 4096))
# ``SLEEP_MS`` defaults barely above FFT’s 100 ms so two tenants hardly ever wake in lockstep, spreading scheduler quanta for extra life like interference.
# Similar caveat as FFT: sleep is between measured iterations and is excluded from ``latency_ms_device``—solely GPU matmul time is within the pattern checklist.
SLEEP_MS = int(os.environ.get("SLEEP_MS", 150))
I suppose by this level you understand that nothing right here is area particular. The numbers occur to return from a signal-processing workload subsequent to a matmul, however swap in your personal two brokers — one mild and deadline-driven, one heavy and regular — and the story holds. This can be a publish about workload personalities colliding on one accelerator, not about anyone software.
Timing it with out being illusioned
There’s a basic option to benchmark a GPU and get a lovely but fully mistaken quantity: time how lengthy it takes Python to launch the kernel. CUDA is asynchronous, so torch.matmul(A, B) returns virtually immediately whereas the GPU remains to be sweating. Measure that and also you’ll be glad that your matmul takes solely 50 microseconds, and then you definitely’ll begin to bang your head questioning why the manufacturing is gradual.
The employees don’t try this. They wrap every operation in CUDA occasions and drive a torch.cuda.synchronize() so the clock stops after the kernels truly retire on the SMs:
# Begin epoch instantly earlier than ``report`` minimizes hole between “intent to launch” and queue submission for be part of alignment research.
epoch_ns_start = time.time_ns()
_ev_start.report()
_ = torch.fft.fft2(knowledge)
_ev_end.report()
torch.cuda.synchronize()
epoch_ns_end = time.time_ns()
latency_ms_device = float(_ev_start.elapsed_time(_ev_end))
elapsed_time reads the GPU’s personal timeline — sub-microsecond decision, no host-side jitter. That synchronize() is the distinction between measuring “how lengthy did the GPU work” and “how lengthy did Python take to ask.” Then each iteration coughs up one structured line and flushes it, so Kubernetes log streaming sees it instantly:
print(
f"KTS_APP,v1,FFT,{i},{epoch_ns_start},{epoch_ns_end},{latency_ms_device:.6f},{phase_optional}"
)
# ``flush`` forces line-buffered container stdout via CRI earlier than the subsequent sleep—with out it, tail -f can batch strains and scramble be part of order.
sys.stdout.flush()
Uncooked silicon execution time goes in; a structured log comes out. A downstream parser aggregates these into actual percentiles, making a strict measurement contract that strips away all host-side noise.
3. How two pods find yourself on one GPU (defined)
That is the half which is able to really feel like magic for people who find themselves new to K8s. For others, you may safely skip this part and transfer on to the subsequent one.
By default, Kubernetes treats nvidia.com/gpu as an entire, indivisible factor: one GPU, one claimant, no sharing. The NVIDIA gadget plugin’s time-slicing characteristic modifications the bookkeeping. You hand it a ConfigMap that claims, basically, “faux every bodily GPU is a number of”:
apiVersion: v1
variety: ConfigMap
metadata:
identify: time-slicing-config
namespace: nvidia-device-plugin
knowledge:
any: |-
model: v1
flags:
migStrategy: "none"
failOnInitError: true
sharing:
timeSlicing:
failRequestsGreaterThanOne: true
renameByDefault: false
sources:
- identify: nvidia.com/gpu
replicas: 4
replicas: 4 is Kubernetes for “mislead the scheduler 4 instances.” After this, one bodily GTX 1080 advertises 4 allocatable nvidia.com/gpu slots to the API. 4 pods can every request "1" and all get scheduled, fairly fortunately.
Right here is the catch, in daring as a result of the complete publish is determined by it: this doesn’t bodily partition the {hardware}. It isn’t MIG. There isn’t any reminiscence fence and no compute fence. The 4 “GPUs” are the identical silicon, and the pods take activates it via CUDA time-slicing — the GPU context-switches between them like a single barista serving 4 strains by sprinting between registers. Extra schedulable slots, precisely zero isolation.

The experiment is three Kubernetes Jobs: every agent alone (the baselines), after which each without delay. The “each without delay” manifest is the entire ballgame — two Jobs, every innocently asking for one GPU, intentionally touchdown on the identical card:
containers:
- identify: employee
picture: localhost/kts-worker:v1
imagePullPolicy: By no means
sources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
Neither pod is aware of the opposite exists. Neither requested to share. The scheduler put them in the identical room as a result of, so far as it is aware of, there have been 4 rooms. The baselines let you know how briskly every agent runs when it owns the GPU; the shared run tells you what it pays for firm. The hole between them is the complete story.
4. The rig, in a single sentence
Every part beneath runs on a seven-year-old NVIDIA GTX 1080 (8 GB, Pascal) on a single-node K3s with the inventory NVIDIA gadget plugin and CUDA time-slicing. No H100, no MIG, no datacenter rack — simply the cardboard half the individuals studying this nonetheless have beneath their desk.
I’m utilizing this vintage on function. Unhealthy scheduling doesn’t magically vanish on an H100; it simply executes its bottlenecks at the next clock pace. In case your brokers are combating over a reminiscence bus on a $150 card, throwing $30,000 on the downside received’t stop the visitors jam—it simply makes the crash costlier. Throwing an H100 at an orchestration flaw doesn’t repair the rivalry; it simply allows you to execute dangerous structure in fewer milliseconds. The physics of cache eviction don’t care what yr your silicon was minted.
(Driver, containerd, and toolkit variations are pinned within the repo for anybody reproducing this; they’re boring on function and you do not want them to comply with the story.)
5. The receipts (i.e., the numbers)
Now the entire story in a single image:

4 panels, one punchline. The medians (the left pair of bars in every latency chart) are mainly untouched. The throughputs (backside row) misplaced a measly 7.3% and 1.4% — the form of quantity you’d report up the chain and get a thumbs-up emoji for. After which there’s that top-right nook of the top-left chart: the small agent’s p99 jumped by 66%. Similar dashboard, similar Working pods, similar boring throughput graph — and one among your two brokers is now sometimes, unpredictably 66% slower than it was yesterday. Welcome to GPU sharing.
The precise numbers, so no one has to squint their eyes on the bars:
| Metric | Solo | Shared | Change |
|---|---|---|---|
| FFT (latency-sensitive) p50 | 3.598 ms | 3.593 ms | insignificant |
| FFT p95 | 3.645 ms | 5.868 ms | 1.61× |
| FFT p99 | 3.679 ms | 6.101 ms | 1.66× |
| FFT jitter (p99/p50) | 1.02 | 1.70 | tail blows out |
| GEMM (heavy) p50 | 20.677 ms | 20.669 ms | insignificant |
| GEMM p95 | 20.896 ms | 24.505 ms | 1.17× |
| GEMM p99 | 20.985 ms | 24.690 ms | 1.18× |
| GEMM jitter (p99/p50) | 1.01 | 1.20 | slight |
| FFT throughput (iter/s) | 278.1 | 257.9 | −7.3% |
| GEMM throughput (iter/s) | 49.1 | 48.3 | −1.4% |
Learn these FFT rows twice. The median didn’t transfer. If you happen to had been gazing a p50 dashboard you’d swear nothing occurred, log out, and go to lunch. However one in each hundred FFT calls now takes 66% longer, and the hole between a typical iteration and a foul one almost doubled. You didn’t gradual the agent down on common — you made it sometimes, unpredictably late. Which is worse, as a result of now it’s a flaky agent and no one can reproduce it on a Friday afternoon.
That is the important thing asymmetry, and it isn’t a coincidence: the small, latency-sensitive agent degrades first and worst. The large GEMM is a bulldozer — it grabs its quantum and grinds via. The little FFT retains getting tapped on the shoulder mid-stride, shoved off the SMs, and advised to attend for its subsequent flip. When two workloads share a single line, the one which wanted to be fast is the one which suffers. This has enormous implications within the telecom area: if it retains occurring, calls begin to drop and worse involves worst, even emergency service numbers might also cease functioning. Simply let that thought sink in!
To make this comparable throughout any pair of brokers, the software computes a degradation issue (DF) = shared_p99 / baseline_p99. DF = 1.0 means sharing was free. Larger means it harm. For this run it’s 1.66 for the FFT and 1.18 for the GEMM. That 1.66 is the complete publish compressed right into a quantity you may placed on a slide to point out to your supervisor.
And right here’s the half that needs to be unlawful: the throughput barely moved. In case your SLO (Service Stage Goal) is written when it comes to common throughput, you’d take a look at “FFT down 7%, GEMM down 1%” and declare victory. In the meantime your tail-sensitive agent is silently lacking one deadline in fifty. Averages are the place rivalry goes to cover. The imply is a form soul who rounds your worst moments away. The p99 is the buddy who remembers every part.
One sanity test, then we transfer on. The profiler additionally scrapes DCGM GPU-utilization counters each 100 ms and joins them to every iteration. Within the shared window, the FFT employee’s SM and DRAM exercise rise sharply (its execution cycles now overlap with a GEMM hammering the identical reminiscence system); within the solo window, they don’t. So the rivalry exhibits up at two fully impartial layers — software latency and {hardware} counters — which is how you understand that is actual and never a stopwatch artifact.
6. That is about agent swarms, not anyone workload
One might simply label part 5 as “an FFT and a matmul fought over a GPU, which surprises completely nobody who has ever written a CUDA kernel”, however that misses the purpose fully. The 2 staff are simply handy, measurable stand-ins for a sample that exhibits up the moment you place an actual agent swarm on shared {hardware}:
- The sunshine, deadline-driven brokers — routers, guardrails, classifiers, tool-callers, small quick fashions. Low cost individually, always operating, and the entire pipeline waits on them. (The FFT employee is one concrete instance of this character.)
- The heavy, regular brokers — the large transformer ahead passes, the GEMM-bound mannequin calls that dominate compute. (The GEMM employee is one concrete instance of that one.)
Put any two brokers with these shapes on one time-sliced GPU and also you get precisely what I measured: medians barely twitch, however the small, latency-critical agent eats the tail. It doesn’t matter what the brokers do; it issues how they behave on the SMs — one wants to complete quick and sometimes, the opposite simply desires to grind. Time-slicing palms out turns. It doesn’t hand out deadlines. So the agent that lives or dies by its deadline is the one which suffers when its flip retains getting interrupted.
That’s the programs thread operating via this complete collection. Half 1 was about not repeating work throughout brokers (share the KV cache). This half is about not mendacity to your self about what sharing the GPU prices these brokers. Time-slicing buys you capability — extra schedulable slots on one card — and offers you zero isolation. Watch solely averages and your most deadline-sensitive agent breaks first, silently, within the p99, whereas each pod retains flashing Working.
7. “So… how do I truly run it?”
The pipeline is intentionally boring, as a result of in programs engineering, ‘thrilling’ normally means manufacturing is on hearth. It’s a linear construct → cluster → logs → metrics graph pushed from the repo root:
run.pybuilds the employee picture with Podman, imports it into K3s’ containerd, makes the namespace, optionally begins a DCGM scrape thread, applies the Jobs, waits, and collects logs intologs/run-./ - The employees emit these per-iteration
KTS_APPstrains you noticed above. generate_results.pyparses the logs, trims warmup, computes p50/p95/p99, the throughput proxy, the degradation issue, and the DCGM be part of, then writesknowledge/abstract.{csv,json}, the plots, and adocs/outcomes.md.
On a node that already has K3s, the NVIDIA driver, the Container Toolkit, the gadget plugin, and the nvidia RuntimeClass, the entire thing is three instructions:
# 1. Set up the time-slicing ConfigMap and reload the gadget plugin
kubectl apply -f time-slicing-config.yaml
# 2. Construct the employee picture and run the complete benchmark (construct, import, Jobs, logs)
python3 run.py
# 3. Flip the logs into summaries, plots, and a outcomes web page
python3 generate_results.py
The repo hyperlink? nicely, you could find it close to the highest of the article. And congratulations that you just made it this far – I hardly thought anybody would ever do!
8. Sincere caveats (as a result of the feedback are coming)
This can be a small, deliberate research, not a datacenter capability mannequin. Right here is precisely what it’s not, earlier than somebody posts it for me:
- It’s two brokers, not fifty. The config exposes 4 logical slots; the highlighted run pairs one FFT employee with one GEMM employee. That’s the smallest attention-grabbing rivalry case, picked for readability. Filling all 4 slots (the complete rivalry matrix) is on the roadmap, not in these numbers. I’m not reporting fifty-agent outcomes, as a result of I didn’t measure fifty brokers.
- Throughput is a mean-rate proxy.
1000 / imply latencyis an iteration price, not request-serving throughput beneath an actual arrival course of. It earns its hold for the “averages conceal the tail” level and nothing fancier. - The workloads are artificial. A looping FFT and a looping matmul are trustworthy stand-ins for a lightweight, latency-sensitive agent and a heavy inference agent, however they don’t seem to be a completely served mannequin behind actual visitors. The interference form generalizes; absolutely the milliseconds don’t.
- DCGM exercise is a low-magnitude proxy. The employees tempo themselves with sleeps, so the GPU idles lots and the SM/DRAM means look small. Deal with them as relative, within-study alerts — they corroborate the latency story, they don’t declare full saturation.
- Time-slicing shouldn’t be the one sharing mode. As §7 lays out, this research intentionally measures the default path — the one most individuals get the second they flip on GPU sharing. A head-to-head with MPS and MIG is a separate publish.
- One GPU class, one run highlighted. Numbers come from a single Pascal GTX 1080. Newer GPUs context-switch sooner and absolutely the tails shrink; the course — small latency-sensitive agent degrades first — is the sturdy outcome.
None of this strikes the takeaway. It simply retains me trustworthy about its scope — and the second a benchmark publish hides its caveats is the second its numbers cease being price something.
9. Wrap (and the setup for Half 3)
Kubernetes time-slicing is an excellent phantasm. It tells your scheduler that one GPU is 4, lets 4 pods report Working, after which quietly locks them in a room to battle over the reminiscence bus. For throughput-bound, deadline-relaxed work, that phantasm is innocent and genuinely helpful. For the latency-sensitive members of an agent swarm, the phantasm hides precisely the place you aren’t wanting: the p99.
The answer isn’t to ban GPU sharing—you need to share {hardware}, until you might have an infinite price range. The answer is to cease utilizing a inexperienced YAML checkmark as an alternative to microarchitectural actuality. Measure the tail, attribute the degradation, and schedule with precise silicon limits in thoughts. Kube-TimeSlice-Profiler is a step in direction of the proper course: it turns the imprecise feeling of “the GPU appears gradual at present” right into a measurable Degradation Issue with receipts.
If you happen to got here right here as a newbie who simply wished to know why “each pods are Working” doesn’t imply “each brokers are comfortable”: congratulations, you now perceive GPU sharing higher than the inexperienced checkmark does. Go forward and mistrust your averages, you’re prepared!
Arising subsequent: The PCIe Stroll of Disgrace

We simply survived two brokers combating over a single GPU with out mendacity to ourselves concerning the latency tail. However there’s one other silent tax buried in each RAG pipeline: the PCIe commute.
Proper now, each time an agent must retrieve context, it pauses, leaves the accelerator, crawls throughout the PCIe bus again to Python, runs a vector search on the CPU, and trudges all the best way again.
In Half 3, we’re killing that commute. We’ll construct a customized CUDA Prime-Ok kernel to maintain the complete retrieval loop trapped on the GPU {hardware}—no Python round-trips, no host-side delays. Similar price range GPU. Similar “cease losing {hardware}” philosophy.
See you in Half 3.
Disclaimer: The illustrations on this article had been generated utilizing AI (Claude Opus 4.8). They’re illustrative, not photographic, and any labels seen inside the pictures are stylized quite than authoritative — discuss with the article physique and the code itself for exact operate names, metric values, and structure particulars.















