NVIDIA RTX Spark — supported day one

Data center intelligence.
Powered by your gaming PC.

Intelligence shouldn't be owned by the hyperscalers alone. BareMetalRT is the world's first global GPU-native edge compute mesh. BareMetalRT uses NVIDIA's own TensorRT-LLM (Apache 2.0) CUDA kernels — 1,500+ hand-optimized .cu files for attention, GEMM, and quantization — the same engine that powers cloud inference APIs. Built for the edge, not the cloud.
← Back

The Breakthrough

The first heterogeneous tensor parallelism implementation on consumer hardware over commodity networks. No NVLink. No matched hardware. No Linux. This isn't a roadmap — it's working today.
FP32 GPU-native reduction — the theoretical precision floor
Prior attempts reduced in FP16 on the CPU — producing NaN at layer 23. Our CUDA kernel reduces in FP32 directly on GPU: 2.62×10−7 mean absolute error at the final layer — the irreducible noise of IEEE 754 arithmetic. 2,500× more accurate than FP16. No implementation at FP32 precision, on any hardware, can do better.
Asymmetric-tolerant TCP transport
NCCL requires Linux, NVLink, and identical GPUs. We replaced it entirely — custom TensorRT plugins intercept every collective call and route through a TCP transport that tolerates asymmetric compute. Each rank computes at its own speed. No barrier, no matched hardware.
TensorRT-LLM on Windows
NVIDIA discontinued Windows support for TensorRT-LLM. We maintain the only working Windows build of v0.12 — 1,500+ hand-optimized CUDA kernels for attention, GEMM, and quantization running natively. No WSL, no Docker, no Linux compatibility layer.
Synchronization, not the network
WiFi (316ms ping) and gigabit ethernet (1ms ping) produced identical throughput — 276ms vs 277ms per token. A 300× improvement in network latency yielded zero improvement in token speed. GPU synchronization overhead dominates, not the wire.
293ms → 185ms → 80ms — Mistral 7B over ethernet
14GB model split across a 4070 Super (12GB) and a 4060 Laptop (8GB) — too large for either GPU alone. Three optimization stages: overlapped AllReduce cut latency 37%, then KV caching shrank payloads from 49KB to 8KB for a further 57% reduction. 12.5 tok/s — faster than reading speed.
MoE: where the mesh wins
Dense TP pays 64 AllReduces per token. Mixture-of-Experts pays 2 point-to-point hops. Experts placed geographically, routed like a CDN. 75% of the mesh is idle per token — available for concurrent serving. Data centers scale via pipeline parallelism (degraded). The mesh scales via expert parallelism (it doesn't degrade).
← Back

Whitepaper

Heterogeneous Tensor Parallelism over Commodity TCP Networks via FP32 GPU-Native Reduction

Abstract

We present the first system that achieves heterogeneous tensor parallelism on ad-hoc consumer GPUs over commodity TCP networks, running Mistral 7B — a 14GB model too large for either GPU alone — across an RTX 4070 Super (12GB) and an RTX 4060 Laptop (8GB) at 80ms/token (12.5 tok/s). Every prior distributed inference project on consumer hardware chose pipeline parallelism, accepting idle GPUs and architectural limitations as unavoidable. Data-center systems (HexGen, Helix) support heterogeneous tensor parallelism but require managed clusters, Linux, and offline profiling.

The central contribution is a precision result: FP32 end-to-end computation achieves the theoretical floor of IEEE 754 arithmetic — identical to NCCL on NVLink and 2,500× more accurate than FP16. No implementation of tensor parallelism at FP32 precision, on any hardware, can do better. An asymmetric-tolerant TCP transport replaces NCCL, enabling GPUs with different VRAM and compute capabilities to participate in the same AllReduce without barrier stalls or protocol-level assumptions of symmetric hardware. The system is built on NVIDIA TensorRT-LLM ported natively to Windows.

Our benchmarks reveal a counterintuitive finding: in synchronous tensor parallelism, GPU synchronization overhead — not network latency — is the dominant bottleneck. A 300× improvement in network speed (WiFi to ethernet) yielded zero throughput improvement. We argue that Mixture-of-Experts architectures structurally favor distributed meshes over centralized data centers, replacing dense AllReduce with sparse expert routing over commodity networks.

1. Introduction

A 7-billion-parameter model doesn't fit on most consumer GPUs. A 70-billion-parameter model doesn't fit on any of them. But most households with a gaming PC have more than enough aggregate VRAM across machines to run these models — if those GPUs could work together. They can't. Over 100 million RTX-class NVIDIA GPUs sit in PCs and workstations[17], and almost none of them can combine their memory and compute with the GPU in the next room, let alone across a network.

This isn't a missing feature — it's a missing category. Every existing path forces a choice between quality and accessibility. Cloud inference (OpenAI, Anthropic, Google) uses data-center-grade hardware with optimized kernels — but is expensive, rate-limited, and geographically constrained. Local inference tools (Ollama, LM Studio, Jan, GPT4All) run on consumer hardware — but are limited to a single machine, use a general-purpose CUDA backend that leaves significant performance on the table, and can only run models that fit in one GPU's memory. Distributed inference projects (Petals[1], Exo[2]) can span multiple machines — but only via pipeline parallelism, which wastes half the available compute to pipeline bubbles and limits model architectures. The fastest inference kernels — NVIDIA's TensorRT-LLM[18], with 1,500+ hand-optimized CUDA routines — exist, but require homogeneous data-center GPUs communicating over NVLink or InfiniBand via NCCL[16].

No system combines the three things needed at once: optimized kernels that extract full performance from the hardware, tensor parallelism that keeps every GPU busy on every layer, and heterogeneous support that lets different GPUs with different VRAM and different compute capabilities — on different machines, over a commodity network — contribute to a single inference pass. That is the problem this paper solves. Moreover, the industry's shift toward Mixture-of-Experts (MoE) architectures — which activate only a fraction of parameters per token — creates a structural advantage for distributed meshes over centralized data centers, a direction we explore in Section 9.

2. Related Work

Parallelism strategies and the TP-requires-NVLink axiom. Model parallelism for large neural networks has been explored along two axes. Pipeline parallelism (PP), introduced by GPipe[5], partitions a model into sequential stages assigned to different devices. Megatron-LM[22] introduced intra-layer tensor parallelism (TP) for transformer models, splitting attention and MLP layers horizontally across GPUs connected by NVLink within a single DGX node — and explicitly prescribed that "tensor-parallel size should ideally be confined to the high-bandwidth intra-node network (NVLink domain)." DeepSpeed[20] combines PP with ZeRO-style memory optimization for training at scale. Alpa[26] automates the selection of inter- and intra-operator parallelism strategies via ILP, programmatically encoding the assumption that TP belongs on high-bandwidth links and PP on low-bandwidth links. Domino[25] — described as "the first work that provides a uniform Tensor Parallelism solution for both single-node and multi-node cases" — confirms that as of late 2024, no prior system had a working cross-node TP solution; even Domino addresses the problem by hiding communication behind computation rather than proving the communication itself is fast enough. All of these systems assume homogeneous hardware within each parallelism group and rely on NCCL[16] for collective communication — requiring Linux, NVLink or InfiniBand, and identical GPUs.

Distributed inference on consumer hardware. Petals[1] enables collaborative inference by distributing transformer blocks across volunteers over the internet, using pipeline parallelism exclusively. The authors explicitly state that tensor parallelism is infeasible in their setting due to per-layer activation transfers. Exo[2] similarly builds on pipeline parallelism for home clusters, acknowledging that inter-device latency makes per-layer synchronization impractical. Prima.cpp[11] achieves 30–70B inference on heterogeneous home clusters using pipelined-ring parallelism, reaching 674 ms/token for 70B over WiFi — but still avoids tensor parallelism entirely. Parallax[24] (under review at ICLR 2026) explicitly states that "tensor parallelism requires frequent, high-bandwidth communication, making it suitable only for tightly-coupled GPUs in a datacenter" and defaults to pipeline parallelism for decentralized settings. All of these systems accept pipeline bubbles and sequential stage execution as necessary tradeoffs. None supports heterogeneous GPUs within a parallelism group, and none addresses the precision loss that occurs when AllReduce is performed over a network rather than NVLink.

Offloading and single-GPU approaches. FlexGen[21] enables large model inference on a single GPU by offloading weights to CPU and disk, trading latency for accessibility. llama.cpp provides CPU and CUDA inference via a general-purpose backend but does not use TensorRT-LLM's optimized kernels and does not support multi-machine parallelism. These approaches are complementary: they maximize single-device utilization, while our work enables models too large for any single device.

Heterogeneous parallelism. AMP[10] explores asymmetric partitioning for training on heterogeneous GPU clusters, assigning different numbers of layers to devices based on compute capability — a form of asymmetric pipeline parallelism, not tensor parallelism. HexGen[8] supports asymmetric tensor and pipeline parallelism over heterogeneous GPUs and networks, using constrained optimization to assign workloads proportionally to device capability; its successor HexGen-2[9] extends this to disaggregated prefill/decode at ICLR 2025. Helix[13] formulates heterogeneous LLM serving as a max-flow problem over weighted GPU graphs. Hetis[14] introduces fine-grained dynamic parallelism at attention-head granularity for heterogeneous clusters at SC '25. All achieve significant throughput improvements over homogeneous baselines — but all assume managed datacenter clusters with known topology, require offline profiling and optimization passes, and build on existing Linux-only frameworks (vLLM, Megatron). None targets ad-hoc consumer hardware, WiFi or commodity TCP networks, or Windows.

Custom AllReduce and the NCCL bottleneck. Recent work validates that bypassing NCCL yields substantial gains: MSCCL++[6] achieves a 1.7× geomean speedup over NCCL (up to 5.4×) and is in production at Microsoft Azure. NVRAR[23] achieves 1.9–3.6× lower latency than NCCL for multi-node TP decode on 128KB–2MB messages. However, both target HPC interconnects (InfiniBand, Slingshot), not commodity TCP. Flash Communication[4] finds that on PCIe-connected L40 GPUs without NVLink, AllReduce accounts for 65% of TP inference latency, and proposes INT4 quantization of payloads as a workaround — further evidence that the field treats TP's communication cost as the barrier rather than questioning the underlying transport. Our work takes a different approach: rather than compressing payloads or hiding communication, we replace the transport entirely with a custom FP32 AllReduce over TCP that achieves IEEE 754 single-precision accuracy — identical to NCCL on NVLink — and demonstrate that for autoregressive decode, where AllReduce payloads are O(hidden_dim) per layer, even commodity Ethernet is sufficient.

3. Approach: FP32 Reduction and Asymmetric Tensor Parallelism

Every prior attempt at distributed LLM inference over consumer hardware chose pipeline parallelism — and accepted the bubbles, idle GPUs, and architectural limitations that come with it. Tensor parallelism was considered infeasible outside of homogeneous data center hardware. Our system solves the two problems that blocked it: precision loss during networked AllReduce, and synchronization failure across mismatched GPUs. Together, they enable heterogeneous tensor parallelism on consumer hardware over commodity networks — a setting no prior system has addressed.

As described in Section 2, pipeline parallelism (PP)[5] splits models vertically across stages, while tensor parallelism (TP)[22] splits each layer horizontally, keeping every GPU active on every layer. TP requires AllReduce synchronization at every layer boundary — and every existing implementation assumes identical GPUs.

Pipeline Parallelism
GPU A Layers 1–11 1 transfer GPU B Layers 12–22 1 transfer Output 2 network transfers per forward pass Sequential. Bubbles. Each GPU idle 50% of the time.
Tensor Parallelism
4070 Super · L1 12GB · 56 SMs 4060 Laptop · L1 8GB · 24 SMs AllReduce 4070 Super · L2 4060 Laptop · L2 AllReduce ··· ··· 4070 Super · L22 4060 Laptop · L22 AllReduce Output 22 network round trips per forward pass Different GPUs. Different machines. Identical output.

The consensus in prior work is that per-layer synchronization makes TP infeasible over commodity networks (Section 2). Over WiFi (1-5ms per hop), a 32-layer model pays 32 round trips per token; over NVLink (600 ns), that cost is invisible. This assumption drove every previous project to choose PP. As our benchmarks reveal (Section 7d), GPU synchronization overhead, not network latency, dominates the per-layer cost — but latency was only part of the problem. Behind it was a second, harder wall: precision loss during AllReduce. We address precision in its own section (Section 4) because it is the central technical contribution of this work.

The second innovation is what makes heterogeneous TP possible: a transport layer that tolerates asymmetric compute. NCCL’s AllReduce algorithms — ring, tree, recursive halving — assume all ranks complete each layer at roughly the same time. When a 4070 Super finishes a layer in 2ms and a 4060 Laptop takes 4ms, NCCL’s synchronization model breaks down. Our TCP transport doesn’t assume symmetric timing. Each rank computes at its own speed, signals readiness, and the exchange happens when both sides are done. The slower GPU sets the compute pace, and network latency adds overhead at every synchronization point — but the faster GPU doesn’t stall the protocol, it simply waits on a non-blocking receive. This is what enables two GPUs with different SM counts, different VRAM, and different clock speeds to produce identical results.

Heterogeneous GPUs differ in SM count, clock speed, memory bandwidth, and tensor core count — all of which affect per-layer compute time. In a standard ring AllReduce[3], if one rank finishes its chunk while another is still computing, the ring stalls. NCCL’s collective algorithms were not designed for asymmetric participants. The cost of our approach is real: the slowest GPU sets the pace. But a 4070 Super (12GB) paired with a 4060 Laptop (8GB) can run a model too large for either card alone — the tradeoff is reduced throughput versus inability to run the model at all.

The technique is also transport-agnostic. We demonstrate it over TCP, but the same asymmetric-tolerant protocol could enable heterogeneous TP over NVLink or PCIe — a configuration NCCL does not support today, even within a data center.

4. Precision Correctness: The Theoretical Floor

Precision correctness is the central technical challenge of tensor parallelism over commodity hardware — and the reason every prior consumer-GPU project chose pipeline parallelism instead. This section presents the problem, the IEEE 754 arithmetic that governs it, our measurements across two models, and the design that achieves the theoretical floor.

4a. Why Tensor Parallelism Requires Precision That Pipeline Parallelism Does Not

In pipeline parallelism, each GPU runs a contiguous block of layers end-to-end and passes the final hidden state to the next GPU. The hand-off is a simple copy — no arithmetic is performed on the activation during transfer. The only precision requirement is that the tensor is faithfully serialized and deserialized, which any data type satisfies trivially.

Tensor parallelism is fundamentally different. Each layer’s computation is split across GPUs, and an AllReduce operation — a summation of partial results from every rank — must execute at every layer boundary. A typical LLM has 22–80 transformer layers, each containing two AllReduce points: one after the attention projection and one after the MLP. A 32-layer model performs 64 AllReduce operations per token. Every one of these operations is an arithmetic reduction — and every reduction is subject to floating-point rounding. If the rounding error is large enough, it compounds through subsequent layers until the model produces wrong output.

4b. IEEE 754: 10 Bits vs 23 Bits

The difference between FP16 and FP32 is not a matter of degree. It is structural.

IEEE 754 half-precision (FP16) stores 10 bits of mantissa, giving ~3.3 decimal digits of precision. IEEE 754 single-precision (FP32) stores 23 bits of mantissa, giving ~7.2 decimal digits. When two floating-point numbers are added, the smaller value is right-shifted to align exponents, and the trailing bits that fall off the mantissa are lost. In FP16, any addition where the operands differ by more than a factor of ~1,024 (210) discards the smaller value entirely. In FP32, this threshold is ~8.4 million (223).

In a row-parallel linear layer, each rank computes a partial matrix product and AllReduce sums the partials. These partial sums can differ by orders of magnitude — some ranks hold weight slices with large values, others with small values. FP16 summation silently drops the contribution of the smaller partial. FP32 preserves it.

This is not a theoretical concern. It is measurable at every layer of a real model.

4c. Measurements: Activation Drift Across Layers

We simulate TP=2 on a single GPU by splitting each row-parallel linear layer (attention output projection and MLP down projection) into two halves along the input dimension, computing each half independently, and summing via AllReduce. We compare the resulting hidden states at every layer against a single-GPU baseline. The experiment isolates the effect of the TP split and reduction — all other computation is identical.

Table 1: TinyLlama 1.1B — FP32 vs FP16 Model (22 layers, TP=2 simulated)
Layer FP32 Model Error FP16 Model Error Ratio
Layer 1 1.64e-9 4.91e-6 2,996×
Layer 5 2.64e-8 6.08e-5 2,301×
Layer 10 4.60e-8 1.22e-4 2,661×
Layer 15 7.02e-8 1.72e-4 2,447×
Layer 22 2.62e-7 6.73e-4 2,572×
Prompt: “The capital of France is”. Error = mean absolute difference of activation tensors vs FP32 single-GPU baseline. Both produce “Paris” for this short prompt.
Table 2: Mistral 7B — FP16 Model (32 layers, TP=2 simulated)
Layer TP=2 Error vs Baseline Max |activation|
Layer 1 1.66e-6 0.2
Layer 10 3.86e-5 247.4
Layer 20 1.32e-4 251.9
Layer 32 5.23e-4 181.0
FP16 model, same prompt. Error grows from 1.66e-6 at layer 1 to 5.23e-4 at layer 32 — a 315× increase across the network. Activation magnitudes reach 252.

Table 1 is the key result. The FP32 column represents the irreducible error introduced by splitting a matrix multiply across two ranks. At layer 22, the mean absolute error is 2.62e-7 — seven orders of magnitude smaller than the activation values themselves. This is not a quality gap that could be improved with engineering. It is the theoretical floor of IEEE 754 single-precision arithmetic.

The floor exists because floating-point addition is not associative. A matrix multiply Y = X · W computed as a single operation produces Y = ∑ xiwi with one specific summation order. When the same multiply is split across two ranks — rank 0 computing i<N/2 and rank 1 computing i≥N/2 — the partial sums are added in a different order. The difference is the rounding noise inherent in FP32’s 23-bit mantissa. Every distributed system has this floor, including NCCL on NVLink. There is no implementation of tensor parallelism, on any hardware, that avoids it.

FP64 (double-precision) would reduce the floor to ~1e-16 at twice the memory cost and roughly half the compute throughput. The improvement is real but irrelevant: 1e-7 error cannot flip a logit. The gap between the top-1 and top-2 token in a confident prediction is typically 1–10 in logit space. An error of 1e-7 is ten million times smaller than the decision boundary. FP32 is effectively lossless for tensor-parallel AllReduce.

4d. The FP16 Failure Mode: Drift and Overflow

FP16 is not close to the floor. It is 2,500× above it.

At every layer, FP16 computation introduces ~2,500× more activation error than FP32 (Table 1, ratio column). The error is not in the AllReduce alone — on GPU, CUDA promotes FP16 additions to FP32 internally, so two-operand FP16 reduction is lossless. The error accumulates in the model computation between AllReduce calls: the matrix multiplies, the residual additions, the layer normalizations. Each of these operations loses bits when performed in FP16, and the losses compound through every subsequent layer.

On TinyLlama (22 layers), the accumulated FP16 error reaches 6.73e-4 at the final layer. For the prompt “The capital of France is”, this is not enough to flip the output — “Paris” dominates the logit distribution by a wide margin. But on harder prompts where the top-2 tokens are separated by a smaller gap — reasoning chains, code generation, ambiguous completions — an error of this magnitude can and will flip the winner. And once one wrong token enters the autoregressive loop, every subsequent token is conditioned on a mistake. The error doesn’t correct itself. It cascades.

On Mistral 7B (32 layers, Table 2), the error grows to 5.23e-4 with activation magnitudes reaching 252. This is still within FP16’s representable range (max 65,504) — but under adversarial conditions, the partial sums at AllReduce boundaries can be much larger. In our early TP experiments, CPU-side FP16 reduction — summing two partial activation tensors in half-precision on the host — produced outright NaN at layer 23. IEEE 754 half-precision overflows at 65,504; two large partial sums exceeded this range, producing Inf, which propagated as NaN through every remaining layer. The failure was silent: no error message, no warning, just garbage output. This is the class of failure that would face any system attempting FP16 tensor parallelism over a network — and the reason no prior project attempted it. Petals[1] and Exo[2] chose pipeline parallelism precisely to avoid per-layer reduction, sidestepping the problem rather than solving it.

4e. Design: FP32 End-to-End

We solve this with two design choices that work together. First, the entire computation path runs in FP32. Weights are stored in FP16 for memory efficiency, but every matmul, every residual connection, and every normalization executes in FP32 precision. This is the key insight: FP32 AllReduce alone is not sufficient. If the model computes in FP16 and only the reduction is FP32, the drift accumulates in the model layers between reductions. The entire computation path must be FP32 to reach the theoretical floor.

Reduction is performed on the GPU, never on the CPU, eliminating the overflow path documented in Section 4d.

4f. Live Validation

We validate the precision design on a real cross-machine TP=2 deployment: an RTX 4070 Super (desktop, 12GB, 56 SMs) and an RTX 4060 Laptop (8GB, 24 SMs) running Mistral 7B Instruct (14GB total — too large for either GPU alone) over ethernet. Three test prompts:

Prompt: “1+1=” → “2”
Prompt: “The capital of France is” → “Paris.”
Prompt: “A cat is a” → “small domesticated mammal known for its playful, curious and independent nature. They belong to the Felidae family...” (120 tokens, coherent)

All three produce correct, coherent output at ~80ms/token. The third prompt is particularly meaningful: 120 tokens of autoregressive generation, each conditioned on every previous token, across 32 layers with 64 AllReduce operations per token — a total of 7,680 cross-machine reductions with zero accumulated error visible in the output. This is the theoretical floor in practice: NCCL-equivalent precision over TCP, across heterogeneous consumer GPUs, on Windows.

5. System Design

The system is a GPU-native compute mesh built on three components: an orchestrator for node discovery and session management, a custom TCP transport replacing NCCL, and a TensorRT plugin layer that intercepts collective operations and routes them through the transport. Each is described below.

System Architecture
Orchestrator discovery · sessions · rank assignment register register Machine A — Desktop Daemon (rank 0) TensorRT-LLM Engine FP16 weights · FP32 compute AllReduce Plugin AllGather Plugin 4070 Super 12GB · 56 SMs Machine B — Laptop Daemon (rank 1) TensorRT-LLM Engine FP16 weights · FP32 compute AllReduce Plugin AllGather Plugin 4060 Laptop 8GB · 24 SMs TCP WiFi or Ethernet Plugins intercept AllReduce/AllGather → route through TCP transport instead of NCCL

Orchestrator

A central orchestrator handles node registration, session matching, and rank assignment. Daemon nodes on each machine register with the orchestrator, advertising their GPU capabilities (model, VRAM, SM count, compute capability). When compatible nodes come online, the orchestrator creates a session and connects them. The orchestrator does not participate in data transfer — it only manages topology.

TCP Transport and AllReduce

The custom transport replaces NCCL entirely. Each rank bootstraps by exchanging capability information, then establishes a full-mesh of TCP connections. Each rank operates independently — no barrier, no assumption of symmetric compute speed — which is what enables heterogeneous GPUs to participate in the same AllReduce without protocol-level stalls. All reduction is GPU-native in FP32; no host-side arithmetic ever touches the tensor data, eliminating the FP16 overflow path of prior approaches.

TensorRT Plugin Integration

TensorRT-LLM's multi-GPU support is hardcoded to NCCL — every AllReduce, AllGather, and ReduceScatter operation in the engine graph calls into NCCL's collective communication library. There is no plugin interface or abstraction layer for swapping the communication backend. To replace NCCL with our TCP transport, we implemented custom TensorRT plugins that intercept every collective call at execution time and route it through the transport. The engine graph is unchanged; only the communication backend is swapped.

6. TensorRT-LLM on Windows

NVIDIA discontinued Windows support for TensorRT-LLM. We restored it on v0.12.0, the last release with partial Windows compatibility, and the latest at time of development. This is a prerequisite for consumer deployment: the vast majority of the 100M+ RTX GPUs cited in Section 1 run Windows, and requiring Linux would exclude most of the target hardware.

The port required patches at four layers of the build system: Package management — Conan dependency profiles assumed Linux toolchains; we wrote Windows-native profiles for all C++ dependencies. FMHA kernel compilation — fused multi-head attention kernels used GCC-specific intrinsics and POSIX paths; we ported these to MSVC with equivalent CUDA intrinsics. Python bindings — nanobind's build configuration assumed Unix shared library conventions; we patched the CMake targets for Windows DLL generation. MSVC/CUDA interop — the CUDA toolkit and MSVC have different C++ ABI expectations for exception handling and name mangling; we resolved linking failures across the toolchain boundary.

The result is a native Windows build that compiles and executes TensorRT-LLM inference engines — the full set of NVIDIA's 1,500+ hand-tuned CUDA kernels for attention, GEMM, and quantization — without WSL, Docker, or any Linux compatibility layer.

7. Experimental Evaluation

We demonstrated tensor parallelism across two heterogeneous GPUs on separate machines:

Rank 0: NVIDIA RTX 4070 Super — 12GB VRAM, 56 Streaming Multiprocessors
Rank 1: NVIDIA RTX 4060 Laptop — 8GB VRAM, 24 Streaming Multiprocessors

These GPUs differ in VRAM, SM count, and clock speed — a configuration NCCL cannot support. Both GPUs share the same SM architecture (Ada Lovelace, sm_89) — heterogeneous TP has only been confirmed across GPUs with identical SM architectures. Cross-architecture TP (e.g., Ampere + Ada) within the same TP group remains untested and may introduce divergence from differing instruction-level behavior. In MoE configurations, different experts can run on different architectures freely since experts never AllReduce with each other — the constraint applies only within each expert's TP group.

Experimental Setup

Software: NVIDIA TensorRT-LLM v0.12.0 (custom Windows build), CUDA Toolkit 12.6, cuDNN 9.3, TensorRT 10.4, Python 3.12, Windows 11 Pro on both machines. Engines built with trtllm-build using TP=2, FP16 weights with FP32 AllReduce accumulation.
Network: WiFi tests used 802.11ac (5GHz), measured ping 316ms round-trip under load. Ethernet tests used gigabit (1000BASE-T) via consumer router, measured ping 0.5-1ms.
Methodology: All latency measurements are per-token generation time during autoregressive decoding, measured after a 5-token warm-up period to ensure CUDA kernels are JIT-compiled and GPU clocks are boosted. Each reported number is the median of 20 consecutive tokens. Throughput is computed as 1/latency. AllReduce timing breakdowns use CUDA event-based instrumentation inserted at each stage of the D2H → TCP → H2D → reduce pipeline.

7a. Correctness — TinyLlama 1.1B

TinyLlama 1.1B (22 layers, FP32) was split horizontally, with each rank holding half the weights. Every transformer layer synchronizes via our custom TCP AllReduce with FP32 accumulation. Verified outputs across both ranks:

"1+1=" → "2" — both ranks agree ✓
"The capital of France is" → "Paris" — both ranks agree ✓
"A cat is a" → "domestic" — both ranks agree ✓

Identical outputs from two different GPUs, on two different machines. FP32 GPU-native reduction is what makes this possible — without it, accumulated precision loss corrupts the output by layer 15.

7b. Scale — Mistral 7B

To prove the architecture scales beyond toy models, we built and deployed Mistral 7B Instruct — a 7 billion parameter model with 32 transformer layers, grouped-query attention (8 KV heads), and 14GB of FP16 weights. This model cannot fit on either GPU alone: the 4070 Super has 12GB, the 4060 Laptop has 8GB. Without tensor parallelism, this model simply doesn't run on consumer hardware with less than 16GB VRAM.

With TP=2 across both machines, Mistral 7B produces coherent output:

"What is the capital of France?" → "The capital of France is Paris."
Generation latency: 80ms/token (12.5 tok/s) over gigabit ethernet with KV cache + overlapped AllReduce

7c. Benchmark Data

Performance Comparison
Configuration Latency Throughput Notes
llama.cpp — 4070 Super single GPU 3.4 ms/tok 295 tok/s Q8, single GPU baseline
BareMetalRT TP=2 — localhost 8.1 ms/tok 123 tok/s TinyLlama, same machine, no network
BareMetalRT TP=2 — WiFi 277 ms/tok 3.6 tok/s TinyLlama, 316ms ping, pre-overlap
BareMetalRT TP=2 — Ethernet 276 ms/tok 3.6 tok/s TinyLlama, 1ms ping, pre-overlap
BareMetalRT TP=2 — Mistral 7B (early) 185 ms/tok 5.4 tok/s Gigabit ethernet, overlapped AllReduce, 3-4ms/call, 49KB payload
BareMetalRT TP=2 — Mistral 7B (current) 80 ms/tok 12.5 tok/s Gigabit ethernet, KV cache + overlap, 1.2ms/call, 8KB payload

All TP=2 numbers measured over gigabit ethernet on the home LAN unless noted. WiFi 6 (5 GHz, ~2 ms RTT) adds roughly 2–3 ms per AllReduce call to the per-call cost — for Mistral 7B (64 calls/token) that's ~130–190 ms/token of WiFi tax versus ethernet.

7d. Analysis: Network Latency vs. Synchronization Overhead

Before overlap optimization: gigabit ethernet (1ms ping) and WiFi (316ms ping) produced identical throughput — 276ms vs 277ms per token. A 300× improvement in raw network latency yielded zero improvement in token generation speed. This initially appeared to contradict the assumption that network latency makes per-layer synchronization infeasible.

Instrumentation revealed why. In the synchronous (pre-overlap) implementation, each AllReduce call spent 1.5-5ms waiting for GPU compute to finish before any network transfer began. This synchronization wait dominated the per-call cost, masking the actual network transfer time. WiFi and ethernet produced identical results because neither was the bottleneck — the GPU sync wait was.

After overlap optimization: hiding network transfer behind the GPU sync wait reduced per-call AllReduce cost from 7–10ms to 3–4ms. At 3–4ms per call × 64 calls per token (Mistral 7B, 32 layers with AllReduce + AllGather), this yields ~185ms/tok.

This improved Mistral 7B generation from 293ms to 185ms/tok (5.4 tok/s) — a 37% latency reduction. Full layer-level pipelining (deferring recv+reduce to the next AllReduce call) was attempted but breaks TRT-LLM's buffer contract.

With recv hidden, the remaining AllReduce cost is dominated by the GPU sync wait — which is the irreducible time for each layer's compute to finish. This is near the theoretical floor for synchronous tensor parallelism. However, when sync_wait is short (fast layers or small tensors), recv time is not fully hidden, and network latency would become exposed. WiFi and ethernet would likely diverge under these conditions — a result not yet re-measured with the overlapped implementation. The pre-overlap finding that "the network is not the bottleneck" applied to the synchronous implementation; with overlap, network bandwidth becomes an increasingly significant factor for layers where compute is fast relative to transfer.

The Mistral 7B result (185ms/tok with overlap) is particularly significant. Despite having 32 layers (vs TinyLlama's 22) and 10× the parameters, per-token latency is lower than pre-overlap TinyLlama (276ms) — because Mistral's larger per-layer compute provides more sync wait time for recv to hide behind, and the KV cache reduces redundant recomputation. Larger models have a better compute-to-communication ratio, suggesting TP scales favorably with model size.

KV cache effect on AllReduce payload. The 185ms/tok measurement was taken before the KV cache fix (Section 4f). With correct KV caching, the generation phase processes a single token per step instead of re-computing the full context. This shrinks the AllReduce payload from ~49KB (full activation tensor) to 8KB (single-token hidden state). At 8KB, each AllReduce call completes in ~1.2ms — down from 3–4ms at 49KB. Over 64 calls per token, this reduces AllReduce overhead from ~224ms to ~77ms. Combined with faster single-token GPU compute, generation latency drops to 80ms/tok (12.5 tok/s) — a 2.3× improvement over the pre-KV-cache measurement. The AllReduce overlap optimization (hiding recv behind sync wait) compounds with the smaller payload: at 8KB, recv completes in <1ms, well within even the shortest sync waits.

Optimization trajectory. The three benchmark stages tell a story about where the remaining headroom lies:

293ms/tok → Synchronous AllReduce. Recv blocks the critical path. GPU sync wait (1.5–5ms) + network transfer (~2ms) = 7–10ms per call × 64 calls.
185ms/tok → Overlapped AllReduce. Recv runs during GPU sync wait. Per-call cost drops to 3–4ms. A 37% latency reduction from hiding I/O behind compute.
80ms/tok → KV cache fix. Generation-phase payload shrinks from 49KB to 8KB. Per-call cost drops to ~1.2ms. A further 57% reduction from sending less data.

Each stage removed a different bottleneck: first the blocking recv, then the payload size. The remaining bottleneck is the GPU sync wait itself (~0.7ms per call), which is the time for each layer’s TensorRT compute to finish — the irreducible cost of the model’s arithmetic. Further gains require either reducing per-layer compute (quantization, pruning) or overlapping the next layer’s compute with the current layer’s reduction (layer-level pipelining, which we attempted but is blocked by TRT-LLM’s buffer contract). Larger models improve the ratio further: more compute per layer means more sync wait to hide communication behind, making AllReduce overhead a smaller fraction of per-token cost.

To contextualize: 12.5 tokens/second is not competitive with single-GPU inference on a model that fits in VRAM — llama.cpp on one 4070 Super runs a quantized model at 295 tok/s. But TP and single-GPU address fundamentally different constraints. Single-GPU runs models that fit on one card. TP splits a model that cannot fit on any single GPU across several for capability — and at 12.5 tok/s, the output streams faster than a human reads. The throughput is practical for interactive use, and the correctness result — identical output from mismatched GPUs over a commodity network — is the primary contribution.

7e. Measured Re-Baseline and the Async AllReduce Experiment (May 2026)

The Section 7c–7d numbers above (293 → 185 → 80 ms/tok) are a mix of measured per-call costs and projected per-token totals. To validate end-to-end, we re-baselined Mistral 7B TP=2 on the same Desktop (RTX 4070 SUPER) + Laptop (RTX 4060) over WiFi, running through the full production path (Cloudflare → droplet → WS bridge → daemon → TCP AllReduce).

v0.7.7 baseline (May 18, WiFi) 1498 ms/tok. Context-recompute generation (full sequence re-run per token).
v0.7.8 incremental KV-cache 352 ms/tok. Generation step processes one token using cached attention K/V from prior steps.
v0.7.10 + gen-optimized build 223 ms/tok. remove_input_padding=enable + opt_num_tokens=1 tell TRT to pick kernel tactics tuned for single-token generation rather than the max-input-len prefill profile.
v0.7.12 + async AllReduce ~275 ms/tok mean, ~265 ms median, tighter p95. CPU returns from allReduce() immediately; TCP work runs as a CUDA host callback on the transport stream.

The earlier 80 ms/tok projection from Section 7c assumed a fully-overlapped CPU enqueue path: the CPU thread should be doing useful next-layer kernel launches while TCP is in flight. We implemented that explicitly in v0.7.12 via the host-callback pattern described below and measured the actual win on WiFi. The measured median improvement was small — well within the WiFi-noise envelope. The CPU enqueue work for one transformer layer turns out to be ~1–2 ms, so even fully hidden behind a 4 ms TCP wait, the savings per layer are smaller than per-call network variance. The architectural change is correct and gives a tighter p95 spread, but it isn’t the step-function we predicted.

The async AllReduce — first attempt (worker thread). The natural design was: record an event on the TRT stream, dispatch a job to a worker thread, return immediately; the worker performs the D2H + TCP + H2D + reduce and records a second event that the TRT stream waits on for its next op. The first implementation produced fluent-but-semantically-random output:

"The capital of France is" → "lid rip ownership induced Aus shade maintenance Hou Flor integrated zeketi lat..."  (~210 ms/tok, gibberish)

Speed proved the dispatch was working; gibberish proved the synchronization was not. Root cause: cudaStreamWaitEvent captures the event’s state at submission time. If the event has never been recorded (or its prior record has already completed), the captured state is “ready” and the wait is a no-op. The CPU thread’s submission of cudaStreamWaitEvent(trt_stream, event_h2d_done) happened before the worker had a chance to call cudaEventRecord(event_h2d_done, ...), so the TRT stream did not actually wait — next-layer kernels ran on unreduced data.

The correct pattern: cudaLaunchHostFunc. Keep all event semantics inside one stream’s submission order. The CPU thread submits the full sequence onto the transport stream — D2H copy, then a host callback that performs the TCP exchange, then H2D + GPU reduce, then a final event record. The CUDA driver runs the callback on its own worker thread when the stream reaches that op, blocking only the transport stream until the callback returns. The TRT stream’s wait on the final event then captures a well-defined pending state and the wait actually waits.

Per-call CUDA events are created fresh (cudaEventCreateWithFlags with cudaEventDisableTiming) and freed in a stream-ordered cleanup callback at the end of the sequence. The allocation cost is ~5 µs per event × 2 events × 64 ARs/token = ~0.6 ms/token — negligible against the ~250 ms/token total.

Where the host-callback pattern matters more than on home WiFi. The CPU-bound serialization that async eliminates scales with per-call CPU work. On Mistral 7B incremental generation over home WiFi, per-call CPU work is small and the median win is marginal. The same refactor pays significantly more on:

  • Internet-latency inference (mesh-over-WAN). At 10–100 ms per AR, a CPU thread blocked for 64 × 50 ms = 3.2 s per token is unworkable. Async lets the CPU prep many operations during long network waits.
  • Larger or sparser models (Mixtral, Llama-70B). More kernels per layer means more CPU enqueue work to overlap with each AR’s TCP latency.
  • Multi-user serving. One user’s AR-wait becomes another user’s CPU enqueue time.

The home-WiFi median result is therefore not the headline of this experiment. The headline is that the host-callback pattern is the architecturally correct primitive for distributed-inference work where TCP runs on a CUDA stream, and that the CPU-blocking serialization that previously dominated the analysis was a smaller fraction of per-token cost than the optimization-trajectory model predicted.

8. Limitations

Correctness is proven across two model scales. Overlapped AllReduce (Section 7d) improved Mistral 7B latency from 293ms to 80ms/tok. The remaining optimizations:

Further computation-communication overlap — the current implementation overlaps TCP recv with GPU sync wait. Full layer-level pipelining — deferring the previous layer's recv+reduce while the next layer's compute begins — was attempted but breaks TRT-LLM's buffer contract. Achieving this would require modifying TensorRT's execution model or introducing a shadow buffer scheme.

Larger models improve the ratio — Mistral 7B (32 layers, 14GB) achieves nearly the same per-token latency as TinyLlama (22 layers, 2GB) despite 10x more parameters. Each layer has more compute relative to the fixed AllReduce overhead. At 70B+ parameters, the AllReduce cost becomes a rounding error — exactly where TP shines.

Same SM architecture required within a TP group — heterogeneous TP has been confirmed only across GPUs sharing the same SM architecture (Ada Lovelace, sm_89). Cross-architecture pairs (e.g., Ampere sm_86 + Ada sm_89) within the same TP group remain untested. In MoE configurations, this constraint applies only within each expert's TP group — different experts can run on different architectures since they never AllReduce with each other.

Asymmetric weight splitting — currently, TP splits weight matrices evenly. With heterogeneous GPUs, proportional splitting (more columns to the GPU with more VRAM) would allow larger models to fit across mismatched hardware. The transport and reduction layers are already agnostic to split ratios.

Concurrent serving — dense TP locks all ranks for every token. The current system serves one user at a time. Continuous batching — processing multiple users’ tokens in the same forward pass — would amortize AllReduce overhead across N users while growing the payload only linearly (8KB × N). TRT-LLM supports this on Linux via its C++ executor; enabling it over our TCP transport requires adapting the scheduling logic. MoE architectures (Section 9) solve this structurally: only 2 of 8 experts activate per token, so 75% of the mesh can serve other users concurrently without batching.

We prioritized proving correctness over the most challenging transport — WiFi between consumer laptops. The benchmarks demonstrate that at TP=2 for autoregressive decode, GPU synchronization overhead — not network transfer — is the observed bottleneck: a 300× improvement in network speed (WiFi to Ethernet) yielded zero throughput improvement. Whether network latency becomes limiting after synchronization overhead is reduced (e.g. via persistent kernels or CUDA graphs across the AllReduce boundary), or at higher TP degrees where payload sizes and peer counts increase, remains an open question. What is settled is that the field's assumption that network latency alone ruled out cross-node TP was wrong — the system works, and the bottleneck is elsewhere.

Notably, our AllReduce uses a ring topology: each rank communicates only with its two neighbors, regardless of N. Scaling from TP=2 to TP=N increases the number of sequential ring steps to 2(N−1), but the per-chunk size shrinks to data_size/N, and per-node bandwidth remains constant. For Mistral 7B's hidden dimension (4096 × 4 bytes = 16KB per AllReduce), even TP=8 produces 14 steps of 2KB chunks — microseconds over gigabit Ethernet. The ring's linear scaling, combined with the small payload sizes inherent to autoregressive decode, suggests that commodity TCP may remain viable well beyond TP=2. We leave empirical validation at higher TP degrees to future work.

9. Future Work

9a. Mixture-of-Experts and Edge Compute

Dense transformer models are the worst case for edge tensor parallelism — every layer requires AllReduce across all ranks, every token. But the industry is moving to Mixture-of-Experts (MoE) architectures, and this changes the math entirely.

In a MoE model like Mixtral[7], each transformer layer contains multiple expert sub-networks, but only 2 of 8 experts activate per token. A gating network routes each token to its best-matching experts. The rest stay idle. This means most of the model's parameters exist but don't participate in any given forward pass — a property that is inefficient in a data center but advantageous for a distributed compute mesh.

Instead of splitting one dense model across GPUs that must synchronize every layer, each expert lives on a different node. A token arrives, the gate selects two experts, and only those two nodes compute. No AllReduce across all ranks — just point-to-point communication between the gate and the selected experts. The per-layer synchronization bottleneck that makes dense TP painful over WiFi largely disappears.

The compute mesh topology becomes an advantage, not a liability. Experts can be placed geographically — routed to the nearest available GPU with that expert hot in VRAM, like a CDN for inference. A gamer in Jakarta holds Expert 3, a researcher in Nairobi holds Expert 7. A token that needs both pays two point-to-point hops, not a global AllReduce.

The two parallelism strategies also compose. If a single expert is too large for one GPU, it can be tensor-parallelized across nearby GPUs — TP within the expert, expert parallelism across the mesh. This creates two tiers of communication: EP across the internet for routing tokens to the right expert (point-to-point, latency-tolerant, one hop), and TP across local GPUs for computing within the expert (AllReduce over ethernet at 0.1ms, not WiFi at 5ms). Since only 2 of 8 experts activate per token, 75% of the mesh is idle for any given request — available to serve other tokens concurrently.

Data center architecture is overprovisioned for MoE. Dense transformers activate every parameter on every token — they fully utilize NVLink’s 900 GB/s bandwidth and justify the cost of keeping every GPU in a DGX node busy. MoE inverts this: 75% of parameters are idle per token, and the communication pattern is sparse point-to-point, not dense AllReduce. Data centers compensate by batching aggressively, but this requires sustained high request volume. A $3M GB200 NVL72 rack has 75% of its 13.8TB VRAM idle per token. A mesh inverts the cost structure: idle VRAM costs nearly nothing because it sits in consumer hardware that would otherwise be sleeping, and expert routing needs only commodity TCP. Moreover, data centers scale beyond NVLink via pipeline parallelism — a degraded mechanism that increases sequential depth and introduces bubbles. MoE on a mesh scales via expert parallelism: each expert occupies its own node(s), aggregate VRAM grows linearly with mesh size (100 nodes × 12GB = 1.2TB), and per-token communication cost does not increase. Data centers scale capacity by adding pipeline stages. The mesh scales capacity by adding participants. One degrades utilization; the other does not.

Mesh formation. The orchestrator already solves the assignment problem for dense TP: nodes register, advertise GPU capabilities, and receive rank assignments. MoE extends this with expert placement. When a node joins the mesh, the orchestrator evaluates its VRAM capacity, network latency to existing nodes, and which experts are underserved — then assigns one or more experts and pushes the corresponding weight shards. If an expert is too large for a single GPU, the orchestrator groups nearby low-latency nodes into a TP group for that expert, exactly as it does today for dense models. The same heartbeat and session infrastructure that manages dense TP sessions manages expert placement — the unit of assignment changes from "rank in a TP group" to "expert shard on a node."

Redundancy and fault tolerance. Dense TP has no tolerance for node failure — if any rank drops, the entire inference pass is invalid, because every layer's AllReduce requires all participants. This is equally true in data centers: a failed GPU in an 8-way TP job kills the job. Data centers mitigate this with hardware reliability (ECC, redundant power, hot-spare nodes) and checkpoint-restart at the job scheduler level — not at the model level. MoE on a mesh offers a structural alternative. Since only 2 of 8 experts activate per token and 75% of nodes are idle at any moment, the orchestrator can maintain replicated experts — the same expert weights loaded on multiple nodes in different locations. When the orchestrator's heartbeat detects a node failure (missed heartbeats over the existing TCP keepalive), it re-routes subsequent tokens to a surviving replica of that expert. No checkpoint, no restart, no lost inference state — the next token simply goes elsewhere. The replication factor is a function of mesh size: with 24 nodes serving 8 experts, each expert can have 3 replicas at no additional VRAM cost beyond what the mesh already provides. The orchestrator's placement algorithm treats replica count as a parameter alongside VRAM capacity and network latency, preferring geographic diversity so that a local network outage does not eliminate all copies of the same expert.

Concurrent serving. Dense TP locks every GPU into a single inference pass. MoE on a mesh does not: since only 2 of 8 experts activate per token, 75% of nodes can simultaneously serve other users’ tokens. Multiple inference requests flow through the mesh concurrently without explicit batching or model replication — the sparsity that makes MoE parameter-efficient also makes it serving-efficient.

Our transport layer already supports arbitrary world sizes and rank-to-rank communication. The infrastructure for MoE routing is the same TCP mesh we built for tensor parallelism — the difference is traffic pattern, not protocol. Dense TP validates the transport layer; MoE architectures are where it achieves practical scalability.

MoE Hybrid Parallelism — EP + TP
Gate / Router selects 2 of 8 experts per token Expert 1 — Active Jakarta · TP within expert GPU A Expert 1a GPU B Expert 1b AllReduce (LAN) ~0.1ms ethernet partial result Exp 2 idle Exp 3 idle Exp 4 idle Expert 5 — Active Nairobi · fits on 1 GPU GPU C full expert no AllReduce needed partial result Combine & Output 2 point-to-point hops, not 22 global AllReduces EP across the internet (latency-tolerant) · TP within experts (LAN-speed) 6 of 8 experts idle per token — available for other requests

9b. Distributed KV Cache

Foundation models increasingly support context windows of 128K–1M+ tokens. The KV cache required to serve these contexts grows linearly with sequence length — at 128K tokens, Mistral 7B's KV cache alone consumes 4–8GB per GPU, approaching or exceeding the total VRAM of consumer hardware. This is the next wall after model size: even when tensor parallelism enables a model's weights to fit across multiple GPUs, a long context window can exhaust the remaining VRAM on the smallest device.

Existing solutions trade latency for capacity. FlexGen[21] offloads KV cache to CPU RAM and disk, solving the memory problem but adding 10–50ms per page fault. Ring Attention[12] distributes the sequence across devices, requiring all devices to participate in every attention computation — a global synchronization cost that scales with world size. PagedAttention (vLLM) manages KV cache as virtual memory pages on a single GPU but cannot help when that GPU is full.

Our mesh architecture enables a different approach: distributed KV cache across idle nodes. The orchestrator already tracks every node's GPU model, VRAM capacity, and current utilization. The TCP transport already moves tensors between GPUs with pinned-memory staging. In an MoE configuration, 75% of expert nodes are idle for any given token — their VRAM is empty and available. Instead of evicting old KV entries to CPU or dropping context, we propose paging them to peer GPUs in the mesh that have free VRAM.

This creates a three-tier memory hierarchy for KV cache:

Tier 1 — Local GPU VRAM (hot): Recent tokens, zero additional latency. Standard KV cache behavior.
Tier 2 — Peer GPU VRAM (warm): Older tokens paged to idle nodes in the mesh. Retrieval cost is one TCP round trip (1–5ms over ethernet) — slower than local VRAM but faster than disk I/O (10–50ms) and competitive with CPU offloading when accounting for the H2D copy back to GPU.
Tier 3 — CPU RAM / disk (cold): Oldest tokens, last resort. Same as existing offloading approaches.

The critical difference from existing offloading schemes is that Tier 2 uses GPU VRAM on peer nodes, not local CPU RAM. A fetch from peer GPU VRAM over ethernet is a DMA-speed copy on both ends with a network hop in between — the same staged pipeline (GPU → pinned host → TCP → pinned host → GPU) already proven in our AllReduce implementation. Unlike Ring Attention, only the node that needs a specific KV block fetches it — no global synchronization, no ring topology, just point-to-point retrieval.

In MoE configurations, this composes naturally with expert routing. The gate routes a token to two active experts; the remaining six expert nodes are idle, their combined VRAM forming a distributed memory pool. The same orchestrator that routes computation can route KV cache placement — assigning pages to the nearest node with free VRAM, like a CDN for memory. As context grows, the mesh absorbs it across available nodes rather than forcing each node to manage the problem alone.

Three prerequisites make this approach viable: (1) an orchestrator with real-time visibility into per-node VRAM utilization, (2) a transport layer capable of GPU-to-GPU tensor transfer across the network, and (3) idle nodes with empty VRAM available as storage. Our system already satisfies all three. The remaining work is a cache eviction policy (LRU with locality-aware placement) and integration with TensorRT-LLM's attention kernels to support non-contiguous KV cache reads. The same mesh that routes computation can also route memory, reducing dependence on centralized cloud infrastructure.

10. Conclusion

We presented the first implementation of heterogeneous tensor parallelism across consumer GPUs on separate machines over commodity TCP networks. By solving two problems that blocked prior attempts — precision loss during networked AllReduce and synchronization failure across mismatched GPUs — we demonstrated that tensor parallelism is feasible outside of homogeneous data center hardware. Our benchmarks revealed that in synchronous tensor parallelism, GPU synchronization overhead — not network latency — is the dominant bottleneck: a 300× improvement in network latency yielded zero improvement in throughput. Overlapping TCP recv with GPU synchronization reduced per-call AllReduce cost by 50–60%. Combined with correct KV caching — which shrinks the generation-phase AllReduce payload from 49KB to 8KB — Mistral 7B generation improved from 293ms to 80ms/tok (12.5 tok/s).

The system produces identical output from two different GPUs (RTX 4070 Super and RTX 4060 Laptop) running Mistral 7B — a 14GB model too large for either GPU alone — over gigabit ethernet. The industry's shift toward Mixture-of-Experts architectures, where only a fraction of parameters activate per token, structurally favors distributed meshes over centralized data centers: expert routing requires only point-to-point communication over commodity networks, not AllReduce over NVLink, and data centers scale past their NVLink domains via pipeline parallelism — a degraded mechanism that reintroduces the bubbles and idle GPUs that tensor parallelism was designed to eliminate. A mesh scales via expert parallelism, which does not degrade. Dense tensor parallelism proved the transport works. Mixture-of-Experts is where it becomes practical.

References

[1] Borzunov, A., et al. "Petals: Collaborative Inference and Fine-tuning of Large Models." Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), System Demonstrations, 2023.

[2] Exo Explore. "exo: Run your own AI cluster at home with everyday devices." github.com/exo-explore/exo, 2024.

[3] Gibiansky, A. "Bringing HPC Techniques to Deep Learning." Baidu Research, 2017.

[4] He, Q., et al. "Flash Communication: Reducing Tensor Parallelism Bottleneck for Fast LLM Inference." arXiv:2412.04964, 2024.

[5] Huang, Y., et al. "GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism." Advances in Neural Information Processing Systems (NeurIPS), 2019.

[6] Hwang, C., et al. "MSCCL++: Rethinking GPU Communication Abstractions for AI Inference." arXiv:2504.09014, 2025.

[7] Jiang, A.Q., et al. "Mixtral of Experts." arXiv:2401.04088, 2024.

[8] Jiang, Y., et al. "HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment." Proceedings of the 41st International Conference on Machine Learning (ICML), 2024.

[9] Jiang, Y., et al. "HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment." Proceedings of the 13th International Conference on Learning Representations (ICLR), 2025.

[10] Lee, S., et al. "AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness." Advances in Neural Information Processing Systems (NeurIPS), 2022.

[11] Li, Z., et al. "Prima.cpp: Fast 30–70B LLM Inference on Heterogeneous and Low-Resource Home Clusters." arXiv:2504.08791, 2025.

[12] Liu, H., et al. "Ring Attention with Blockwise Transformers for Near-Infinite Context." arXiv:2310.01889, 2023.

[13] Mei, X., et al. "Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs." Proceedings of the 30th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2025.

[14] Mo, Z., et al. "Hetis: Serving LLMs in Heterogeneous GPU Clusters with Fine-grained and Dynamic Parallelism." Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2025.

[15] NVIDIA. "CUDA C++ Best Practices Guide: Data Transfer Between Host and Device." docs.nvidia.com/cuda/cuda-c-best-practices-guide, 2024.

[16] NVIDIA. "NCCL: NVIDIA Collective Communications Library." developer.nvidia.com/nccl, 2024.

[17] NVIDIA. "Over 100 Million RTX AI PCs and Workstations." NVIDIA Newsroom, CES 2024, January 2024.

[18] NVIDIA. "TensorRT-LLM." github.com/NVIDIA/TensorRT-LLM, 2024.

[19] Patarasuk, P. and Yuan, X. "Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations." Journal of Parallel and Distributed Computing, Vol. 69, No. 2, 2009.

[20] Rasley, J., et al. "DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters." Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020.

[21] Sheng, Y., et al. "FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU." Proceedings of the 40th International Conference on Machine Learning (ICML), 2023.

[22] Shoeybi, M., et al. "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism." arXiv:1909.08053, 2019.

[23] Singhania, P., et al. "LLM Inference Beyond a Single Node: From Bottlenecks to Mitigations with Fast All-Reduce Communication." arXiv:2511.09557, 2025.

[24] Tong, C., et al. "Parallax: Efficient LLM Inference Service over Decentralized Environment." arXiv:2509.26182, 2025. Under review at ICLR 2026.

[25] Wang, G., et al. "Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping." arXiv:2409.15241, 2024.

[26] Zheng, L., et al. "Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning." Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2022.