PCI Express bandwidth considerations
Real-time GPU video processing systems live or die on PCI Express bandwidth. Composer pairs multiple Blackmagic Decklink capture cards with NVIDIA GPUs for compositing; modern motherboards offer impressive theoretical bandwidth, but a single oversight in slot wiring or CPU-lane allocation can throttle the GPU to a fraction of its capability and turn an otherwise capable host into a frame-dropping liability. This section is the deep-dive on what to watch for.
PCIe bandwidth refresher
PCIe bandwidth is a product of lane count and generation (speed per lane). Each successive generation roughly doubles the per-lane bandwidth:
| Generation | Bandwidth per lane | x8 slot | x16 slot |
|---|---|---|---|
PCIe 2.0 |
~500 MB/s | ~4 GB/s | ~8 GB/s |
PCIe 3.0 |
~1 GB/s | ~8 GB/s | ~16 GB/s |
PCIe 4.0 |
~2 GB/s | ~16 GB/s | ~32 GB/s |
PCIe 5.0 |
~4 GB/s | ~32 GB/s | ~64 GB/s |
Modern NVIDIA GPUs are designed to run at PCIe x16. They remain functional at fewer lanes — the slot just delivers proportionally less bandwidth.
Decklink cards and their PCIe requirements
The capture cards Composer supports operate at modest PCIe generations. Don't expect a Decklink to take advantage of a Gen-5 slot — the hardware itself runs at Gen 2 or Gen 3 regardless of motherboard capability.
| Model | PCIe interface | Channels | Max resolution |
|---|---|---|---|
| Decklink Mini Recorder 4K | x4 Gen 2 | 1 | 2160p30 |
| Decklink Mini Monitor 4K | x4 Gen 2 | 1 | 2160p30 |
| Decklink Duo 2 | x4 Gen 2 | 4 (3G-SDI) | 1080p60 |
| Decklink Quad 2 | x8 Gen 2 | 8 (3G-SDI) | 1080p60 |
| Decklink Quad HDMI Recorder | x8 Gen 3 | 4 (HDMI 2.0) | 2160p60 |
| Decklink 8K Pro | x8 Gen 3 | 4 (12G-SDI) | 4K / 8K |
The lane-allocation problem
The CPU determines how many PCIe lanes are available to the rest of the system. Consumer parts are surprisingly thin once you start adding capture cards alongside an NVIDIA GPU:
| Processor class | Typical PCIe lanes | Example |
|---|---|---|
| Consumer desktop (Intel Core) | 20–24 | 13th-gen Core i9 |
| Consumer desktop (AMD Ryzen) | 24–28 | Ryzen 9 7950X |
| HEDT / workstation | 48–64 | Threadripper PRO |
| Server (AMD EPYC) | 96–128 | EPYC 9004 / 8004 series |
A typical real-time video system — RTX-class GPU at x16, two Decklink Duo 2 cards at x4 each, an NVMe drive at x4, plus chipset connectivity — already needs ≥30 lanes. On a 20-lane consumer CPU, the motherboard automatically downgrades the GPU slot from x16 to x8 (or worse) the moment additional PCIe devices land. Each motherboard handles this differently; the manual usually documents the behaviour under "PCIe bifurcation" or "slot bandwidth".
Why GPU-lane reduction matters
When the GPU runs at x8 instead of x16, memory transfers between system RAM and GPU memory become the bottleneck. A GPU operating at PCIe 3.0 x8 has ~8 GB/s of theoretical bandwidth where a x16 slot would deliver ~16 GB/s; in practice you see 60–80 % of theoretical sustained.
This directly hits CUDA memory-copy operations and turns into dropped frames as soon as the project's per-frame transfer budget is exceeded.
CUDA memory transfers — the per-frame budget
The standard CUDA memory copy (cudaMemcpy) is blocking — the host CPU thread waits for the transfer to complete. Worse, with the default CUDA stream, copies and kernel execution serialize:
Default stream timeline:
[Host → Device memcpy] → [CUDA kernel] → [Device → Host memcpy]
↑ ↑
GPU cores idle GPU cores idle
Composer's frame budget at 25 fps is 40 ms per frame. Anything spent on transfers is taken away from compositing.
Quantifying the impact
A 1080p BGRA frame is 1920 × 1080 × 4 bytes ≈ 8.3 MB. Transfer times scale directly with available bandwidth:
| Configuration | Theoretical bandwidth | Time to transfer 8.3 MB |
|---|---|---|
x16 Gen 3 |
~16 GB/s | ~0.5 ms |
x8 Gen 3 |
~8 GB/s | ~1.0 ms |
x16 Gen 2 |
~8 GB/s | ~1.0 ms |
x8 Gen 2 |
~4 GB/s | ~2.1 ms |
A live show compositing 8 inputs at reduced bandwidth could spend 8–16 ms just moving pixels into the GPU — 20–40 % of the 40 ms frame budget at 25 fps, more than half of it at 60 fps. That's before any compositing or encoding has happened.
Mitigation through asynchronous operations
CUDA does offer techniques that hide some of the transfer cost:
- CUDA streams — overlap transfers with kernel execution by using multiple streams, so the GPU keeps computing while the next transfer is in flight.
- Pinned (page-locked) memory — direct DMA between system RAM and the GPU, up to 2× higher bandwidth than pageable memory.
- CUDA / DirectX interop — when output goes to a display rather than back to system RAM, the framebuffer can stay GPU-resident.
These are real wins, but none of them compensate for fundamentally inadequate PCIe bandwidth. If a card is wired to a x4 slot when it needs x16, no amount of streaming heroics gets the bytes through.
Recommended hardware: AMD EPYC
For multi-capture-card production systems, AMD EPYC is the sane platform — the lane count is generous enough to avoid every compromise.
EPYC 9004 / 9005 (Genoa / Turin)
- 128 PCIe 5.0 lanes per socket.
- In dual-socket setups, 64 lanes per CPU are reserved for the inter-socket Infinity Fabric link, leaving 128 lanes total exposed to devices.
- 12 channels of DDR5 memory for the kind of bandwidth-intensive work Composer thrives on.
EPYC 8004
- 96 PCIe 5.0 lanes per socket.
- Single-socket design optimised for compact deployments — fits in a 1U or 2U chassis with sensible thermals.
- Lower power consumption than the 9004 / 9005 — useful for mobile production units (broadcast trucks, flyaway kits).
- 6 channels of DDR5.
Sample configuration
A balanced professional rig on an EPYC 9004 host:
| Component | PCIe lanes |
|---|---|
| NVIDIA RTX 4090 / 5090 | x16 Gen 5 |
| Decklink Duo 2 #1 | x4 Gen 2 |
| Decklink Duo 2 #2 | x4 Gen 2 |
| Decklink Quad HDMI | x8 Gen 3 |
| NVMe RAID controller | x8 Gen 4 |
| 10 GbE NIC | x4 Gen 3 |
| Total | 44 lanes |
44 lanes used out of 128 available — full GPU bandwidth, no compromises, room for future expansion.
Summary recommendations
- Audit PCIe lanes before specifying hardware. Count the GPU, every capture card, NVMe storage, networking, and any chipset overhead. Compare against the CPU's published lane budget.
- Avoid consumer platforms for multi-capture-card deployments. 20–28 lanes force unacceptable compromises.
- Choose EPYC 8004 / 9004 / 9005 for professional installations when you have multiple capture cards and a dedicated GPU. The 96–128 PCIe 5.0 lanes eliminate the lane budget as a constraint.
- PCIe Gen 5 helps the GPU, not capture cards. Decklink cards remain at Gen 2 / 3 native speeds. Plan around the bottleneck, not the marketing.
For per-host tuning beyond hardware sizing, see Tuning for maximum performance above.