PCI Express bandwidth considerations

Real-time GPU video processing systems live or die on PCI Express bandwidth. Composer pairs multiple Blackmagic Decklink capture cards with NVIDIA GPUs for compositing; modern motherboards offer impressive theoretical bandwidth, but a single oversight in slot wiring or CPU-lane allocation can throttle the GPU to a fraction of its capability and turn an otherwise capable host into a frame-dropping liability. This section is the deep-dive on what to watch for.

PCIe bandwidth refresher

PCIe bandwidth is a product of lane count and generation (speed per lane). Each successive generation roughly doubles the per-lane bandwidth:

Generation Bandwidth per lane x8 slot x16 slot
PCIe 2.0 ~500 MB/s ~4 GB/s ~8 GB/s
PCIe 3.0 ~1 GB/s ~8 GB/s ~16 GB/s
PCIe 4.0 ~2 GB/s ~16 GB/s ~32 GB/s
PCIe 5.0 ~4 GB/s ~32 GB/s ~64 GB/s

Modern NVIDIA GPUs are designed to run at PCIe x16. They remain functional at fewer lanes — the slot just delivers proportionally less bandwidth.

The capture cards Composer supports operate at modest PCIe generations. Don't expect a Decklink to take advantage of a Gen-5 slot — the hardware itself runs at Gen 2 or Gen 3 regardless of motherboard capability.

Model PCIe interface Channels Max resolution
Decklink Mini Recorder 4K x4 Gen 2 1 2160p30
Decklink Mini Monitor 4K x4 Gen 2 1 2160p30
Decklink Duo 2 x4 Gen 2 4 (3G-SDI) 1080p60
Decklink Quad 2 x8 Gen 2 8 (3G-SDI) 1080p60
Decklink Quad HDMI Recorder x8 Gen 3 4 (HDMI 2.0) 2160p60
Decklink 8K Pro x8 Gen 3 4 (12G-SDI) 4K / 8K

The lane-allocation problem

The CPU determines how many PCIe lanes are available to the rest of the system. Consumer parts are surprisingly thin once you start adding capture cards alongside an NVIDIA GPU:

Processor class Typical PCIe lanes Example
Consumer desktop (Intel Core) 20–24 13th-gen Core i9
Consumer desktop (AMD Ryzen) 24–28 Ryzen 9 7950X
HEDT / workstation 48–64 Threadripper PRO
Server (AMD EPYC) 96–128 EPYC 9004 / 8004 series

A typical real-time video system — RTX-class GPU at x16, two Decklink Duo 2 cards at x4 each, an NVMe drive at x4, plus chipset connectivity — already needs ≥30 lanes. On a 20-lane consumer CPU, the motherboard automatically downgrades the GPU slot from x16 to x8 (or worse) the moment additional PCIe devices land. Each motherboard handles this differently; the manual usually documents the behaviour under "PCIe bifurcation" or "slot bandwidth".

Why GPU-lane reduction matters

When the GPU runs at x8 instead of x16, memory transfers between system RAM and GPU memory become the bottleneck. A GPU operating at PCIe 3.0 x8 has ~8 GB/s of theoretical bandwidth where a x16 slot would deliver ~16 GB/s; in practice you see 60–80 % of theoretical sustained.

This directly hits CUDA memory-copy operations and turns into dropped frames as soon as the project's per-frame transfer budget is exceeded.

CUDA memory transfers — the per-frame budget

The standard CUDA memory copy (cudaMemcpy) is blocking — the host CPU thread waits for the transfer to complete. Worse, with the default CUDA stream, copies and kernel execution serialize:

Default stream timeline:
  [Host → Device memcpy] → [CUDA kernel] → [Device → Host memcpy]
            ↑                                       ↑
       GPU cores idle                          GPU cores idle

Composer's frame budget at 25 fps is 40 ms per frame. Anything spent on transfers is taken away from compositing.

Quantifying the impact

A 1080p BGRA frame is 1920 × 1080 × 4 bytes ≈ 8.3 MB. Transfer times scale directly with available bandwidth:

Configuration Theoretical bandwidth Time to transfer 8.3 MB
x16 Gen 3 ~16 GB/s ~0.5 ms
x8 Gen 3 ~8 GB/s ~1.0 ms
x16 Gen 2 ~8 GB/s ~1.0 ms
x8 Gen 2 ~4 GB/s ~2.1 ms

A live show compositing 8 inputs at reduced bandwidth could spend 8–16 ms just moving pixels into the GPU — 20–40 % of the 40 ms frame budget at 25 fps, more than half of it at 60 fps. That's before any compositing or encoding has happened.

Mitigation through asynchronous operations

CUDA does offer techniques that hide some of the transfer cost:

  • CUDA streams — overlap transfers with kernel execution by using multiple streams, so the GPU keeps computing while the next transfer is in flight.
  • Pinned (page-locked) memory — direct DMA between system RAM and the GPU, up to 2× higher bandwidth than pageable memory.
  • CUDA / DirectX interop — when output goes to a display rather than back to system RAM, the framebuffer can stay GPU-resident.

These are real wins, but none of them compensate for fundamentally inadequate PCIe bandwidth. If a card is wired to a x4 slot when it needs x16, no amount of streaming heroics gets the bytes through.

For multi-capture-card production systems, AMD EPYC is the sane platform — the lane count is generous enough to avoid every compromise.

EPYC 9004 / 9005 (Genoa / Turin)

  • 128 PCIe 5.0 lanes per socket.
  • In dual-socket setups, 64 lanes per CPU are reserved for the inter-socket Infinity Fabric link, leaving 128 lanes total exposed to devices.
  • 12 channels of DDR5 memory for the kind of bandwidth-intensive work Composer thrives on.

EPYC 8004

  • 96 PCIe 5.0 lanes per socket.
  • Single-socket design optimised for compact deployments — fits in a 1U or 2U chassis with sensible thermals.
  • Lower power consumption than the 9004 / 9005 — useful for mobile production units (broadcast trucks, flyaway kits).
  • 6 channels of DDR5.

Sample configuration

A balanced professional rig on an EPYC 9004 host:

Component PCIe lanes
NVIDIA RTX 4090 / 5090 x16 Gen 5
Decklink Duo 2 #1 x4 Gen 2
Decklink Duo 2 #2 x4 Gen 2
Decklink Quad HDMI x8 Gen 3
NVMe RAID controller x8 Gen 4
10 GbE NIC x4 Gen 3
Total 44 lanes

44 lanes used out of 128 available — full GPU bandwidth, no compromises, room for future expansion.

Summary recommendations

  • Audit PCIe lanes before specifying hardware. Count the GPU, every capture card, NVMe storage, networking, and any chipset overhead. Compare against the CPU's published lane budget.
  • Avoid consumer platforms for multi-capture-card deployments. 20–28 lanes force unacceptable compromises.
  • Choose EPYC 8004 / 9004 / 9005 for professional installations when you have multiple capture cards and a dedicated GPU. The 96–128 PCIe 5.0 lanes eliminate the lane budget as a constraint.
  • PCIe Gen 5 helps the GPU, not capture cards. Decklink cards remain at Gen 2 / 3 native speeds. Plan around the bottleneck, not the marketing.

For per-host tuning beyond hardware sizing, see Tuning for maximum performance above.