If you’re building high-performance backends, AI training clusters, low-latency trading systems, or modern storage fabrics, you’ll eventually hit the limits of “traditional” networking: too many copies, too many context switches, and too much CPU spent shuttling bytes instead of doing useful work.
Remote Direct Memory Access (RDMA) changes that by letting one machine read/write another machine’s memory directly, largely offloading the data path to the NIC. The result is microseconds-level latency, very high throughput, and dramatically lower CPU load, exactly what today’s data-intensive applications need.
Below, we unpack RDMA from first principles, compare it to DMA/TCP/IP, explain InfiniBand, RoCE, and iWARP, and give you a practical checklist for deploying an RDMA fabric that actually delivers.

RDMA vs. DMA vs. “Traditional” Networking—The Mental Model
DMA, in one minute
Direct Memory Access (DMA) lets a device move data to/from system memory without the CPU doing byte-by-byte copies. The CPU programs the DMA engine, then gets out of the way until an interrupt signals completion.
That’s great inside a single server, but it doesn’t solve the over-the-network cost of TCP/IP stacks, kernel crossings, and user/kernel copies.
RDMA, in one minute
Remote Direct Memory Access (RDMA) extends the DMA idea across machines. User-space applications register memory regions once, post “work requests” to NIC-managed queues, and the NICs move data between hosts directly with minimal kernel involvement and no extra copies along the hot path.
The control plane (connection setup, memory registration) still involves the OS, but steady-state I/O runs mostly in hardware queues (queue pairs) with completions delivered to user-space via completion queues.
The short, practical comparison
Dimension | Traditional TCP/IP I/O | DMA (local only) | RDMA (remote, offloaded) |
CPU involvement | High: syscalls, copies, interrupts | Low (local) | Low (offloaded to NIC) |
Copies | User→kernel→NIC and back | None (device↔memory) | Zero-copy between hosts’ memories |
Latency | Milliseconds → high microseconds (stack overhead) | N/A (intra-host) | Low microseconds (kernel bypass) |
Throughput | Good, but CPU-limited | N/A | Line-rate at small CPU cost |
Scope | Networked | Intra-host | Networked (host↔host) |
Typical use | General apps | Disk, GPU, NIC transfers | HPC, AI/ML, NVMe-oF, HFT |

How RDMA Works
An RDMA application opens a device, registers memory (pinning pages and granting the NIC DMA rights), then creates queue pairs (QPs): a Send Queue and a Receive Queue bound together. It also creates Completion Queues (CQs), which the NIC fills with “done” events. The app posts work requests (e.g., RDMA WRITE, RDMA READ, SEND/RECV) to the QP, and polls the CQ for completions—no kernel round-trip on each I/O.
Why this is fast?
- Kernel bypass & zero copy: Once memory is registered, payloads don’t bounce through the kernel or intermediate buffers.
- NIC-managed transport: The NIC handles segmentation, retransmissions (depending on fabric), and completion signaling.
- Doorbells, not syscalls: Posting work requests involves MMIO doorbells, not heavy system calls.
- Tight, predictable queues: Queue pairs and completion queues are simple ring buffers that are cache-friendly and polled by the app.
RDMA Fabrics and Protocols: InfiniBand vs. RoCE vs. iWARP
You can think of RDMA as a set of semantics (memory registration, queue pairs, verbs). Those semantics can ride on different underlay transports:
InfiniBand (native RDMA fabric)
InfiniBand is a purpose-built, lossless, low-latency fabric with RDMA as a first-class citizen. It defines link, transport, and management layers for tightly coupled clusters. Historically dominant in HPC because of latency, bandwidth, and mature tooling, InfiniBand still pushes the envelope at high speeds and large cluster scales.
RoCE (RDMA over Converged Ethernet)
RoCE brings RDMA semantics to Ethernet.
- RoCEv1 works at Layer 2 (same L2 domain).
- RoCEv2 encapsulates RDMA traffic over UDP/IPv4 or UDP/IPv6, making it routable across L3 networks (UDP port 4791). RoCEv2 typically relies on data-center controls like PFC and ECN to manage loss/congestion and keep latencies tight.
iWARP (RDMA over TCP)
iWARP implements RDMA over the familiar, reliable TCP transport (via a stack of IETF standards: RDMAP, DDP, MPA/TCP/SCTP). Because TCP is reliable and connection-oriented, iWARP can be attractive in lossy or wide-area environments though its performance characteristics differ from UDP-based RoCEv2 and native InfiniBand.
Quick side-by-side overview
Feature | InfiniBand | RoCEv2 | iWARP |
Underlay transport | Native IB fabric | UDP over IPv4/IPv6, L3 routable | TCP (IETF RDMAP/DDP/MPA) |
Loss behavior | Engineered lossless fabric | Ethernet with PFC/ECN for low loss | TCP handles loss/retransmit |
Typical domains | HPC, AI training pods, tightly coupled clusters | Enterprise DCs wanting RDMA on Ethernet | Mixed or WAN-ish scenarios needing TCP reliability |
Operational complexity | IB-specific gear/procedures | Ethernet networking + DC QoS tuning | TCP familiarity, fewer fabric tweaks |
Latency ceiling (practical) | Excellent | Very good with proper tuning | Good; TCP overhead variability |
Hardware | IB switches & HCAs | Ethernet switches/NICs with RoCE | NICs with iWARP offload |

Where RDMA Shines: Real-World Use Cases
- HPC & scientific computing: tightly coupled MPI workloads, CFD, weather, genomics—where latency directly impacts wall-clock time. (InfiniBand dominant.)
- AI/ML training & inference at scale: model parallelism and parameter server designs benefit from low-latency, high-throughput exchanges. (InfiniBand and RoCEv2 are common.)
- Disaggregated storage (NVMe-oF): NVMe over Fabrics supports an RDMA transport (alongside TCP/FC), giving you remote flash pools with local-like performance.
- Low-latency finance: Market data distribution and order handling where microseconds matter. (RoCEv2 and specialized Ethernet.)
- Kernel-bypass middleware: modern databases, caches, and streaming systems leveraging verbs or RDMA-enabled RPC to remove kernel overheads.
The RDMA Data Path vs. Traditional TCP/IP
Traditional path (simplified):
- App writes() → kernel socket → protocol stack (TCP/IP) → NIC DMA → wire → peer NIC → peer stack → peer socket → read() → app.
Multiple copies and context switches along the way.
RDMA path (simplified):
- App registers memory, creates queue pairs and completion queues.
- App posts RDMA WRITE/READ or SEND/RECV to the Send Queue.
- NIC moves the payload directly from local memory to remote registered memory.
- Completion shows up in the CQ. No per-message syscalls.

Benefits You Can Bank On
- Ultra-low latency: Eliminating kernel involvement on the hot path avoids extra copies, context switches, and interrupts.
- High throughput at low CPU: Data movement is NIC-accelerated; CPUs spend cycles on application logic instead of memcpy().
- Stable tail latency: Hardware queues + tuned fabrics = predictable p99s, especially on IB or well-engineered RoCEv2.
- Scalability: Queue pairs and registered memory scale across many flows and nodes, particularly in IB clusters.
Limitations and Operational Gotchas
- Operational complexity: Memory registration, page pinning, MTUs, QP states, CQ sizes, and NUMA placement need careful tuning.
- Fabric engineering: RoCEv2 demands thoughtful PFC/ECN policies to avoid head-of-line blocking or unfairness; IB has its own queue and credit tuning.
- Debugging difficulty: Silent drops or congestion misconfig can look like application bugs; visibility tools matter.
- Heterogeneous clusters: Mixing NIC generations or fabrics (IB + Ethernet) introduces policy drift and hard-to-reproduce tail behavior.
Storage Spotlight: NVMe over Fabrics with RDMA
NVMe-oF separates compute from storage while keeping latency tight. Its RDMA transport lets hosts talk directly to remote NVMe queues over the fabric, avoiding the overhead of legacy SCSI stacks and delivering near-local performance with central manageability, crucial for AI data pipelines and scale-out block stores.
Specs from NVM Express and educational material from SNIA outline RDMA among the standard transports (alongside TCP and Fibre Channel).
Designing an RDMA-Capable Network: A Practical Checklist
1) Choose your transport first
- If you control the cluster end-to-end and want the best latency/bandwidth: InfiniBand.
- If you must stay on Ethernet and need L3 routing: RoCEv2 (plan for PFC/ECN).
- If you expect lossy or WAN segments and value TCP semantics: iWARP.
2) Size your NICs and queues
- Match NIC generation (e.g., PCIe lane width/speed) to node memory bandwidth and CPU sockets.
- Provision queue pairs per core/worker; avoid contention.
- Right-size Completion Queues and polling strategies to balance CPU burn vs. latency.
3) Engineer the fabric (Ethernet/RoCEv2)
- Enable and tune PFC where required, but beware of head-of-line blocking.
- Use ECN/RED policies to mark before drop, coupled with congestion notification handling.
- Keep loss domains small; avoid microbursts via adequate buffering and traffic class separation.
4) Align the host stack
- Pin memory judiciously; monitor registered region saturation.
- Co-locate NIC IRQs/doorbells and polling threads with the right NUMA node.
- Pick MTU and inline thresholds that fit your message size profile.
- Consider hybrid paths: small messages over RDMA SEND, large payloads via RDMA WRITE/READ.
5) Instrument everything
- Track CQ overruns, QP error states, retransmits (where applicable), ECN marks, and p99/p999 latencies.
- In storage fabrics, monitor NVMe-oF queue depths end-to-end.

Developer’s Corner: Verbs-Level Workflow (Conceptual)
- Open device & allocate PD: A Protection Domain groups QPs and memory registrations.
- Register memory (MRs): Pin pages; hand keys to NIC.
- Create QPs and CQs: Decide queue depths; associate QPs with CQs.
- Exchange QP info out-of-band: (e.g., RDMA-CM or your own control channel) to move QP to RTR/RTS.
- Post work requests: RDMA_WRITE, RDMA_READ, SEND, RECV.
- Poll completions: Low-latency polling loop reads CQEs; handle errors and back-pressure.
NVIDIA’s programming guide and university tutorials are good starting points for the data structures and queue semantics.
Cost & Architecture Notes
- CapEx vs. OpEx: IB often wins on raw latency/bandwidth but implies an IB-specific switching domain. RoCEv2 leverages Ethernet skills/hardware you may already have, but you must invest in switch QoS engineering. iWARP can simplify ops in mixed or lossy environments using TCP familiarity.
- Scaling out: For very large clusters, pay attention to path diversity (ECMP for RoCEv2, fat-tree/dragonfly for IB), and ensure your job scheduler or orchestration respects topology-aware placement.
- Interoperability: Stick to well-supported NIC/switch combos; test at Plugfests and lab-validate congestion policies before production cutovers.
Frequently Asked Questions
Q1: Is RDMA faster than TCP/IP?
A: In low-latency regimes and at high throughput, yes, primarily because RDMA bypasses the kernel on the data path, avoids extra copies, and uses NIC queues and completions. The exact win depends on message sizes, congestion policies, and hardware.
Q2: Can RDMA run over Ethernet?
A: Yes. That’s RoCE (L2/L3). RoCEv2 encapsulates RDMA over UDP/IPv4 or UDP/IPv6 using destination port 4791, so it’s routable across subnets with proper QoS and congestion control.
Q3: What about RDMA over TCP?
A: That’s iWARP, specified by the IETF as RDMAP/DDP/MPA over TCP (or SCTP). You get TCP’s reliability semantics, which can be valuable in lossy networks or WANs.
Q4: Is RDMA only for HPC?
A: No. It’s foundational in NVMe-oF storage fabrics (NVMe-RDMA), increasingly common in AI/ML clusters, and used in low-latency trading when engineered correctly.
Q5: What skills do my team need?
A: Network QoS and congestion control (RoCEv2), queue-based programming (verbs), NUMA-aware multi-threading, and solid observability across NICs and switches.
Did this article help you or not? Tell us on Facebook and LinkedIn . We’d love to hear from you!