← Back to Projects

xv6 RDMA

CRISC-VQEMUxv6Systems

Overview

xv6 RDMA is a teaching-grade RDMA stack built on top of xv6-riscv. It introduces memory regions (MRs), queue pairs (QPs), work requests, and completion queues directly into the xv6 kernel, exposes a Verbs-like user-space API through new syscalls, and supports two transport modes: in-process software loopback and Ethernet-framed RDMA-over-E1000 for two-VM messaging.

Description

xv6 RDMA extends MIT's xv6-riscv teaching OS with a complete kernel RDMA subsystem modeled after the InfiniBand Verbs programming interface. The system adds seven new syscalls (numbers 22–28) for registering memory regions, creating and connecting queue pairs, posting send work requests, and polling completion queues. It supports two distinct data paths: a loopback path for single-guest intra-process RDMA that copies data between registered physical pages using memmove, and a network path that frames RDMA WRITE payloads in custom Ethernet frames (ethertype 0x8915) sent and received over the QEMU-emulated E1000 NIC.

What it is and what it does

The kernel RDMA subsystem is organized into three layers. The resource management layer (rdma.c) tracks all MRs and QPs in global tables protected by spinlocks; it handles VA-to-PA translation for registered user pages, enforces queue size constraints (powers of two, single kalloc page), and processes work requests synchronously — dispatching each posted WR to either the loopback or network path based on the QP's connection state. The network transport layer (rdma_net.c) builds and sends Ethernet frames carrying a fixed-size rdma_pkt_hdr (opcode, source QP, destination QP, rkey, remote MR offset, length) prepended to the WRITE payload; on receive it validates the rkey against the destination MR, copies the payload into the MR's physical pages, posts a completion entry to the receiver's CQ, and sends an ACK frame back. The syscall layer (sysrdma.c) bridges user and kernel by using copyin / copyout to safely transfer work-request descriptors, completion entries, and MAC addresses across the user–kernel boundary.

Capabilities

  • RDMA WRITE in both loopback mode (same guest, no NIC) and network mode (two QEMU VMs connected via E1000 + -netdev user)
  • Memory regions registered by user-space VA; the kernel pins and translates each page to a physical address for DMA-style access
  • Queue pairs with configurable send-queue and completion-queue depths, transitioning through INIT → RTR → RTS states mirroring real Verbs
  • rdma_connect() sets a remote MAC and remote QP ID so subsequent post_send calls are automatically routed to the network path
  • Completion queues polled by the user via poll_cq; signaled send WRs produce a completion only after an ACK is received
  • Remote MR access authenticated by a per-MR rkey carried in every packet header
  • Docker and CI tooling (Dockerfile, docker-compose.yml, test-xv6.py) support automated test runs

Implementation

Memory regions and address translation: rdma_register_mr() calls walkaddr() on the calling process's page table to resolve the user virtual address to a kernel physical address; the PA is stored in the MR table entry. All subsequent kernel-side accesses use this PA directly via P2V/physical pointer arithmetic, bypassing the user page table entirely.

Queue pairs and work request dispatch: rdma_create_qp() allocates separate kalloc pages for the send queue and completion queue ring buffers. rdma_post_send() copies the work request into the SQ ring, then immediately calls rdma_process_work_requests(), which checks qp->network_mode: if unset it performs a local memmove; if set it calls rdma_net_tx_write() to frame and transmit over E1000.

Network packet format and RX path: Every outgoing frame carries an rdma_pkt_hdr followed by the WRITE payload. net.c dispatches frames with ethertype 0x8915 to rdma_net_rx(). The receiver validates the rkey, copies the payload to mr->phys_addr + remote_offset, posts a CQ entry, and sends a minimal ACK frame back. On the sender side, rdma_net_rx_ack() matches the ACK's sequence number against the pending_acks table and posts the completion entry.

Syscall interface: Seven syscalls dispatched from the standard xv6 syscall.c table. sysrdma.c uses argint and argaddr to extract arguments and copyin / copyout to safely move structs between user and kernel address spaces.

Testing: rdmatest.c exercises MR registration, QP lifecycle, and loopback WRITE. rdmanet_test.c runs as two roles (host_a / host_b) across two QEMU instances and verifies that a WRITE from host B appears correctly in host A's MR. Kernel-side unit tests gated by -DRDMA_TESTING run automatically at boot.

Demo

No live demo available. To run locally:

  • Single-VM loopback: make qemu then run _rdmatest
  • Two-VM network test: run ./scripts/run_host_a.sh in terminal 1 (receiver), then ./scripts/run_host_b.sh in terminal 2 (sender)

See docs/NETWORK_RDMA_TESTING.md and docs/SETUP.md for full setup instructions.

Tech & Tools

C (kernel and user space) · RISC-V · QEMU (virt machine, E1000 NIC) · xv6-riscv (MIT teaching OS base)

Highlights

  • Kernel RDMA subsystem with Verbs-like API (MR, QP, WR, CQ) added to xv6-riscv
  • Dual transport: software loopback and RDMA-over-Ethernet via E1000
  • Custom Ethernet ethertype (0x8915) framing with rkey-authenticated remote MR access
  • Full syscall interface with safe copyin/copyout user–kernel data transfer

More Projects