From NIC to User Space: Data Structures and Ring Buffer Behavior
When a network interface card (NIC) receives packets, it uses DMA to place them directly into pre-allocated buffers in system RAM. The process is structured around a fixed-size RX ring buffer of descriptors, which the NIC and driver share.
RX Ring Buffer Basics
Descriptor Structure
A simplified receive descriptor might look like:
struct rx_desc {
void *buf_addr; // Physical address of buffer for DMA
uint16_t length; // Packet length
uint8_t status; // Flags: DONE, errors, VLAN info, etc.
};
-
Ring = a fixed-size array of these descriptors.
-
Fixed size because:
- Hardware allocates internal state for each entry.
- It’s easier for the driver to wrap pointers with modulo arithmetic.
-
Circular: After the last entry, pointers wrap back to index 0.
Packet Arrival Path
Step 0 — NIC Writes via DMA
- NIC’s PHY and MAC receive a packet from the wire.
- NIC picks the next free descriptor (pointed to by its head pointer).
- DMA engine writes entire packet into
buf_addr
. - NIC updates
length
andstatus = DONE
.
At this point, the RX ring might look like:
[ DONE, DONE, DONE, EMPTY, EMPTY, ... ]
Step 1 — Interrupt
-
NIC signals CPU via IRQ or MSI-X.
-
CPU’s Local APIC routes to the assigned core.
-
ISR (Interrupt Service Routine) runs quickly:
- Acknowledges interrupt.
- Schedules NAPI poll for packet processing.
Step 2 — NAPI Poll
-
NAPI runs in softirq context.
-
Driver walks the RX ring starting at its tail pointer:
-
For each descriptor marked DONE (Ready to be pulled out):
- Create
sk_buff
pointing to the DMA buffer. - Set length, protocol, and other metadata.
- Pass
sk_buff
intonetif_receive_skb()
(network stack).
- Create
-
Mark descriptor EMPTY and advance tail pointer.
-
[ EMPTY, <Tail Here>, DONE, EMPTY, EMPTY, ... ]
What does softIRQ context mean in Linux?
In Linux, softIRQ context means the code is running in a special, deferred interrupt handling mode — not as a normal process, but not as a hard interrupt either.
It’s a middle ground the kernel uses so that heavy work triggered by an interrupt doesn’t block other interrupts for too long.
Why it exists
Hard IRQ context (ISR): Runs immediately when the CPU gets an interrupt.
- Runs with interrupts disabled.
- Must be very quick — just enough to acknowledge hardware and schedule real work.
- Can’t sleep or block.
SoftIRQ context:
- Scheduled by a hard IRQ handler.
- Runs with interrupts enabled.
- Can take longer because it’s not holding up other interrupts.
- Still not normal process context — can’t sleep, can’t call blocking functions.
- Runs either right after the hard IRQ exits or later in a special kernel thread (
ksoftirqd
).
How this applies to NAPI and NIC RX
- Packet arrives → NIC raises interrupt → ISR runs in hard IRQ context.
- ISR disables further NIC interrupts and calls
napi_schedule()
. napi_schedule()
marks the NIC’s poll handler to run inNET_RX_SOFTIRQ
context.- Later, the kernel runs the softIRQ handler:
- Driver’s poll() function drains RX ring, creates sk_buffs, passes them to networking stack.
- When the softIRQ finishes, interrupts are fully re-enabled and the CPU can resume normal tasks.
Why not just do all work in the ISR?
Because:
- Copying packets, allocating sk_buffs, running through protocol parsing is slow compared to just acknowledging the hardware.
- While you’re in a hard IRQ, all other interrupts on that CPU are masked.
- Spending 500 µs in a hard IRQ could delay timers, disk I/O, other NIC queues, etc.
- Splitting into hard IRQ → softIRQ keeps “acknowledge and exit fast” while still giving low latency packet processing.
Step 3 — Network Stack
- L2 parse: Ethernet header removed.
- L3 parse: IP header validated, checksum verified (or skipped if offloaded).
- L4 parse: TCP/UDP header parsed, ports identified.
- skb is queued into the correct socket receive queue.
Step 4 — User Space
- App calling
recv()
orread()
wakes up if data is available. - skb’s payload copied into user buffer (or mapped in zero-copy modes).
- skb freed back to pool.
What Happens if the RX Ring Fills
Because the RX ring has a fixed number of slots, it can fill up under heavy load:
Filling Condition
- NIC head pointer catches up to driver tail pointer.
- All descriptors are marked DONE, but driver hasn’t processed them yet.
- No free descriptor = nowhere to DMA next packet.
Result
-
Packets are dropped in hardware:
- NIC increments a missed packet or RX drop counter.
- The dropped frame never makes it into RAM.
-
Driver can read this counter via NIC registers for diagnostics.
Why it happens
- CPU or driver not processing descriptors fast enough (high interrupt load, other workloads).
- Burst of incoming packets exceeds ring capacity before driver can catch up.
- Interrupt moderation might delay driver wakeup just long enough for the ring to fill.
Mitigations
-
Increase RX ring size
- Some NICs allow larger descriptor rings (e.g., from 256 to 4096 entries).
-
Enable NAPI / packet batching
- Reduces per-packet interrupt cost, processes multiple per poll.
-
Distribute load with RSS
- Multiple RX queues mapped to different cores.
-
Use higher-performance packet paths
- XDP, DPDK, or other bypass frameworks.
-
Tune interrupt moderation
- Lower coalescing delay to drain ring sooner.
Key Insight
Once a packet is dropped due to a full RX ring, it’s gone — the NIC doesn’t have a secondary overflow buffer. The only way to prevent this is to make sure the driver processes descriptors quickly enough, or to spread the load across more queues/cores.