Skip to main content

From NIC to User Space: Data Structures and Ring Buffer Behavior

When a network interface card (NIC) receives packets, it uses DMA to place them directly into pre-allocated buffers in system RAM. The process is structured around a fixed-size RX ring buffer of descriptors, which the NIC and driver share.

RX Ring Buffer Basics

Descriptor Structure

A simplified receive descriptor might look like:

struct rx_desc {
void *buf_addr; // Physical address of buffer for DMA
uint16_t length; // Packet length
uint8_t status; // Flags: DONE, errors, VLAN info, etc.
};
  • Ring = a fixed-size array of these descriptors.

  • Fixed size because:

    • Hardware allocates internal state for each entry.
    • It’s easier for the driver to wrap pointers with modulo arithmetic.
  • Circular: After the last entry, pointers wrap back to index 0.



Packet Arrival Path

Step 0 — NIC Writes via DMA

  • NIC’s PHY and MAC receive a packet from the wire.
  • NIC picks the next free descriptor (pointed to by its head pointer).
  • DMA engine writes entire packet into buf_addr.
  • NIC updates length and status = DONE.

At this point, the RX ring might look like:

[ DONE, DONE, DONE, EMPTY, EMPTY, ... ]

Step 1 — Interrupt

  • NIC signals CPU via IRQ or MSI-X.

  • CPU’s Local APIC routes to the assigned core.

  • ISR (Interrupt Service Routine) runs quickly:

    • Acknowledges interrupt.
    • Schedules NAPI poll for packet processing.

Step 2 — NAPI Poll

  • NAPI runs in softirq context.

  • Driver walks the RX ring starting at its tail pointer:

    1. For each descriptor marked DONE (Ready to be pulled out):

      • Create sk_buff pointing to the DMA buffer.
      • Set length, protocol, and other metadata.
      • Pass sk_buff into netif_receive_skb() (network stack).
    2. Mark descriptor EMPTY and advance tail pointer.

[ EMPTY, <Tail Here>, DONE, EMPTY, EMPTY, ... ]
What does softIRQ context mean in Linux?

In Linux, softIRQ context means the code is running in a special, deferred interrupt handling mode — not as a normal process, but not as a hard interrupt either.
It’s a middle ground the kernel uses so that heavy work triggered by an interrupt doesn’t block other interrupts for too long.


Why it exists

Hard IRQ context (ISR): Runs immediately when the CPU gets an interrupt.

  • Runs with interrupts disabled.
  • Must be very quick — just enough to acknowledge hardware and schedule real work.
  • Can’t sleep or block.

SoftIRQ context:

  • Scheduled by a hard IRQ handler.
  • Runs with interrupts enabled.
  • Can take longer because it’s not holding up other interrupts.
  • Still not normal process context — can’t sleep, can’t call blocking functions.
  • Runs either right after the hard IRQ exits or later in a special kernel thread (ksoftirqd).

How this applies to NAPI and NIC RX

  1. Packet arrives → NIC raises interrupt → ISR runs in hard IRQ context.
  2. ISR disables further NIC interrupts and calls napi_schedule().
  3. napi_schedule() marks the NIC’s poll handler to run in NET_RX_SOFTIRQ context.
  4. Later, the kernel runs the softIRQ handler:
    • Driver’s poll() function drains RX ring, creates sk_buffs, passes them to networking stack.
  5. When the softIRQ finishes, interrupts are fully re-enabled and the CPU can resume normal tasks.

Why not just do all work in the ISR?

Because:

  • Copying packets, allocating sk_buffs, running through protocol parsing is slow compared to just acknowledging the hardware.
  • While you’re in a hard IRQ, all other interrupts on that CPU are masked.
  • Spending 500 µs in a hard IRQ could delay timers, disk I/O, other NIC queues, etc.
  • Splitting into hard IRQ → softIRQ keeps “acknowledge and exit fast” while still giving low latency packet processing.

Step 3 — Network Stack

  • L2 parse: Ethernet header removed.
  • L3 parse: IP header validated, checksum verified (or skipped if offloaded).
  • L4 parse: TCP/UDP header parsed, ports identified.
  • skb is queued into the correct socket receive queue.

Step 4 — User Space

  • App calling recv() or read() wakes up if data is available.
  • skb’s payload copied into user buffer (or mapped in zero-copy modes).
  • skb freed back to pool.


What Happens if the RX Ring Fills

Because the RX ring has a fixed number of slots, it can fill up under heavy load:

Filling Condition

  • NIC head pointer catches up to driver tail pointer.
  • All descriptors are marked DONE, but driver hasn’t processed them yet.
  • No free descriptor = nowhere to DMA next packet.

Result

  • Packets are dropped in hardware:

    • NIC increments a missed packet or RX drop counter.
    • The dropped frame never makes it into RAM.
  • Driver can read this counter via NIC registers for diagnostics.


Why it happens

  • CPU or driver not processing descriptors fast enough (high interrupt load, other workloads).
  • Burst of incoming packets exceeds ring capacity before driver can catch up.
  • Interrupt moderation might delay driver wakeup just long enough for the ring to fill.

Mitigations

  1. Increase RX ring size

    • Some NICs allow larger descriptor rings (e.g., from 256 to 4096 entries).
  2. Enable NAPI / packet batching

    • Reduces per-packet interrupt cost, processes multiple per poll.
  3. Distribute load with RSS

    • Multiple RX queues mapped to different cores.
  4. Use higher-performance packet paths

    • XDP, DPDK, or other bypass frameworks.
  5. Tune interrupt moderation

    • Lower coalescing delay to drain ring sooner.

Key Insight

Once a packet is dropped due to a full RX ring, it’s gone — the NIC doesn’t have a secondary overflow buffer. The only way to prevent this is to make sure the driver processes descriptors quickly enough, or to spread the load across more queues/cores.