VPP At 100GbE: Crushing 140 Mpps Packet Drops With A Flow Cache — And Why Your NIC Is The Real Ceiling + Video

Introduction:

Achieving line-rate packet processing at 100 Gigabit Ethernet (100GbE) is one of the most demanding challenges in modern networking. At 100 Gbps, even minimum-sized 64‑byte packets arrive at a staggering rate of approximately 148.8 million packets per second (Mpps), leaving virtually no room for inefficient processing. The Vector Packet Processing (VPP) stack, part of the FD.io project, tackles this by using a software data plane that can drop packets at 140 Mpps through a combination of a flow cache, tuple‑space search, and a strict per‑packet cycle budget. This article breaks down how VPP achieves this performance, why the NIC ring—not the CPU—becomes the ultimate bottleneck, and provides a hands‑on guide to configuring and tuning VPP for high‑speed packet filtering.

Learning Objectives:

Understand the architecture of VPP’s flow cache and how tuple‑space search enables near‑line‑rate packet classification.
Learn to calculate and optimize the per‑packet cycle budget on modern x86_64 CPUs for 100GbE workloads.
Configure and benchmark VPP ACLs (Access Control Lists) and flow‑based filtering rules to drop or forward traffic at 140+ Mpps.
Diagnose NIC ring buffer limitations and tune DPDK parameters to push the bottleneck from CPU to the NIC.

You Should Know:

The Flow Cache and Tuple‑Space Search: The Heart of High‑Speed Filtering

VPP’s packet processing pipeline is built around a flow cache that stores session state for active traffic flows. When a packet arrives, VPP performs a tuple‑space search—typically a 5‑tuple (source IP, destination IP, source port, destination port, protocol)—to determine if the packet belongs to an existing flow. If a match is found, the cached action (e.g., drop, forward, or apply ACL) is executed without re‑evaluating the entire policy set. This is the key to sustaining 140 Mpps: the first packet of a flow pays the full classification cost, but subsequent packets enjoy a fast‑path lookup.

Step‑by‑step: Enabling and Monitoring the Flow Cache in VPP

Verify flow cache support – Ensure your VPP build includes the `acl` plugin with stateful (flow‑cached) ACLs. From the VPP CLI:
```
vpp show plugins | grep acl
```

Look for `acl_plugin.so` and confirm `stateful` is enabled.

Enable stateful ACLs – When creating an ACL, use the `+` flag for stateful (flow‑cached) rules:
```
vpp acl add permit +5-tuple
```
This instructs VPP to cache the 5‑tuple match for the flow.
View flow cache statistics – Monitor cache hits and misses:
```
vpp show acl plugin cache
```
This displays the number of active flows, hits, misses, and evictions.
Tune cache size – The flow cache is bounded to prevent memory exhaustion. Adjust the maximum number of flows:
```
vpp set acl plugin cache max-entries <N>
```
For 100GbE with millions of concurrent flows, set N appropriately (e.g., 2,000,000).
Test with a synthetic load – Use `TRex` or `pktgen` to generate a mix of new and established flows. Observe the cache hit rate; a well‑tuned cache should exceed 95% hits for steady‑state traffic.

Linux/Windows Commands for Traffic Generation:

Linux (TRex):
`sudo ./t-rex-64 -i -c 1 –cfg trex_cfg.yaml` – start TRex in interactive mode.
`trex> start -f cap2/dns.yaml -m 100% -d 60` – generate 100% line rate for 60 seconds.
Windows (pktgen‑DPDK):
`pktgen -l 0-1 -1 4 — -P -m “[1:0].0″` – launch pktgen with two cores; then use `set 0 rate 100` to push 100% line rate.

2. The Per‑Packet Cycle Budget: Math and Reality

At 140 Mpps, each packet must be processed in ~7.14 nanoseconds on a 2.5 GHz CPU (which translates to roughly 17.8 CPU cycles). This is the per‑packet cycle budget. However, modern x86_64 cores can execute multiple instructions per cycle, but memory accesses (DRAM, cache misses) are costly. VPP’s vector processing—batching packets into vectors of 256 or more—amortizes the overhead of I/O and reduces cache misses. Even so, the cycle budget is tight; any branch misprediction or cache miss can blow the budget and cause packet drops.

Step‑by‑step: Measuring and Optimizing Per‑Packet Cycles

Measure current throughput – Use VPP’s built‑in performance counters:
```
vpp show runtime
```
This shows the average cycles per packet for each graph node (e.g., ip4-input, acl-plugin).
Identify bottlenecks – Look for nodes with high cycle counts. For example, if `acl-plugin` shows >20 cycles/packet, your ACL rules may be too complex or the flow cache is undersized.
Optimize ACL rules – Order rules from most‑specific to least‑specific. VPP evaluates rules in order; placing high‑hit rules first reduces average search depth. Use `show acl rules` to see hit counters.
Increase vector size – VPP’s default vector size is 256. For 100GbE, consider increasing it to 512 or 1024 to further amortize I/O overhead. In /etc/vpp/startup.conf, set:
```
vpp { vector-size 512 }
```
Pin CPU cores – Isolate cores for VPP workers using `isolcpus` in Linux kernel boot parameters, then pin each worker to a dedicated core in startup.conf:
```
cpu { main-core 0 worker-core 1-3 }
```

Linux Command to Check CPU Cycles:

`perf stat -e cycles,instructions,cache-misses -p ` – attach `perf` to the VPP process to measure real‑time cycle counts and cache behavior.

Why the NIC Ring—Not the CPU—Is the Ceiling

Even if your CPU can process 140 Mpps, the network interface card (NIC) may become the bottleneck. The NIC’s ring buffer (RX/TX descriptor rings) has a finite depth, typically 1024 or 2048 descriptors. At 148 Mpps, these rings fill up in microseconds. If the CPU cannot drain the rings fast enough, the NIC will drop packets (RX drops) or stall (TX backpressure). VPP and DPDK use poll mode drivers (PMD) that constantly poll the rings, but if the ring size is too small or the interrupt moderation is misconfigured, the NIC’s hardware limits will cap your throughput well below the CPU’s potential.

Step‑by‑step: Tuning NIC Rings for 100GbE

Check current ring sizes – Use DPDK’s `testpmd` or VPP’s show interface:
```
vpp show interface <iface>
```
Look for `rx ring size` and tx ring size.
Increase ring depth – In VPP’s startup.conf, set ring sizes for your DPDK‑managed interfaces:
```
dpdk { 
dev <pci-address> { 
num-rx-queues 4 
num-tx-queues 4 
rx-ring-size 4096 
tx-ring-size 4096 
} 
}
```
A ring size of 4096 provides more buffer against micro‑bursts.
Adjust burst size – VPP processes packets in bursts. Increase the `rx-burst` size in `startup.conf` to match the ring depth:
```
dpdk { rx-burst 512 }
```
Monitor drops – Use `show interface ` and look for `rx drops` and tx drops. If drops are non‑zero, your rings are overflowing.
Consider multiple queues – Spread traffic across multiple RX queues using RSS (Receive Side Scaling). In startup.conf:
```
dpdk { dev <pci> { num-rx-queues 8 } }
```
Then assign each queue to a different worker core. This parallelizes packet reception and reduces per‑queue pressure.

Windows Equivalent (using DPDK on Windows):

The same principles apply; use `dpdk-devbind.py` to bind the NIC to the DPDK driver, and set `–rxd` and `–txd` parameters in `testpmd` to adjust ring sizes.

4. Building and Deploying a High‑Performance ACL Filter

A practical use case is deploying an ACL that drops unwanted traffic (e.g., DDoS protection) while forwarding legitimate flows at line rate. VPP’s ACL plugin supports both stateless and stateful (flow‑cached) rules. For 100GbE, stateful ACLs are mandatory because they avoid re‑evaluating every packet against the full rule set.

Step‑by‑step: Deploying a Stateful ACL

Create an ACL with stateful rules – Suppose you want to permit SSH (port 22) from a specific subnet and drop everything else:
```
vpp acl add permit tcp src 192.168.1.0/24 dst-port 22 +5-tuple
vpp acl add deny ip any any +5-tuple
```
The `+5-tuple` flag enables flow caching for each rule.
Apply the ACL to an interface – Attach the ACL to the input path of your 100GbE interface:
```
vpp acl-interface set-input-acl <iface> <acl-index>
```
Verify the filter – Use `show acl-interface ` to confirm the ACL is applied.
Test with a mix of traffic – Generate traffic with both permitted and denied flows. Check `show acl plugin cache` to see that the cache is being populated.
Tune for micro‑bursts – If you see occasional drops, increase the flow cache entry count and consider enabling `acl plugin hash` for faster lookups:
```
vpp set acl plugin hash
```

5. Benchmarking 100GbE Packet Drops: Tools and Methodology

To validate that your VPP instance can drop packets at 140 Mpps, you need a reliable benchmarking setup. The de‑facto tool is TRex, a stateful traffic generator that can saturate 100GbE links with realistic traffic patterns.

Step‑by‑step: Benchmarking with TRex

Install TRex on a separate server with a 100GbE NIC (or use two ports on the same server in loopback mode).
Configure TRex – Edit `/etc/trex_cfg.yaml` to set the interface, MAC addresses, and desired line rate.
Generate traffic – Start TRex in active mode and push 100% line rate with 64‑byte packets:
```
trex> start -f /opt/trex/cap2/udp_64B.yaml -m 100% -d 60
```
Monitor VPP – On the VPP side, run `show interface ` and `show runtime` every second to observe drops and per‑packet cycles.
Analyze results – If VPP drops packets, check the `rx drops` counter. If zero, the NIC is keeping up. If non‑zero, increase ring size or add more RX queues. Also check `show error` to see if any graph nodes are reporting errors.

Linux Command for Real‑time Monitoring:

`watch -1 1 “vppctl show interface | grep -E ‘rx|tx|drop'”` – provides a live dashboard of interface statistics.

Advanced: Bypassing the Kernel for Even Lower Latency

VPP runs entirely in userspace, bypassing the kernel network stack. This is achieved through DPDK (Data Plane Development Kit), which provides direct access to NIC hardware. However, even DPDK has overhead from system calls and memory management. For the absolute lowest latency, consider using VPP with AF_XDP (eXpress Data Path) or VPP with RDMA (Remote Direct Memory Access) to further reduce CPU involvement.

Step‑by‑step: Enabling AF_XDP in VPP

Compile VPP with AF_XDP support – Ensure `vpp` is built with the `af_xdp` plugin enabled.

2. Configure the interface – In `startup.conf`, add:

af_xdp { dev <iface> }

Bind the NIC to the AF_XDP driver – Use the `xdp-loader` tool from the xdp-tools package:
```
xdp-loader load -m skb <iface> /path/to/vpp_af_xdp.o
```
Verify – Use `show interface` to confirm the interface is using the `af_xdp` driver.

What Undercode Say:

Key Takeaway 1: The flow cache is not optional at 100GbE; without it, the per‑packet ACL evaluation would exceed the 17‑cycle budget, leading to massive drops. Stateful ACLs are a must.
Key Takeaway 2: The NIC ring buffer is often the forgotten bottleneck. Many engineers spend days tuning CPU affinity and vector sizes, only to find that a simple ring‑size increase from 1024 to 4096 eliminates all drops.

Analysis:

VPP’s ability to drop packets at 140 Mpps is a testament to the power of userspace packet processing and vectorized batching. However, achieving this performance in production requires a holistic view: the CPU, memory hierarchy, NIC hardware, and even the PCIe bus speed all play a role. The per‑packet cycle budget is so tight that any deviation—such as a cache miss due to a large ACL table or an interrupt from another core—can cause the system to fall behind. Moreover, the NIC’s ring buffer acts as a finite queue; if the CPU cannot drain it fast enough, packets are dropped at the hardware level, and no amount of CPU tuning can recover them. This underscores the importance of proper queue sizing and RSS configuration. Finally, while VPP is production‑ready, achieving 140 Mpps is not a plug‑and‑play affair; it demands careful benchmarking, iterative tuning, and a deep understanding of both software and hardware bottlenecks.

Prediction:

+1 As 100GbE and 200GbE become standard in data centers, VPP’s software‑only approach will increasingly replace expensive hardware ASICs for many packet‑processing tasks, driving down costs and increasing flexibility.
+1 The adoption of programmable NICs (SmartNICs) with on‑board flow caches will offload even more processing from the CPU, pushing the achievable packet rate beyond 200 Mpps while maintaining the same CPU cycle budget.
-1 The complexity of tuning VPP for line rate will remain a barrier for many organizations, leading to a skills gap and potential misconfigurations that cause performance degradation rather than improvement.
-1 As encryption (IPsec, TLS) becomes ubiquitous, the per‑packet cycle budget will shrink further, potentially making software‑only 100GbE filtering impossible without hardware acceleration, forcing a hybrid approach.

▶️ Related Video (66% Match):

🎯Let’s Practice For Free:

🎓 Live Courses & Certifications:

Join Undercode Academy for Verified Certifications

🚀 Request a Custom Project:

Secure, high-velocity infrastructure and disruptive technological engineering. Contact our engineering team for high-tier development and proprietary systems:
[email protected]
💎 Smart Architecture | 🛡️ Secure by Design | ⭐ Trusted by Thousands

IT/Security Reporter URL:

Reported By: Haryachyy Learning – Hackers Feeds
Extra Hub: Undercode MoN
Basic Verification: Pass ✅

🔐JOIN OUR CYBER WORLD [ CVE News • HackMonitor • UndercodeNews ]

💬 Whatsapp | 💬 Telegram

📢 Follow UndercodeTesting & Stay Tuned:

𝕏 formerly Twitter 🐦 | @ Threads | 🔗 Linkedin | 🦋BlueSky

Listen to this Post