XDP and AF_XDP based on net-next (Dec 18) #5

michalQb · 2023-12-18T16:25:19Z

No description provided.

Tell hardware to write back completed descriptors even when interrupts are disabled. Otherwise, descriptors might not be written back until the hardware can flush a full cacheline of descriptors. This can cause unnecessary delays when traffic is light (or even trigger Tx queue timeout). The example scenario to reproduce the Tx timeout if the fix is not applied: - configure at least 2 Tx queues to be assigned to the same q_vector, - generate a huge Tx traffic on the first Tx queue - try to send a few packets using the second Tx queue. In such a case Tx timeout will appear on the second Tx queue because no completion descriptors are written back for that queue while interrupts are disabled due to NAPI polling. The patch is necessary to start work on the AF_XDP implementation for the idpf driver, because there may be a case where a regular LAN Tx queue and an XDP queue share the same NAPI. Fixes: c2d548c ("idpf: add TX splitq napi poll support") Fixes: a5ab9ee ("idpf: add singleq start_xmit and napi poll") Reviewed-by: Przemek Kitszel <[email protected]> Reviewed-by: Alexander Lobakin <[email protected]> Signed-off-by: Joshua Hay <[email protected]> Co-developed-by: Michal Kubiak <[email protected]> Signed-off-by: Michal Kubiak <[email protected]>

The page pool feature allows for setting the page offset as a one of creation parameters. Such offset can be used for XDP-specific configuration of page pool when we need some extra space reserved for the packet headroom. Unfortunately, such page offset value (from the page pool) was never used during SKB build what can have a negative impact when XDP_PASS action is returned and the received packet should be passed to the kernel network stack. Address such a problem by adding the page offset from the page pool when SKB offset is being computed. Fixes: 3a8845a ("idpf: add RX splitq napi poll support") Signed-off-by: Michal Kubiak <[email protected]>

Extend basic structures of the driver (e.g. 'idpf_vport', 'idpf_queue', 'idpf_vport_user_config_data') by adding members necessary to support XDP. Add extra XDP Tx queues needed to support XDP_TX and XDP_REDIRECT actions without interfering a regular Tx traffic. Also add functions dedicated to support XDP initialization for Rx and Tx queues and call those functions from the existing algorithms of queues configuration. Signed-off-by: Michal Kubiak <[email protected]>

Implement loading the XDP program using ndo_bpf callback for splitq and XDP_SETUP_PROG parameter. Add functions for stopping, reconfiguring and restarting all queues when needed. Also, implement the XDP hot swap mechanism when the existing XDP program is replaced by another one (without a necessity of reconfiguring anything). Signed-off-by: Michal Kubiak <[email protected]>

Implement basic setup of the XDP program. Extend the function for creating the page pool by adding a support for XDP headroom configuration. Add handling of XDP_PASS and XDP_DROP action. Signed-off-by: Michal Kubiak <[email protected]>

Implement two separate completion queue cleaning functions which should be used depending on the scheduling mode: - queue-based scheduling (idpf_tx_clean_qb_complq) - flow-based scheduling (idpf_tx_clean_fb_complq). Add 4-byte descriptor for queue-based scheduling mode and perform some refactoring to extract the common code for both scheduling modes. Signed-off-by: Michal Kubiak <[email protected]>

Implement sending the packet from an XDP ring. XDP path functions are separate from the general Tx routines, because this allows to simplify and therefore speedup the process. It also makes code more friendly to future XDP-specific optimizations Signed-off-by: Michal Kubiak <[email protected]>

Implement XDP_REDIRECT action and ndo_xdp_xmit() callback. For now, packets redirected from CPU with index greater than XDP queues number are just dropped with an error. This is a rather common situation and it will be addressed in later patches. Patch also refactors RX XDP handling to use switch statement due to increased number of actions. Signed-off-by: Michal Kubiak <[email protected]>

Port of commit 22bf877 ("ice: introduce XDP_TX fallback path"). The patch handles the case, when queue number is not sufficient for the current number of CPUs. To avoid dropping some packets redirected from other interfaces, XDP TxQs are allowed to be shared between CPUs, which imposes the locking requirement. Static key approach has little to none performance penalties when sharing is not needed. Suggested-by: Larysa Zaremba <[email protected]> Signed-off-by: Michal Kubiak <[email protected]>

Relative queue id is one of the required fields of the Tx queue description in VC 2.0 for splitq mode. In the current VC implementation all Tx queues are configured together, so the relative queue id (the index of the Tx queue in the queue group) can be computed on the fly. However, such a solution is not flexible because it is not easy to configure a single Tx queue. So, instead, introduce a new structure member in 'idpf_queue' dedicated to storing the relative queue id. Then send that value over the VC. This patch is the first step in making the existing VC API more flexible to allow configuration of single queues. Signed-off-by: Michal Kubiak <[email protected]>

Implement VC functions dedicated to enabling, disabling and configuring randomly selected queues. Also, refactor the existing implementation to make the code more modular. Introduce new generic functions for sending VC messages consisting of chunks, in order to isolate the sending algorithm and its implementation for specific VC messages. Finally, rewrite the function for mapping queues to q_vectors using the new modular approach to avoid copying the code that implements the VC message sending algorithm. Signed-off-by: Michal Kubiak <[email protected]>

Move Rx and Tx queue lookup functions from the ethtool implementation to the idpf header. Now, those functions can be used globally, including XDP configuration. Signed-off-by: Michal Kubiak <[email protected]>

Hou Tao says: ==================== bpf: Fix the release of inner map From: Hou Tao <[email protected]> Hi, The patchset aims to fix the release of inner map in map array or map htab. The release of inner map is different with normal map. For normal map, the map is released after the bpf program which uses the map is destroyed, because the bpf program tracks the used maps. However bpf program can not track the used inner map because these inner map may be updated or deleted dynamically, and for now the ref-counter of inner map is decreased after the inner map is remove from outer map, so the inner map may be freed before the bpf program, which is accessing the inner map, exits and there will be use-after-free problem as demonstrated by patch #6. The patchset fixes the problem by deferring the release of inner map. The freeing of inner map is deferred according to the sleepable attributes of the bpf programs which own the outer map. Patch #1 fixes the warning when running the newly-added selftest under interpreter mode. Patch #2 adds more parameters to .map_fd_put_ptr() to prepare for the fix. Patch #3 fixes the incorrect value of need_defer when freeing the fd array. Patch #4 fixes the potential use-after-free problem by using call_rcu_tasks_trace() and call_rcu() to wait for one tasks trace RCU GP and one RCU GP unconditionally. Patch #5 optimizes the free of inner map by removing the unnecessary RCU GP waiting. Patch #6 adds a selftest to demonstrate the potential use-after-free problem. Patch #7 updates a selftest to update outer map in syscall bpf program. Please see individual patches for more details. And comments are always welcome. Change Log: v5: * patch #3: rename fd_array_map_delete_elem_with_deferred_free() to __fd_array_map_delete_elem() (Alexei) * patch #5: use atomic64_t instead of atomic_t to prevent potential overflow (Alexei) * patch #7: use ptr_to_u64() helper instead of force casting to initialize pointers in bpf_attr (Alexei) v4: https://lore.kernel.org/bpf/[email protected] * patch #2: don't use "deferred", use "need_defer" uniformly * patch #3: newly-added, fix the incorrect value of need_defer during fd array free. * patch #4: doesn't consider the case in which bpf map is not used by any bpf program and only use sleepable_refcnt to remove unnecessary tasks trace RCU GP (Alexei) * patch #4: remove memory barriers added due to cautiousness (Alexei) v3: https://lore.kernel.org/bpf/[email protected] * multiple variable renamings (Martin) * define BPF_MAP_RCU_GP/BPF_MAP_RCU_TT_GP as bit (Martin) * use call_rcu() and its variants instead of synchronize_rcu() (Martin) * remove unnecessary mask in bpf_map_free_deferred() (Martin) * place atomic_or() and the related smp_mb() together (Martin) * add patch #6 to demonstrate that updating outer map in syscall program is dead-lock free (Alexei) * update comments about the memory barrier in bpf_map_fd_put_ptr() * update commit message for patch #3 and #4 to describe more details v2: https://lore.kernel.org/bpf/[email protected] * defer the invocation of ops->map_free() instead of bpf_map_put() (Martin) * update selftest to make it being reproducible under JIT mode (Martin) * remove unnecessary preparatory patches v1: https://lore.kernel.org/bpf/[email protected] ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

Hou Tao says: ==================== The patch set aims to fix the problems found when inspecting the code related with maybe_wait_bpf_programs(). Patch #1 removes unnecessary invocation of maybe_wait_bpf_programs(). Patch #2 calls maybe_wait_bpf_programs() only once for batched update. Patch #3 adds the missed waiting when doing batched lookup_deletion on htab of maps. Patch #4 does wait only if the update or deletion operation succeeds. Patch #5 fixes the value of batch.count when memory allocation fails. ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

Andrii Nakryiko says: ==================== BPF token support in libbpf's BPF object Add fuller support for BPF token in high-level BPF object APIs. This is the most frequently used way to work with BPF using libbpf, so supporting BPF token there is critical. Patch #1 is improving kernel-side BPF_TOKEN_CREATE behavior by rejecting to create "empty" BPF token with no delegation. This seems like saner behavior which also makes libbpf's caching better overall. If we ever want to create BPF token with no delegate_xxx options set on BPF FS, we can use a new flag to enable that. Patches #2-#5 refactor libbpf internals, mostly feature detection code, to prepare it from BPF token FD. Patch #6 adds options to pass BPF token into BPF object open options. It also adds implicit BPF token creation logic to BPF object load step, even without any explicit involvement of the user. If the environment is setup properly, BPF token will be created transparently and used implicitly. This allows for all existing application to gain BPF token support by just linking with latest version of libbpf library. No source code modifications are required. All that under assumption that privileged container management agent properly set up default BPF FS instance at /sys/bpf/fs to allow BPF token creation. Patches #7-#8 adds more selftests, validating BPF object APIs work as expected under unprivileged user namespaced conditions in the presence of BPF token. Patch alobakin#9 extends libbpf with LIBBPF_BPF_TOKEN_PATH envvar knowledge, which can be used to override custom BPF FS location used for implicit BPF token creation logic without needing to adjust application code. This allows admins or container managers to mount BPF token-enabled BPF FS at non-standard location without the need to coordinate with applications. LIBBPF_BPF_TOKEN_PATH can also be used to disable BPF token implicit creation by setting it to an empty value. Patch alobakin#10 tests this new envvar functionality. v2->v3: - move some stray feature cache refactorings into patch #4 (Alexei); - add LIBBPF_BPF_TOKEN_PATH envvar support (Alexei); v1->v2: - remove minor code redundancies (Eduard, John); - add acks and rebase. ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

Handle completion markers for idpf

…rnel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following batch contains Netfilter updates for net-next: Patch #1 fix checksum calculation in nfnetlink_queue with SCTP, segment GSO packet since skb_zerocopy() does not support GSO_BY_FRAGS, from Antonio Ojea. Patch #2 extend nfnetlink_queue coverage to handle SCTP packets, from Antonio Ojea. Patch #3 uses consume_skb() instead of kfree_skb() in nfnetlink, from Donald Hunter. Patch #4 adds a dedicate commit list for sets to speed up intra-transaction lookups, from Florian Westphal. Patch #5 skips removal of element from abort path for the pipapo backend, ditching the shadow copy of this datastructure is sufficient. Patch #6 moves nf_ct_netns_get() out of nf_conncount_init() to let users of conncoiunt decide when to enable conntrack, this is needed by openvswitch, from Xin Long. Patch #7 pass context to all nft_parse_register_load() in preparation for the next patch. Patches #8 and alobakin#9 reject loads from uninitialized registers from control plane to remove register initialization from datapath. From Florian Westphal. * tag 'nf-next-24-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nf_tables: don't initialize registers in nft_do_chain() netfilter: nf_tables: allow loads only when register is initialized netfilter: nf_tables: pass context structure to nft_parse_register_load netfilter: move nf_ct_netns_get out of nf_conncount_init netfilter: nf_tables: do not remove elements if set backend implements .abort netfilter: nf_tables: store new sets in dedicated list netfilter: nfnetlink: convert kfree_skb to consume_skb selftests: netfilter: nft_queue.sh: sctp coverage netfilter: nfnetlink_queue: unbreak SCTP traffic ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>

jahay1 and others added 15 commits December 19, 2023 10:22

idpf: add support for XDP_PASS and XDP_DROP

08311f1

Implement basic setup of the XDP program. Extend the function for creating the page pool by adding a support for XDP headroom configuration. Add handling of XDP_PASS and XDP_DROP action. Signed-off-by: Michal Kubiak <[email protected]>

idpf: refactor rx bufs init (xdp fixup)

4be1f14

fix compl desc parsing (xdp fixup)

24a5757

idpf: add support for sw interrupt

ba8dd0f

idpf: move search rx and tx queues to header

777725a

Move Rx and Tx queue lookup functions from the ethtool implementation to the idpf header. Now, those functions can be used globally, including XDP configuration. Signed-off-by: Michal Kubiak <[email protected]>

michalQb added 4 commits December 20, 2023 20:36

idpf: add af_xdp initialization

1c3e29a

idpf: implement tx for af_xdp

5c7ff15

Handle completion markers for idpf

idpf: implement rx path for af_xdp

25fcc7f

idpf: enable xsk features and ndo_xsk_wakeup

800e77e

michalQb force-pushed the xdp-net-next branch from 6e4f5a9 to 800e77e Compare December 20, 2023 22:55

michalQb added 2 commits December 21, 2023 23:32

Single bufq only for XSK

6d7e3ae

Fix AF_XDP link up/down bugs

a84c95a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XDP and AF_XDP based on net-next (Dec 18) #5

XDP and AF_XDP based on net-next (Dec 18) #5

michalQb commented Dec 18, 2023

XDP and AF_XDP based on net-next (Dec 18) #5

Are you sure you want to change the base?

XDP and AF_XDP based on net-next (Dec 18) #5

Conversation

michalQb commented Dec 18, 2023