Skip to content
  1. Sep 14, 2021
    • Eric Dumazet's avatar
      Revert "Revert "ipv4: fix memory leaks in ip_cmsg_send() callers"" · d198b277
      Eric Dumazet authored
      This reverts commit d7807a9a.
      
      As mentioned in https://lkml.org/lkml/2021/9/13/1819
      5 years old commit 91948309 ("ipv4: fix memory leaks in ip_cmsg_send() callers")
      was a correct fix.
      
        ip_cmsg_send() can loop over multiple cmsghdr()
      
        If IP_RETOPTS has been successful, but following cmsghdr generates an error,
        we do not free ipc.ok
      
        If IP_RETOPTS is not successful, we have freed the allocated temporary space,
        not the one currently in ipc.opt.
      
      Sure, code could be refactored, but let's not bring back old bugs.
      
      Fixes: d7807a9a
      
       ("Revert "ipv4: fix memory leaks in ip_cmsg_send() callers"")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Yajun Deng <yajun.deng@linux.dev>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d198b277
    • zhenggy's avatar
      tcp: fix tp->undo_retrans accounting in tcp_sacktag_one() · 4f884f39
      zhenggy authored
      Commit 10d3be56 ("tcp-tso: do not split TSO packets at retransmit
      time") may directly retrans a multiple segments TSO/GSO packet without
      split, Since this commit, we can no longer assume that a retransmitted
      packet is a single segment.
      
      This patch fixes the tp->undo_retrans accounting in tcp_sacktag_one()
      that use the actual segments(pcount) of the retransmitted packet.
      
      Before that commit (10d3be56), the assumption underlying the
      tp->undo_retrans-- seems correct.
      
      Fixes: 10d3be56
      
       ("tcp-tso: do not split TSO packets at retransmit time")
      Signed-off-by: default avatarzhenggy <zhenggy@chinatelecom.cn>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4f884f39
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 2865ba82
      David S. Miller authored
      
      
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2021-09-14
      
      The following pull-request contains BPF updates for your *net* tree.
      
      We've added 7 non-merge commits during the last 13 day(s) which contain
      a total of 18 files changed, 334 insertions(+), 193 deletions(-).
      
      The main changes are:
      
      1) Fix mmap_lock lockdep splat in BPF stack map's build_id lookup, from Yonghong Song.
      
      2) Fix BPF cgroup v2 program bypass upon net_cls/prio activation, from Daniel Borkmann.
      
      3) Fix kvcalloc() BTF line info splat on oversized allocation attempts, from Bixuan Cui.
      
      4) Fix BPF selftest build of task_pt_regs test for arm64/s390, from Jean-Philippe Brucker.
      
      5) Fix BPF's disasm.{c,h} to dual-license so that it is aligned with bpftool given the former
         is a build dependency for the latter, from Daniel Borkmann with ACKs from contributors.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2865ba82
    • Eric Dumazet's avatar
      net-caif: avoid user-triggerable WARN_ON(1) · 550ac9c1
      Eric Dumazet authored
      syszbot triggers this warning, which looks something
      we can easily prevent.
      
      If we initialize priv->list_field in chnl_net_init(),
      then always use list_del_init(), we can remove robust_list_del()
      completely.
      
      WARNING: CPU: 0 PID: 3233 at net/caif/chnl_net.c:67 robust_list_del net/caif/chnl_net.c:67 [inline]
      WARNING: CPU: 0 PID: 3233 at net/caif/chnl_net.c:67 chnl_net_uninit+0xc9/0x2e0 net/caif/chnl_net.c:375
      Modules linked in:
      CPU: 0 PID: 3233 Comm: syz-executor.3 Not tainted 5.14.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:robust_list_del net/caif/chnl_net.c:67 [inline]
      RIP: 0010:chnl_net_uninit+0xc9/0x2e0 net/caif/chnl_net.c:375
      Code: 89 eb e8 3a a3 ba f8 48 89 d8 48 c1 e8 03 42 80 3c 28 00 0f 85 bf 01 00 00 48 81 fb 00 14 4e 8d 48 8b 2b 75 d0 e8 17 a3 ba f8 <0f> 0b 5b 5d 41 5c 41 5d e9 0a a3 ba f8 4c 89 e3 e8 02 a3 ba f8 4c
      RSP: 0018:ffffc90009067248 EFLAGS: 00010202
      RAX: 0000000000008780 RBX: ffffffff8d4e1400 RCX: ffffc9000fd34000
      RDX: 0000000000040000 RSI: ffffffff88bb6e49 RDI: 0000000000000003
      RBP: ffff88802cd9ee08 R08: 0000000000000000 R09: ffffffff8d0e6647
      R10: ffffffff88bb6dc2 R11: 0000000000000000 R12: ffff88803791ae08
      R13: dffffc0000000000 R14: 00000000e600ffce R15: ffff888073ed3480
      FS:  00007fed10fa0700(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000001b2c322000 CR3: 00000000164a6000 CR4: 00000000001506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       register_netdevice+0xadf/0x1500 net/core/dev.c:10347
       ipcaif_newlink+0x4c/0x260 net/caif/chnl_net.c:468
       __rtnl_newlink+0x106d/0x1750 net/core/rtnetlink.c:3458
       rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3506
       rtnetlink_rcv_msg+0x413/0xb80 net/core/rtnetlink.c:5572
       netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504
       netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
       netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340
       netlink_sendmsg+0x86d/0xdb0 net/netlink/af_netlink.c:1929
       sock_sendmsg_nosec net/socket.c:704 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:724
       __sys_sendto+0x21c/0x320 net/socket.c:2036
       __do_sys_sendto net/socket.c:2048 [inline]
       __se_sys_sendto net/socket.c:2044 [inline]
       __x64_sys_sendto+0xdd/0x1b0 net/socket.c:2044
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: cc36a070
      
       ("net-caif: add CAIF netdevice")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      550ac9c1
    • Daniel Borkmann's avatar
      bpf, selftests: Add test case for mixed cgroup v1/v2 · 43d2b88c
      Daniel Borkmann authored
      
      
      Minimal selftest which implements a small BPF policy program to the
      connect(2) hook which rejects TCP connection requests to port 60123
      with EPERM. This is being attached to a non-root cgroup v2 path. The
      test asserts that this works under cgroup v2-only and under a mixed
      cgroup v1/v2 environment where net_classid is set in the former case.
      
      Before fix:
      
        # ./test_progs -t cgroup_v1v2
        test_cgroup_v1v2:PASS:server_fd 0 nsec
        test_cgroup_v1v2:PASS:client_fd 0 nsec
        test_cgroup_v1v2:PASS:cgroup_fd 0 nsec
        test_cgroup_v1v2:PASS:server_fd 0 nsec
        run_test:PASS:skel_open 0 nsec
        run_test:PASS:prog_attach 0 nsec
        test_cgroup_v1v2:PASS:cgroup-v2-only 0 nsec
        run_test:PASS:skel_open 0 nsec
        run_test:PASS:prog_attach 0 nsec
        run_test:PASS:join_classid 0 nsec
        (network_helpers.c:219: errno: None) Unexpected success to connect to server
        test_cgroup_v1v2:FAIL:cgroup-v1v2 unexpected error: -1 (errno 0)
        #27 cgroup_v1v2:FAIL
        Summary: 0/0 PASSED, 0 SKIPPED, 1 FAILED
      
      After fix:
      
        # ./test_progs -t cgroup_v1v2
        #27 cgroup_v1v2:OK
        Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20210913230759.2313-3-daniel@iogearbox.net
      43d2b88c
    • Daniel Borkmann's avatar
      bpf, selftests: Add cgroup v1 net_cls classid helpers · d8079d80
      Daniel Borkmann authored
      
      
      Minimal set of helpers for net_cls classid cgroupv1 management in order
      to set an id, join from a process, initiate setup and teardown. cgroupv2
      helpers are left as-is, but reused where possible.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20210913230759.2313-2-daniel@iogearbox.net
      d8079d80
    • Daniel Borkmann's avatar
      bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode · 8520e224
      Daniel Borkmann authored
      Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used.
      Back in the days, commit bd1060a1 ("sock, cgroup: add sock->sk_cgroup")
      embedded per-socket cgroup information into sock->sk_cgrp_data and in order
      to save 8 bytes in struct sock made both mutually exclusive, that is, when
      cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2
      falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp).
      
      The assumption made was "there is no reason to mix the two and this is in line
      with how legacy and v2 compatibility is handled" as stated in bd1060a1.
      However, with Kubernetes more widely supporting cgroups v2 as well nowadays,
      this assumption no longer holds, and the possibility of the v1/v2 mixed mode
      with the v2 root fallback being hit becomes a real security issue.
      
      Many of the cgroup v2 BPF programs are also used for policy enforcement, just
      to pick _one_ example, that is, to programmatically deny socket related system
      calls like connect(2) or bind(2). A v2 root fallback would implicitly cause
      a policy bypass for the affected Pods.
      
      In production environments, we have recently seen this case due to various
      circumstances: i) a different 3rd party agent and/or ii) a container runtime
      such as [0] in the user's environment configuring legacy cgroup v1 net_cls
      tags, which triggered implicitly mentioned root fallback. Another case is
      Kubernetes projects like kind [1] which create Kubernetes nodes in a container
      and also add cgroup namespaces to the mix, meaning programs which are attached
      to the cgroup v2 root of the cgroup namespace get attached to a non-root
      cgroup v2 path from init namespace point of view. And the latter's root is
      out of reach for agents on a kind Kubernetes node to configure. Meaning, any
      entity on the node setting cgroup v1 net_cls tag will trigger the bypass
      despite cgroup v2 BPF programs attached to the namespace root.
      
      Generally, this mutual exclusiveness does not hold anymore in today's user
      environments and makes cgroup v2 usage from BPF side fragile and unreliable.
      This fix adds proper struct cgroup pointer for the cgroup v2 case to struct
      sock_cgroup_data in order to address these issues; this implicitly also fixes
      the tradeoffs being made back then with regards to races and refcount leaks
      as stated in bd1060a1, and removes the fallback, so that cgroup v2 BPF
      programs always operate as expected.
      
        [0] https://github.com/nestybox/sysbox/
        [1] https://kind.sigs.k8s.io/
      
      Fixes: bd1060a1
      
       ("sock, cgroup: add sock->sk_cgroup")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarStanislav Fomichev <sdf@google.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Link: https://lore.kernel.org/bpf/20210913230759.2313-1-daniel@iogearbox.net
      8520e224
    • Bixuan Cui's avatar
      bpf: Add oversize check before call kvcalloc() · 0e6491b5
      Bixuan Cui authored
      Commit 7661809d
      
       ("mm: don't allow oversized kvmalloc() calls") add the
      oversize check. When the allocation is larger than what kmalloc() supports,
      the following warning triggered:
      
      WARNING: CPU: 0 PID: 8408 at mm/util.c:597 kvmalloc_node+0x108/0x110 mm/util.c:597
      Modules linked in:
      CPU: 0 PID: 8408 Comm: syz-executor221 Not tainted 5.14.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:kvmalloc_node+0x108/0x110 mm/util.c:597
      Call Trace:
       kvmalloc include/linux/mm.h:806 [inline]
       kvmalloc_array include/linux/mm.h:824 [inline]
       kvcalloc include/linux/mm.h:829 [inline]
       check_btf_line kernel/bpf/verifier.c:9925 [inline]
       check_btf_info kernel/bpf/verifier.c:10049 [inline]
       bpf_check+0xd634/0x150d0 kernel/bpf/verifier.c:13759
       bpf_prog_load kernel/bpf/syscall.c:2301 [inline]
       __sys_bpf+0x11181/0x126e0 kernel/bpf/syscall.c:4587
       __do_sys_bpf kernel/bpf/syscall.c:4691 [inline]
       __se_sys_bpf kernel/bpf/syscall.c:4689 [inline]
       __x64_sys_bpf+0x78/0x90 kernel/bpf/syscall.c:4689
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Reported-by: default avatar <syzbot+f3e749d4c662818ae439@syzkaller.appspotmail.com>
      Signed-off-by: default avatarBixuan Cui <cuibixuan@huawei.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20210911005557.45518-1-cuibixuan@huawei.com
      0e6491b5
  2. Sep 13, 2021
  3. Sep 12, 2021
  4. Sep 11, 2021
    • Jesper Nilsson's avatar
      net: stmmac: allow CSR clock of 300MHz · 08dad2f4
      Jesper Nilsson authored
      
      
      The Synopsys Ethernet IP uses the CSR clock as a base clock for MDC.
      The divisor used is set in the MAC_MDIO_Address register field CR
      (Clock Rate)
      
      The divisor is there to change the CSR clock into a clock that falls
      below the IEEE 802.3 specified max frequency of 2.5MHz.
      
      If the CSR clock is 300MHz, the code falls back to using the reset
      value in the MAC_MDIO_Address register, as described in the comment
      above this code.
      
      However, 300MHz is actually an allowed value and the proper divider
      can be estimated quite easily (it's just 1Hz difference!)
      
      A CSR frequency of 300MHz with the maximum clock rate value of 0x5
      (STMMAC_CSR_250_300M, a divisor of 124) gives somewhere around
      ~2.42MHz which is below the IEEE 802.3 specified maximum.
      
      For the ARTPEC-8 SoC, the CSR clock is this problematic 300MHz,
      and unfortunately, the reset-value of the MAC_MDIO_Address CR field
      is 0x0.
      
      This leads to a clock rate of zero and a divisor of 42, and gives an
      MDC frequency of ~7.14MHz.
      
      Allow CSR clock of 300MHz by making the comparison inclusive.
      
      Signed-off-by: default avatarJesper Nilsson <jesper.nilsson@axis.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      08dad2f4
    • Yonghong Song's avatar
      bpf, mm: Fix lockdep warning triggered by stack_map_get_build_id_offset() · 2f1aaf3e
      Yonghong Song authored
      Currently the bpf selftest "get_stack_raw_tp" triggered the warning:
      
        [ 1411.304463] WARNING: CPU: 3 PID: 140 at include/linux/mmap_lock.h:164 find_vma+0x47/0xa0
        [ 1411.304469] Modules linked in: bpf_testmod(O) [last unloaded: bpf_testmod]
        [ 1411.304476] CPU: 3 PID: 140 Comm: systemd-journal Tainted: G        W  O      5.14.0+ #53
        [ 1411.304479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
        [ 1411.304481] RIP: 0010:find_vma+0x47/0xa0
        [ 1411.304484] Code: de 48 89 ef e8 ba f5 fe ff 48 85 c0 74 2e 48 83 c4 08 5b 5d c3 48 8d bf 28 01 00 00 be ff ff ff ff e8 2d 9f d8 00 85 c0 75 d4 <0f> 0b 48 89 de 48 8
        [ 1411.304487] RSP: 0018:ffffabd440403db8 EFLAGS: 00010246
        [ 1411.304490] RAX: 0000000000000000 RBX: 00007f00ad80a0e0 RCX: 0000000000000000
        [ 1411.304492] RDX: 0000000000000001 RSI: ffffffff9776b144 RDI: ffffffff977e1b0e
        [ 1411.304494] RBP: ffff9cf5c2f50000 R08: ffff9cf5c3eb25d8 R09: 00000000fffffffe
        [ 1411.304496] R10: 0000000000000001 R11: 00000000ef974e19 R12: ffff9cf5c39ae0e0
        [ 1411.304498] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9cf5c39ae0e0
        [ 1411.304501] FS:  00007f00ae754780(0000) GS:ffff9cf5fba00000(0000) knlGS:0000000000000000
        [ 1411.304504] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [ 1411.304506] CR2: 000000003e34343c CR3: 0000000103a98005 CR4: 0000000000370ee0
        [ 1411.304508] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [ 1411.304510] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [ 1411.304512] Call Trace:
        [ 1411.304517]  stack_map_get_build_id_offset+0x17c/0x260
        [ 1411.304528]  __bpf_get_stack+0x18f/0x230
        [ 1411.304541]  bpf_get_stack_raw_tp+0x5a/0x70
        [ 1411.305752] RAX: 0000000000000000 RBX: 5541f689495641d7 RCX: 0000000000000000
        [ 1411.305756] RDX: 0000000000000001 RSI: ffffffff9776b144 RDI: ffffffff977e1b0e
        [ 1411.305758] RBP: ffff9cf5c02b2f40 R08: ffff9cf5ca7606c0 R09: ffffcbd43ee02c04
        [ 1411.306978]  bpf_prog_32007c34f7726d29_bpf_prog1+0xaf/0xd9c
        [ 1411.307861] R10: 0000000000000001 R11: 0000000000000044 R12: ffff9cf5c2ef60e0
        [ 1411.307865] R13: 0000000000000005 R14: 0000000000000000 R15: ffff9cf5c2ef6108
        [ 1411.309074]  bpf_trace_run2+0x8f/0x1a0
        [ 1411.309891] FS:  00007ff485141700(0000) GS:ffff9cf5fae00000(0000) knlGS:0000000000000000
        [ 1411.309896] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [ 1411.311221]  syscall_trace_enter.isra.20+0x161/0x1f0
        [ 1411.311600] CR2: 00007ff48514d90e CR3: 0000000107114001 CR4: 0000000000370ef0
        [ 1411.312291]  do_syscall_64+0x15/0x80
        [ 1411.312941] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [ 1411.313803]  entry_SYSCALL_64_after_hwframe+0x44/0xae
        [ 1411.314223] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [ 1411.315082] RIP: 0033:0x7f00ad80a0e0
        [ 1411.315626] Call Trace:
        [ 1411.315632]  stack_map_get_build_id_offset+0x17c/0x260
      
      To reproduce, first build `test_progs` binary:
      
        make -C tools/testing/selftests/bpf -j60
      
      and then run the binary at tools/testing/selftests/bpf directory:
      
        ./test_progs -t get_stack_raw_tp
      
      The warning is due to commit 5b78ed24 ("mm/pagemap: add mmap_assert_locked()
      annotations to find_vma*()") which added mmap_assert_locked() in find_vma()
      function. The mmap_assert_locked() function asserts that mm->mmap_lock needs
      to be held. But this is not the case for bpf_get_stack() or bpf_get_stackid()
      helper (kernel/bpf/stackmap.c), which uses mmap_read_trylock_non_owner()
      instead. Since mm->mmap_lock is not held in bpf_get_stack[id]() use case,
      the above warning is emitted during test run.
      
      This patch fixed the issue by (1). using mmap_read_trylock() instead of
      mmap_read_trylock_non_owner() to satisfy lockdep checking in find_vma(), and
      (2). droping lockdep for mmap_lock right before the irq_work_queue(). The
      function mmap_read_trylock_non_owner() is also removed since after this
      patch nobody calls it any more.
      
      Fixes: 5b78ed24
      
       ("mm/pagemap: add mmap_assert_locked() annotations to find_vma*()")
      Suggested-by: default avatarJason Gunthorpe <jgg@ziepe.ca>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Luigi Rizzo <lrizzo@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: linux-mm@kvack.org
      Link: https://lore.kernel.org/bpf/20210909155000.1610299-1-yhs@fb.com
      2f1aaf3e
  5. Sep 10, 2021
  6. Sep 09, 2021
    • Guenter Roeck's avatar
      net: ni65: Avoid typecast of pointer to u32 · e0119126
      Guenter Roeck authored
      
      
      Building alpha:allmodconfig results in the following error.
      
      drivers/net/ethernet/amd/ni65.c: In function 'ni65_stop_start':
      drivers/net/ethernet/amd/ni65.c:751:37: error:
      	cast from pointer to integer of different size
      		buffer[i] = (u32) isa_bus_to_virt(tmdp->u.buffer);
      
      'buffer[]' is declared as unsigned long, so replace the typecast to u32
      with a typecast to unsigned long to fix the problem.
      
      Cc: Arnd Bergmann <arnd@kernel.org>
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e0119126
    • David S. Miller's avatar
      Merge branch 'sfx-xdp-fallback-tx-queues' · e3a843f9
      David S. Miller authored
      
      
      Íñigo Huguet says:
      
      ====================
      sfc: fallback for lack of xdp tx queues
      
      If there are not enough hardware resources to allocate one tx queue per
      CPU for XDP, XDP_TX and XDP_REDIRECT actions were unavailable, and using
      them resulted each time with the packet being drop and this message in
      the logs: XDP TX failed (-22)
      
      These patches implement 2 fallback solutions for 2 different situations
      that might happen:
      1. There are not enough free resources for all the tx queues, but there
         are some free resources available
      2. There are not enough free resources at all for tx queues.
      
      Both solutions are based in sharing tx queues, using __netif_tx_lock for
      synchronization. In the second case, as there are not XDP TX queues to
      share, network stack queues are used instead, but since we're taking
      __netif_tx_lock, concurrent access to the queues is correctly protected.
      
      The solution for this second case might affect performance both of XDP
      traffic and normal traffice due to lock contention if both are used
      intensively. That's why I call it a "last resort" fallback: it's not a
      desirable situation, but at least we have XDP TX working.
      
      Some tests has shown good results and indicate that the non-fallback
      case is not being damaged by this changes. They are also promising for
      the fallback cases. This is the test:
      1. From another machine, send high amount of packets with pktgen, script
         samples/pktgen/pktgen_sample04_many_flows.sh
      2. In the tested machine, run samples/bpf/xdp_rxq_info with arguments
         "-a XDP_TX --swapmac" and see the results
      3. In the tested machine, run also pktgen_sample04 to create high TX
         normal traffic, and see how xdp_rxq_info results vary
      
      Note that this test doesn't check the worst situations for the fallback
      solutions because XDP_TX will only be executed from the same CPUs that
      are processed by sfc, and not from every CPU in the system, so the
      performance drop due to the highest locking contention doesn't happen.
      I'd like to test that, as well, but I don't have access right now to a
      proper environment.
      
      Test results:
      
      Without doing TX:
      Before changes: ~2,900,000 pps
      After changes, 1 queues/core: ~2,900,000 pps
      After changes, 2 queues/core: ~2,900,000 pps
      After changes, 8 queues/core: ~2,900,000 pps
      After changes, borrowing from network stack: ~2,900,000 pps
      
      With multiflow TX at the same time:
      Before changes: ~1,700,000 - 2,900,000 pps
      After changes, 1 queues/core: ~1,700,000 - 2,900,000 pps
      After changes, 2 queues/core: ~1,700,000 pps
      After changes, 8 queues/core: ~1,700,000 pps
      After changes, borrowing from network stack: 1,150,000 pps
      
      Sporadic "XDP TX failed (-5)" warnings are shown when running xdp program
      and pktgen simultaneously. This was expected because XDP doesn't have any
      buffering system if the NIC is under very high pressure. Thousands of
      these warnings are shown in the case of borrowing net stack queues. As I
      said before, this was also expected.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e3a843f9
    • Íñigo Huguet's avatar
      sfc: last resort fallback for lack of xdp tx queues · 6215b608
      Íñigo Huguet authored
      
      
      Previous patch addressed the situation of having some free resources for
      xdp tx but not enough for one tx queue per CPU. This patch address the
      worst case of not having resources at all for xdp tx.
      
      Instead of using queues dedicated to xdp, normal queues used by network
      stack are shared for both cases, using __netif_tx_lock for
      synchronization. Also queue stop/restart must be considered in the xdp
      path to avoid freezing the queue.
      
      This is not the ideal situation we might want to be, and a performance
      penalty is expected both for normal and xdp traffic, but at least XDP
      will work in all possible situations (with a warning in the logs),
      improving a bit the pain of not knowing in what situations we can use it
      and in what situations we cannot.
      
      Signed-off-by: default avatarÍñigo Huguet <ihuguet@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6215b608
    • Íñigo Huguet's avatar
      sfc: fallback for lack of xdp tx queues · 41544618
      Íñigo Huguet authored
      
      
      If there are not enough resources to allocate one TX queue per core for
      XDP TX it was completely disabled.
      
      This patch implements a fallback solution for sharing the available
      queues using __netif_tx_lock for synchronization. In the normal case that
      there is one TX queue per CPU, no locking is done, as it was before.
      
      With this fallback solution, XDP TX will work in much more cases that
      were failing, specially in machines with many CPUs. It's hard for XDP
      users to know what features are supported across different NICs and
      configurations, so they will benefit on having wider support.
      
      Signed-off-by: default avatarÍñigo Huguet <ihuguet@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      41544618
    • Joakim Zhang's avatar
      net: stmmac: platform: fix build warning when with !CONFIG_PM_SLEEP · 2a48d96f
      Joakim Zhang authored
      Use __maybe_unused for noirq_suspend()/noirq_resume() hooks to avoid
      build warning with !CONFIG_PM_SLEEP:
      
      >> drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c:796:12: error: 'stmmac_pltfr_noirq_resume' defined but not used [-Werror=unused-function]
           796 | static int stmmac_pltfr_noirq_resume(struct device *dev)
               |            ^~~~~~~~~~~~~~~~~~~~~~~~~
      >> drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c:775:12: error: 'stmmac_pltfr_noirq_suspend' defined but not used [-Werror=unused-function]
           775 | static int stmmac_pltfr_noirq_suspend(struct device *dev)
               |            ^~~~~~~~~~~~~~~~~~~~~~~~~~
         cc1: all warnings being treated as errors
      
      Fixes: 276aae37
      
       ("net: stmmac: fix system hang caused by eee_ctrl_timer during suspend/resume")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarJoakim Zhang <qiangqing.zhang@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2a48d96f
    • Xiyu Yang's avatar
      net/l2tp: Fix reference count leak in l2tp_udp_recv_core · 9b6ff7eb
      Xiyu Yang authored
      The reference count leak issue may take place in an error handling
      path. If both conditions of tunnel->version == L2TP_HDR_VER_3 and the
      return value of l2tp_v3_ensure_opt_in_linear is nonzero, the function
      would directly jump to label invalid, without decrementing the reference
      count of the l2tp_session object session increased earlier by
      l2tp_tunnel_get_session(). This may result in refcount leaks.
      
      Fix this issue by decrease the reference count before jumping to the
      label invalid.
      
      Fixes: 4522a70d
      
       ("l2tp: fix reading optional fields of L2TPv3")
      Signed-off-by: default avatarXiyu Yang <xiyuyang19@fudan.edu.cn>
      Signed-off-by: default avatarXin Xiong <xiongx18@fudan.edu.cn>
      Signed-off-by: default avatarXin Tan <tanxin.ctf@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9b6ff7eb