Skip to content
  1. Jul 28, 2021
    • John Fastabend's avatar
      bpf, sockmap: Fix memleak on ingress msg enqueue · 9635720b
      John Fastabend authored
      If backlog handler is running during a tear down operation we may enqueue
      data on the ingress msg queue while tear down is trying to free it.
      
       sk_psock_backlog()
         sk_psock_handle_skb()
           skb_psock_skb_ingress()
             sk_psock_skb_ingress_enqueue()
               sk_psock_queue_msg(psock,msg)
                                                 spin_lock(ingress_lock)
                                                  sk_psock_zap_ingress()
                                                   _sk_psock_purge_ingerss_msg()
                                                    _sk_psock_purge_ingress_msg()
                                                  -- free ingress_msg list --
                                                 spin_unlock(ingress_lock)
                 spin_lock(ingress_lock)
                 list_add_tail(msg,ingress_msg) <- entry on list with no one
                                                   left to free it.
                 spin_unlock(ingress_lock)
      
      To fix we only enqueue from backlog if the ENABLED bit is set. The tear
      down logic clears the bit with ingress_lock set so we wont enqueue the
      msg in the last step.
      
      Fixes: 799aa7f9
      
       ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20210727160500.1713554-4-john.fastabend@gmail.com
      9635720b
    • John Fastabend's avatar
      bpf, sockmap: On cleanup we additionally need to remove cached skb · 476d9801
      John Fastabend authored
      Its possible if a socket is closed and the receive thread is under memory
      pressure it may have cached a skb. We need to ensure these skbs are
      free'd along with the normal ingress_skb queue.
      
      Before 799aa7f9 ("skmsg: Avoid lock_sock() in sk_psock_backlog()") tear
      down and backlog processing both had sock_lock for the common case of
      socket close or unhash. So it was not possible to have both running in
      parrallel so all we would need is the kfree in those kernels.
      
      But, latest kernels include the commit 799aa7f98d5e and this requires a
      bit more work. Without the ingress_lock guarding reading/writing the
      state->skb case its possible the tear down could run before the state
      update causing it to leak memory or worse when the backlog reads the state
      it could potentially run interleaved with the tear down and we might end up
      free'ing the state->skb from tear down side but already have the reference
      from backlog side. To resolve such races we wrap accesses in ingress_lock
      on both sides serializing tear down and backlog case. In both cases this
      only happens after an EAGAIN error case so having an extra lock in place
      is likely fine. The normal path will skip the locks.
      
      Note, we check state->skb before grabbing lock. This works because
      we can only enqueue with the mutex we hold already. Avoiding a race
      on adding state->skb after the check. And if tear down path is running
      that is also fine if the tear down path then removes state->skb we
      will simply set skb=NULL and the subsequent goto is skipped. This
      slight complication avoids locking in normal case.
      
      With this fix we no longer see this warning splat from tcp side on
      socket close when we hit the above case with redirect to ingress self.
      
      [224913.935822] WARNING: CPU: 3 PID: 32100 at net/core/stream.c:208 sk_stream_kill_queues+0x212/0x220
      [224913.935841] Modules linked in: fuse overlay bpf_preload x86_pkg_temp_thermal intel_uncore wmi_bmof squashfs sch_fq_codel efivarfs ip_tables x_tables uas xhci_pci ixgbe mdio xfrm_algo xhci_hcd wmi
      [224913.935897] CPU: 3 PID: 32100 Comm: fgs-bench Tainted: G          I       5.14.0-rc1alu+ #181
      [224913.935908] Hardware name: Dell Inc. Precision 5820 Tower/002KVM, BIOS 1.9.2 01/24/2019
      [224913.935914] RIP: 0010:sk_stream_kill_queues+0x212/0x220
      [224913.935923] Code: 8b 83 20 02 00 00 85 c0 75 20 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 89 df e8 2b 11 fe ff eb c3 0f 0b e9 7c ff ff ff 0f 0b eb ce <0f> 0b 5b 5d 41 5c 41 5d 41 5e 41 5f c3 90 0f 1f 44 00 00 41 57 41
      [224913.935932] RSP: 0018:ffff88816271fd38 EFLAGS: 00010206
      [224913.935941] RAX: 0000000000000ae8 RBX: ffff88815acd5240 RCX: dffffc0000000000
      [224913.935948] RDX: 0000000000000003 RSI: 0000000000000ae8 RDI: ffff88815acd5460
      [224913.935954] RBP: ffff88815acd5460 R08: ffffffff955c0ae8 R09: fffffbfff2e6f543
      [224913.935961] R10: ffffffff9737aa17 R11: fffffbfff2e6f542 R12: ffff88815acd5390
      [224913.935967] R13: ffff88815acd5480 R14: ffffffff98d0c080 R15: ffffffff96267500
      [224913.935974] FS:  00007f86e6bd1700(0000) GS:ffff888451cc0000(0000) knlGS:0000000000000000
      [224913.935981] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [224913.935988] CR2: 000000c0008eb000 CR3: 00000001020e0005 CR4: 00000000003706e0
      [224913.935994] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [224913.936000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [224913.936007] Call Trace:
      [224913.936016]  inet_csk_destroy_sock+0xba/0x1f0
      [224913.936033]  __tcp_close+0x620/0x790
      [224913.936047]  tcp_close+0x20/0x80
      [224913.936056]  inet_release+0x8f/0xf0
      [224913.936070]  __sock_release+0x72/0x120
      [224913.936083]  sock_close+0x14/0x20
      
      Fixes: a136678c
      
       ("bpf: sk_msg, zap ingress queue on psock down")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20210727160500.1713554-3-john.fastabend@gmail.com
      476d9801
    • John Fastabend's avatar
      bpf, sockmap: Zap ingress queues after stopping strparser · 343597d5
      John Fastabend authored
      We don't want strparser to run and pass skbs into skmsg handlers when
      the psock is null. We just sk_drop them in this case. When removing
      a live socket from map it means extra drops that we do not need to
      incur. Move the zap below strparser close to avoid this condition.
      
      This way we stop the stream parser first stopping it from processing
      packets and then delete the psock.
      
      Fixes: a136678c
      
       ("bpf: sk_msg, zap ingress queue on psock down")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20210727160500.1713554-2-john.fastabend@gmail.com
      343597d5
  2. Jul 20, 2021
    • Lorenz Bauer's avatar
      bpf: Fix OOB read when printing XDP link fdinfo · d6371c76
      Lorenz Bauer authored
      We got the following UBSAN report on one of our testing machines:
      
          ================================================================================
          UBSAN: array-index-out-of-bounds in kernel/bpf/syscall.c:2389:24
          index 6 is out of range for type 'char *[6]'
          CPU: 43 PID: 930921 Comm: systemd-coredum Tainted: G           O      5.10.48-cloudflare-kasan-2021.7.0 #1
          Hardware name: <snip>
          Call Trace:
           dump_stack+0x7d/0xa3
           ubsan_epilogue+0x5/0x40
           __ubsan_handle_out_of_bounds.cold+0x43/0x48
           ? seq_printf+0x17d/0x250
           bpf_link_show_fdinfo+0x329/0x380
           ? bpf_map_value_size+0xe0/0xe0
           ? put_files_struct+0x20/0x2d0
           ? __kasan_kmalloc.constprop.0+0xc2/0xd0
           seq_show+0x3f7/0x540
           seq_read_iter+0x3f8/0x1040
           seq_read+0x329/0x500
           ? seq_read_iter+0x1040/0x1040
           ? __fsnotify_parent+0x80/0x820
           ? __fsnotify_update_child_dentry_flags+0x380/0x380
           vfs_read+0x123/0x460
           ksys_read+0xed/0x1c0
           ? __x64_sys_pwrite64+0x1f0/0x1f0
           do_syscall_64+0x33/0x40
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
          <snip>
          ================================================================================
          ================================================================================
          UBSAN: object-size-mismatch in kernel/bpf/syscall.c:2384:2
      
      From the report, we can infer that some array access in bpf_link_show_fdinfo at index 6
      is out of bounds. The obvious candidate is bpf_link_type_strs[BPF_LINK_TYPE_XDP] with
      BPF_LINK_TYPE_XDP == 6. It turns out that BPF_LINK_TYPE_XDP is missing from bpf_types.h
      and therefore doesn't have an entry in bpf_link_type_strs:
      
          pos:	0
          flags:	02000000
          mnt_id:	13
          link_type:	(null)
          link_id:	4
          prog_tag:	bcf7977d3b93787c
          prog_id:	4
          ifindex:	1
      
      Fixes: aa8d3a71
      
       ("bpf, xdp: Add bpf_link-based XDP attachment API")
      Signed-off-by: default avatarLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20210719085134.43325-2-lmb@cloudflare.com
      d6371c76
  3. Jul 16, 2021
    • Daniel Borkmann's avatar
      bpf, selftests: Add test cases for pointer alu from multiple paths · a6c39de7
      Daniel Borkmann authored
      
      
      Add several test cases for checking update_alu_sanitation_state() under
      multiple paths:
      
        # ./test_verifier
        [...]
        #1061/u map access: known scalar += value_ptr unknown vs const OK
        #1061/p map access: known scalar += value_ptr unknown vs const OK
        #1062/u map access: known scalar += value_ptr const vs unknown OK
        #1062/p map access: known scalar += value_ptr const vs unknown OK
        #1063/u map access: known scalar += value_ptr const vs const (ne) OK
        #1063/p map access: known scalar += value_ptr const vs const (ne) OK
        #1064/u map access: known scalar += value_ptr const vs const (eq) OK
        #1064/p map access: known scalar += value_ptr const vs const (eq) OK
        #1065/u map access: known scalar += value_ptr unknown vs unknown (eq) OK
        #1065/p map access: known scalar += value_ptr unknown vs unknown (eq) OK
        #1066/u map access: known scalar += value_ptr unknown vs unknown (lt) OK
        #1066/p map access: known scalar += value_ptr unknown vs unknown (lt) OK
        #1067/u map access: known scalar += value_ptr unknown vs unknown (gt) OK
        #1067/p map access: known scalar += value_ptr unknown vs unknown (gt) OK
        [...]
        Summary: 1762 PASSED, 0 SKIPPED, 0 FAILED
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a6c39de7
    • Daniel Borkmann's avatar
      bpf: Fix pointer arithmetic mask tightening under state pruning · e042aa53
      Daniel Borkmann authored
      In 7fedb63a ("bpf: Tighten speculative pointer arithmetic mask") we
      narrowed the offset mask for unprivileged pointer arithmetic in order to
      mitigate a corner case where in the speculative domain it is possible to
      advance, for example, the map value pointer by up to value_size-1 out-of-
      bounds in order to leak kernel memory via side-channel to user space.
      
      The verifier's state pruning for scalars leaves one corner case open
      where in the first verification path R_x holds an unknown scalar with an
      aux->alu_limit of e.g. 7, and in a second verification path that same
      register R_x, here denoted as R_x', holds an unknown scalar which has
      tighter bounds and would thus satisfy range_within(R_x, R_x') as well as
      tnum_in(R_x, R_x') for state pruning, yielding an aux->alu_limit of 3:
      Given the second path fits the register constraints for pruning, the final
      generated mask from aux->alu_limit will remain at 7. While technically
      not wrong for the non-speculative domain, it would however be possible
      to craft similar cases where the mask would be too wide as in 7fedb63a
      
      .
      
      One way to fix it is to detect the presence of unknown scalar map pointer
      arithmetic and force a deeper search on unknown scalars to ensure that
      we do not run into a masking mismatch.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e042aa53
    • Daniel Borkmann's avatar
      bpf: Remove superfluous aux sanitation on subprog rejection · 59089a18
      Daniel Borkmann authored
      Follow-up to fe9a5ca7
      
       ("bpf: Do not mark insn as seen under speculative
      path verification"). The sanitize_insn_aux_data() helper does not serve a
      particular purpose in today's code. The original intention for the helper
      was that if function-by-function verification fails, a given program would
      be cleared from temporary insn_aux_data[], and then its verification would
      be re-attempted in the context of the main program a second time.
      
      However, a failure in do_check_subprogs() will skip do_check_main() and
      propagate the error to the user instead, thus such situation can never occur.
      Given its interaction is not compatible to the Spectre v1 mitigation (due to
      comparing aux->seen with env->pass_cnt), just remove sanitize_insn_aux_data()
      to avoid future bugs in this area.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      59089a18
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 20192d9c
      David S. Miller authored
      
      
      Andrii Nakryiko says:
      
      ====================
      pull-request: bpf 2021-07-15
      
      The following pull-request contains BPF updates for your *net* tree.
      
      We've added 9 non-merge commits during the last 5 day(s) which contain
      a total of 9 files changed, 37 insertions(+), 15 deletions(-).
      
      The main changes are:
      
      1) Fix NULL pointer dereference in BPF_TEST_RUN for BPF_XDP_DEVMAP and
         BPF_XDP_CPUMAP programs, from Xuan Zhuo.
      
      2) Fix use-after-free of net_device in XDP bpf_link, from Xuan Zhuo.
      
      3) Follow-up fix to subprog poke descriptor use-after-free problem, from
         Daniel Borkmann and John Fastabend.
      
      4) Fix out-of-range array access in s390 BPF JIT backend, from Colin Ian King.
      
      5) Fix memory leak in BPF sockmap, from John Fastabend.
      
      6) Fix for sockmap to prevent proc stats reporting bug, from John Fastabend
         and Jakub Sitnicki.
      
      7) Fix NULL pointer dereference in bpftool, from Tobias Klauser.
      
      8) AF_XDP documentation fixes, from Baruch Siach.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20192d9c
    • Dongliang Mu's avatar
      usb: hso: fix error handling code of hso_create_net_device · a6ecfb39
      Dongliang Mu authored
      
      
      The current error handling code of hso_create_net_device is
      hso_free_net_device, no matter which errors lead to. For example,
      WARNING in hso_free_net_device [1].
      
      Fix this by refactoring the error handling code of
      hso_create_net_device by handling different errors by different code.
      
      [1] https://syzkaller.appspot.com/bug?id=66eff8d49af1b28370ad342787413e35bbe76efe
      
      Reported-by: default avatar <syzbot+44d53c7255bb1aea22d2@syzkaller.appspotmail.com>
      Fixes: 5fcfb6d0
      
       ("hso: fix bailout in error case of probe")
      Signed-off-by: default avatarDongliang Mu <mudongliangabcd@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a6ecfb39
    • Jia He's avatar
      qed: fix possible unpaired spin_{un}lock_bh in _qed_mcp_cmd_and_union() · 6206b798
      Jia He authored
      
      
      Liajian reported a bug_on hit on a ThunderX2 arm64 server with FastLinQ
      QL41000 ethernet controller:
       BUG: scheduling while atomic: kworker/0:4/531/0x00000200
        [qed_probe:488()]hw prepare failed
        kernel BUG at mm/vmalloc.c:2355!
        Internal error: Oops - BUG: 0 [#1] SMP
        CPU: 0 PID: 531 Comm: kworker/0:4 Tainted: G W 5.4.0-77-generic #86-Ubuntu
        pstate: 00400009 (nzcv daif +PAN -UAO)
       Call trace:
        vunmap+0x4c/0x50
        iounmap+0x48/0x58
        qed_free_pci+0x60/0x80 [qed]
        qed_probe+0x35c/0x688 [qed]
        __qede_probe+0x88/0x5c8 [qede]
        qede_probe+0x60/0xe0 [qede]
        local_pci_probe+0x48/0xa0
        work_for_cpu_fn+0x24/0x38
        process_one_work+0x1d0/0x468
        worker_thread+0x238/0x4e0
        kthread+0xf0/0x118
        ret_from_fork+0x10/0x18
      
      In this case, qed_hw_prepare() returns error due to hw/fw error, but in
      theory work queue should be in process context instead of interrupt.
      
      The root cause might be the unpaired spin_{un}lock_bh() in
      _qed_mcp_cmd_and_union(), which causes botton half is disabled incorrectly.
      
      Reported-by: default avatarLijian Zhang <Lijian.Zhang@arm.com>
      Signed-off-by: default avatarJia He <justin.he@arm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6206b798
    • Ziyang Xuan's avatar
      net: fix uninit-value in caif_seqpkt_sendmsg · 991e6343
      Ziyang Xuan authored
      
      
      When nr_segs equal to zero in iovec_from_user, the object
      msg->msg_iter.iov is uninit stack memory in caif_seqpkt_sendmsg
      which is defined in ___sys_sendmsg. So we cann't just judge
      msg->msg_iter.iov->base directlly. We can use nr_segs to judge
      msg in caif_seqpkt_sendmsg whether has data buffers.
      
      =====================================================
      BUG: KMSAN: uninit-value in caif_seqpkt_sendmsg+0x693/0xf60 net/caif/caif_socket.c:542
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1c9/0x220 lib/dump_stack.c:118
       kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:118
       __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:215
       caif_seqpkt_sendmsg+0x693/0xf60 net/caif/caif_socket.c:542
       sock_sendmsg_nosec net/socket.c:652 [inline]
       sock_sendmsg net/socket.c:672 [inline]
       ____sys_sendmsg+0x12b6/0x1350 net/socket.c:2343
       ___sys_sendmsg net/socket.c:2397 [inline]
       __sys_sendmmsg+0x808/0xc90 net/socket.c:2480
       __compat_sys_sendmmsg net/compat.c:656 [inline]
      
      Reported-by: default avatar <syzbot+09a5d591c1f98cf5efcb@syzkaller.appspotmail.com>
      Link: https://syzkaller.appspot.com/bug?id=1ace85e8fc9b0d5a45c08c2656c3e91762daa9b8
      Fixes: bece7b23
      
       ("caif: Rewritten socket implementation")
      Signed-off-by: default avatarZiyang Xuan <william.xuanziyang@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      991e6343
    • Tobias Klauser's avatar
      bpftool: Check malloc return value in mount_bpffs_for_pin · d444b06e
      Tobias Klauser authored
      Fix and add a missing NULL check for the prior malloc() call.
      
      Fixes: 49a086c2
      
       ("bpftool: implement prog load command")
      Signed-off-by: default avatarTobias Klauser <tklauser@distanz.ch>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Link: https://lore.kernel.org/bpf/20210715110609.29364-1-tklauser@distanz.ch
      d444b06e
    • Jakub Sitnicki's avatar
      bpf, sockmap, udp: sk_prot needs inuse_idx set for proc stats · 54ea2f49
      Jakub Sitnicki authored
      The proc socket stats use sk_prot->inuse_idx value to record inuse sock
      stats. We currently do not set this correctly from sockmap side. The
      result is reading sock stats '/proc/net/sockstat' gives incorrect values.
      The socket counter is incremented correctly, but because we don't set the
      counter correctly when we replace sk_prot we may omit the decrement.
      
      To get the correct inuse_idx value move the core_initcall that initializes
      the UDP proto handlers to late_initcall. This way it is initialized after
      UDP has the chance to assign the inuse_idx value from the register protocol
      handler.
      
      Fixes: edc6741c
      
       ("bpf: Add sockmap hooks for UDP sockets")
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarCong Wang <cong.wang@bytedance.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20210714154750.528206-1-jakub@cloudflare.com
      54ea2f49
    • John Fastabend's avatar
      bpf, sockmap, tcp: sk_prot needs inuse_idx set for proc stats · 228a4a7b
      John Fastabend authored
      The proc socket stats use sk_prot->inuse_idx value to record inuse sock
      stats. We currently do not set this correctly from sockmap side. The
      result is reading sock stats '/proc/net/sockstat' gives incorrect values.
      The socket counter is incremented correctly, but because we don't set the
      counter correctly when we replace sk_prot we may omit the decrement.
      
      To get the correct inuse_idx value move the core_initcall that initializes
      the TCP proto handlers to late_initcall. This way it is initialized after
      TCP has the chance to assign the inuse_idx value from the register protocol
      handler.
      
      Fixes: 604326b4
      
       ("bpf, sockmap: convert to generic sk_msg interface")
      Suggested-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarCong Wang <cong.wang@bytedance.com>
      Link: https://lore.kernel.org/bpf/20210712195546.423990-3-john.fastabend@gmail.com
      228a4a7b
    • John Fastabend's avatar
      bpf, sockmap: Fix potential memory leak on unlikely error case · 7e6b27a6
      John Fastabend authored
      If skb_linearize is needed and fails we could leak a msg on the error
      handling. To fix ensure we kfree the msg block before returning error.
      Found during code review.
      
      Fixes: 4363023d
      
       ("bpf, sockmap: Avoid failures from skb_to_sgvec when skb has frag_list")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarCong Wang <cong.wang@bytedance.com>
      Link: https://lore.kernel.org/bpf/20210712195546.423990-2-john.fastabend@gmail.com
      7e6b27a6
    • Colin Ian King's avatar
      s390/bpf: Perform r1 range checking before accessing jit->seen_reg[r1] · 91091656
      Colin Ian King authored
      Currently array jit->seen_reg[r1] is being accessed before the range
      checking of index r1. The range changing on r1 should be performed
      first since it will avoid any potential out-of-range accesses on the
      array seen_reg[] and also it is more optimal to perform checks on r1
      before fetching data from the array. Fix this by swapping the order
      of the checks before the array access.
      
      Fixes: 05462310
      
       ("s390/bpf: Add s390x eBPF JIT compiler backend")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Acked-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Link: https://lore.kernel.org/bpf/20210715125712.24690-1-colin.king@canonical.com
      91091656
    • Qitao Xu's avatar
      net_sched: introduce tracepoint trace_qdisc_enqueue() · 70713ddd
      Qitao Xu authored
      Tracepoint trace_qdisc_enqueue() is introduced to trace skb at
      the entrance of TC layer on TX side. This is similar to
      trace_qdisc_dequeue():
      
      1. For both we only trace successful cases. The failure cases
         can be traced via trace_kfree_skb().
      
      2. They are called at entrance or exit of TC layer, not for each
         ->enqueue() or ->dequeue(). This is intentional, because
         we want to make trace_qdisc_enqueue() symmetric to
         trace_qdisc_dequeue(), which is easier to use.
      
      The return value of qdisc_enqueue() is not interesting here,
      we have Qdisc's drop packets in ->dequeue(), it is impossible to
      trace them even if we have the return value, the only way to trace
      them is tracing kfree_skb().
      
      We only add information we need to trace ring buffer. If any other
      information is needed, it is easy to extend it without breaking ABI,
      see commit 3dd344ea
      
       ("net: tracepoint: exposing sk_family in all
      tcp:tracepoints").
      
      Reviewed-by: default avatarCong Wang <cong.wang@bytedance.com>
      Signed-off-by: default avatarQitao Xu <qitao.xu@bytedance.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      70713ddd
    • Qitao Xu's avatar
      net_sched: use %px to print skb address in trace_qdisc_dequeue() · 851f36e4
      Qitao Xu authored
      
      
      Print format of skbaddr is changed to %px from %p, because we want
      to use skb address as a quick way to identify a packet.
      
      Note, trace ring buffer is only accessible to privileged users,
      it is safe to use a real kernel address here.
      
      Reviewed-by: default avatarCong Wang <cong.wang@bytedance.com>
      Signed-off-by: default avatarQitao Xu <qitao.xu@bytedance.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      851f36e4
    • Qitao Xu's avatar
      net: use %px to print skb address in trace_netif_receive_skb · 65875073
      Qitao Xu authored
      
      
      The print format of skb adress in tracepoint class net_dev_template
      is changed to %px from %p, because we want to use skb address
      as a quick way to identify a packet.
      
      Note, trace ring buffer is only accessible to privileged users,
      it is safe to use a real kernel address here.
      
      Reviewed-by: default avatarCong Wang <cong.wang@bytedance.com>
      Signed-off-by: default avatarQitao Xu <qitao.xu@bytedance.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      65875073
    • Colin Ian King's avatar
      liquidio: Fix unintentional sign extension issue on left shift of u16 · e7efc2ce
      Colin Ian King authored
      Shifting the u16 integer oct->pcie_port by CN23XX_PKT_INPUT_CTL_MAC_NUM_POS
      (29) bits will be promoted to a 32 bit signed int and then sign-extended
      to a u64. In the cases where oct->pcie_port where bit 2 is set (e.g. 3..7)
      the shifted value will be sign extended and the top 32 bits of the result
      will be set.
      
      Fix this by casting the u16 values to a u64 before the 29 bit left shift.
      
      Addresses-Coverity: ("Unintended sign extension")
      
      Fixes: 3451b97c
      
       ("liquidio: CN23XX register setup")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7efc2ce
    • Geert Uytterhoeven's avatar
      net: dsa: mv88e6xxx: NET_DSA_MV88E6XXX_PTP should depend on NET_DSA_MV88E6XXX · 99bb2eba
      Geert Uytterhoeven authored
      Making global2 support mandatory removed the Kconfig symbol
      NET_DSA_MV88E6XXX_GLOBAL2.  This symbol also served as an intermediate
      symbol to make NET_DSA_MV88E6XXX_PTP depend on NET_DSA_MV88E6XXX.  With
      the symbol removed, the user is always asked about PTP support for
      Marvell 88E6xxx switches, even if the latter support is not enabled.
      
      Fix this by reinstating the dependency.
      
      Fixes: 63368a74
      
       ("net: dsa: mv88e6xxx: Make global2 support mandatory")
      Signed-off-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      99bb2eba
  4. Jul 15, 2021
    • David S. Miller's avatar
      Merge branch 'r8152-pm-fixxes' · 3ffd3dad
      David S. Miller authored
      
      
      Takashi Iwai says:
      
      ====================
      r8152: Fix a couple of PM problems
      
      it seems that r8152 driver suffers from the deadlock at both runtime
      and system PM.  Formerly, it was seen more often at hibernation
      resume, but now it's triggered more frequently, as reported in SUSE
      Bugzilla:
        https://bugzilla.suse.com/show_bug.cgi?id=1186194
      
      While debugging the problem, I stumbled on a few obvious bugs and here
      is the results with two patches for addressing the resume problem.
      
      ***
      
      However, the story doesn't end here, unfortunately, and those patches
      don't seem sufficing.  The rest major problem is that the driver calls
      napi_disable() and napi_enable() in the PM suspend callbacks.  This
      makes the system stalling at (runtime-)suspend.  If we drop
      napi_disable() and napi_enable() calls in the PM suspend callbacks, it
      starts working (that was the result in Bugzilla comment 13):
        https://bugzilla.suse.com/show_bug.cgi?id=1186194#c13
      
      So, my patches aren't enough and we still need to investigate
      further.  It'd be appreciated if anyone can give a fix or a hint for
      more debugging.  The usage of napi_disable() at PM callbacks is unique
      in this driver and looks rather suspicious to me; but I'm no expert in
      this area so I might be wrong...
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ffd3dad
    • Takashi Iwai's avatar
      r8152: Fix a deadlock by doubly PM resume · 776ac63a
      Takashi Iwai authored
      
      
      r8152 driver sets up the MAC address at reset-resume, while
      rtl8152_set_mac_address() has the temporary autopm get/put.  This may
      lead to a deadlock as the PM lock has been already taken for the
      execution of the runtime PM callback.
      
      This patch adds the workaround to avoid the superfluous autpm when
      called from rtl8152_reset_resume().
      
      Link: https://bugzilla.suse.com/show_bug.cgi?id=1186194
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      776ac63a
    • Takashi Iwai's avatar
      r8152: Fix potential PM refcount imbalance · 9c23aa51
      Takashi Iwai authored
      
      
      rtl8152_close() takes the refcount via usb_autopm_get_interface() but
      it doesn't release when RTL8152_UNPLUG test hits.  This may lead to
      the imbalance of PM refcount.  This patch addresses it.
      
      Link: https://bugzilla.suse.com/show_bug.cgi?id=1186194
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9c23aa51
    • Linus Torvalds's avatar
      Merge tag 'net-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 8096acd7
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski.
       "Including fixes from bpf and netfilter.
      
        Current release - regressions:
      
         - sock: fix parameter order in sock_setsockopt()
      
        Current release - new code bugs:
      
         - netfilter: nft_last:
             - fix incorrect arithmetic when restoring last used
             - honor NFTA_LAST_SET on restoration
      
        Previous releases - regressions:
      
         - udp: properly flush normal packet at GRO time
      
         - sfc: ensure correct number of XDP queues; don't allow enabling the
           feature if there isn't sufficient resources to Tx from any CPU
      
         - dsa: sja1105: fix address learning getting disabled on the CPU port
      
         - mptcp: addresses a rmem accounting issue that could keep packets in
           subflow receive buffers longer than necessary, delaying MPTCP-level
           ACKs
      
         - ip_tunnel: fix mtu calculation for ETHER tunnel devices
      
         - do not reuse skbs allocated from skbuff_fclone_cache in the napi
           skb cache, we'd try to return them to the wrong slab cache
      
         - tcp: consistently disable header prediction for mptcp
      
        Previous releases - always broken:
      
         - bpf: fix subprog poke descriptor tracking use-after-free
      
         - ipv6:
             - allocate enough headroom in ip6_finish_output2() in case
               iptables TEE is used
             - tcp: drop silly ICMPv6 packet too big messages to avoid
               expensive and pointless lookups (which may serve as a DDOS
               vector)
             - make sure fwmark is copied in SYNACK packets
             - fix 'disable_policy' for forwarded packets (align with IPv4)
      
         - netfilter: conntrack:
             - do not renew entry stuck in tcp SYN_SENT state
             - do not mark RST in the reply direction coming after SYN packet
               for an out-of-sync entry
      
         - mptcp: cleanly handle error conditions with MP_JOIN and syncookies
      
         - mptcp: fix double free when rejecting a join due to port mismatch
      
         - validate lwtstate->data before returning from skb_tunnel_info()
      
         - tcp: call sk_wmem_schedule before sk_mem_charge in zerocopy path
      
         - mt76: mt7921: continue to probe driver when fw already downloaded
      
         - bonding: fix multiple issues with offloading IPsec to (thru?) bond
      
         - stmmac: ptp: fix issues around Qbv support and setting time back
      
         - bcmgenet: always clear wake-up based on energy detection
      
        Misc:
      
         - sctp: move 198 addresses from unusable to private scope
      
         - ptp: support virtual clocks and timestamping
      
         - openvswitch: optimize operation for key comparison"
      
      * tag 'net-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (158 commits)
        net: dsa: properly check for the bridge_leave methods in dsa_switch_bridge_leave()
        sfc: add logs explaining XDP_TX/REDIRECT is not available
        sfc: ensure correct number of XDP queues
        sfc: fix lack of XDP TX queues - error XDP TX failed (-22)
        net: fddi: fix UAF in fza_probe
        net: dsa: sja1105: fix address learning getting disabled on the CPU port
        net: ocelot: fix switchdev objects synced for wrong netdev with LAG offload
        net: Use nlmsg_unicast() instead of netlink_unicast()
        octeontx2-pf: Fix uninitialized boolean variable pps
        ipv6: allocate enough headroom in ip6_finish_output2()
        net: hdlc: rename 'mod_init' & 'mod_exit' functions to be module-specific
        net: bridge: multicast: fix MRD advertisement router port marking race
        net: bridge: multicast: fix PIM hello router port marking race
        net: phy: marvell10g: fix differentiation of 88X3310 from 88X3340
        dsa: fix for_each_child.cocci warnings
        virtio_net: check virtqueue_add_sgs() return value
        mptcp: properly account bulk freed memory
        selftests: mptcp: fix case multiple subflows limited by server
        mptcp: avoid processing packet if a subflow reset
        mptcp: fix syncookie process if mptcp can not_accept new subflow
        ...
      8096acd7
    • Christian Brauner's avatar
      fs: add vfs_parse_fs_param_source() helper · d1d488d8
      Christian Brauner authored
      
      
      Add a simple helper that filesystems can use in their parameter parser
      to parse the "source" parameter. A few places open-coded this function
      and that already caused a bug in the cgroup v1 parser that we fixed.
      Let's make it harder to get this wrong by introducing a helper which
      performs all necessary checks.
      
      Link: https://syzkaller.appspot.com/bug?id=6312526aba5beae046fdae8f00399f87aab48b12
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d1d488d8
    • Christian Brauner's avatar
      cgroup: verify that source is a string · 3b046272
      Christian Brauner authored
      The following sequence can be used to trigger a UAF:
      
          int fscontext_fd = fsopen("cgroup");
          int fd_null = open("/dev/null, O_RDONLY);
          int fsconfig(fscontext_fd, FSCONFIG_SET_FD, "source", fd_null);
          close_range(3, ~0U, 0);
      
      The cgroup v1 specific fs parser expects a string for the "source"
      parameter.  However, it is perfectly legitimate to e.g.  specify a file
      descriptor for the "source" parameter.  The fs parser doesn't know what
      a filesystem allows there.  So it's a bug to assume that "source" is
      always of type fs_value_is_string when it can reasonably also be
      fs_value_is_file.
      
      This assumption in the cgroup code causes a UAF because struct
      fs_parameter uses a union for the actual value.  Access to that union is
      guarded by the param->type member.  Since the cgroup paramter parser
      didn't check param->type but unconditionally moved param->string into
      fc->source a close on the fscontext_fd would trigger a UAF during
      put_fs_context() which frees fc->source thereby freeing the file stashed
      in param->file causing a UAF during a close of the fd_null.
      
      Fix this by verifying that param->type is actually a string and report
      an error if not.
      
      In follow up patches I'll add a new generic helper that can be used here
      and by other filesystems instead of this error-prone copy-pasta fix.
      But fixing it in here first makes backporting a it to stable a lot
      easier.
      
      Fixes: 8d2451f4
      
       ("cgroup1: switch to option-by-option parsing")
      Reported-by: default avatar <syzbot+283ce5a46486d6acdbaf@syzkaller.appspotmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: <stable@kernel.org>
      Cc: syzkaller-bugs <syzkaller-bugs@googlegroups.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b046272
  5. Jul 14, 2021
    • Vladimir Oltean's avatar
      net: dsa: properly check for the bridge_leave methods in dsa_switch_bridge_leave() · bcb9928a
      Vladimir Oltean authored
      This was not caught because there is no switch driver which implements
      the .port_bridge_join but not .port_bridge_leave method, but it should
      nonetheless be fixed, as in certain conditions (driver development) it
      might lead to NULL pointer dereference.
      
      Fixes: f66a6a69
      
       ("net: dsa: permit cross-chip bridging between all trees in the system")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bcb9928a
    • Linus Torvalds's avatar
      Merge tag 'vboxsf-v5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hansg/linux · 40226a3d
      Linus Torvalds authored
      Pull vboxsf fixes from Hans de Goede:
       "This adds support for the atomic_open directory-inode op to vboxsf.
      
        Note this is not just an enhancement this also fixes an actual issue
        which users are hitting, see the commit message of the "boxsf: Add
        support for the atomic_open directory-inode" patch"
      
      * tag 'vboxsf-v5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hansg/linux:
        vboxsf: Add support for the atomic_open directory-inode op
        vboxsf: Add vboxsf_[create|release]_sf_handle() helpers
        vboxsf: Make vboxsf_dir_create() return the handle for the created file
        vboxsf: Honor excl flag to the dir-inode create op
      40226a3d
    • Linus Torvalds's avatar
      Merge tag 'for-5.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · f02bf857
      Linus Torvalds authored
      Pull btrfs zoned mode fixes from David Sterba:
      
       - fix deadlock when allocating system chunk
      
       - fix wrong mutex unlock on an error path
      
       - fix extent map splitting for append operation
      
       - update and fix message reporting unusable chunk space
      
       - don't block when background zone reclaim runs with balance in
         parallel
      
      * tag 'for-5.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: zoned: fix wrong mutex unlock on failure to allocate log root tree
        btrfs: don't block if we can't acquire the reclaim lock
        btrfs: properly split extent_map for REQ_OP_ZONE_APPEND
        btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
        btrfs: fix deadlock with concurrent chunk allocations involving system chunks
        btrfs: zoned: print unusable percentage when reclaiming block groups
        btrfs: zoned: fix types for u64 division in btrfs_reclaim_bgs_work
      f02bf857
    • David S. Miller's avatar
      Merge branch 'sfc-tx-queues' · 28efd208
      David S. Miller authored
      Íñigo Huguet says:
      
      ====================
      sfc: Fix lack of XDP TX queues
      
      A change introduced in commit e26ca4b5 ("sfc: reduce the number of
      requested xdp ev queues") created a bug in XDP_TX and XDP_REDIRECT
      because it unintentionally reduced the number of XDP TX queues, letting
      not enough queues to have one per CPU, which leaded to errors if XDP
      TX/REDIRECT was done from a high numbered CPU.
      
      This patchs make the following changes:
      - Fix the bug mentioned above
      - Revert commit 99ba0ea6
      
       ("sfc: adjust efx->xdp_tx_queue_count with
        the real number of initialized queues") which intended to fix a related
        problem, created by mentioned bug, but it's no longer necessary
      - Add a new error log message if there are not enough resources to make
        XDP_TX/REDIRECT work
      
      V1 -> V2: keep the calculation of how many tx queues can handle a single
      event queue, but apply the "max. tx queues per channel" upper limit.
      V2 -> V3: WARN_ON if the number of initialized XDP TXQs differs from the
      expected.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      28efd208
    • Íñigo Huguet's avatar
      sfc: add logs explaining XDP_TX/REDIRECT is not available · d2a16bde
      Íñigo Huguet authored
      
      
      If it's not possible to allocate enough channels for XDP, XDP_TX and
      XDP_REDIRECT don't work. However, only a message saying that not enough
      channels were available was shown, but not saying what are the
      consequences in that case. The user didn't know if he/she can use XDP
      or not, if the performance is reduced, or what.
      
      Signed-off-by: default avatarÍñigo Huguet <ihuguet@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d2a16bde
    • Íñigo Huguet's avatar
      sfc: ensure correct number of XDP queues · 788bc000
      Íñigo Huguet authored
      Commit 99ba0ea6 ("sfc: adjust efx->xdp_tx_queue_count with the real
      number of initialized queues") intended to fix a problem caused by a
      round up when calculating the number of XDP channels and queues.
      However, this was not the real problem. The real problem was that the
      number of XDP TX queues had been reduced to half in
      commit e26ca4b5
      
       ("sfc: reduce the number of requested xdp ev queues"),
      but the variable xdp_tx_queue_count had remained the same.
      
      Once the correct number of XDP TX queues is created again in the
      previous patch of this series, this also can be reverted since the error
      doesn't actually exist.
      
      Only in the case that there is a bug in the code we can have different
      values in xdp_queue_number and efx->xdp_tx_queue_count. Because of this,
      and per Edward Cree's suggestion, I add instead a WARN_ON to catch if it
      happens again in the future.
      
      Note that the number of allocated queues can be higher than the number
      of used ones due to the round up, as explained in the existing comment
      in the code. That's why we also have to stop increasing xdp_queue_number
      beyond efx->xdp_tx_queue_count.
      
      Signed-off-by: default avatarÍñigo Huguet <ihuguet@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      788bc000
    • Íñigo Huguet's avatar
      sfc: fix lack of XDP TX queues - error XDP TX failed (-22) · f28100cb
      Íñigo Huguet authored
      Fixes: e26ca4b5
      
       sfc: reduce the number of requested xdp ev queues
      
      The buggy commit intended to allocate less channels for XDP in order to
      be more unlikely to reach the limit of 32 channels of the driver.
      
      The idea was to use each IRQ/eventqeue for more XDP TX queues than
      before, calculating which is the maximum number of TX queues that one
      event queue can handle. For example, in EF10 each event queue could
      handle up to 8 queues, better than the 4 they were handling before the
      change. This way, it would have to allocate half of channels than before
      for XDP TX.
      
      The problem is that the TX queues are also contained inside the channel
      structs, and there are only 4 queues per channel. Reducing the number of
      channels means also reducing the number of queues, resulting in not
      having the desired number of 1 queue per CPU.
      
      This leads to getting errors on XDP_TX and XDP_REDIRECT if they're
      executed from a high numbered CPU, because there only exist queues for
      the low half of CPUs, actually. If XDP_TX/REDIRECT is executed in a low
      numbered CPU, the error doesn't happen. This is the error in the logs
      (repeated many times, even rate limited):
      sfc 0000:5e:00.0 ens3f0np0: XDP TX failed (-22)
      
      This errors happens in function efx_xdp_tx_buffers, where it expects to
      have a dedicated XDP TX queue per CPU.
      
      Reverting the change makes again more likely to reach the limit of 32
      channels in machines with many CPUs. If this happen, no XDP_TX/REDIRECT
      will be possible at all, and we will have this log error messages:
      
      At interface probe:
      sfc 0000:5e:00.0: Insufficient resources for 12 XDP event queues (24 other channels, max 32)
      
      At every subsequent XDP_TX/REDIRECT failure, rate limited:
      sfc 0000:5e:00.0 ens3f0np0: XDP TX failed (-22)
      
      However, without reverting the change, it makes the user to think that
      everything is OK at probe time, but later it fails in an unpredictable
      way, depending on the CPU that handles the packet.
      
      It is better to restore the predictable behaviour. If the user sees the
      error message at probe time, he/she can try to configure the best way it
      fits his/her needs. At least, he/she will have 2 options:
      - Accept that XDP_TX/REDIRECT is not available (he/she may not need it)
      - Load sfc module with modparam 'rss_cpus' with a lower number, thus
        creating less normal RX queues/channels, letting more free resources
        for XDP, with some performance penalty.
      
      Anyway, let the calculation of maximum TX queues that can be handled by
      a single event queue, and use it only if it's less than the number of TX
      queues per channel. This doesn't happen in practice, but could happen if
      some constant values are tweaked in the future, such us
      EFX_MAX_TXQ_PER_CHANNEL, EFX_MAX_EVQ_SIZE or EFX_MAX_DMAQ_SIZE.
      
      Related mailing list thread:
      https://lore.kernel.org/bpf/20201215104327.2be76156@carbon/
      
      Signed-off-by: default avatarÍñigo Huguet <ihuguet@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f28100cb
    • Pavel Skripkin's avatar
      net: fddi: fix UAF in fza_probe · deb7178e
      Pavel Skripkin authored
      fp is netdev private data and it cannot be
      used after free_netdev() call. Using fp after free_netdev()
      can cause UAF bug. Fix it by moving free_netdev() after error message.
      
      Fixes: 61414f5e
      
       ("FDDI: defza: Add support for DEC FDDIcontroller 700
      TURBOchannel adapter")
      Signed-off-by: default avatarPavel Skripkin <paskripkin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      deb7178e
    • Vladimir Oltean's avatar
      net: dsa: sja1105: fix address learning getting disabled on the CPU port · b0b33b04
      Vladimir Oltean authored
      In May 2019 when commit 640f763f ("net: dsa: sja1105: Add support
      for Spanning Tree Protocol") was introduced, the comment that "STP does
      not get called for the CPU port" was true. This changed after commit
      0394a63a ("net: dsa: enable and disable all ports") in August 2019
      and went largely unnoticed, because the sja1105_bridge_stp_state_set()
      method did nothing different compared to the static setup done by
      sja1105_init_mac_settings().
      
      With the ability to turn address learning off introduced by the blamed
      commit, there is a new priv->learn_ena port mask in the driver. When
      sja1105_bridge_stp_state_set() gets called and we are in
      BR_STATE_LEARNING or later, address learning is enabled or not depending
      on priv->learn_ena & BIT(port).
      
      So what happens is that priv->learn_ena is not being set from anywhere
      for the CPU port, and the static configuration done by
      sja1105_init_mac_settings() is being overwritten.
      
      To solve this, acknowledge that the static configuration of STP state is
      no longer necessary because the STP state is being set by the DSA core
      now, but what is necessary is to set priv->learn_ena for the CPU port.
      
      Fixes: 4d942354
      
       ("net: dsa: sja1105: offload bridge port flags to device")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b0b33b04
    • Vladimir Oltean's avatar
      net: ocelot: fix switchdev objects synced for wrong netdev with LAG offload · e56c6bbd
      Vladimir Oltean authored
      The point with a *dev and a *brport_dev is that when we have a LAG net
      device that is a bridge port, *dev is an ocelot net device and
      *brport_dev is the bonding/team net device. The ocelot net device
      beneath the LAG does not exist from the bridge's perspective, so we need
      to sync the switchdev objects belonging to the brport_dev and not to the
      dev.
      
      Fixes: e4bd44e8
      
       ("net: ocelot: replay switchdev events when joining bridge")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e56c6bbd
    • Yajun Deng's avatar
      net: Use nlmsg_unicast() instead of netlink_unicast() · 01757f53
      Yajun Deng authored
      
      
      It has 'if (err >0 )' statement in nlmsg_unicast(), so use nlmsg_unicast()
      instead of netlink_unicast(), this looks more concise.
      
      v2: remove the change in netfilter.
      
      Signed-off-by: default avatarYajun Deng <yajun.deng@linux.dev>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      01757f53
  6. Jul 13, 2021
    • Xuan Zhuo's avatar
      xdp, net: Fix use-after-free in bpf_xdp_link_release · 5acc7d3e
      Xuan Zhuo authored
      The problem occurs between dev_get_by_index() and dev_xdp_attach_link().
      At this point, dev_xdp_uninstall() is called. Then xdp link will not be
      detached automatically when dev is released. But link->dev already
      points to dev, when xdp link is released, dev will still be accessed,
      but dev has been released.
      
      dev_get_by_index()        |
      link->dev = dev           |
                                |      rtnl_lock()
                                |      unregister_netdevice_many()
                                |          dev_xdp_uninstall()
                                |      rtnl_unlock()
      rtnl_lock();              |
      dev_xdp_attach_link()     |
      rtnl_unlock();            |
                                |      netdev_run_todo() // dev released
      bpf_xdp_link_release()    |
          /* access dev.        |
             use-after-free */  |
      
      [   45.966867] BUG: KASAN: use-after-free in bpf_xdp_link_release+0x3b8/0x3d0
      [   45.967619] Read of size 8 at addr ffff00000f9980c8 by task a.out/732
      [   45.968297]
      [   45.968502] CPU: 1 PID: 732 Comm: a.out Not tainted 5.13.0+ #22
      [   45.969222] Hardware name: linux,dummy-virt (DT)
      [   45.969795] Call trace:
      [   45.970106]  dump_backtrace+0x0/0x4c8
      [   45.970564]  show_stack+0x30/0x40
      [   45.970981]  dump_stack_lvl+0x120/0x18c
      [   45.971470]  print_address_description.constprop.0+0x74/0x30c
      [   45.972182]  kasan_report+0x1e8/0x200
      [   45.972659]  __asan_report_load8_noabort+0x2c/0x50
      [   45.973273]  bpf_xdp_link_release+0x3b8/0x3d0
      [   45.973834]  bpf_link_free+0xd0/0x188
      [   45.974315]  bpf_link_put+0x1d0/0x218
      [   45.974790]  bpf_link_release+0x3c/0x58
      [   45.975291]  __fput+0x20c/0x7e8
      [   45.975706]  ____fput+0x24/0x30
      [   45.976117]  task_work_run+0x104/0x258
      [   45.976609]  do_notify_resume+0x894/0xaf8
      [   45.977121]  work_pending+0xc/0x328
      [   45.977575]
      [   45.977775] The buggy address belongs to the page:
      [   45.978369] page:fffffc00003e6600 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4f998
      [   45.979522] flags: 0x7fffe0000000000(node=0|zone=0|lastcpupid=0x3ffff)
      [   45.980349] raw: 07fffe0000000000 fffffc00003e6708 ffff0000dac3c010 0000000000000000
      [   45.981309] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
      [   45.982259] page dumped because: kasan: bad access detected
      [   45.982948]
      [   45.983153] Memory state around the buggy address:
      [   45.983753]  ffff00000f997f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [   45.984645]  ffff00000f998000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      [   45.985533] >ffff00000f998080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      [   45.986419]                                               ^
      [   45.987112]  ffff00000f998100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      [   45.988006]  ffff00000f998180: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      [   45.988895] ==================================================================
      [   45.989773] Disabling lock debugging due to kernel taint
      [   45.990552] Kernel panic - not syncing: panic_on_warn set ...
      [   45.991166] CPU: 1 PID: 732 Comm: a.out Tainted: G    B             5.13.0+ #22
      [   45.991929] Hardware name: linux,dummy-virt (DT)
      [   45.992448] Call trace:
      [   45.992753]  dump_backtrace+0x0/0x4c8
      [   45.993208]  show_stack+0x30/0x40
      [   45.993627]  dump_stack_lvl+0x120/0x18c
      [   45.994113]  dump_stack+0x1c/0x34
      [   45.994530]  panic+0x3a4/0x7d8
      [   45.994930]  end_report+0x194/0x198
      [   45.995380]  kasan_report+0x134/0x200
      [   45.995850]  __asan_report_load8_noabort+0x2c/0x50
      [   45.996453]  bpf_xdp_link_release+0x3b8/0x3d0
      [   45.997007]  bpf_link_free+0xd0/0x188
      [   45.997474]  bpf_link_put+0x1d0/0x218
      [   45.997942]  bpf_link_release+0x3c/0x58
      [   45.998429]  __fput+0x20c/0x7e8
      [   45.998833]  ____fput+0x24/0x30
      [   45.999247]  task_work_run+0x104/0x258
      [   45.999731]  do_notify_resume+0x894/0xaf8
      [   46.000236]  work_pending+0xc/0x328
      [   46.000697] SMP: stopping secondary CPUs
      [   46.001226] Dumping ftrace buffer:
      [   46.001663]    (ftrace buffer empty)
      [   46.002110] Kernel Offset: disabled
      [   46.002545] CPU features: 0x00000001,23202c00
      [   46.003080] Memory Limit: none
      
      Fixes: aa8d3a71
      
       ("bpf, xdp: Add bpf_link-based XDP attachment API")
      Reported-by: default avatarAbaci <abaci@linux.alibaba.com>
      Signed-off-by: default avatarXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarDust Li <dust.li@linux.alibaba.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20210710031635.41649-1-xuanzhuo@linux.alibaba.com
      5acc7d3e
    • Daniel Borkmann's avatar
      bpf: Fix tail_call_reachable rejection for interpreter when jit failed · 5dd0a6b8
      Daniel Borkmann authored
      During testing of f263a814 ("bpf: Track subprog poke descriptors correctly
      and fix use-after-free") under various failure conditions, for example, when
      jit_subprogs() fails and tries to clean up the program to be run under the
      interpreter, we ran into the following freeze:
      
        [...]
        #127/8 tailcall_bpf2bpf_3:FAIL
        [...]
        [   92.041251] BUG: KASAN: slab-out-of-bounds in ___bpf_prog_run+0x1b9d/0x2e20
        [   92.042408] Read of size 8 at addr ffff88800da67f68 by task test_progs/682
        [   92.043707]
        [   92.044030] CPU: 1 PID: 682 Comm: test_progs Tainted: G   O   5.13.0-53301-ge6c08cb33a30-dirty #87
        [   92.045542] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
        [   92.046785] Call Trace:
        [   92.047171]  ? __bpf_prog_run_args64+0xc0/0xc0
        [   92.047773]  ? __bpf_prog_run_args32+0x8b/0xb0
        [   92.048389]  ? __bpf_prog_run_args64+0xc0/0xc0
        [   92.049019]  ? ktime_get+0x117/0x130
        [...] // few hundred [similar] lines more
        [   92.659025]  ? ktime_get+0x117/0x130
        [   92.659845]  ? __bpf_prog_run_args64+0xc0/0xc0
        [   92.660738]  ? __bpf_prog_run_args32+0x8b/0xb0
        [   92.661528]  ? __bpf_prog_run_args64+0xc0/0xc0
        [   92.662378]  ? print_usage_bug+0x50/0x50
        [   92.663221]  ? print_usage_bug+0x50/0x50
        [   92.664077]  ? bpf_ksym_find+0x9c/0xe0
        [   92.664887]  ? ktime_get+0x117/0x130
        [   92.665624]  ? kernel_text_address+0xf5/0x100
        [   92.666529]  ? __kernel_text_address+0xe/0x30
        [   92.667725]  ? unwind_get_return_address+0x2f/0x50
        [   92.668854]  ? ___bpf_prog_run+0x15d4/0x2e20
        [   92.670185]  ? ktime_get+0x117/0x130
        [   92.671130]  ? __bpf_prog_run_args64+0xc0/0xc0
        [   92.672020]  ? __bpf_prog_run_args32+0x8b/0xb0
        [   92.672860]  ? __bpf_prog_run_args64+0xc0/0xc0
        [   92.675159]  ? ktime_get+0x117/0x130
        [   92.677074]  ? lock_is_held_type+0xd5/0x130
        [   92.678662]  ? ___bpf_prog_run+0x15d4/0x2e20
        [   92.680046]  ? ktime_get+0x117/0x130
        [   92.681285]  ? __bpf_prog_run32+0x6b/0x90
        [   92.682601]  ? __bpf_prog_run64+0x90/0x90
        [   92.683636]  ? lock_downgrade+0x370/0x370
        [   92.684647]  ? mark_held_locks+0x44/0x90
        [   92.685652]  ? ktime_get+0x117/0x130
        [   92.686752]  ? lockdep_hardirqs_on+0x79/0x100
        [   92.688004]  ? ktime_get+0x117/0x130
        [   92.688573]  ? __cant_migrate+0x2b/0x80
        [   92.689192]  ? bpf_test_run+0x2f4/0x510
        [   92.689869]  ? bpf_test_timer_continue+0x1c0/0x1c0
        [   92.690856]  ? rcu_read_lock_bh_held+0x90/0x90
        [   92.691506]  ? __kasan_slab_alloc+0x61/0x80
        [   92.692128]  ? eth_type_trans+0x128/0x240
        [   92.692737]  ? __build_skb+0x46/0x50
        [   92.693252]  ? bpf_prog_test_run_skb+0x65e/0xc50
        [   92.693954]  ? bpf_prog_test_run_raw_tp+0x2d0/0x2d0
        [   92.694639]  ? __fget_light+0xa1/0x100
        [   92.695162]  ? bpf_prog_inc+0x23/0x30
        [   92.695685]  ? __sys_bpf+0xb40/0x2c80
        [   92.696324]  ? bpf_link_get_from_fd+0x90/0x90
        [   92.697150]  ? mark_held_locks+0x24/0x90
        [   92.698007]  ? lockdep_hardirqs_on_prepare+0x124/0x220
        [   92.699045]  ? finish_task_switch+0xe6/0x370
        [   92.700072]  ? lockdep_hardirqs_on+0x79/0x100
        [   92.701233]  ? finish_task_switch+0x11d/0x370
        [   92.702264]  ? __switch_to+0x2c0/0x740
        [   92.703148]  ? mark_held_locks+0x24/0x90
        [   92.704155]  ? __x64_sys_bpf+0x45/0x50
        [   92.705146]  ? do_syscall_64+0x35/0x80
        [   92.706953]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
        [...]
      
      Turns out that the program rejection from e411901c ("bpf: allow for tailcalls
      in BPF subprograms for x64 JIT") is buggy since env->prog->aux->tail_call_reachable
      is never true. Commit ebf7d1f5 ("bpf, x64: rework pro/epilogue and tailcall
      handling in JIT") added a tracker into check_max_stack_depth() which propagates
      the tail_call_reachable condition throughout the subprograms. This info is then
      assigned to the subprogram's func[i]->aux->tail_call_reachable. However, in the
      case of the rejection check upon JIT failure, env->prog->aux->tail_call_reachable
      is used. func[0]->aux->tail_call_reachable which represents the main program's
      information did not propagate this to the outer env->prog->aux, though. Add this
      propagation into check_max_stack_depth() where it needs to belong so that the
      check can be done reliably.
      
      Fixes: ebf7d1f5 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
      Fixes: e411901c
      
       ("bpf: allow for tailcalls in BPF subprograms for x64 JIT")
      Co-developed-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/bpf/618c34e3163ad1a36b1e82377576a6081e182f25.1626123173.git.daniel@iogearbox.net
      5dd0a6b8