Skip to content
  1. Feb 01, 2024
    • Emmanuel Grumbach's avatar
      wifi: iwlwifi: fix a memory corruption · aa2cc936
      Emmanuel Grumbach authored
      commit cf4a0d84 upstream.
      
      iwl_fw_ini_trigger_tlv::data is a pointer to a __le32, which means that
      if we copy to iwl_fw_ini_trigger_tlv::data + offset while offset is in
      bytes, we'll write past the buffer.
      
      Cc: stable@vger.kernel.org
      Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218233
      
      
      Fixes: cf29c5b6 ("iwlwifi: dbg_ini: implement time point handling")
      Signed-off-by: default avatarEmmanuel Grumbach <emmanuel.grumbach@intel.com>
      Signed-off-by: default avatarMiri Korenblit <miriam.rachel.korenblit@intel.com>
      Link: https://msgid.link/20240111150610.2d2b8b870194.I14ed76505a5cf87304e0c9cc05cc0ae85ed3bf91@changeid
      
      
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aa2cc936
    • Bernd Edlinger's avatar
      exec: Fix error handling in begin_new_exec() · dcc54a54
      Bernd Edlinger authored
      
      
      commit 84c39ec5 upstream.
      
      If get_unused_fd_flags() fails, the error handling is incomplete because
      bprm->cred is already set to NULL, and therefore free_bprm will not
      unlock the cred_guard_mutex. Note there are two error conditions which
      end up here, one before and one after bprm->cred is cleared.
      
      Fixes: b8a61c9e ("exec: Generic execfd support")
      Signed-off-by: default avatarBernd Edlinger <bernd.edlinger@hotmail.de>
      Acked-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Link: https://lore.kernel.org/r/AS8P193MB128517ADB5EFF29E04389EDAE4752@AS8P193MB1285.EURP193.PROD.OUTLOOK.COM
      
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dcc54a54
    • Ilya Dryomov's avatar
      rbd: don't move requests to the running list on errors · 46464457
      Ilya Dryomov authored
      
      
      commit ded080c8 upstream.
      
      The running list is supposed to contain requests that are pinning the
      exclusive lock, i.e. those that must be flushed before exclusive lock
      is released.  When wake_lock_waiters() is called to handle an error,
      requests on the acquiring list are failed with that error and no
      flushing takes place.  Briefly moving them to the running list is not
      only pointless but also harmful: if exclusive lock gets acquired
      before all of their state machines are scheduled and go through
      rbd_lock_del_request(), we trigger
      
          rbd_assert(list_empty(&rbd_dev->running_list));
      
      in rbd_try_acquire_lock().
      
      Cc: stable@vger.kernel.org
      Fixes: 637cd060 ("rbd: new exclusive lock wait/wake code")
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: default avatarDongsheng Yang <dongsheng.yang@easystack.cn>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      46464457
    • Omar Sandoval's avatar
      btrfs: don't abort filesystem when attempting to snapshot deleted subvolume · 6e6bca99
      Omar Sandoval authored
      
      
      commit 7081929a upstream.
      
      If the source file descriptor to the snapshot ioctl refers to a deleted
      subvolume, we get the following abort:
      
        BTRFS: Transaction aborted (error -2)
        WARNING: CPU: 0 PID: 833 at fs/btrfs/transaction.c:1875 create_pending_snapshot+0x1040/0x1190 [btrfs]
        Modules linked in: pata_acpi btrfs ata_piix libata scsi_mod virtio_net blake2b_generic xor net_failover virtio_rng failover scsi_common rng_core raid6_pq libcrc32c
        CPU: 0 PID: 833 Comm: t_snapshot_dele Not tainted 6.7.0-rc6 #2
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
        RIP: 0010:create_pending_snapshot+0x1040/0x1190 [btrfs]
        RSP: 0018:ffffa09c01337af8 EFLAGS: 00010282
        RAX: 0000000000000000 RBX: ffff9982053e7c78 RCX: 0000000000000027
        RDX: ffff99827dc20848 RSI: 0000000000000001 RDI: ffff99827dc20840
        RBP: ffffa09c01337c00 R08: 0000000000000000 R09: ffffa09c01337998
        R10: 0000000000000003 R11: ffffffffb96da248 R12: fffffffffffffffe
        R13: ffff99820535bb28 R14: ffff99820b7bd000 R15: ffff99820381ea80
        FS:  00007fe20aadabc0(0000) GS:ffff99827dc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000559a120b502f CR3: 00000000055b6000 CR4: 00000000000006f0
        Call Trace:
         <TASK>
         ? create_pending_snapshot+0x1040/0x1190 [btrfs]
         ? __warn+0x81/0x130
         ? create_pending_snapshot+0x1040/0x1190 [btrfs]
         ? report_bug+0x171/0x1a0
         ? handle_bug+0x3a/0x70
         ? exc_invalid_op+0x17/0x70
         ? asm_exc_invalid_op+0x1a/0x20
         ? create_pending_snapshot+0x1040/0x1190 [btrfs]
         ? create_pending_snapshot+0x1040/0x1190 [btrfs]
         create_pending_snapshots+0x92/0xc0 [btrfs]
         btrfs_commit_transaction+0x66b/0xf40 [btrfs]
         btrfs_mksubvol+0x301/0x4d0 [btrfs]
         btrfs_mksnapshot+0x80/0xb0 [btrfs]
         __btrfs_ioctl_snap_create+0x1c2/0x1d0 [btrfs]
         btrfs_ioctl_snap_create_v2+0xc4/0x150 [btrfs]
         btrfs_ioctl+0x8a6/0x2650 [btrfs]
         ? kmem_cache_free+0x22/0x340
         ? do_sys_openat2+0x97/0xe0
         __x64_sys_ioctl+0x97/0xd0
         do_syscall_64+0x46/0xf0
         entry_SYSCALL_64_after_hwframe+0x6e/0x76
        RIP: 0033:0x7fe20abe83af
        RSP: 002b:00007ffe6eff1360 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fe20abe83af
        RDX: 00007ffe6eff23c0 RSI: 0000000050009417 RDI: 0000000000000003
        RBP: 0000000000000003 R08: 0000000000000000 R09: 00007fe20ad16cd0
        R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
        R13: 00007ffe6eff13c0 R14: 00007fe20ad45000 R15: 0000559a120b6d58
         </TASK>
        ---[ end trace 0000000000000000 ]---
        BTRFS: error (device vdc: state A) in create_pending_snapshot:1875: errno=-2 No such entry
        BTRFS info (device vdc: state EA): forced readonly
        BTRFS warning (device vdc: state EA): Skipping commit of aborted transaction.
        BTRFS: error (device vdc: state EA) in cleanup_transaction:2055: errno=-2 No such entry
      
      This happens because create_pending_snapshot() initializes the new root
      item as a copy of the source root item. This includes the refs field,
      which is 0 for a deleted subvolume. The call to btrfs_insert_root()
      therefore inserts a root with refs == 0. btrfs_get_new_fs_root() then
      finds the root and returns -ENOENT if refs == 0, which causes
      create_pending_snapshot() to abort.
      
      Fix it by checking the source root's refs before attempting the
      snapshot, but after locking subvol_sem to avoid racing with deletion.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6e6bca99
    • Qu Wenruo's avatar
      btrfs: defrag: reject unknown flags of btrfs_ioctl_defrag_range_args · 52e02f26
      Qu Wenruo authored
      
      
      commit 173431b2 upstream.
      
      Add extra sanity check for btrfs_ioctl_defrag_range_args::flags.
      
      This is not really to enhance fuzzing tests, but as a preparation for
      future expansion on btrfs_ioctl_defrag_range_args.
      
      In the future we're going to add new members, allowing more fine tuning
      for btrfs defrag.  Without the -ENONOTSUPP error, there would be no way
      to detect if the kernel supports those new defrag features.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      52e02f26
    • David Sterba's avatar
      btrfs: don't warn if discard range is not aligned to sector · 86aff7c5
      David Sterba authored
      
      
      commit a208b3f1 upstream.
      
      There's a warning in btrfs_issue_discard() when the range is not aligned
      to 512 bytes, originally added in 4d89d377 ("btrfs:
      btrfs_issue_discard ensure offset/length are aligned to sector
      boundaries"). We can't do sub-sector writes anyway so the adjustment is
      the only thing that we can do and the warning is unnecessary.
      
      CC: stable@vger.kernel.org # 4.19+
      Reported-by: default avatar <syzbot+4a4f1eba14eb5c3417d1@syzkaller.appspotmail.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      86aff7c5
    • Chung-Chiang Cheng's avatar
      btrfs: tree-checker: fix inline ref size in error messages · b60f748a
      Chung-Chiang Cheng authored
      
      
      commit f398e70d upstream.
      
      The error message should accurately reflect the size rather than the
      type.
      
      Fixes: f82d1c7c ("btrfs: tree-checker: Add EXTENT_ITEM and METADATA_ITEM check")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChung-Chiang Cheng <cccheng@synology.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b60f748a
    • Fedor Pchelkin's avatar
      btrfs: ref-verify: free ref cache before clearing mount opt · c91c247b
      Fedor Pchelkin authored
      
      
      commit f03e274a upstream.
      
      As clearing REF_VERIFY mount option indicates there were some errors in a
      ref-verify process, a ref cache is not relevant anymore and should be
      freed.
      
      btrfs_free_ref_cache() requires REF_VERIFY option being set so call
      it just before clearing the mount option.
      
      Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
      
      Reported-by: default avatar <syzbot+be14ed7728594dc8bd42@syzkaller.appspotmail.com>
      Fixes: fd708b81 ("Btrfs: add a extent ref verify tool")
      CC: stable@vger.kernel.org # 5.4+
      Closes: https://lore.kernel.org/lkml/000000000000e5a65c05ee832054@google.com/
      
      
      Reported-by: default avatar <syzbot+c563a3c79927971f950f@syzkaller.appspotmail.com>
      Closes: https://lore.kernel.org/lkml/0000000000007fe09705fdc6086c@google.com/
      
      
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFedor Pchelkin <pchelkin@ispras.ru>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c91c247b
    • Omar Sandoval's avatar
      btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot of subvolume being deleted · 9ebd514f
      Omar Sandoval authored
      
      
      commit 3324d054 upstream.
      
      Sweet Tea spotted a race between subvolume deletion and snapshotting
      that can result in the root item for the snapshot having the
      BTRFS_ROOT_SUBVOL_DEAD flag set. The race is:
      
      Thread 1                                      | Thread 2
      ----------------------------------------------|----------
      btrfs_delete_subvolume                        |
        btrfs_set_root_flags(BTRFS_ROOT_SUBVOL_DEAD)|
                                                    |btrfs_mksubvol
                                                    |  down_read(subvol_sem)
                                                    |  create_snapshot
                                                    |    ...
                                                    |    create_pending_snapshot
                                                    |      copy root item from source
        down_write(subvol_sem)                      |
      
      This flag is only checked in send and swap activate, which this would
      cause to fail mysteriously.
      
      create_snapshot() now checks the root refs to reject a deleted
      subvolume, so we can fix this by locking subvol_sem earlier so that the
      BTRFS_ROOT_SUBVOL_DEAD flag and the root refs are updated atomically.
      
      CC: stable@vger.kernel.org # 4.14+
      Reported-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9ebd514f
    • Eric Dumazet's avatar
      nbd: always initialize struct msghdr completely · d9c54763
      Eric Dumazet authored
      
      
      commit 78fbb92a upstream.
      
      syzbot complains that msg->msg_get_inq value can be uninitialized [1]
      
      struct msghdr got many new fields recently, we should always make
      sure their values is zero by default.
      
      [1]
       BUG: KMSAN: uninit-value in tcp_recvmsg+0x686/0xac0 net/ipv4/tcp.c:2571
        tcp_recvmsg+0x686/0xac0 net/ipv4/tcp.c:2571
        inet_recvmsg+0x131/0x580 net/ipv4/af_inet.c:879
        sock_recvmsg_nosec net/socket.c:1044 [inline]
        sock_recvmsg+0x12b/0x1e0 net/socket.c:1066
        __sock_xmit+0x236/0x5c0 drivers/block/nbd.c:538
        nbd_read_reply drivers/block/nbd.c:732 [inline]
        recv_work+0x262/0x3100 drivers/block/nbd.c:863
        process_one_work kernel/workqueue.c:2627 [inline]
        process_scheduled_works+0x104e/0x1e70 kernel/workqueue.c:2700
        worker_thread+0xf45/0x1490 kernel/workqueue.c:2781
        kthread+0x3ed/0x540 kernel/kthread.c:388
        ret_from_fork+0x66/0x80 arch/x86/kernel/process.c:147
        ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242
      
      Local variable msg created at:
        __sock_xmit+0x4c/0x5c0 drivers/block/nbd.c:513
        nbd_read_reply drivers/block/nbd.c:732 [inline]
        recv_work+0x262/0x3100 drivers/block/nbd.c:863
      
      CPU: 1 PID: 7465 Comm: kworker/u5:1 Not tainted 6.7.0-rc7-syzkaller-00041-gf016f7547aee #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
      Workqueue: nbd5-recv recv_work
      
      Fixes: f94fd25c ("tcp: pass back data left in socket after receive")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: stable@vger.kernel.org
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Cc: nbd@other.debian.org
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240112132657.647112-1-edumazet@google.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d9c54763
    • Shenwei Wang's avatar
      net: fec: fix the unhandled context fault from smmu · 0a5a083c
      Shenwei Wang authored
      
      
      [ Upstream commit 5e344807 ]
      
      When repeatedly changing the interface link speed using the command below:
      
      ethtool -s eth0 speed 100 duplex full
      ethtool -s eth0 speed 1000 duplex full
      
      The following errors may sometimes be reported by the ARM SMMU driver:
      
      [ 5395.035364] fec 5b040000.ethernet eth0: Link is Down
      [ 5395.039255] arm-smmu 51400000.iommu: Unhandled context fault:
      fsr=0x402, iova=0x00000000, fsynr=0x100001, cbfrsynra=0x852, cb=2
      [ 5398.108460] fec 5b040000.ethernet eth0: Link is Up - 100Mbps/Full -
      flow control off
      
      It is identified that the FEC driver does not properly stop the TX queue
      during the link speed transitions, and this results in the invalid virtual
      I/O address translations from the SMMU and causes the context faults.
      
      Fixes: dbc64a8e ("net: fec: move calls to quiesce/resume packet processing out of fec_restart()")
      Signed-off-by: default avatarShenwei Wang <shenwei.wang@nxp.com>
      Link: https://lore.kernel.org/r/20240123165141.2008104-1-shenwei.wang@nxp.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      0a5a083c
    • Zhipeng Lu's avatar
      fjes: fix memleaks in fjes_hw_setup · 5b1086d2
      Zhipeng Lu authored
      
      
      [ Upstream commit f6cc4b6a ]
      
      In fjes_hw_setup, it allocates several memory and delay the deallocation
      to the fjes_hw_exit in fjes_probe through the following call chain:
      
      fjes_probe
        |-> fjes_hw_init
              |-> fjes_hw_setup
        |-> fjes_hw_exit
      
      However, when fjes_hw_setup fails, fjes_hw_exit won't be called and thus
      all the resources allocated in fjes_hw_setup will be leaked. In this
      patch, we free those resources in fjes_hw_setup and prevents such leaks.
      
      Fixes: 2fcbca68 ("fjes: platform_driver's .probe and .remove routine")
      Signed-off-by: default avatarZhipeng Lu <alexious@zju.edu.cn>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240122172445.3841883-1-alexious@zju.edu.cn
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      5b1086d2
    • Jakub Kicinski's avatar
      selftests: netdevsim: fix the udp_tunnel_nic test · 4b4dcb3f
      Jakub Kicinski authored
      
      
      [ Upstream commit 0879020a ]
      
      This test is missing a whole bunch of checks for interface
      renaming and one ifup. Presumably it was only used on a system
      with renaming disabled and NetworkManager running.
      
      Fixes: 91f430b2 ("selftests: net: add a test for UDP tunnel info infra")
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240123060529.1033912-1-kuba@kernel.org
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      4b4dcb3f
    • Jenishkumar Maheshbhai Patel's avatar
      net: mvpp2: clear BM pool before initialization · cec65f09
      Jenishkumar Maheshbhai Patel authored
      
      
      [ Upstream commit 9f538b41 ]
      
      Register value persist after booting the kernel using
      kexec which results in kernel panic. Thus clear the
      BM pool registers before initialisation to fix the issue.
      
      Fixes: 3f518509 ("ethernet: Add new driver for Marvell Armada 375 network unit")
      Signed-off-by: default avatarJenishkumar Maheshbhai Patel <jpatel2@marvell.com>
      Reviewed-by: default avatarMaxime Chevallier <maxime.chevallier@bootlin.com>
      Link: https://lore.kernel.org/r/20240119035914.2595665-1-jpatel2@marvell.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      cec65f09
    • Bernd Edlinger's avatar
      net: stmmac: Wait a bit for the reset to take effect · acb6eaf2
      Bernd Edlinger authored
      
      
      [ Upstream commit a5f5eee2 ]
      
      otherwise the synopsys_id value may be read out wrong,
      because the GMAC_VERSION register might still be in reset
      state, for at least 1 us after the reset is de-asserted.
      
      Add a wait for 10 us before continuing to be on the safe side.
      
      > From what have you got that delay value?
      
      Just try and error, with very old linux versions and old gcc versions
      the synopsys_id was read out correctly most of the time (but not always),
      with recent linux versions and recnet gcc versions it was read out
      wrongly most of the time, but again not always.
      I don't have access to the VHDL code in question, so I cannot
      tell why it takes so long to get the correct values, I also do not
      have more than a few hardware samples, so I cannot tell how long
      this timeout must be in worst case.
      Experimentally I can tell that the register is read several times
      as zero immediately after the reset is de-asserted, also adding several
      no-ops is not enough, adding a printk is enough, also udelay(1) seems to
      be enough but I tried that not very often, and I have not access to many
      hardware samples to be 100% sure about the necessary delay.
      And since the udelay here is only executed once per device instance,
      it seems acceptable to delay the boot for 10 us.
      
      BTW: my hardware's synopsys id is 0x37.
      
      Fixes: c5e4ddbd ("net: stmmac: Add support for optional reset control")
      Signed-off-by: default avatarBernd Edlinger <bernd.edlinger@hotmail.de>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Reviewed-by: default avatarSerge Semin <fancer.lancer@gmail.com>
      Link: https://lore.kernel.org/r/AS8P193MB1285A810BD78C111E7F6AA34E4752@AS8P193MB1285.EURP193.PROD.OUTLOOK.COM
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      acb6eaf2
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: validate NFPROTO_* family · 67ee3736
      Pablo Neira Ayuso authored
      
      
      [ Upstream commit d0009eff ]
      
      Several expressions explicitly refer to NF_INET_* hook definitions
      from expr->ops->validate, however, family is not validated.
      
      Bail out with EOPNOTSUPP in case they are used from unsupported
      families.
      
      Fixes: 0ca743a5 ("netfilter: nf_tables: add compatibility layer for x_tables")
      Fixes: a3c90f7a ("netfilter: nf_tables: flow offload expression")
      Fixes: 2fa84193 ("netfilter: nf_tables: introduce routing expression")
      Fixes: 554ced0a ("netfilter: nf_tables: add support for native socket matching")
      Fixes: ad49d86e ("netfilter: nf_tables: Add synproxy support")
      Fixes: 4ed8eb65 ("netfilter: nf_tables: Add native tproxy support")
      Fixes: 6c472602 ("netfilter: nf_tables: add xfrm expression")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      67ee3736
    • Florian Westphal's avatar
      netfilter: nf_tables: restrict anonymous set and map names to 16 bytes · ed5b62bb
      Florian Westphal authored
      
      
      [ Upstream commit b462579b ]
      
      nftables has two types of sets/maps, one where userspace defines the
      name, and anonymous sets/maps, where userspace defines a template name.
      
      For the latter, kernel requires presence of exactly one "%d".
      nftables uses "__set%d" and "__map%d" for this.  The kernel will
      expand the format specifier and replaces it with the smallest unused
      number.
      
      As-is, userspace could define a template name that allows to move
      the set name past the 256 bytes upperlimit (post-expansion).
      
      I don't see how this could be a problem, but I would prefer if userspace
      cannot do this, so add a limit of 16 bytes for the '%d' template name.
      
      16 bytes is the old total upper limit for set names that existed when
      nf_tables was merged initially.
      
      Fixes: 38745490 ("netfilter: nf_tables: Allow set names of up to 255 chars")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ed5b62bb
    • Filipe Manana's avatar
      btrfs: fix race between reading a directory and adding entries to it · c25d7922
      Filipe Manana authored
      
      
      commit 8e7f82de upstream.
      
      When opening a directory (opendir(3)) or rewinding it (rewinddir(3)), we
      are not holding the directory's inode locked, and this can result in later
      attempting to add two entries to the directory with the same index number,
      resulting in a transaction abort, with -EEXIST (-17), when inserting the
      second delayed dir index. This results in a trace like the following:
      
        Sep 11 22:34:59 myhostname kernel: BTRFS error (device dm-3): err add delayed dir index item(name: cockroach-stderr.log) into the insertion tree of the delayed node(root id: 5, inode id: 4539217, errno: -17)
        Sep 11 22:34:59 myhostname kernel: ------------[ cut here ]------------
        Sep 11 22:34:59 myhostname kernel: kernel BUG at fs/btrfs/delayed-inode.c:1504!
        Sep 11 22:34:59 myhostname kernel: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        Sep 11 22:34:59 myhostname kernel: CPU: 0 PID: 7159 Comm: cockroach Not tainted 6.4.15-200.fc38.x86_64 #1
        Sep 11 22:34:59 myhostname kernel: Hardware name: ASUS ESC500 G3/P9D WS, BIOS 2402 06/27/2018
        Sep 11 22:34:59 myhostname kernel: RIP: 0010:btrfs_insert_delayed_dir_index+0x1da/0x260
        Sep 11 22:34:59 myhostname kernel: Code: eb dd 48 (...)
        Sep 11 22:34:59 myhostname kernel: RSP: 0000:ffffa9980e0fbb28 EFLAGS: 00010282
        Sep 11 22:34:59 myhostname kernel: RAX: 0000000000000000 RBX: ffff8b10b8f4a3c0 RCX: 0000000000000000
        Sep 11 22:34:59 myhostname kernel: RDX: 0000000000000000 RSI: ffff8b177ec21540 RDI: ffff8b177ec21540
        Sep 11 22:34:59 myhostname kernel: RBP: ffff8b110cf80888 R08: 0000000000000000 R09: ffffa9980e0fb938
        Sep 11 22:34:59 myhostname kernel: R10: 0000000000000003 R11: ffffffff86146508 R12: 0000000000000014
        Sep 11 22:34:59 myhostname kernel: R13: ffff8b1131ae5b40 R14: ffff8b10b8f4a418 R15: 00000000ffffffef
        Sep 11 22:34:59 myhostname kernel: FS:  00007fb14a7fe6c0(0000) GS:ffff8b177ec00000(0000) knlGS:0000000000000000
        Sep 11 22:34:59 myhostname kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        Sep 11 22:34:59 myhostname kernel: CR2: 000000c00143d000 CR3: 00000001b3b4e002 CR4: 00000000001706f0
        Sep 11 22:34:59 myhostname kernel: Call Trace:
        Sep 11 22:34:59 myhostname kernel:  <TASK>
        Sep 11 22:34:59 myhostname kernel:  ? die+0x36/0x90
        Sep 11 22:34:59 myhostname kernel:  ? do_trap+0xda/0x100
        Sep 11 22:34:59 myhostname kernel:  ? btrfs_insert_delayed_dir_index+0x1da/0x260
        Sep 11 22:34:59 myhostname kernel:  ? do_error_trap+0x6a/0x90
        Sep 11 22:34:59 myhostname kernel:  ? btrfs_insert_delayed_dir_index+0x1da/0x260
        Sep 11 22:34:59 myhostname kernel:  ? exc_invalid_op+0x50/0x70
        Sep 11 22:34:59 myhostname kernel:  ? btrfs_insert_delayed_dir_index+0x1da/0x260
        Sep 11 22:34:59 myhostname kernel:  ? asm_exc_invalid_op+0x1a/0x20
        Sep 11 22:34:59 myhostname kernel:  ? btrfs_insert_delayed_dir_index+0x1da/0x260
        Sep 11 22:34:59 myhostname kernel:  ? btrfs_insert_delayed_dir_index+0x1da/0x260
        Sep 11 22:34:59 myhostname kernel:  btrfs_insert_dir_item+0x200/0x280
        Sep 11 22:34:59 myhostname kernel:  btrfs_add_link+0xab/0x4f0
        Sep 11 22:34:59 myhostname kernel:  ? ktime_get_real_ts64+0x47/0xe0
        Sep 11 22:34:59 myhostname kernel:  btrfs_create_new_inode+0x7cd/0xa80
        Sep 11 22:34:59 myhostname kernel:  btrfs_symlink+0x190/0x4d0
        Sep 11 22:34:59 myhostname kernel:  ? schedule+0x5e/0xd0
        Sep 11 22:34:59 myhostname kernel:  ? __d_lookup+0x7e/0xc0
        Sep 11 22:34:59 myhostname kernel:  vfs_symlink+0x148/0x1e0
        Sep 11 22:34:59 myhostname kernel:  do_symlinkat+0x130/0x140
        Sep 11 22:34:59 myhostname kernel:  __x64_sys_symlinkat+0x3d/0x50
        Sep 11 22:34:59 myhostname kernel:  do_syscall_64+0x5d/0x90
        Sep 11 22:34:59 myhostname kernel:  ? syscall_exit_to_user_mode+0x2b/0x40
        Sep 11 22:34:59 myhostname kernel:  ? do_syscall_64+0x6c/0x90
        Sep 11 22:34:59 myhostname kernel:  entry_SYSCALL_64_after_hwframe+0x72/0xdc
      
      The race leading to the problem happens like this:
      
      1) Directory inode X is loaded into memory, its ->index_cnt field is
         initialized to (u64)-1 (at btrfs_alloc_inode());
      
      2) Task A is adding a new file to directory X, holding its vfs inode lock,
         and calls btrfs_set_inode_index() to get an index number for the entry.
      
         Because the inode's index_cnt field is set to (u64)-1 it calls
         btrfs_inode_delayed_dir_index_count() which fails because no dir index
         entries were added yet to the delayed inode and then it calls
         btrfs_set_inode_index_count(). This functions finds the last dir index
         key and then sets index_cnt to that index value + 1. It found that the
         last index key has an offset of 100. However before it assigns a value
         of 101 to index_cnt...
      
      3) Task B calls opendir(3), ending up at btrfs_opendir(), where the VFS
         lock for inode X is not taken, so it calls btrfs_get_dir_last_index()
         and sees index_cnt still with a value of (u64)-1. Because of that it
         calls btrfs_inode_delayed_dir_index_count() which fails since no dir
         index entries were added to the delayed inode yet, and then it also
         calls btrfs_set_inode_index_count(). This also finds that the last
         index key has an offset of 100, and before it assigns the value 101
         to the index_cnt field of inode X...
      
      4) Task A assigns a value of 101 to index_cnt. And then the code flow
         goes to btrfs_set_inode_index() where it increments index_cnt from
         101 to 102. Task A then creates a delayed dir index entry with a
         sequence number of 101 and adds it to the delayed inode;
      
      5) Task B assigns 101 to the index_cnt field of inode X;
      
      6) At some later point when someone tries to add a new entry to the
         directory, btrfs_set_inode_index() will return 101 again and shortly
         after an attempt to add another delayed dir index key with index
         number 101 will fail with -EEXIST resulting in a transaction abort.
      
      Fix this by locking the inode at btrfs_get_dir_last_index(), which is only
      only used when opening a directory or attempting to lseek on it.
      
      Reported-by: default avatarken <ken@bllue.org>
      Link: https://lore.kernel.org/linux-btrfs/CAE6xmH+Lp=Q=E61bU+v9eWX8gYfLvu6jLYxjxjFpo3zHVPR0EQ@mail.gmail.com/
      
      
      Reported-by: default avatar <syzbot+d13490c82ad5353c779d@syzkaller.appspotmail.com>
      Link: https://lore.kernel.org/linux-btrfs/00000000000036e1290603e097e0@google.com/
      
      
      Fixes: 9b378f6a ("btrfs: fix infinite directory reads")
      CC: stable@vger.kernel.org # 6.5+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c25d7922
    • Filipe Manana's avatar
      btrfs: refresh dir last index during a rewinddir(3) call · fd968e68
      Filipe Manana authored
      
      
      commit e60aa5da upstream.
      
      When opening a directory we find what's the index of its last entry and
      then store it in the directory's file handle private data (struct
      btrfs_file_private::last_index), so that in the case new directory entries
      are added to a directory after an opendir(3) call we don't end up in an
      infinite loop (see commit 9b378f6a ("btrfs: fix infinite directory
      reads")) when calling readdir(3).
      
      However once rewinddir(3) is called, POSIX states [1] that any new
      directory entries added after the previous opendir(3) call, must be
      returned by subsequent calls to readdir(3):
      
        "The rewinddir() function shall reset the position of the directory
         stream to which dirp refers to the beginning of the directory.
         It shall also cause the directory stream to refer to the current
         state of the corresponding directory, as a call to opendir() would
         have done."
      
      We currently don't refresh the last_index field of the struct
      btrfs_file_private associated to the directory, so after a rewinddir(3)
      we are not returning any new entries added after the opendir(3) call.
      
      Fix this by finding the current last index of the directory when llseek
      is called against the directory.
      
      This can be reproduced by the following C program provided by Ian Johnson:
      
         #include <dirent.h>
         #include <stdio.h>
      
         int main(void) {
           DIR *dir = opendir("test");
      
           FILE *file;
           file = fopen("test/1", "w");
           fwrite("1", 1, 1, file);
           fclose(file);
      
           file = fopen("test/2", "w");
           fwrite("2", 1, 1, file);
           fclose(file);
      
           rewinddir(dir);
      
           struct dirent *entry;
           while ((entry = readdir(dir))) {
              printf("%s\n", entry->d_name);
           }
           closedir(dir);
           return 0;
         }
      
      Reported-by: default avatarIan Johnson <ian@ianjohnson.dev>
      Link: https://lore.kernel.org/linux-btrfs/YR1P0S.NGASEG570GJ8@ianjohnson.dev/
      
      
      Fixes: 9b378f6a ("btrfs: fix infinite directory reads")
      CC: stable@vger.kernel.org # 6.5+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fd968e68
    • Filipe Manana's avatar
      btrfs: set last dir index to the current last index when opening dir · a045b6b1
      Filipe Manana authored
      commit 35795036 upstream.
      
      When opening a directory for reading it, we set the last index where we
      stop iteration to the value in struct btrfs_inode::index_cnt. That value
      does not match the index of the most recently added directory entry but
      it's instead the index number that will be assigned the next directory
      entry.
      
      This means that if after the call to opendir(3) new directory entries are
      added, a readdir(3) call will return the first new directory entry. This
      is fine because POSIX says the following [1]:
      
        "If a file is removed from or added to the directory after the most
         recent call to opendir() or rewinddir(), whether a subsequent call to
         readdir() returns an entry for that file is unspecified."
      
      For example for the test script from commit 9b378f6a ("btrfs: fix
      infinite directory reads"), where we have 2000 files in a directory, ext4
      doesn't return any new directory entry after opendir(3), while xfs returns
      the first 13 new directory entries added after the opendir(3) call.
      
      If we move to a shorter example with an empty directory when opendir(3) is
      called, and 2 files added to the directory after the opendir(3) call, then
      readdir(3) on btrfs will return the first file, ext4 and xfs return the 2
      files (but in a different order). A test program for this, reported by
      Ian Johnson, is the following:
      
         #include <dirent.h>
         #include <stdio.h>
      
         int main(void) {
           DIR *dir = opendir("test");
      
           FILE *file;
           file = fopen("test/1", "w");
           fwrite("1", 1, 1, file);
           fclose(file);
      
           file = fopen("test/2", "w");
           fwrite("2", 1, 1, file);
           fclose(file);
      
           struct dirent *entry;
           while ((entry = readdir(dir))) {
              printf("%s\n", entry->d_name);
           }
           closedir(dir);
           return 0;
         }
      
      To make this less odd, change the behaviour to never return new entries
      that were added after the opendir(3) call. This is done by setting the
      last_index field of the struct btrfs_file_private attached to the
      directory's file handle with a value matching btrfs_inode::index_cnt
      minus 1, since that value always matches the index of the next new
      directory entry and not the index of the most recently added entry.
      
      [1] https://pubs.opengroup.org/onlinepubs/007904875/functions/readdir_r.html
      
      Link: https://lore.kernel.org/linux-btrfs/YR1P0S.NGASEG570GJ8@ianjohnson.dev/
      
      
      CC: stable@vger.kernel.org # 6.5+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a045b6b1
    • Filipe Manana's avatar
      btrfs: fix infinite directory reads · 2aa515b5
      Filipe Manana authored
      
      
      commit 9b378f6a upstream.
      
      The readdir implementation currently processes always up to the last index
      it finds. This however can result in an infinite loop if the directory has
      a large number of entries such that they won't all fit in the given buffer
      passed to the readdir callback, that is, dir_emit() returns a non-zero
      value. Because in that case readdir() will be called again and if in the
      meanwhile new directory entries were added and we still can't put all the
      remaining entries in the buffer, we keep repeating this over and over.
      
      The following C program and test script reproduce the problem:
      
        $ cat /mnt/readdir_prog.c
        #include <sys/types.h>
        #include <dirent.h>
        #include <stdio.h>
      
        int main(int argc, char *argv[])
        {
          DIR *dir = opendir(".");
          struct dirent *dd;
      
          while ((dd = readdir(dir))) {
            printf("%s\n", dd->d_name);
            rename(dd->d_name, "TEMPFILE");
            rename("TEMPFILE", dd->d_name);
          }
          closedir(dir);
        }
      
        $ gcc -o /mnt/readdir_prog /mnt/readdir_prog.c
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
      
        mkfs.btrfs -f $DEV &> /dev/null
        #mkfs.xfs -f $DEV &> /dev/null
        #mkfs.ext4 -F $DEV &> /dev/null
      
        mount $DEV $MNT
      
        mkdir $MNT/testdir
        for ((i = 1; i <= 2000; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        cd $MNT/testdir
        /mnt/readdir_prog
      
        cd /mnt
      
        umount $MNT
      
      This behaviour is surprising to applications and it's unlike ext4, xfs,
      tmpfs, vfat and other filesystems, which always finish. In this case where
      new entries were added due to renames, some file names may be reported
      more than once, but this varies according to each filesystem - for example
      ext4 never reported the same file more than once while xfs reports the
      first 13 file names twice.
      
      So change our readdir implementation to track the last index number when
      opendir() is called and then make readdir() never process beyond that
      index number. This gives the same behaviour as ext4.
      
      Reported-by: default avatarRob Landley <rob@landley.net>
      Link: https://lore.kernel.org/linux-btrfs/2c8c55ec-04c6-e0dc-9c5c-8c7924778c35@landley.net/
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=217681
      
      
      CC: stable@vger.kernel.org # 5.15
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2aa515b5
    • Florian Westphal's avatar
      netfilter: nft_limit: reject configurations that cause integer overflow · bc6e242b
      Florian Westphal authored
      
      
      [ Upstream commit c9d9eb9c ]
      
      Reject bogus configs where internal token counter wraps around.
      This only occurs with very very large requests, such as 17gbyte/s.
      
      Its better to reject this rather than having incorrect ratelimit.
      
      Fixes: d2168e84 ("netfilter: nft_limit: add per-byte limiting")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      bc6e242b
    • Frederic Weisbecker's avatar
      rcu: Defer RCU kthreads wakeup when CPU is dying · c817f5c0
      Frederic Weisbecker authored
      
      
      [ Upstream commit e787644c ]
      
      When the CPU goes idle for the last time during the CPU down hotplug
      process, RCU reports a final quiescent state for the current CPU. If
      this quiescent state propagates up to the top, some tasks may then be
      woken up to complete the grace period: the main grace period kthread
      and/or the expedited main workqueue (or kworker).
      
      If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
      arm the RT bandwith timer to the local offline CPU. Since this happens
      after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
      timer gets ignored. Therefore if the RCU kthreads are waiting for RT
      bandwidth to be available, they may never be actually scheduled.
      
      This triggers TREE03 rcutorture hangs:
      
      	 rcu: INFO: rcu_preempt self-detected stall on CPU
      	 rcu:     4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved)
      	 rcu:     (t=21035 jiffies g=938281 q=40787 ncpus=6)
      	 rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
      	 rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
      	 rcu: RCU grace-period kthread stack dump:
      	 task:rcu_preempt     state:R  running task     stack:14896 pid:14    tgid:14    ppid:2      flags:0x00004000
      	 Call Trace:
      	  <TASK>
      	  __schedule+0x2eb/0xa80
      	  schedule+0x1f/0x90
      	  schedule_timeout+0x163/0x270
      	  ? __pfx_process_timeout+0x10/0x10
      	  rcu_gp_fqs_loop+0x37c/0x5b0
      	  ? __pfx_rcu_gp_kthread+0x10/0x10
      	  rcu_gp_kthread+0x17c/0x200
      	  kthread+0xde/0x110
      	  ? __pfx_kthread+0x10/0x10
      	  ret_from_fork+0x2b/0x40
      	  ? __pfx_kthread+0x10/0x10
      	  ret_from_fork_asm+0x1b/0x30
      	  </TASK>
      
      The situation can't be solved with just unpinning the timer. The hrtimer
      infrastructure and the nohz heuristics involved in finding the best
      remote target for an unpinned timer would then also need to handle
      enqueues from an offline CPU in the most horrendous way.
      
      So fix this on the RCU side instead and defer the wake up to an online
      CPU if it's too late for the local one.
      
      Reported-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Fixes: 5c0930cc ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarNeeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c817f5c0
    • Dinghao Liu's avatar
      net/mlx5e: fix a potential double-free in fs_any_create_groups · b2fa86b2
      Dinghao Liu authored
      
      
      [ Upstream commit aef855df ]
      
      When kcalloc() for ft->g succeeds but kvzalloc() for in fails,
      fs_any_create_groups() will free ft->g. However, its caller
      fs_any_create_table() will free ft->g again through calling
      mlx5e_destroy_flow_table(), which will lead to a double-free.
      Fix this by setting ft->g to NULL in fs_any_create_groups().
      
      Fixes: 0f575c20 ("net/mlx5e: Introduce Flow Steering ANY API")
      Signed-off-by: default avatarDinghao Liu <dinghao.liu@zju.edu.cn>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      b2fa86b2
    • Zhipeng Lu's avatar
      net/mlx5e: fix a double-free in arfs_create_groups · 42876db0
      Zhipeng Lu authored
      
      
      [ Upstream commit 3c6d5189 ]
      
      When `in` allocated by kvzalloc fails, arfs_create_groups will free
      ft->g and return an error. However, arfs_create_table, the only caller of
      arfs_create_groups, will hold this error and call to
      mlx5e_destroy_flow_table, in which the ft->g will be freed again.
      
      Fixes: 1cabe6b0 ("net/mlx5e: Create aRFS flow tables")
      Signed-off-by: default avatarZhipeng Lu <alexious@zju.edu.cn>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      42876db0
    • Leon Romanovsky's avatar
      net/mlx5e: Allow software parsing when IPsec crypto is enabled · 890881d1
      Leon Romanovsky authored
      
      
      [ Upstream commit 20f5468a ]
      
      All ConnectX devices have software parsing capability enabled, but it is
      more correct to set allow_swp only if capability exists, which for IPsec
      means that crypto offload is supported.
      
      Fixes: 2451da08 ("net/mlx5: Unify device IPsec capabilities check")
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      890881d1
    • Rahul Rameshbabu's avatar
      net/mlx5: Use mlx5 device constant for selecting CQ period mode for ASO · 62ce1600
      Rahul Rameshbabu authored
      
      
      [ Upstream commit 20cbf8cb ]
      
      mlx5 devices have specific constants for choosing the CQ period mode. These
      constants do not have to match the constants used by the kernel software
      API for DIM period mode selection.
      
      Fixes: cdd04f4d ("net/mlx5: Add support to create SQ and CQ for ASO")
      Signed-off-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Reviewed-by: default avatarJianbo Liu <jianbol@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      62ce1600
    • Yevgeny Kliteynik's avatar
      net/mlx5: DR, Can't go to uplink vport on RX rule · 75d9ed49
      Yevgeny Kliteynik authored
      
      
      [ Upstream commit 5b2a2523 ]
      
      Go-To-Vport action on RX is not allowed when the vport is uplink.
      In such case, the packet should be dropped.
      
      Fixes: 9db810ed ("net/mlx5: DR, Expose steering action functionality")
      Signed-off-by: default avatarYevgeny Kliteynik <kliteyn@nvidia.com>
      Reviewed-by: default avatarErez Shitrit <erezsh@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      75d9ed49
    • Yevgeny Kliteynik's avatar
      net/mlx5: DR, Use the right GVMI number for drop action · e54aedd4
      Yevgeny Kliteynik authored
      
      
      [ Upstream commit 56659542 ]
      
      When FW provides ICM addresses for drop RX/TX, the provided capability
      is 64 bits that contain its GVMI as well as the ICM address itself.
      In case of TX DROP this GVMI is different from the GVMI that the
      domain is operating on.
      
      This patch fixes the action to use these GVMI IDs, as provided by FW.
      
      Fixes: 9db810ed ("net/mlx5: DR, Expose steering action functionality")
      Signed-off-by: default avatarYevgeny Kliteynik <kliteyn@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e54aedd4
    • Zhengchao Shao's avatar
      ipv6: init the accept_queue's spinlocks in inet6_create · f11792c3
      Zhengchao Shao authored
      
      
      [ Upstream commit 435e202d ]
      
      In commit 198bc90e("tcp: make sure init the accept_queue's spinlocks
      once"), the spinlocks of accept_queue are initialized only when socket is
      created in the inet4 scenario. The locks are not initialized when socket
      is created in the inet6 scenario. The kernel reports the following error:
      INFO: trying to register non-static key.
      The code is fine but needs lockdep annotation, or maybe
      you didn't initialize this object before use?
      turning off the locking correctness validator.
      Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      Call Trace:
      <TASK>
      	dump_stack_lvl (lib/dump_stack.c:107)
      	register_lock_class (kernel/locking/lockdep.c:1289)
      	__lock_acquire (kernel/locking/lockdep.c:5015)
      	lock_acquire.part.0 (kernel/locking/lockdep.c:5756)
      	_raw_spin_lock_bh (kernel/locking/spinlock.c:178)
      	inet_csk_listen_stop (net/ipv4/inet_connection_sock.c:1386)
      	tcp_disconnect (net/ipv4/tcp.c:2981)
      	inet_shutdown (net/ipv4/af_inet.c:935)
      	__sys_shutdown (./include/linux/file.h:32 net/socket.c:2438)
      	__x64_sys_shutdown (net/socket.c:2445)
      	do_syscall_64 (arch/x86/entry/common.c:52)
      	entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
      RIP: 0033:0x7f52ecd05a3d
      Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7
      48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
      ff 73 01 c3 48 8b 0d ab a3 0e 00 f7 d8 64 89 01 48
      RSP: 002b:00007f52ecf5dde8 EFLAGS: 00000293 ORIG_RAX: 0000000000000030
      RAX: ffffffffffffffda RBX: 00007f52ecf5e640 RCX: 00007f52ecd05a3d
      RDX: 00007f52ecc8b188 RSI: 0000000000000000 RDI: 0000000000000004
      RBP: 00007f52ecf5de20 R08: 00007ffdae45c69f R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000293 R12: 00007f52ecf5e640
      R13: 0000000000000000 R14: 00007f52ecc8b060 R15: 00007ffdae45c6e0
      
      Fixes: 198bc90e ("tcp: make sure init the accept_queue's spinlocks once")
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240122102001.2851701-1-shaozhengchao@huawei.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f11792c3
    • Zhengchao Shao's avatar
      netlink: fix potential sleeping issue in mqueue_flush_file · de061604
      Zhengchao Shao authored
      
      
      [ Upstream commit 234ec0b6 ]
      
      I analyze the potential sleeping issue of the following processes:
      Thread A                                Thread B
      ...                                     netlink_create  //ref = 1
      do_mq_notify                            ...
        sock = netlink_getsockbyfilp          ...     //ref = 2
        info->notify_sock = sock;             ...
      ...                                     netlink_sendmsg
      ...                                       skb = netlink_alloc_large_skb  //skb->head is vmalloced
      ...                                       netlink_unicast
      ...                                         sk = netlink_getsockbyportid //ref = 3
      ...                                         netlink_sendskb
      ...                                           __netlink_sendskb
      ...                                             skb_queue_tail //put skb to sk_receive_queue
      ...                                         sock_put //ref = 2
      ...                                     ...
      ...                                     netlink_release
      ...                                       deferred_put_nlk_sk //ref = 1
      mqueue_flush_file
        spin_lock
        remove_notification
          netlink_sendskb
            sock_put  //ref = 0
              sk_free
                ...
                __sk_destruct
                  netlink_sock_destruct
                    skb_queue_purge  //get skb from sk_receive_queue
                      ...
                      __skb_queue_purge_reason
                        kfree_skb_reason
                          __kfree_skb
                          ...
                          skb_release_all
                            skb_release_head_state
                              netlink_skb_destructor
                                vfree(skb->head)  //sleeping while holding spinlock
      
      In netlink_sendmsg, if the memory pointed to by skb->head is allocated by
      vmalloc, and is put to sk_receive_queue queue, also the skb is not freed.
      When the mqueue executes flush, the sleeping bug will occur. Use
      vfree_atomic instead of vfree in netlink_skb_destructor to solve the issue.
      
      Fixes: c05cdb1b ("netlink: allow large data transfers from user-space")
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Link: https://lore.kernel.org/r/20240122011807.2110357-1-shaozhengchao@huawei.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      de061604
    • Salvatore Dipietro's avatar
      tcp: Add memory barrier to tcp_push() · 90fba981
      Salvatore Dipietro authored
      [ Upstream commit 7267e8dc ]
      
      On CPUs with weak memory models, reads and updates performed by tcp_push
      to the sk variables can get reordered leaving the socket throttled when
      it should not. The tasklet running tcp_wfree() may also not observe the
      memory updates in time and will skip flushing any packets throttled by
      tcp_push(), delaying the sending. This can pathologically cause 40ms
      extra latency due to bad interactions with delayed acks.
      
      Adding a memory barrier in tcp_push removes the bug, similarly to the
      previous commit bf06200e ("tcp: tsq: fix nonagle handling").
      smp_mb__after_atomic() is used to not incur in unnecessary overhead
      on x86 since not affected.
      
      Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu
      22.04 and Apache Tomcat 9.0.83 running the basic servlet below:
      
      import java.io.IOException;
      import java.io.OutputStreamWriter;
      import java.io.PrintWriter;
      import javax.servlet.ServletException;
      import javax.servlet.http.HttpServlet;
      import javax.servlet.http.HttpServletRequest;
      import javax.servlet.http.HttpServletResponse;
      
      public class HelloWorldServlet extends HttpServlet {
          @Override
          protected void doGet(HttpServletRequest request, HttpServletResponse response)
            throws ServletException, IOException {
              response.setContentType("text/html;charset=utf-8");
              OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
              String s = "a".repeat(3096);
              osw.write(s,0,s.length());
              osw.flush();
          }
      }
      
      Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
      c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+
      values is observed while, with the patch, the extra latency disappears.
      
      No patch and tcp_autocorking=1
      ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
        ...
       50.000%    0.91ms
       75.000%    1.13ms
       90.000%    1.46ms
       99.000%    1.74ms
       99.900%    1.89ms
       99.990%   41.95ms  <<< 40+ ms extra latency
       99.999%   48.32ms
      100.000%   48.96ms
      
      With patch and tcp_autocorking=1
      ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
      
      
        ...
       50.000%    0.90ms
       75.000%    1.13ms
       90.000%    1.45ms
       99.000%    1.72ms
       99.900%    1.83ms
       99.990%    2.11ms  <<< no 40+ ms extra latency
       99.999%    2.53ms
      100.000%    2.62ms
      
      Patch has been also tested on x86 (m7i.2xlarge instance) which it is not
      affected by this issue and the patch doesn't introduce any additional
      delay.
      
      Fixes: 7aa5470c ("tcp: tsq: move tsq_flags close to sk_wmem_alloc")
      Signed-off-by: default avatarSalvatore Dipietro <dipiets@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240119190133.43698-1-dipiets@amazon.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      90fba981
    • David Howells's avatar
      afs: Hide silly-rename files from userspace · ab49164c
      David Howells authored
      
      
      [ Upstream commit 57e9d49c ]
      
      There appears to be a race between silly-rename files being created/removed
      and various userspace tools iterating over the contents of a directory,
      leading to such errors as:
      
      	find: './kernel/.tmp_cpio_dir/include/dt-bindings/reset/.__afs2080': No such file or directory
      	tar: ./include/linux/greybus/.__afs3C95: File removed before we read it
      
      when building a kernel.
      
      Fix afs_readdir() so that it doesn't return .__afsXXXX silly-rename files
      to userspace.  This doesn't stop them being looked up directly by name as
      we need to be able to look them up from within the kernel as part of the
      silly-rename algorithm.
      
      Fixes: 79ddbfa5 ("afs: Implement sillyrename for unlink and rename")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ab49164c
    • Petr Pavlu's avatar
      tracing: Ensure visibility when inserting an element into tracing_map · f4f7e696
      Petr Pavlu authored
      [ Upstream commit 2b447606 ]
      
      Running the following two commands in parallel on a multi-processor
      AArch64 machine can sporadically produce an unexpected warning about
      duplicate histogram entries:
      
       $ while true; do
           echo hist:key=id.syscall:val=hitcount > \
             /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
           cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
           sleep 0.001
         done
       $ stress-ng --sysbadaddr $(nproc)
      
      The warning looks as follows:
      
      [ 2911.172474] ------------[ cut here ]------------
      [ 2911.173111] Duplicates detected: 1
      [ 2911.173574] WARNING: CPU: 2 PID: 12247 at kernel/trace/tracing_map.c:983 tracing_map_sort_entries+0x3e0/0x408
      [ 2911.174702] Modules linked in: iscsi_ibft(E) iscsi_boot_sysfs(E) rfkill(E) af_packet(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) ena(E) tiny_power_button(E) qemu_fw_cfg(E) button(E) fuse(E) efi_pstore(E) ip_tables(E) x_tables(E) xfs(E) libcrc32c(E) aes_ce_blk(E) aes_ce_cipher(E) crct10dif_ce(E) polyval_ce(E) polyval_generic(E) ghash_ce(E) gf128mul(E) sm4_ce_gcm(E) sm4_ce_ccm(E) sm4_ce(E) sm4_ce_cipher(E) sm4(E) sm3_ce(E) sm3(E) sha3_ce(E) sha512_ce(E) sha512_arm64(E) sha2_ce(E) sha256_arm64(E) nvme(E) sha1_ce(E) nvme_core(E) nvme_auth(E) t10_pi(E) sg(E) scsi_mod(E) scsi_common(E) efivarfs(E)
      [ 2911.174738] Unloaded tainted modules: cppc_cpufreq(E):1
      [ 2911.180985] CPU: 2 PID: 12247 Comm: cat Kdump: loaded Tainted: G            E      6.7.0-default #2 1b58bbb22c97e4399dc09f92d309344f69c44a01
      [ 2911.182398] Hardware name: Amazon EC2 c7g.8xlarge/, BIOS 1.0 11/1/2018
      [ 2911.183208] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
      [ 2911.184038] pc : tracing_map_sort_entries+0x3e0/0x408
      [ 2911.184667] lr : tracing_map_sort_entries+0x3e0/0x408
      [ 2911.185310] sp : ffff8000a1513900
      [ 2911.185750] x29: ffff8000a1513900 x28: ffff0003f272fe80 x27: 0000000000000001
      [ 2911.186600] x26: ffff0003f272fe80 x25: 0000000000000030 x24: 0000000000000008
      [ 2911.187458] x23: ffff0003c5788000 x22: ffff0003c16710c8 x21: ffff80008017f180
      [ 2911.188310] x20: ffff80008017f000 x19: ffff80008017f180 x18: ffffffffffffffff
      [ 2911.189160] x17: 0000000000000000 x16: 0000000000000000 x15: ffff8000a15134b8
      [ 2911.190015] x14: 0000000000000000 x13: 205d373432323154 x12: 5b5d313131333731
      [ 2911.190844] x11: 00000000fffeffff x10: 00000000fffeffff x9 : ffffd1b78274a13c
      [ 2911.191716] x8 : 000000000017ffe8 x7 : c0000000fffeffff x6 : 000000000057ffa8
      [ 2911.192554] x5 : ffff0012f6c24ec0 x4 : 0000000000000000 x3 : ffff2e5b72b5d000
      [ 2911.193404] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0003ff254480
      [ 2911.194259] Call trace:
      [ 2911.194626]  tracing_map_sort_entries+0x3e0/0x408
      [ 2911.195220]  hist_show+0x124/0x800
      [ 2911.195692]  seq_read_iter+0x1d4/0x4e8
      [ 2911.196193]  seq_read+0xe8/0x138
      [ 2911.196638]  vfs_read+0xc8/0x300
      [ 2911.197078]  ksys_read+0x70/0x108
      [ 2911.197534]  __arm64_sys_read+0x24/0x38
      [ 2911.198046]  invoke_syscall+0x78/0x108
      [ 2911.198553]  el0_svc_common.constprop.0+0xd0/0xf8
      [ 2911.199157]  do_el0_svc+0x28/0x40
      [ 2911.199613]  el0_svc+0x40/0x178
      [ 2911.200048]  el0t_64_sync_handler+0x13c/0x158
      [ 2911.200621]  el0t_64_sync+0x1a8/0x1b0
      [ 2911.201115] ---[ end trace 0000000000000000 ]---
      
      The problem appears to be caused by CPU reordering of writes issued from
      __tracing_map_insert().
      
      The check for the presence of an element with a given key in this
      function is:
      
       val = READ_ONCE(entry->val);
       if (val && keys_match(key, val->key, map->key_size)) ...
      
      The write of a new entry is:
      
       elt = get_free_elt(map);
       memcpy(elt->key, key, map->key_size);
       entry->val = elt;
      
      The "memcpy(elt->key, key, map->key_size);" and "entry->val = elt;"
      stores may become visible in the reversed order on another CPU. This
      second CPU might then incorrectly determine that a new key doesn't match
      an already present val->key and subsequently insert a new element,
      resulting in a duplicate.
      
      Fix the problem by adding a write barrier between
      "memcpy(elt->key, key, map->key_size);" and "entry->val = elt;", and for
      good measure, also use WRITE_ONCE(entry->val, elt) for publishing the
      element. The sequence pairs with the mentioned "READ_ONCE(entry->val);"
      and the "val->key" check which has an address dependency.
      
      The barrier is placed on a path executed when adding an element for
      a new key. Subsequent updates targeting the same key remain unaffected.
      
      From the user's perspective, the issue was introduced by commit
      c193707d ("tracing: Remove code which merges duplicates"), which
      followed commit cbf4100e ("tracing: Add support to detect and avoid
      duplicates"). The previous code operated differently; it inherently
      expected potential races which result in duplicates but merged them
      later when they occurred.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240122150928.27725-1-petr.pavlu@suse.com
      
      
      
      Fixes: c193707d ("tracing: Remove code which merges duplicates")
      Signed-off-by: default avatarPetr Pavlu <petr.pavlu@suse.com>
      Acked-by: default avatarTom Zanussi <tom.zanussi@linux.intel.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f4f7e696
    • Dan Carpenter's avatar
      netfs, fscache: Prevent Oops in fscache_put_cache() · 82a9bc34
      Dan Carpenter authored
      
      
      [ Upstream commit 3be0b3ed ]
      
      This function dereferences "cache" and then checks if it's
      IS_ERR_OR_NULL().  Check first, then dereference.
      
      Fixes: 9549332d ("fscache: Implement cache registration")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Link: https://lore.kernel.org/r/e84bc740-3502-4f16-982a-a40d5676615c@moroto.mountain/
      
       # v2
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      82a9bc34
    • Sharath Srinivasan's avatar
      net/rds: Fix UBSAN: array-index-out-of-bounds in rds_cmsg_recv · 71024928
      Sharath Srinivasan authored
      
      
      [ Upstream commit 13e788de ]
      
      Syzcaller UBSAN crash occurs in rds_cmsg_recv(),
      which reads inc->i_rx_lat_trace[j + 1] with index 4 (3 + 1),
      but with array size of 4 (RDS_RX_MAX_TRACES).
      Here 'j' is assigned from rs->rs_rx_trace[i] and in-turn from
      trace.rx_trace_pos[i] in rds_recv_track_latency(),
      with both arrays sized 3 (RDS_MSG_RX_DGRAM_TRACE_MAX). So fix the
      off-by-one bounds check in rds_recv_track_latency() to prevent
      a potential crash in rds_cmsg_recv().
      
      Found by syzcaller:
      =================================================================
      UBSAN: array-index-out-of-bounds in net/rds/recv.c:585:39
      index 4 is out of range for type 'u64 [4]'
      CPU: 1 PID: 8058 Comm: syz-executor228 Not tainted 6.6.0-gd2f51b3516da #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
      BIOS 1.15.0-1 04/01/2014
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0x136/0x150 lib/dump_stack.c:106
       ubsan_epilogue lib/ubsan.c:217 [inline]
       __ubsan_handle_out_of_bounds+0xd5/0x130 lib/ubsan.c:348
       rds_cmsg_recv+0x60d/0x700 net/rds/recv.c:585
       rds_recvmsg+0x3fb/0x1610 net/rds/recv.c:716
       sock_recvmsg_nosec net/socket.c:1044 [inline]
       sock_recvmsg+0xe2/0x160 net/socket.c:1066
       __sys_recvfrom+0x1b6/0x2f0 net/socket.c:2246
       __do_sys_recvfrom net/socket.c:2264 [inline]
       __se_sys_recvfrom net/socket.c:2260 [inline]
       __x64_sys_recvfrom+0xe0/0x1b0 net/socket.c:2260
       do_syscall_x64 arch/x86/entry/common.c:51 [inline]
       do_syscall_64+0x40/0x110 arch/x86/entry/common.c:82
       entry_SYSCALL_64_after_hwframe+0x63/0x6b
      ==================================================================
      
      Fixes: 3289025a ("RDS: add receive message trace used by application")
      Reported-by: default avatarChenyuan Yang <chenyuan0y@gmail.com>
      Closes: https://lore.kernel.org/linux-rdma/CALGdzuoVdq-wtQ4Az9iottBqC5cv9ZhcE5q8N7LfYFvkRsOVcw@mail.gmail.com/
      
      
      Signed-off-by: default avatarSharath Srinivasan <sharath.srinivasan@oracle.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      71024928
    • Horatiu Vultur's avatar
      net: micrel: Fix PTP frame parsing for lan8814 · fcb0b4b6
      Horatiu Vultur authored
      
      
      [ Upstream commit aaf632f7 ]
      
      The HW has the capability to check each frame if it is a PTP frame,
      which domain it is, which ptp frame type it is, different ip address in
      the frame. And if one of these checks fail then the frame is not
      timestamp. Most of these checks were disabled except checking the field
      minorVersionPTP inside the PTP header. Meaning that once a partner sends
      a frame compliant to 8021AS which has minorVersionPTP set to 1, then the
      frame was not timestamp because the HW expected by default a value of 0
      in minorVersionPTP. This is exactly the same issue as on lan8841.
      Fix this issue by removing this check so the userspace can decide on this.
      
      Fixes: ece19502 ("net: phy: micrel: 1588 support for LAN8814 phy")
      Signed-off-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Reviewed-by: default avatarMaxime Chevallier <maxime.chevallier@bootlin.com>
      Reviewed-by: default avatarDivya Koppera <divya.koppera@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      fcb0b4b6
    • Yunjian Wang's avatar
      tun: add missing rx stats accounting in tun_xdp_act · 7a581f59
      Yunjian Wang authored
      
      
      [ Upstream commit f1084c42 ]
      
      The TUN can be used as vhost-net backend, and it is necessary to
      count the packets transmitted from TUN to vhost-net/virtio-net.
      However, there are some places in the receive path that were not
      taken into account when using XDP. It would be beneficial to also
      include new accounting for successfully received bytes using
      dev_sw_netstats_rx_add.
      
      Fixes: 761876c8 ("tap: XDP support")
      Signed-off-by: default avatarYunjian Wang <wangyunjian@huawei.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      7a581f59
    • Yunjian Wang's avatar
      tun: fix missing dropped counter in tun_xdp_act · 41e7decd
      Yunjian Wang authored
      
      
      [ Upstream commit 5744ba05 ]
      
      The commit 8ae1aff0 ("tuntap: split out XDP logic") includes
      dropped counter for XDP_DROP, XDP_ABORTED, and invalid XDP actions.
      Unfortunately, that commit missed the dropped counter when error
      occurs during XDP_TX and XDP_REDIRECT actions. This patch fixes
      this issue.
      
      Fixes: 8ae1aff0 ("tuntap: split out XDP logic")
      Signed-off-by: default avatarYunjian Wang <wangyunjian@huawei.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      41e7decd
    • Jakub Kicinski's avatar
      net: fix removing a namespace with conflicting altnames · a2232f29
      Jakub Kicinski authored
      
      
      [ Upstream commit d09486a0 ]
      
      Mark reports a BUG() when a net namespace is removed.
      
          kernel BUG at net/core/dev.c:11520!
      
      Physical interfaces moved outside of init_net get "refunded"
      to init_net when that namespace disappears. The main interface
      name may get overwritten in the process if it would have
      conflicted. We need to also discard all conflicting altnames.
      Recent fixes addressed ensuring that altnames get moved
      with the main interface, which surfaced this problem.
      
      Reported-by: default avatarМарк Коренберг <socketpair@gmail.com>
      Link: https://lore.kernel.org/all/CAEmTpZFZ4Sv3KwqFOY2WKDHeZYdi0O7N5H1nTvcGp=SAEavtDg@mail.gmail.com/
      
      
      Fixes: 7663d522 ("net: check for altname conflicts when changing netdev's netns")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Reviewed-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      a2232f29