Skip to content
  1. Dec 08, 2021
    • Shiraz Saleem's avatar
      RDMA/irdma: Report correct WC errors · 25b5d6fd
      Shiraz Saleem authored
      Return IBV_WC_REM_OP_ERR for responder QP errors instead of
      IBV_WC_REM_ACCESS_ERR.
      
      Return IBV_WC_LOC_QP_OP_ERR for errors detected on the SQ with bad opcodes
      
      Fixes: 44d9e529
      
       ("RDMA/irdma: Implement device initialization definitions")
      Link: https://lore.kernel.org/r/20211201231509.1930-1-shiraz.saleem@intel.com
      Signed-off-by: default avatarShiraz Saleem <shiraz.saleem@intel.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      25b5d6fd
    • Christophe JAILLET's avatar
      RDMA/irdma: Fix a potential memory allocation issue in 'irdma_prm_add_pble_mem()' · 117697cc
      Christophe JAILLET authored
      'pchunk->bitmapbuf' is a bitmap. Its size (in number of bits) is stored in
      'pchunk->sizeofbitmap'.
      
      When it is allocated, the size (in bytes) is computed by:
         size_in_bits >> 3
      
      There are 2 issues (numbers bellow assume that longs are 64 bits):
         - there is no guarantee here that 'pchunk->bitmapmem.size' is modulo
           BITS_PER_LONG but bitmaps are stored as longs
           (sizeofbitmap=8 bits will only allocate 1 byte, instead of 8 (1 long))
      
         - the number of bytes is computed with a shift, not a round up, so we
           may allocate less memory than needed
           (sizeofbitmap=65 bits will only allocate 8 bytes (i.e. 1 long), when 2
           longs are needed = 16 bytes)
      
      Fix both issues by using 'bitmap_zalloc()' and remove the useless
      'bitmapmem' from 'struct irdma_chunk'.
      
      While at it, remove some useless NULL test before calling
      kfree/bitmap_free.
      
      Fixes: 915cc7ac
      
       ("RDMA/irdma: Add miscellaneous utility definitions")
      Link: https://lore.kernel.org/r/5e670b640508e14b1869c3e8e4fb970d78cbe997.1638692171.git.christophe.jaillet@wanadoo.fr
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Reviewed-by: default avatarShiraz Saleem <shiraz.saleem@intel.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      117697cc
    • Shiraz Saleem's avatar
      RDMA/irdma: Fix a user-after-free in add_pble_prm · 1e11a39a
      Shiraz Saleem authored
      When irdma_hmc_sd_one fails, 'chunk' is freed while its still on the PBLE
      info list.
      
      Add the chunk entry to the PBLE info list only after successful setting of
      the SD in irdma_hmc_sd_one.
      
      Fixes: e8c4dbc2
      
       ("RDMA/irdma: Add PBLE resource manager")
      Link: https://lore.kernel.org/r/20211207152135.2192-1-shiraz.saleem@intel.com
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarShiraz Saleem <shiraz.saleem@intel.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      1e11a39a
    • Mike Marciniszyn's avatar
      IB/hfi1: Fix leak of rcvhdrtail_dummy_kvaddr · 60a8b5a1
      Mike Marciniszyn authored
      This buffer is currently allocated in hfi1_init():
      
      	if (reinit)
      		ret = init_after_reset(dd);
      	else
      		ret = loadtime_init(dd);
      	if (ret)
      		goto done;
      
      	/* allocate dummy tail memory for all receive contexts */
      	dd->rcvhdrtail_dummy_kvaddr = dma_alloc_coherent(&dd->pcidev->dev,
      							 sizeof(u64),
      							 &dd->rcvhdrtail_dummy_dma,
      							 GFP_KERNEL);
      
      	if (!dd->rcvhdrtail_dummy_kvaddr) {
      		dd_dev_err(dd, "cannot allocate dummy tail memory\n");
      		ret = -ENOMEM;
      		goto done;
      	}
      
      The reinit triggered path will overwrite the old allocation and leak it.
      
      Fix by moving the allocation to hfi1_alloc_devdata() and the deallocation
      to hfi1_free_devdata().
      
      Link: https://lore.kernel.org/r/20211129192008.101968.91302.stgit@awfm-01.cornelisnetworks.com
      Cc: stable@vger.kernel.org
      Fixes: 46b010d3
      
       ("staging/rdma/hfi1: Workaround to prevent corruption during packet delivery")
      Signed-off-by: default avatarMike Marciniszyn <mike.marciniszyn@cornelisnetworks.com>
      Signed-off-by: default avatarDennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      60a8b5a1
    • Mike Marciniszyn's avatar
      IB/hfi1: Fix early init panic · f6a3cfec
      Mike Marciniszyn authored
      The following trace can be observed with an init failure such as firmware
      load failures:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
        PGD 0 P4D 0
        Oops: 0010 [#1] SMP PTI
        CPU: 0 PID: 537 Comm: kworker/0:3 Tainted: G           OE    --------- -  - 4.18.0-240.el8.x86_64 #1
        Workqueue: events work_for_cpu_fn
        RIP: 0010:0x0
        Code: Bad RIP value.
        RSP: 0000:ffffae5f878a3c98 EFLAGS: 00010046
        RAX: 0000000000000000 RBX: ffff95e48e025c00 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff95e48e025c00
        RBP: ffff95e4bf3660a4 R08: 0000000000000000 R09: ffffffff86d5e100
        R10: ffff95e49e1de600 R11: 0000000000000001 R12: ffff95e4bf366180
        R13: ffff95e48e025c00 R14: ffff95e4bf366028 R15: ffff95e4bf366000
        FS:  0000000000000000(0000) GS:ffff95e4df200000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: ffffffffffffffd6 CR3: 0000000f86a0a003 CR4: 00000000001606f0
        Call Trace:
         receive_context_interrupt+0x1f/0x40 [hfi1]
         __free_irq+0x201/0x300
         free_irq+0x2e/0x60
         pci_free_irq+0x18/0x30
         msix_free_irq.part.2+0x46/0x80 [hfi1]
         msix_clean_up_interrupts+0x2b/0x70 [hfi1]
         hfi1_init_dd+0x640/0x1a90 [hfi1]
         do_init_one.isra.19+0x34d/0x680 [hfi1]
         local_pci_probe+0x41/0x90
         work_for_cpu_fn+0x16/0x20
         process_one_work+0x1a7/0x360
         worker_thread+0x1cf/0x390
         ? create_worker+0x1a0/0x1a0
         kthread+0x112/0x130
         ? kthread_flush_work_fn+0x10/0x10
         ret_from_fork+0x35/0x40
      
      The free_irq() results in a callback to the registered interrupt handler,
      and rcd->do_interrupt is NULL because the receive context data structures
      are not fully initialized.
      
      Fix by ensuring that the do_interrupt is always assigned and adding a
      guards in the slow path handler to detect and handle a partially
      initialized receive context and noop the receive.
      
      Link: https://lore.kernel.org/r/20211129192003.101968.33612.stgit@awfm-01.cornelisnetworks.com
      Cc: stable@vger.kernel.org
      Fixes: b0ba3c18
      
       ("IB/hfi1: Move normal functions from hfi1_devdata to const array")
      Signed-off-by: default avatarMike Marciniszyn <mike.marciniszyn@cornelisnetworks.com>
      Signed-off-by: default avatarDennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      f6a3cfec
    • Mike Marciniszyn's avatar
      IB/hfi1: Insure use of smp_processor_id() is preempt disabled · b6d57e24
      Mike Marciniszyn authored
      The following BUG has just surfaced with our 5.16 testing:
      
        BUG: using smp_processor_id() in preemptible [00000000] code: mpicheck/1581081
        caller is sdma_select_user_engine+0x72/0x210 [hfi1]
        CPU: 0 PID: 1581081 Comm: mpicheck Tainted: G S                5.16.0-rc1+ #1
        Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
        Call Trace:
         <TASK>
         dump_stack_lvl+0x33/0x42
         check_preemption_disabled+0xbf/0xe0
         sdma_select_user_engine+0x72/0x210 [hfi1]
         ? _raw_spin_unlock_irqrestore+0x1f/0x31
         ? hfi1_mmu_rb_insert+0x6b/0x200 [hfi1]
         hfi1_user_sdma_process_request+0xa02/0x1120 [hfi1]
         ? hfi1_write_iter+0xb8/0x200 [hfi1]
         hfi1_write_iter+0xb8/0x200 [hfi1]
         do_iter_readv_writev+0x163/0x1c0
         do_iter_write+0x80/0x1c0
         vfs_writev+0x88/0x1a0
         ? recalibrate_cpu_khz+0x10/0x10
         ? ktime_get+0x3e/0xa0
         ? __fget_files+0x66/0xa0
         do_writev+0x65/0x100
         do_syscall_64+0x3a/0x80
      
      Fix this long standing bug by moving the smp_processor_id() to after the
      rcu_read_lock().
      
      The rcu_read_lock() implicitly disables preemption.
      
      Link: https://lore.kernel.org/r/20211129191958.101968.87329.stgit@awfm-01.cornelisnetworks.com
      Cc: stable@vger.kernel.org
      Fixes: 0cb2aa69
      
       ("IB/hfi1: Add sysfs interface for affinity setup")
      Signed-off-by: default avatarMike Marciniszyn <mike.marciniszyn@cornelisnetworks.com>
      Signed-off-by: default avatarDennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      b6d57e24
    • Mike Marciniszyn's avatar
      IB/hfi1: Correct guard on eager buffer deallocation · 9292f8f9
      Mike Marciniszyn authored
      The code tests the dma address which legitimately can be 0.
      
      The code should test the kernel logical address to avoid leaking eager
      buffer allocations that happen to map to a dma address of 0.
      
      Fixes: 60368186
      
       ("IB/hfi1: Fix user-space buffers mapping with IOMMU enabled")
      Link: https://lore.kernel.org/r/20211129191952.101968.17137.stgit@awfm-01.cornelisnetworks.com
      Signed-off-by: default avatarMike Marciniszyn <mike.marciniszyn@cornelisnetworks.com>
      Signed-off-by: default avatarDennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      9292f8f9
  2. Nov 29, 2021
    • Guoqing Jiang's avatar
      RDMA/rtrs: Call {get,put}_cpu_ptr to silence a debug kernel warning · db6169b5
      Guoqing Jiang authored
      
      
      With preemption enabled (CONFIG_DEBUG_PREEMPT=y), the following appeared
      when rnbd client tries to map remote block device.
      
        BUG: using smp_processor_id() in preemptible [00000000] code: bash/1733
        caller is debug_smp_processor_id+0x17/0x20
        CPU: 0 PID: 1733 Comm: bash Not tainted 5.16.0-rc1 #5
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
        Call Trace:
         <TASK>
         dump_stack_lvl+0x5d/0x78
         dump_stack+0x10/0x12
         check_preemption_disabled+0xe4/0xf0
         debug_smp_processor_id+0x17/0x20
         rtrs_clt_update_all_stats+0x3b/0x70 [rtrs_client]
         rtrs_clt_read_req+0xc3/0x380 [rtrs_client]
         ? rtrs_clt_init_req+0xe3/0x120 [rtrs_client]
         rtrs_clt_request+0x1a7/0x320 [rtrs_client]
         ? 0xffffffffc0ab1000
         send_usr_msg+0xbf/0x160 [rnbd_client]
         ? rnbd_clt_put_sess+0x60/0x60 [rnbd_client]
         ? send_usr_msg+0x160/0x160 [rnbd_client]
         ? sg_alloc_table+0x27/0xb0
         ? sg_zero_buffer+0xd0/0xd0
         send_msg_sess_info+0xe9/0x180 [rnbd_client]
         ? rnbd_clt_put_sess+0x60/0x60 [rnbd_client]
         ? blk_mq_alloc_tag_set+0x2ef/0x370
         rnbd_clt_map_device+0xba8/0xcd0 [rnbd_client]
         ? send_msg_open+0x200/0x200 [rnbd_client]
         rnbd_clt_map_device_store+0x3e5/0x620 [rnbd_client
      
      To supress the calltrace, let's call get_cpu_ptr/put_cpu_ptr pair in
      rtrs_clt_update_rdma_stats to disable preemption when accessing per-cpu
      variable.
      
      While at it, let's make the similar change in rtrs_clt_update_wc_stats.
      And for rtrs_clt_inc_failover_cnt, though it was only called inside rcu
      section, but it still can be preempted in case CONFIG_PREEMPT_RCU is
      enabled, so change it to {get,put}_cpu_ptr pair either.
      
      Link: https://lore.kernel.org/r/20211128133501.38710-1-guoqing.jiang@linux.dev
      Signed-off-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      db6169b5
  3. Nov 26, 2021
    • Yangyang Li's avatar
      RDMA/hns: Do not destroy QP resources in the hw resetting phase · b0969f83
      Yangyang Li authored
      When hns_roce_v2_destroy_qp() is called, the brief calling process of the
      driver is as follows:
      
       ......
       hns_roce_v2_destroy_qp
       hns_roce_v2_qp_modify
      	   hns_roce_cmd_mbox
       hns_roce_qp_destroy
      
      If hns_roce_cmd_mbox() detects that the hardware is being reset during the
      execution of the hns_roce_cmd_mbox(), the driver will not be able to get
      the return value from the hardware (the firmware cannot respond to the
      driver's mailbox during the hardware reset phase).
      
      The driver needs to wait for the hardware reset to complete before
      continuing to execute hns_roce_qp_destroy(), otherwise it may happen that
      the driver releases the resources but the hardware is still accessing. In
      order to fix this problem, HNS RoCE needs to add a piece of code to wait
      for the hardware reset to complete.
      
      The original interface get_hw_reset_stat() is the instantaneous state of
      the hardware reset, which cannot accurately reflect whether the hardware
      reset is completed, so it needs to be replaced with the ae_dev_reset_cnt
      interface.
      
      The sign that the hardware reset is complete is that the return value of
      the ae_dev_reset_cnt interface is greater than the original value
      reset_cnt recorded by the driver.
      
      Fixes: 6a04aed6
      
       ("RDMA/hns: Fix the chip hanging caused by sending mailbox&CMQ during reset")
      Link: https://lore.kernel.org/r/20211123142402.26936-1-liangwenpeng@huawei.com
      Signed-off-by: default avatarYangyang Li <liyangyang20@huawei.com>
      Signed-off-by: default avatarWenpeng Liang <liangwenpeng@huawei.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      b0969f83
    • Yangyang Li's avatar
      RDMA/hns: Do not halt commands during reset until later · 52414e27
      Yangyang Li authored
      is_reset is used to indicate whether the hardware starts to reset. When
      hns_roce_hw_v2_reset_notify_down() is called, the hardware has not yet
      started to reset. If is_reset is set at this time, all mailbox operations
      of resource destroy actions will be intercepted by driver. When the driver
      cleans up resources, but the hardware is still accessed, the following
      errors will appear:
      
        arm-smmu-v3 arm-smmu-v3.2.auto: event 0x10 received:
        arm-smmu-v3 arm-smmu-v3.2.auto: 	0x0000350100000010
        arm-smmu-v3 arm-smmu-v3.2.auto: 	0x000002088000003f
        arm-smmu-v3 arm-smmu-v3.2.auto: 	0x00000000a50e0800
        arm-smmu-v3 arm-smmu-v3.2.auto: 	0x0000000000000000
        arm-smmu-v3 arm-smmu-v3.2.auto: event 0x10 received:
        arm-smmu-v3 arm-smmu-v3.2.auto: 	0x0000350100000010
        arm-smmu-v3 arm-smmu-v3.2.auto: 	0x000002088000043e
        arm-smmu-v3 arm-smmu-v3.2.auto: 	0x00000000a50a0800
        arm-smmu-v3 arm-smmu-v3.2.auto: 	0x0000000000000000
        arm-smmu-v3 arm-smmu-v3.2.auto: event 0x10 received:
        arm-smmu-v3 arm-smmu-v3.2.auto: 	0x0000350100000010
        arm-smmu-v3 arm-smmu-v3.2.auto: 	0x0000020880000436
        arm-smmu-v3 arm-smmu-v3.2.auto: 	0x00000000a50a0880
        arm-smmu-v3 arm-smmu-v3.2.auto: 	0x0000000000000000
        arm-smmu-v3 arm-smmu-v3.2.auto: event 0x10 received:
        arm-smmu-v3 arm-smmu-v3.2.auto: 	0x0000350100000010
        arm-smmu-v3 arm-smmu-v3.2.auto: 	0x000002088000043a
        arm-smmu-v3 arm-smmu-v3.2.auto: 	0x00000000a50e0840
        hns3 0000:35:00.0: INT status: CMDQ(0x0) HW errors(0x0) other(0x0)
        arm-smmu-v3 arm-smmu-v3.2.auto: 	0x0000000000000000
        hns3 0000:35:00.0: received unknown or unhandled event of vector0
        arm-smmu-v3 arm-smmu-v3.2.auto: event 0x10 received:
        arm-smmu-v3 arm-smmu-v3.2.auto: 	0x0000350100000010
        {34}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 7
      
      is_reset will be set correctly in check_aedev_reset_status(), so the
      setting in hns_roce_hw_v2_reset_notify_down() should be deleted.
      
      Fixes: 726be12f
      
       ("RDMA/hns: Set reset flag when hw resetting")
      Link: https://lore.kernel.org/r/20211123084809.37318-1-liangwenpeng@huawei.com
      Signed-off-by: default avatarYangyang Li <liyangyang20@huawei.com>
      Signed-off-by: default avatarWenpeng Liang <liangwenpeng@huawei.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      52414e27
    • Doug Ledford's avatar
      Remove Doug Ledford from MAINTAINERS · c4a6f9cd
      Doug Ledford authored
      
      
      Moving on to other things
      
      Link: https://lore.kernel.org/r/12fe41e3d0a515e4fcf5c9e62ac88c39e09c1639.1637616139.git.dledford@redhat.com
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      c4a6f9cd
    • Alaa Hleihel's avatar
      RDMA/mlx5: Fix releasing unallocated memory in dereg MR flow · f0ae4afe
      Alaa Hleihel authored
      For the case of IB_MR_TYPE_DM the mr does doesn't have a umem, even though
      it is a user MR. This causes function mlx5_free_priv_descs() to think that
      it is a kernel MR, leading to wrongly accessing mr->descs that will get
      wrong values in the union which leads to attempt to release resources that
      were not allocated in the first place.
      
      For example:
       DMA-API: mlx5_core 0000:08:00.1: device driver tries to free DMA memory it has not allocated [device address=0x0000000000000000] [size=0 bytes]
       WARNING: CPU: 8 PID: 1021 at kernel/dma/debug.c:961 check_unmap+0x54f/0x8b0
       RIP: 0010:check_unmap+0x54f/0x8b0
       Call Trace:
        debug_dma_unmap_page+0x57/0x60
        mlx5_free_priv_descs+0x57/0x70 [mlx5_ib]
        mlx5_ib_dereg_mr+0x1fb/0x3d0 [mlx5_ib]
        ib_dereg_mr_user+0x60/0x140 [ib_core]
        uverbs_destroy_uobject+0x59/0x210 [ib_uverbs]
        uobj_destroy+0x3f/0x80 [ib_uverbs]
        ib_uverbs_cmd_verbs+0x435/0xd10 [ib_uverbs]
        ? uverbs_finalize_object+0x50/0x50 [ib_uverbs]
        ? lock_acquire+0xc4/0x2e0
        ? lock_acquired+0x12/0x380
        ? lock_acquire+0xc4/0x2e0
        ? lock_acquire+0xc4/0x2e0
        ? ib_uverbs_ioctl+0x7c/0x140 [ib_uverbs]
        ? lock_release+0x28a/0x400
        ib_uverbs_ioctl+0xc0/0x140 [ib_uverbs]
        ? ib_uverbs_ioctl+0x7c/0x140 [ib_uverbs]
        __x64_sys_ioctl+0x7f/0xb0
        do_syscall_64+0x38/0x90
      
      Fix it by reorganizing the dereg flow and mlx5_ib_mr structure:
       - Move the ib_umem field into the user MRs structure in the union as it's
         applicable only there.
       - Function mlx5_ib_dereg_mr() will now call mlx5_free_priv_descs() only
         in case there isn't udata, which indicates that this isn't a user MR.
      
      Fixes: f18ec422
      
       ("RDMA/mlx5: Use a union inside mlx5_ib_mr")
      Link: https://lore.kernel.org/r/66bb1dd253c1fd7ceaa9fc411061eefa457b86fb.1637581144.git.leonro@nvidia.com
      Signed-off-by: default avatarAlaa Hleihel <alaa@nvidia.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      f0ae4afe
    • Pavel Skripkin's avatar
      RDMA: Fix use-after-free in rxe_queue_cleanup · 84b01721
      Pavel Skripkin authored
      On error handling path in rxe_qp_from_init() qp->sq.queue is freed and
      then rxe_create_qp() will drop last reference to this object. qp clean up
      function will try to free this queue one time and it causes UAF bug.
      
      Fix it by zeroing queue pointer after freeing queue in rxe_qp_from_init().
      
      Fixes: 514aee66
      
       ("RDMA: Globally allocate and release QP memory")
      Link: https://lore.kernel.org/r/20211121202239.3129-1-paskripkin@gmail.com
      Reported-by: default avatar <syzbot+aab53008a5adf26abe91@syzkaller.appspotmail.com>
      Signed-off-by: default avatarPavel Skripkin <paskripkin@gmail.com>
      Reviewed-by: default avatarZhu Yanjun <zyjzyj2000@gmail.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      84b01721
  4. Nov 22, 2021
    • Linus Torvalds's avatar
      Linux 5.16-rc2 · 13605725
      Linus Torvalds authored
      v5.16-rc2
      13605725
    • Linus Torvalds's avatar
      Merge tag 'x86-urgent-2021-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 40c93d7f
      Linus Torvalds authored
      Pull x86 fixes from Thomas Gleixner:
      
       - Move the command line preparation and the early command line parsing
         earlier so that the command line parameters which affect
         early_reserve_memory(), e.g. efi=nosftreserve, are taken into
         account. This was broken when the invocation of
         early_reserve_memory() was moved recently.
      
       - Use an atomic type for the SGX page accounting, which is read and
         written locklessly, to plug various race conditions related to it.
      
      * tag 'x86-urgent-2021-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/sgx: Fix free page accounting
        x86/boot: Pull up cmdline preparation and early param parsing
      40c93d7f
    • Linus Torvalds's avatar
      Merge tag 'perf-urgent-2021-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · af16bdea
      Linus Torvalds authored
      Pull x86 perf fixes from Thomas Gleixner:
      
       - Remove unneded PEBS disabling when taking LBR snapshots to prevent an
         unchecked MSR access error.
      
       - Fix IIO event constraints for Snowridge and Skylake server chips.
      
      * tag 'perf-urgent-2021-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/perf: Fix snapshot_branch_stack warning in VM
        perf/x86/intel/uncore: Fix IIO event constraints for Snowridge
        perf/x86/intel/uncore: Fix IIO event constraints for Skylake Server
        perf/x86/intel/uncore: Fix filter_tid mask for CHA events on Skylake Server
      af16bdea
    • Linus Torvalds's avatar
      Merge tag 'powerpc-5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 75603b14
      Linus Torvalds authored
      Pull more powerpc fixes from Michael Ellerman:
      
       - Fix a bug in copying of sigset_t for 32-bit systems, which caused X
         to not start.
      
       - Fix handling of shared LSIs (rare) with the xive interrupt controller
         (Power9/10).
      
       - Fix missing TOC setup in some KVM code, which could result in oopses
         depending on kernel data layout.
      
       - Fix DMA mapping when we have persistent memory and only one DMA
         window available.
      
       - Fix further problems with STRICT_KERNEL_RWX on 8xx, exposed by a
         recent fix.
      
       - A couple of other minor fixes.
      
      Thanks to Alexey Kardashevskiy, Aneesh Kumar K.V, Cédric Le Goater,
      Christian Zigotzky, Christophe Leroy, Daniel Axtens, Finn Thain, Greg
      Kurz, Masahiro Yamada, Nicholas Piggin, and Uwe Kleine-König.
      
      * tag 'powerpc-5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/xive: Change IRQ domain to a tree domain
        powerpc/8xx: Fix pinned TLBs with CONFIG_STRICT_KERNEL_RWX
        powerpc/signal32: Fix sigset_t copy
        powerpc/book3e: Fix TLBCAM preset at boot
        powerpc/pseries/ddw: Do not try direct mapping with persistent memory and one window
        powerpc/pseries/ddw: simplify enable_ddw()
        powerpc/pseries/ddw: Revert "Extend upper limit for huge DMA window for persistent memory"
        powerpc/pseries: Fix numa FORM2 parsing fallback code
        powerpc/pseries: rename numa_dist_table to form2_distances
        powerpc: clean vdso32 and vdso64 directories
        powerpc/83xx/mpc8349emitx: Drop unused variable
        KVM: PPC: Book3S HV: Use GLOBAL_TOC for kvmppc_h_set_dabr/xdabr()
      75603b14
    • Geert Uytterhoeven's avatar
      pstore/blk: Use "%lu" to format unsigned long · 61eb495c
      Geert Uytterhoeven authored
      On 32-bit:
      
          fs/pstore/blk.c: In function ‘__best_effort_init’:
          include/linux/kern_levels.h:5:18: warning: format ‘%zu’ expects argument of type ‘size_t’, but argument 3 has type ‘long unsigned int’ [-Wformat=]
      	5 | #define KERN_SOH "\001"  /* ASCII Start Of Header */
      	  |                  ^~~~~~
          include/linux/kern_levels.h:14:19: note: in expansion of macro ‘KERN_SOH’
             14 | #define KERN_INFO KERN_SOH "6" /* informational */
      	  |                   ^~~~~~~~
          include/linux/printk.h:373:9: note: in expansion of macro ‘KERN_INFO’
            373 |  printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
      	  |         ^~~~~~~~~
          fs/pstore/blk.c:314:3: note: in expansion of macro ‘pr_info’
            314 |   pr_info("attached %s (%zu) (no dedicated panic_write!)\n",
      	  |   ^~~~~~~
      
      Cc: stable@vger.kernel.org
      Fixes: 7bb9557b
      
       ("pstore/blk: Use the normal block device I/O path")
      Signed-off-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20210629103700.1935012-1-geert@linux-m68k.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      61eb495c
  5. Nov 21, 2021
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 923dcc5e
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "15 patches.
      
        Subsystems affected by this patch series: ipc, hexagon, mm (swap,
        slab-generic, kmemleak, hugetlb, kasan, damon, and highmem), and proc"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        proc/vmcore: fix clearing user buffer by properly using clear_user()
        kmap_local: don't assume kmap PTEs are linear arrays in memory
        mm/damon/dbgfs: fix missed use of damon_dbgfs_lock
        mm/damon/dbgfs: use '__GFP_NOWARN' for user-specified size buffer allocation
        kasan: test: silence intentional read overflow warnings
        hugetlb, userfaultfd: fix reservation restore on userfaultfd error
        hugetlb: fix hugetlb cgroup refcounting during mremap
        mm: kmemleak: slob: respect SLAB_NOLEAKTRACE flag
        hexagon: ignore vmlinux.lds
        hexagon: clean up timer-regs.h
        hexagon: export raw I/O routines for modules
        mm: emit the "free" trace report before freeing memory in kmem_cache_free()
        shm: extend forced shm destroy to support objects from several IPC nses
        ipc: WARN if trying to remove ipc object which is absent
        mm/swap.c:put_pages_list(): reinitialise the page list
      923dcc5e
    • Linus Torvalds's avatar
      Merge tag 'block-5.16-2021-11-19' of git://git.kernel.dk/linux-block · 61564e7b
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
      
       - Flip a cap check to avoid a selinux error (Alistair)
      
       - Fix for a regression this merge window where we can miss a queue ref
         put (me)
      
       - Un-mark pstore-blk as broken, as the condition that triggered that
         change has been rectified (Kees)
      
       - Queue quiesce and sync fixes (Ming)
      
       - FUA insertion fix (Ming)
      
       - blk-cgroup error path put fix (Yu)
      
      * tag 'block-5.16-2021-11-19' of git://git.kernel.dk/linux-block:
        blk-mq: don't insert FUA request with data into scheduler queue
        blk-cgroup: fix missing put device in error path from blkg_conf_pref()
        block: avoid to quiesce queue in elevator_init_mq
        Revert "mark pstore-blk as broken"
        blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release()
        block: fix missing queue put in error path
        block: Check ADMIN before NICE for IOPRIO_CLASS_RT
      61564e7b
    • Linus Torvalds's avatar
      Merge tag 'pinctrl-v5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · b100274c
      Linus Torvalds authored
      Pull pin control fixes from Linus Walleij:
       "There is an ACPI stubs fix which is ACKed by the ACPI maintainer for
        merging through my tree.
      
        One item stand out and that is that I delete the <linux/sdb.h> header
        that is used by nothing. I deleted this subsystem (through the GPIO
        tree) a while back so I feel responsible for tidying up the floor.
      
        Other than that it is the usual mistakes, a bit noisy around build
        issue and Kconfig then driver fixes.
      
        Specifics:
      
         - Fix some stubs causing compile issues for ACPI.
      
         - Fix some wakeups on AMD IRQs shared between GPIO and SCI.
      
         - Fix a build warning in the Tegra driver.
      
         - Fix a Kconfig issue in the Qualcomm driver.
      
         - Add a missing include the RALink driver.
      
         - Return a valid type for the Apple pinctrl IRQs.
      
         - Implement some Qualcomm SDM845 dual-edge errata.
      
         - Remove the unused <linux/sdb.h> header. (The subsystem was once
           deleted by the pinctrl maintainer...)
      
         - Fix a duplicate initialized in the Tegra driver.
      
         - Fix register offsets for UFS and SDC in the Qualcomm SM8350 driver"
      
      * tag 'pinctrl-v5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
        pinctrl: qcom: sm8350: Correct UFS and SDC offsets
        pinctrl: tegra194: remove duplicate initializer again
        Remove unused header <linux/sdb.h>
        pinctrl: qcom: sdm845: Enable dual edge errata
        pinctrl: apple: Always return valid type in apple_gpio_irq_type
        pinctrl: ralink: include 'ralink_regs.h' in 'pinctrl-mt7620.c'
        pinctrl: qcom: fix unmet dependencies on GPIOLIB for GPIOLIB_IRQCHIP
        pinctrl: tegra: Return const pointer from tegra_pinctrl_get_group()
        pinctrl: amd: Fix wakeups when IRQ is shared with SCI
        ACPI: Add stubs for wakeup handler functions
      b100274c
    • Linus Torvalds's avatar
      Merge tag 's390-5.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · 6b38e2fb
      Linus Torvalds authored
      Pull s390 updates from Heiko Carstens:
      
       - Add missing Kconfig option for ftrace direct multi sample, so it can
         be compiled again, and also add s390 support for this sample.
      
       - Update Christian Borntraeger's email address.
      
       - Various fixes for memory layout setup. Besides other this makes it
         possible to load shared DCSS segments again.
      
       - Fix copy to user space of swapped kdump oldmem.
      
       - Remove -mstack-guard and -mstack-size compile options when building
         vdso binaries. This can happen when CONFIG_VMAP_STACK is disabled and
         results in broken vdso code which causes more or less random
         exceptions. Also remove the not needed -nostdlib option.
      
       - Fix memory leak on cpu hotplug and return code handling in kexec
         code.
      
       - Wire up futex_waitv system call.
      
       - Replace snprintf with sysfs_emit where appropriate.
      
      * tag 's390-5.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        ftrace/samples: add s390 support for ftrace direct multi sample
        ftrace/samples: add missing Kconfig option for ftrace direct multi sample
        MAINTAINERS: update email address of Christian Borntraeger
        s390/kexec: fix memory leak of ipl report buffer
        s390/kexec: fix return code handling
        s390/dump: fix copying to user-space of swapped kdump oldmem
        s390: wire up sys_futex_waitv system call
        s390/vdso: filter out -mstack-guard and -mstack-size
        s390/vdso: remove -nostdlib compiler flag
        s390: replace snprintf in show functions with sysfs_emit
        s390/boot: simplify and fix kernel memory layout setup
        s390/setup: re-arrange memblock setup
        s390/setup: avoid using memblock_enforce_memory_limit
        s390/setup: avoid reserving memory above identity mapping
      6b38e2fb
    • Linus Torvalds's avatar
      Merge tag '5.16-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6 · b38bfc74
      Linus Torvalds authored
      Pull cifs fixes from Steve French:
       "Three small cifs/smb3 fixes: two to address minor coverity issues and
        one cleanup"
      
      * tag '5.16-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
        cifs: introduce cifs_ses_mark_for_reconnect() helper
        cifs: protect srv_count with cifs_tcp_ses_lock
        cifs: move debug print out of spinlock
      b38bfc74
    • David Hildenbrand's avatar
      proc/vmcore: fix clearing user buffer by properly using clear_user() · c1e63117
      David Hildenbrand authored
      To clear a user buffer we cannot simply use memset, we have to use
      clear_user().  With a virtio-mem device that registers a vmcore_cb and
      has some logically unplugged memory inside an added Linux memory block,
      I can easily trigger a BUG by copying the vmcore via "cp":
      
        systemd[1]: Starting Kdump Vmcore Save Service...
        kdump[420]: Kdump is using the default log level(3).
        kdump[453]: saving to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
        kdump[458]: saving vmcore-dmesg.txt to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
        kdump[465]: saving vmcore-dmesg.txt complete
        kdump[467]: saving vmcore
        BUG: unable to handle page fault for address: 00007f2374e01000
        #PF: supervisor write access in kernel mode
        #PF: error_code(0x0003) - permissions violation
        PGD 7a523067 P4D 7a523067 PUD 7a528067 PMD 7a525067 PTE 800000007048f867
        Oops: 0003 [#1] PREEMPT SMP NOPTI
        CPU: 0 PID: 468 Comm: cp Not tainted 5.15.0+ #6
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-27-g64f37cc530f1-prebuilt.qemu.org 04/01/2014
        RIP: 0010:read_from_oldmem.part.0.cold+0x1d/0x86
        Code: ff ff ff e8 05 ff fe ff e9 b9 e9 7f ff 48 89 de 48 c7 c7 38 3b 60 82 e8 f1 fe fe ff 83 fd 08 72 3c 49 8d 7d 08 4c 89 e9 89 e8 <49> c7 45 00 00 00 00 00 49 c7 44 05 f8 00 00 00 00 48 83 e7 f81
        RSP: 0018:ffffc9000073be08 EFLAGS: 00010212
        RAX: 0000000000001000 RBX: 00000000002fd000 RCX: 00007f2374e01000
        RDX: 0000000000000001 RSI: 00000000ffffdfff RDI: 00007f2374e01008
        RBP: 0000000000001000 R08: 0000000000000000 R09: ffffc9000073bc50
        R10: ffffc9000073bc48 R11: ffffffff829461a8 R12: 000000000000f000
        R13: 00007f2374e01000 R14: 0000000000000000 R15: ffff88807bd421e8
        FS:  00007f2374e12140(0000) GS:ffff88807f000000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f2374e01000 CR3: 000000007a4aa000 CR4: 0000000000350eb0
        Call Trace:
         read_vmcore+0x236/0x2c0
         proc_reg_read+0x55/0xa0
         vfs_read+0x95/0x190
         ksys_read+0x4f/0xc0
         do_syscall_64+0x3b/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Some x86-64 CPUs have a CPU feature called "Supervisor Mode Access
      Prevention (SMAP)", which is used to detect wrong access from the kernel
      to user buffers like this: SMAP triggers a permissions violation on
      wrong access.  In the x86-64 variant of clear_user(), SMAP is properly
      handled via clac()+stac().
      
      To fix, properly use clear_user() when we're dealing with a user buffer.
      
      Link: https://lkml.kernel.org/r/20211112092750.6921-1-david@redhat.com
      Fixes: 997c136f
      
       ("fs/proc/vmcore.c: add hook to read_from_oldmem() to check for non-ram pages")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Philipp Rudo <prudo@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1e63117
    • Ard Biesheuvel's avatar
      kmap_local: don't assume kmap PTEs are linear arrays in memory · 825c43f5
      Ard Biesheuvel authored
      The kmap_local conversion broke the ARM architecture, because the new
      code assumes that all PTEs used for creating kmaps form a linear array
      in memory, and uses array indexing to look up the kmap PTE belonging to
      a certain kmap index.
      
      On ARM, this cannot work, not only because the PTE pages may be
      non-adjacent in memory, but also because ARM/!LPAE interleaves hardware
      entries and extended entries (carrying software-only bits) in a way that
      is not compatible with array indexing.
      
      Fortunately, this only seems to affect configurations with more than 8
      CPUs, due to the way the per-CPU kmap slots are organized in memory.
      
      Work around this by permitting an architecture to set a Kconfig symbol
      that signifies that the kmap PTEs do not form a lineary array in memory,
      and so the only way to locate the appropriate one is to walk the page
      tables.
      
      Link: https://lore.kernel.org/linux-arm-kernel/20211026131249.3731275-1-ardb@kernel.org/
      Link: https://lkml.kernel.org/r/20211116094737.7391-1-ardb@kernel.org
      Fixes: 2a15ba82
      
       ("ARM: highmem: Switch to generic kmap atomic")
      Signed-off-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Reported-by: default avatarQuanyang Wang <quanyang.wang@windriver.com>
      Reviewed-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Acked-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      825c43f5
    • SeongJae Park's avatar
      mm/damon/dbgfs: fix missed use of damon_dbgfs_lock · d78f3853
      SeongJae Park authored
      DAMON debugfs is supposed to protect dbgfs_ctxs, dbgfs_nr_ctxs, and
      dbgfs_dirs using damon_dbgfs_lock.  However, some of the code is
      accessing the variables without the protection.  This fixes it by
      protecting all such accesses.
      
      Link: https://lkml.kernel.org/r/20211110145758.16558-3-sj@kernel.org
      Fixes: 75c1c2b5
      
       ("mm/damon/dbgfs: support multiple contexts")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d78f3853
    • SeongJae Park's avatar
      mm/damon/dbgfs: use '__GFP_NOWARN' for user-specified size buffer allocation · db7a347b
      SeongJae Park authored
      Patch series "DAMON fixes".
      
      This patch (of 2):
      
      DAMON users can trigger below warning in '__alloc_pages()' by invoking
      write() to some DAMON debugfs files with arbitrarily high count
      argument, because DAMON debugfs interface allocates some buffers based
      on the user-specified 'count'.
      
              if (unlikely(order >= MAX_ORDER)) {
                      WARN_ON_ONCE(!(gfp & __GFP_NOWARN));
                      return NULL;
              }
      
      Because the DAMON debugfs interface code checks failure of the
      'kmalloc()', this commit simply suppresses the warnings by adding
      '__GFP_NOWARN' flag.
      
      Link: https://lkml.kernel.org/r/20211110145758.16558-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20211110145758.16558-2-sj@kernel.org
      Fixes: 4bc05954
      
       ("mm/damon: implement a debugfs-based user space interface")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      db7a347b
    • Kees Cook's avatar
      kasan: test: silence intentional read overflow warnings · cab71f74
      Kees Cook authored
      As done in commit d73dad4e ("kasan: test: bypass __alloc_size
      checks") for __write_overflow warnings, also silence some more cases
      that trip the __read_overflow warnings seen in 5.16-rc1[1]:
      
        In file included from include/linux/string.h:253,
                         from include/linux/bitmap.h:10,
                         from include/linux/cpumask.h:12,
                         from include/linux/mm_types_task.h:14,
                         from include/linux/mm_types.h:5,
                         from include/linux/page-flags.h:13,
                         from arch/arm64/include/asm/mte.h:14,
                         from arch/arm64/include/asm/pgtable.h:12,
                         from include/linux/pgtable.h:6,
                         from include/linux/kasan.h:29,
                         from lib/test_kasan.c:10:
        In function 'memcmp',
            inlined from 'kasan_memcmp' at lib/test_kasan.c:897:2:
        include/linux/fortify-string.h:263:25: error: call to '__read_overflow' declared with attribute error: detected read beyond size of object (1st parameter)
          263 |                         __read_overflow();
              |                         ^~~~~~~~~~~~~~~~~
        In function 'memchr',
            inlined from 'kasan_memchr' at lib/test_kasan.c:872:2:
        include/linux/fortify-string.h:277:17: error: call to '__read_overflow' declared with attribute error: detected read beyond size of object (1st parameter)
          277 |                 __read_overflow();
              |                 ^~~~~~~~~~~~~~~~~
      
      [1] http://kisskb.ellerman.id.au/kisskb/buildresult/14660585/log/
      
      Link: https://lkml.kernel.org/r/20211116004111.3171781-1-keescook@chromium.org
      Fixes: d73dad4e
      
       ("kasan: test: bypass __alloc_size checks")
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cab71f74
    • Mina Almasry's avatar
      hugetlb, userfaultfd: fix reservation restore on userfaultfd error · cc30042d
      Mina Almasry authored
      Currently in the is_continue case in hugetlb_mcopy_atomic_pte(), if we
      bail out using "goto out_release_unlock;" in the cases where idx >=
      size, or !huge_pte_none(), the code will detect that new_pagecache_page
      == false, and so call restore_reserve_on_error().  In this case I see
      restore_reserve_on_error() delete the reservation, and the following
      call to remove_inode_hugepages() will increment h->resv_hugepages
      causing a 100% reproducible leak.
      
      We should treat the is_continue case similar to adding a page into the
      pagecache and set new_pagecache_page to true, to indicate that there is
      no reservation to restore on the error path, and we need not call
      restore_reserve_on_error().  Rename new_pagecache_page to
      page_in_pagecache to make that clear.
      
      Link: https://lkml.kernel.org/r/20211117193825.378528-1-almasrymina@google.com
      Fixes: c7b1850d
      
       ("hugetlb: don't pass page cache pages to restore_reserve_on_error")
      Signed-off-by: default avatarMina Almasry <almasrymina@google.com>
      Reported-by: default avatarJames Houghton <jthoughton@google.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc30042d
    • Bui Quang Minh's avatar
      hugetlb: fix hugetlb cgroup refcounting during mremap · afe041c2
      Bui Quang Minh authored
      When hugetlb_vm_op_open() is called during copy_vma(), we may take the
      reference to resv_map->css.  Later, when clearing the reservation
      pointer of old_vma after transferring it to new_vma, we forget to drop
      the reference to resv_map->css.  This leads to a reference leak of css.
      
      Fixes this by adding a check to drop reservation css reference in
      clear_vma_resv_huge_pages()
      
      Link: https://lkml.kernel.org/r/20211113154412.91134-1-minhquangbui99@gmail.com
      Fixes: 550a7d60
      
       ("mm, hugepages: add mremap() support for hugepage backed vma")
      Signed-off-by: default avatarBui Quang Minh <minhquangbui99@gmail.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMina Almasry <almasrymina@google.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      afe041c2
    • Rustam Kovhaev's avatar
      mm: kmemleak: slob: respect SLAB_NOLEAKTRACE flag · 34dbc3aa
      Rustam Kovhaev authored
      When kmemleak is enabled for SLOB, system does not boot and does not
      print anything to the console.  At the very early stage in the boot
      process we hit infinite recursion from kmemleak_init() and eventually
      kernel crashes.
      
      kmemleak_init() specifies SLAB_NOLEAKTRACE for KMEM_CACHE(), but
      kmem_cache_create_usercopy() removes it because CACHE_CREATE_MASK is not
      valid for SLOB.
      
      Let's fix CACHE_CREATE_MASK and make kmemleak work with SLOB
      
      Link: https://lkml.kernel.org/r/20211115020850.3154366-1-rkovhaev@gmail.com
      Fixes: d8843922
      
       ("slab: Ignore internal flags in cache creation")
      Signed-off-by: default avatarRustam Kovhaev <rkovhaev@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34dbc3aa
    • Nathan Chancellor's avatar
      hexagon: ignore vmlinux.lds · eaac2f89
      Nathan Chancellor authored
      
      
      After building allmodconfig, there is an untracked vmlinux.lds file in
      arch/hexagon/kernel:
      
          $ git ls-files . --exclude-standard --others
          arch/hexagon/kernel/vmlinux.lds
      
      Ignore it as all other architectures have.
      
      Link: https://lkml.kernel.org/r/20211115174250.1994179-4-nathan@kernel.org
      Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eaac2f89
    • Nathan Chancellor's avatar
      hexagon: clean up timer-regs.h · 51f2ec59
      Nathan Chancellor authored
      When building allmodconfig, there is a warning about TIMER_ENABLE being
      redefined:
      
        drivers/clocksource/timer-oxnas-rps.c:39:9: error: 'TIMER_ENABLE' macro redefined [-Werror,-Wmacro-redefined]
        #define TIMER_ENABLE            BIT(7)
                ^
        arch/hexagon/include/asm/timer-regs.h:13:9: note: previous definition is here
        #define TIMER_ENABLE            0
                 ^
        1 error generated.
      
      The values in this header are only used in one file each, if they are
      used at all.  Remove the header and sink all of the constants into their
      respective files.
      
      TCX0_CLK_RATE is only used in arch/hexagon/include/asm/timex.h
      
      TIMER_ENABLE, RTOS_TIMER_INT, RTOS_TIMER_REGS_ADDR are only used in
      arch/hexagon/kernel/time.c.
      
      SLEEP_CLK_RATE and TIMER_CLR_ON_MATCH have both been unused since the
      file's introduction in commit 71e4a47f
      
       ("Hexagon: Add time and timer
      functions").
      
      TIMER_ENABLE is redefined as BIT(0) so the shift is moved into the
      definition, rather than its use.
      
      Link: https://lkml.kernel.org/r/20211115174250.1994179-3-nathan@kernel.org
      Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Acked-by: default avatarBrian Cain <bcain@codeaurora.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      51f2ec59
    • Nathan Chancellor's avatar
      hexagon: export raw I/O routines for modules · ffb92ce8
      Nathan Chancellor authored
      Patch series "Fixes for ARCH=hexagon allmodconfig", v2.
      
      This series fixes some issues noticed with ARCH=hexagon allmodconfig.
      
      This patch (of 3):
      
      When building ARCH=hexagon allmodconfig, the following errors occur:
      
        ERROR: modpost: "__raw_readsl" [drivers/i3c/master/svc-i3c-master.ko] undefined!
        ERROR: modpost: "__raw_writesl" [drivers/i3c/master/dw-i3c-master.ko] undefined!
        ERROR: modpost: "__raw_readsl" [drivers/i3c/master/dw-i3c-master.ko] undefined!
        ERROR: modpost: "__raw_writesl" [drivers/i3c/master/i3c-master-cdns.ko] undefined!
        ERROR: modpost: "__raw_readsl" [drivers/i3c/master/i3c-master-cdns.ko] undefined!
      
      Export these symbols so that modules can use them without any errors.
      
      Link: https://lkml.kernel.org/r/20211115174250.1994179-1-nathan@kernel.org
      Link: https://lkml.kernel.org/r/20211115174250.1994179-2-nathan@kernel.org
      Fixes: 013bf24c
      
       ("Hexagon: Provide basic implementation and/or stubs for I/O routines.")
      Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Acked-by: default avatarBrian Cain <bcain@codeaurora.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ffb92ce8
    • Yunfeng Ye's avatar
      mm: emit the "free" trace report before freeing memory in kmem_cache_free() · 9a543f00
      Yunfeng Ye authored
      
      
      After the memory is freed, it can be immediately allocated by other
      CPUs, before the "free" trace report has been emitted.  This causes
      inaccurate traces.
      
      For example, if the following sequence of events occurs:
      
          CPU 0                 CPU 1
      
        (1) alloc xxxxxx
        (2) free  xxxxxx
                               (3) alloc xxxxxx
                               (4) free  xxxxxx
      
      Then they will be inaccurately reported via tracing, so that they appear
      to have happened in this order:
      
          CPU 0                 CPU 1
      
        (1) alloc xxxxxx
                               (2) alloc xxxxxx
        (3) free  xxxxxx
                               (4) free  xxxxxx
      
      This makes it look like CPU 1 somehow managed to allocate memory that
      CPU 0 still had allocated for itself.
      
      In order to avoid this, emit the "free xxxxxx" tracing report just
      before the actual call to free the memory, instead of just after it.
      
      Link: https://lkml.kernel.org/r/374eb75d-7404-8721-4e1e-65b0e5b17279@huawei.com
      Signed-off-by: default avatarYunfeng Ye <yeyunfeng@huawei.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a543f00
    • Alexander Mikhalitsyn's avatar
      shm: extend forced shm destroy to support objects from several IPC nses · 85b6d246
      Alexander Mikhalitsyn authored
      Currently, the exit_shm() function not designed to work properly when
      task->sysvshm.shm_clist holds shm objects from different IPC namespaces.
      
      This is a real pain when sysctl kernel.shm_rmid_forced = 1, because it
      leads to use-after-free (reproducer exists).
      
      This is an attempt to fix the problem by extending exit_shm mechanism to
      handle shm's destroy from several IPC ns'es.
      
      To achieve that we do several things:
      
      1. add a namespace (non-refcounted) pointer to the struct shmid_kernel
      
      2. during new shm object creation (newseg()/shmget syscall) we
         initialize this pointer by current task IPC ns
      
      3. exit_shm() fully reworked such that it traverses over all shp's in
         task->sysvshm.shm_clist and gets IPC namespace not from current task
         as it was before but from shp's object itself, then call
         shm_destroy(shp, ns).
      
      Note: We need to be really careful here, because as it was said before
      (1), our pointer to IPC ns non-refcnt'ed.  To be on the safe side we
      using special helper get_ipc_ns_not_zero() which allows to get IPC ns
      refcounter only if IPC ns not in the "state of destruction".
      
      Q/A
      
      Q: Why can we access shp->ns memory using non-refcounted pointer?
      A: Because shp object lifetime is always shorther than IPC namespace
         lifetime, so, if we get shp object from the task->sysvshm.shm_clist
         while holding task_lock(task) nobody can steal our namespace.
      
      Q: Does this patch change semantics of unshare/setns/clone syscalls?
      A: No. It's just fixes non-covered case when process may leave IPC
         namespace without getting task->sysvshm.shm_clist list cleaned up.
      
      Link: https://lkml.kernel.org/r/67bb03e5-f79c-1815-e2bf-949c67047418@colorfullife.com
      Link: https://lkml.kernel.org/r/20211109151501.4921-1-manfred@colorfullife.com
      Fixes: ab602f79
      
       ("shm: make exit_shm work proportional to task activity")
      Co-developed-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAlexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Cc: Vasily Averin <vvs@virtuozzo.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      85b6d246
    • Alexander Mikhalitsyn's avatar
      ipc: WARN if trying to remove ipc object which is absent · 126e8bee
      Alexander Mikhalitsyn authored
      
      
      Patch series "shm: shm_rmid_forced feature fixes".
      
      Some time ago I met kernel crash after CRIU restore procedure,
      fortunately, it was CRIU restore, so, I had dump files and could do
      restore many times and crash reproduced easily.  After some
      investigation I've constructed the minimal reproducer.  It was found
      that it's use-after-free and it happens only if sysctl
      kernel.shm_rmid_forced = 1.
      
      The key of the problem is that the exit_shm() function not handles shp's
      object destroy when task->sysvshm.shm_clist contains items from
      different IPC namespaces.  In most cases this list will contain only
      items from one IPC namespace.
      
      How can this list contain object from different namespaces? The
      exit_shm() function is designed to clean up this list always when
      process leaves IPC namespace.  But we made a mistake a long time ago and
      did not add a exit_shm() call into the setns() syscall procedures.
      
      The first idea was just to add this call to setns() syscall but it
      obviously changes semantics of setns() syscall and that's
      userspace-visible change.  So, I gave up on this idea.
      
      The first real attempt to address the issue was just to omit forced
      destroy if we meet shp object not from current task IPC namespace [1].
      But that was not the best idea because task->sysvshm.shm_clist was
      protected by rwsem which belongs to current task IPC namespace.  It
      means that list corruption may occur.
      
      Second approach is just extend exit_shm() to properly handle shp's from
      different IPC namespaces [2].  This is really non-trivial thing, I've
      put a lot of effort into that but not believed that it's possible to
      make it fully safe, clean and clear.
      
      Thanks to the efforts of Manfred Spraul working an elegant solution was
      designed.  Thanks a lot, Manfred!
      
      Eric also suggested the way to address the issue in ("[RFC][PATCH] shm:
      In shm_exit destroy all created and never attached segments") Eric's
      idea was to maintain a list of shm_clists one per IPC namespace, use
      lock-less lists.  But there is some extra memory consumption-related
      concerns.
      
      An alternative solution which was suggested by me was implemented in
      ("shm: reset shm_clist on setns but omit forced shm destroy").  The idea
      is pretty simple, we add exit_shm() syscall to setns() but DO NOT
      destroy shm segments even if sysctl kernel.shm_rmid_forced = 1, we just
      clean up the task->sysvshm.shm_clist list.
      
      This chages semantics of setns() syscall a little bit but in comparision
      to the "naive" solution when we just add exit_shm() without any special
      exclusions this looks like a safer option.
      
      [1] https://lkml.org/lkml/2021/7/6/1108
      [2] https://lkml.org/lkml/2021/7/14/736
      
      This patch (of 2):
      
      Let's produce a warning if we trying to remove non-existing IPC object
      from IPC namespace kht/idr structures.
      
      This allows us to catch possible bugs when the ipc_rmid() function was
      called with inconsistent struct ipc_ids*, struct kern_ipc_perm*
      arguments.
      
      Link: https://lkml.kernel.org/r/20211027224348.611025-1-alexander.mikhalitsyn@virtuozzo.com
      Link: https://lkml.kernel.org/r/20211027224348.611025-2-alexander.mikhalitsyn@virtuozzo.com
      Co-developed-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAlexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Cc: Vasily Averin <vvs@virtuozzo.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      126e8bee
    • Matthew Wilcox's avatar
      mm/swap.c:put_pages_list(): reinitialise the page list · 3cd018b4
      Matthew Wilcox authored
      While free_unref_page_list() puts pages onto the CPU local LRU list, it
      does not remove them from the list they were passed in on.  That makes
      the list_head appear to be non-empty, and would lead to various
      corruption problems if we didn't have an assertion that the list was
      empty.
      
      Reinitialise the list after calling free_unref_page_list() to avoid this
      problem.
      
      Link: https://lkml.kernel.org/r/YYp40A2lNrxaZji8@casper.infradead.org
      Fixes: 988c69f1
      
       ("mm: optimise put_pages_list()")
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarSteve French <stfrench@microsoft.com>
      Reported-by: default avatarNamjae Jeon <linkinjeon@kernel.org>
      Tested-by: default avatarSteve French <stfrench@microsoft.com>
      Tested-by: default avatarNamjae Jeon <linkinjeon@kernel.org>
      Cc: Steve French <smfrench@gmail.com>
      Cc: Hyeoncheol Lee <hyc.lee@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3cd018b4
  6. Nov 20, 2021
    • Linus Torvalds's avatar
      Merge tag 'libata-5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata · a90af8f1
      Linus Torvalds authored
      Pull libata fixes from Damien Le Moal:
      
       - Prevent accesses to unsupported log pages as that causes device scan
         failures with LLDDs using libsas (from me).
      
       - A couple of fixes for AMD AHCI adapters handling of low power modes
         and resume (from Mario).
      
       - Fix a compilation warning (from me).
      
      * tag 'libata-5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
        ata: libata-sata: Declare ata_ncq_sdev_attrs static
        ata: libahci: Adjust behavior when StorageD3Enable _DSD is set
        ata: ahci: Add Green Sardine vendor ID as board_ahci_mobile
        ata: libata: add missing ata_identify_page_supported() calls
        ata: libata: improve ata_read_log_page() error message
      a90af8f1
    • Linus Torvalds's avatar
      Merge tag 'trace-v5.16-6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · e4365e36
      Linus Torvalds authored
      Pull tracing fixes from Steven Rostedt:
      
       - Fix double free in destroy_hist_field
      
       - Harden memset() of trace_iterator structure
      
       - Do not warn in trace printk check when test buffer fills up
      
      * tag 'trace-v5.16-6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        tracing: Don't use out-of-sync va_list in event printing
        tracing: Use memset_startat() to zero struct trace_iterator
        tracing/histogram: Fix UAF in destroy_hist_field()
      e4365e36