Skip to content
  1. Apr 14, 2022
    • Eli Cohen's avatar
      vdpa/mlx5: Rename control VQ workqueue to vdpa wq · dc872b72
      Eli Cohen authored
      [ Upstream commit 218bdd20
      
       ]
      
      A subesequent patch will use the same workqueue for executing other
      work not related to control VQ. Rename the workqueue and the work queue
      entry used to convey information to the workqueue.
      
      Signed-off-by: default avatarEli Cohen <elic@nvidia.com>
      Link: https://lore.kernel.org/r/20210909123635.30884-3-elic@nvidia.com
      
      
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      dc872b72
    • Christophe JAILLET's avatar
      scsi: zorro7xx: Fix a resource leak in zorro7xx_remove_one() · aefd755a
      Christophe JAILLET authored
      [ Upstream commit 16ed828b ]
      
      The error handling path of the probe releases a resource that is not freed
      in the remove function. In some cases, a ioremap() must be undone.
      
      Add the missing iounmap() call in the remove function.
      
      Link: https://lore.kernel.org/r/247066a3104d25f9a05de8b3270fc3c848763bcc.1647673264.git.christophe.jaillet@wanadoo.fr
      Fixes: 45804fbb
      
       ("[SCSI] 53c700: Amiga Zorro NCR53c710 SCSI")
      Reviewed-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      aefd755a
    • John Garry's avatar
      scsi: core: Fix sbitmap depth in scsi_realloc_sdev_budget_map() · cd483e17
      John Garry authored
      [ Upstream commit eaba83b5 ]
      
      In commit edb854a3 ("scsi: core: Reallocate device's budget map on
      queue depth change"), the sbitmap for the device budget map may be
      reallocated after the slave device depth is configured.
      
      When the sbitmap is reallocated we use the result from
      scsi_device_max_queue_depth() for the sbitmap size, but don't resize to
      match the actual device queue depth.
      
      Fix by resizing the sbitmap after reallocating the budget sbitmap. We do
      this instead of init'ing the sbitmap to the device queue depth as the user
      may want to change the queue depth later via sysfs or other.
      
      Link: https://lore.kernel.org/r/1647423870-143867-1-git-send-email-john.garry@huawei.com
      Fixes: edb854a3
      
       ("scsi: core: Reallocate device's budget map on queue depth change")
      Tested-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarJohn Garry <john.garry@huawei.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      cd483e17
    • Kevin Groeneveld's avatar
      scsi: sr: Fix typo in CDROM(CLOSETRAY|EJECT) handling · 0610371c
      Kevin Groeneveld authored
      [ Upstream commit bc5519c1 ]
      
      Commit 2e27f576 ("scsi: scsi_ioctl: Call scsi_cmd_ioctl() from
      scsi_ioctl()") seems to have a typo as it is checking ret instead of cmd in
      the if statement checking for CDROMCLOSETRAY and CDROMEJECT.  This changes
      the behaviour of these ioctls as the cdrom_ioctl handling of these is more
      restrictive than the scsi_ioctl version.
      
      Link: https://lore.kernel.org/r/20220323002242.21157-1-kgroeneveld@lenbrook.com
      Fixes: 2e27f576
      
       ("scsi: scsi_ioctl: Call scsi_cmd_ioctl() from scsi_ioctl()")
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarKevin Groeneveld <kgroeneveld@lenbrook.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      0610371c
    • ChenXiaoSong's avatar
      NFSv4: fix open failure with O_ACCMODE flag · 6f52d4cd
      ChenXiaoSong authored
      [ Upstream commit b243874f ]
      
      open() with O_ACCMODE|O_DIRECT flags secondly will fail.
      
      Reproducer:
        1. mount -t nfs -o vers=4.2 $server_ip:/ /mnt/
        2. fd = open("/mnt/file", O_ACCMODE|O_DIRECT|O_CREAT)
        3. close(fd)
        4. fd = open("/mnt/file", O_ACCMODE|O_DIRECT)
      
      Server nfsd4_decode_share_access() will fail with error nfserr_bad_xdr when
      client use incorrect share access mode of 0.
      
      Fix this by using NFS4_SHARE_ACCESS_BOTH share access mode in client,
      just like firstly opening.
      
      Fixes: ce4ef7c0
      
       ("NFS: Split out NFS v4 file operations")
      Signed-off-by: default avatarChenXiaoSong <chenxiaosong2@huawei.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      6f52d4cd
    • ChenXiaoSong's avatar
      Revert "NFSv4: Handle the special Linux file open access mode" · 9f0c2174
      ChenXiaoSong authored
      [ Upstream commit ab0fc21b ]
      
      This reverts commit 44942b4e
      
      .
      
      After secondly opening a file with O_ACCMODE|O_DIRECT flags,
      nfs4_valid_open_stateid() will dereference NULL nfs4_state when lseek().
      
      Reproducer:
        1. mount -t nfs -o vers=4.2 $server_ip:/ /mnt/
        2. fd = open("/mnt/file", O_ACCMODE|O_DIRECT|O_CREAT)
        3. close(fd)
        4. fd = open("/mnt/file", O_ACCMODE|O_DIRECT)
        5. lseek(fd)
      
      Reported-by: default avatarLyu Tao <tao.lyu@epfl.ch>
      Signed-off-by: default avatarChenXiaoSong <chenxiaosong2@huawei.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      9f0c2174
    • Guilherme G. Piccoli's avatar
      Drivers: hv: vmbus: Fix potential crash on module unload · dcd6b1a6
      Guilherme G. Piccoli authored
      [ Upstream commit 792f232d ]
      
      The vmbus driver relies on the panic notifier infrastructure to perform
      some operations when a panic event is detected. Since vmbus can be built
      as module, it is required that the driver handles both registering and
      unregistering such panic notifier callback.
      
      After commit 74347a99 ("x86/Hyper-V: Unload vmbus channel in hv panic callback")
      though, the panic notifier registration is done unconditionally in the module
      initialization routine whereas the unregistering procedure is conditionally
      guarded and executes only if HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE capability
      is set.
      
      This patch fixes that by unconditionally unregistering the panic notifier
      in the module's exit routine as well.
      
      Fixes: 74347a99
      
       ("x86/Hyper-V: Unload vmbus channel in hv panic callback")
      Signed-off-by: default avatarGuilherme G. Piccoli <gpiccoli@igalia.com>
      Reviewed-by: default avatarMichael Kelley <mikelley@microsoft.com>
      Link: https://lore.kernel.org/r/20220315203535.682306-1-gpiccoli@igalia.com
      
      
      Signed-off-by: default avatarWei Liu <wei.liu@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      dcd6b1a6
    • Dan Carpenter's avatar
      drm/amdgpu: fix off by one in amdgpu_gfx_kiq_acquire() · 5ba9d78a
      Dan Carpenter authored
      [ Upstream commit 1647b54e ]
      
      This post-op should be a pre-op so that we do not pass -1 as the bit
      number to test_bit().  The current code will loop downwards from 63 to
      -1.  After changing to a pre-op, it loops from 63 to 0.
      
      Fixes: 71c37505
      
       ("drm/amdgpu/gfx: move more common KIQ code to amdgpu_gfx.c")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      5ba9d78a
    • Mateusz Jończyk's avatar
      rtc: mc146818-lib: fix RTC presence check · 985d87e6
      Mateusz Jończyk authored
      [ Upstream commit ea6fa496 ]
      
      To prevent an infinite loop in mc146818_get_time(),
      commit 211e5db1 ("rtc: mc146818: Detect and handle broken RTCs")
      added a check for RTC availability. Together with a later fix, it
      checked if bit 6 in register 0x0d is cleared.
      
      This, however, caused a false negative on a motherboard with an AMD
      SB710 southbridge; according to the specification [1], bit 6 of register
      0x0d of this chipset is a scratchbit. This caused a regression in Linux
      5.11 - the RTC was determined broken by the kernel and not used by
      rtc-cmos.c [3]. This problem was also reported in Fedora [4].
      
      As a better alternative, check whether the UIP ("Update-in-progress")
      bit is set for longer then 10ms. If that is the case, then apparently
      the RTC is either absent (and all register reads return 0xff) or broken.
      Also limit the number of loop iterations in mc146818_get_time() to 10 to
      prevent an infinite loop there.
      
      The functions mc146818_get_time() and mc146818_does_rtc_work() will be
      refactored later in this patch series, in order to fix a separate
      problem with reading / setting the RTC alarm time. This is done so to
      avoid a confusion about what is being fixed when.
      
      In a previous approach to this problem, I implemented a check whether
      the RTC_HOURS register contains a value <= 24. This, however, sometimes
      did not work correctly on my Intel Kaby Lake laptop. According to
      Intel's documentation [2], "the time and date RAM locations (0-9) are
      disconnected from the external bus" during the update cycle so reading
      this register without checking the UIP bit is incorrect.
      
      [1] AMD SB700/710/750 Register Reference Guide, page 308,
      https://developer.amd.com/wordpress/media/2012/10/43009_sb7xx_rrg_pub_1.00.pdf
      
      [2] 7th Generation Intel ® Processor Family I/O for U/Y Platforms [...] Datasheet
      Volume 1 of 2, page 209
      Intel's Document Number: 334658-006,
      https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/7th-and-8th-gen-core-family-mobile-u-y-processor-lines-i-o-datasheet-vol-1.pdf
      
      [3] Functions in arch/x86/kernel/rtc.c apparently were using it.
      
      [4] https://bugzilla.redhat.com/show_bug.cgi?id=1936688
      
      Fixes: 211e5db1 ("rtc: mc146818: Detect and handle broken RTCs")
      Fixes: ebb22a05
      
       ("rtc: mc146818: Dont test for bit 0-5 in Register D")
      Signed-off-by: default avatarMateusz Jończyk <mat.jonczyk@o2.pl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
      Signed-off-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Link: https://lore.kernel.org/r/20211210200131.153887-5-mat.jonczyk@o2.pl
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      985d87e6
    • Mateusz Jończyk's avatar
      rtc: Check return value from mc146818_get_time() · be6c3152
      Mateusz Jończyk authored
      [ Upstream commit 0dd8d6cb
      
       ]
      
      There are 4 users of mc146818_get_time() and none of them was checking
      the return value from this function. Change this.
      
      Print the appropriate warnings in callers of mc146818_get_time() instead
      of in the function mc146818_get_time() itself, in order not to add
      strings to rtc-mc146818-lib.c, which is kind of a library.
      
      The callers of alpha_rtc_read_time() and cmos_read_time() may use the
      contents of (struct rtc_time *) even when the functions return a failure
      code. Therefore, set the contents of (struct rtc_time *) to 0x00,
      which looks more sensible then 0xff and aligns with the (possibly
      stale?) comment in cmos_read_time:
      
      	/*
      	 * If pm_trace abused the RTC for storage, set the timespec to 0,
      	 * which tells the caller that this RTC value is unusable.
      	 */
      
      For consistency, do this in mc146818_get_time().
      
      Note: hpet_rtc_interrupt() may call mc146818_get_time() many times a
      second. It is very unlikely, though, that the RTC suddenly stops
      working and mc146818_get_time() would consistently fail.
      
      Only compile-tested on alpha.
      
      Signed-off-by: default avatarMateusz Jończyk <mat.jonczyk@o2.pl>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
      Cc: linux-alpha@vger.kernel.org
      Cc: x86@kernel.org
      Signed-off-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Link: https://lore.kernel.org/r/20211210200131.153887-4-mat.jonczyk@o2.pl
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      be6c3152
    • Mateusz Jończyk's avatar
      rtc: mc146818-lib: change return values of mc146818_get_time() · 8c692107
      Mateusz Jończyk authored
      [ Upstream commit d35786b3
      
       ]
      
      No function is checking mc146818_get_time() return values yet, so
      correct them to make them more customary.
      
      Signed-off-by: default avatarMateusz Jończyk <mat.jonczyk@o2.pl>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
      Signed-off-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Link: https://lore.kernel.org/r/20211210200131.153887-3-mat.jonczyk@o2.pl
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      8c692107
    • Mauricio Faria de Oliveira's avatar
      mm: fix race between MADV_FREE reclaim and blkdev direct IO read · c9f50e06
      Mauricio Faria de Oliveira authored
      commit 6c8e2a25 upstream.
      
      Problem:
      =======
      
      Userspace might read the zero-page instead of actual data from a direct IO
      read on a block device if the buffers have been called madvise(MADV_FREE)
      on earlier (this is discussed below) due to a race between page reclaim on
      MADV_FREE and blkdev direct IO read.
      
      - Race condition:
        ==============
      
      During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks
      if the page is not dirty, then discards its rmap PTE(s) (vs.  remap back
      if the page is dirty).
      
      However, after try_to_unmap_one() returns to shrink_page_list(), it might
      keep the page _anyway_ if page_ref_freeze() fails (it expects exactly
      _one_ page reference, from the isolation for page reclaim).
      
      Well, blkdev_direct_IO() gets references for all pages, and on READ
      operations it only sets them dirty _later_.
      
      So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct
      IO read from block devices, and page reclaim happens during
      __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages()
      returns, but BEFORE the pages are set dirty, the situation happens.
      
      The direct IO read eventually completes.  Now, when userspace reads the
      buffers, the PTE is no longer there and the page fault handler
      do_anonymous_page() services that with the zero-page, NOT the data!
      
      A synthetic reproducer is provided.
      
      - Page faults:
        ===========
      
      If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't
      happen, because that faults-in all pages as writeable, so
      do_anonymous_page() sets up a new page/rmap/PTE, and that is used by
      direct IO.  The userspace reads don't fault as the PTE is there (thus
      zero-page is not used/setup).
      
      But if page reclaim happens AFTER it / BEFORE setting pages dirty, the PTE
      is no longer there; the subsequent page faults can't help:
      
      The data-read from the block device probably won't generate faults due to
      DMA (no MMU) but even in the case it wouldn't use DMA, that happens on
      different virtual addresses (not user-mapped addresses) because `struct
      bio_vec` stores `struct page` to figure addresses out (which are different
      from user-mapped addresses) for the read.
      
      Thus userspace reads (to user-mapped addresses) still fault, then
      do_anonymous_page() gets another `struct page` that would address/ map to
      other memory than the `struct page` used by `struct bio_vec` for the read.
      (The original `struct page` is not available, since it wasn't freed, as
      page_ref_freeze() failed due to more page refs.  And even if it were
      available, its data cannot be trusted anymore.)
      
      Solution:
      ========
      
      One solution is to check for the expected page reference count in
      try_to_unmap_one().
      
      There should be one reference from the isolation (that is also checked in
      shrink_page_list() with page_ref_freeze()) plus one or more references
      from page mapping(s) (put in discard: label).  Further references mean
      that rmap/PTE cannot be unmapped/nuked.
      
      (Note: there might be more than one reference from mapping due to
      fork()/clone() without CLONE_VM, which use the same `struct page` for
      references, until the copy-on-write page gets copied.)
      
      So, additional page references (e.g., from direct IO read) now prevent the
      rmap/PTE from being unmapped/dropped; similarly to the page is not freed
      per shrink_page_list()/page_ref_freeze()).
      
      - Races and Barriers:
        ==================
      
      The new check in try_to_unmap_one() should be safe in races with
      bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's
      done under the PTE lock.
      
      The fast path doesn't take the lock, but it checks if the PTE has changed
      and if so, it drops the reference and leaves the page for the slow path
      (which does take that lock).
      
      The fast path requires synchronization w/ full memory barrier: it writes
      the page reference count first then it reads the PTE later, while
      try_to_unmap() writes PTE first then it reads page refcount.
      
      And a second barrier is needed, as the page dirty flag should not be read
      before the page reference count (as in __remove_mapping()).  (This can be
      a load memory barrier only; no writes are involved.)
      
      Call stack/comments:
      
      - try_to_unmap_one()
        - page_vma_mapped_walk()
          - map_pte()			# see pte_offset_map_lock():
              pte_offset_map()
              spin_lock()
      
        - ptep_get_and_clear()	# write PTE
        - smp_mb()			# (new barrier) GUP fast path
        - page_ref_count()		# (new check) read refcount
      
        - page_vma_mapped_walk_done()	# see pte_unmap_unlock():
            pte_unmap()
            spin_unlock()
      
      - bio_iov_iter_get_pages()
        - __bio_iov_iter_get_pages()
          - iov_iter_get_pages()
            - get_user_pages_fast()
              - internal_get_user_pages_fast()
      
                # fast path
                - lockless_pages_from_mm()
                  - gup_{pgd,p4d,pud,pmd,pte}_range()
                      ptep = pte_offset_map()		# not _lock()
                      pte = ptep_get_lockless(ptep)
      
                      page = pte_page(pte)
                      try_grab_compound_head(page)	# inc refcount
                                                  	# (RMW/barrier
                                                   	#  on success)
      
                      if (pte_val(pte) != pte_val(*ptep)) # read PTE
                              put_compound_head(page) # dec refcount
                              			# go slow path
      
                # slow path
                - __gup_longterm_unlocked()
                  - get_user_pages_unlocked()
                    - __get_user_pages_locked()
                      - __get_user_pages()
                        - follow_{page,p4d,pud,pmd}_mask()
                          - follow_page_pte()
                              ptep = pte_offset_map_lock()
                              pte = *ptep
                              page = vm_normal_page(pte)
                              try_grab_page(page)	# inc refcount
                              pte_unmap_unlock()
      
      - Huge Pages:
        ==========
      
      Regarding transparent hugepages, that logic shouldn't change, as MADV_FREE
      (aka lazyfree) pages are PageAnon() && !PageSwapBacked()
      (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn())
      thus should reach shrink_page_list() -> split_huge_page_to_list() before
      try_to_unmap[_one](), so it deals with normal pages only.
      
      (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens,
      which should not or be rare, the page refcount should be greater than
      mapcount: the head page is referenced by tail pages.  That also prevents
      checking the head `page` then incorrectly call page_remove_rmap(subpage)
      for a tail page, that isn't even in the shrink_page_list()'s page_list (an
      effect of split huge pmd/pmvw), as it might happen today in this unlikely
      scenario.)
      
      MADV_FREE'd buffers:
      ===================
      
      So, back to the "if MADV_FREE pages are used as buffers" note.  The case
      is arguable, and subject to multiple interpretations.
      
      The madvise(2) manual page on the MADV_FREE advice value says:
      
      1) 'After a successful MADV_FREE ... data will be lost when
         the kernel frees the pages.'
      2) 'the free operation will be canceled if the caller writes
         into the page' / 'subsequent writes ... will succeed and
         then [the] kernel cannot free those dirtied pages'
      3) 'If there is no subsequent write, the kernel can free the
         pages at any time.'
      
      Thoughts, questions, considerations... respectively:
      
      1) Since the kernel didn't actually free the page (page_ref_freeze()
         failed), should the data not have been lost? (on userspace read.)
      2) Should writes performed by the direct IO read be able to cancel
         the free operation?
         - Should the direct IO read be considered as 'the caller' too,
           as it's been requested by 'the caller'?
         - Should the bio technique to dirty pages on return to userspace
           (bio_check_pages_dirty() is called/used by __blkdev_direct_IO())
           be considered in another/special way here?
      3) Should an upcoming write from a previously requested direct IO
         read be considered as a subsequent write, so the kernel should
         not free the pages? (as it's known at the time of page reclaim.)
      
      And lastly:
      
      Technically, the last point would seem a reasonable consideration and
      balance, as the madvise(2) manual page apparently (and fairly) seem to
      assume that 'writes' are memory access from the userspace process (not
      explicitly considering writes from the kernel or its corner cases; again,
      fairly)..  plus the kernel fix implementation for the corner case of the
      largely 'non-atomic write' encompassed by a direct IO read operation, is
      relatively simple; and it helps.
      
      Reproducer:
      ==========
      
      @ test.c (simplified, but works)
      
      	#define _GNU_SOURCE
      	#include <fcntl.h>
      	#include <stdio.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      
      	int main() {
      		int fd, i;
      		char *buf;
      
      		fd = open(DEV, O_RDONLY | O_DIRECT);
      
      		buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE,
                      	   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
      
      		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
      			buf[i] = 1; // init to non-zero
      
      		madvise(buf, BUF_SIZE, MADV_FREE);
      
      		read(fd, buf, BUF_SIZE);
      
      		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
      			printf("%p: 0x%x\n", &buf[i], buf[i]);
      
      		return 0;
      	}
      
      @ block/fops.c (formerly fs/block_dev.c)
      
      	+#include <linux/swap.h>
      	...
      	... __blkdev_direct_IO[_simple](...)
      	{
      	...
      	+	if (!strcmp(current->comm, "good"))
      	+		shrink_all_memory(ULONG_MAX);
      	+
               	ret = bio_iov_iter_get_pages(...);
      	+
      	+	if (!strcmp(current->comm, "bad"))
      	+		shrink_all_memory(ULONG_MAX);
      	...
      	}
      
      @ shell
      
              # NUM_PAGES=4
              # PAGE_SIZE=$(getconf PAGE_SIZE)
      
              # yes | dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES}
              # DEV=$(losetup -f --show test.img)
      
              # gcc -DDEV=\"$DEV\" \
                    -DBUF_SIZE=$((PAGE_SIZE * NUM_PAGES)) \
                    -DPAGE_SIZE=${PAGE_SIZE} \
                     test.c -o test
      
              # od -tx1 $DEV
              0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a
              *
              0040000
      
              # mv test good
              # ./good
              0x7f7c10418000: 0x79
              0x7f7c10419000: 0x79
              0x7f7c1041a000: 0x79
              0x7f7c1041b000: 0x79
      
              # mv good bad
              # ./bad
              0x7fa1b8050000: 0x0
              0x7fa1b8051000: 0x0
              0x7fa1b8052000: 0x0
              0x7fa1b8053000: 0x0
      
      Note: the issue is consistent on v5.17-rc3, but it's intermittent with the
      support of MADV_FREE on v4.5 (60%-70% error; needs swap).  [wrap
      do_direct_IO() in do_blockdev_direct_IO() @ fs/direct-io.c].
      
      - v5.17-rc3:
      
              # for i in {1..1000}; do ./good; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x79
      
              # mv good bad
              # for i in {1..1000}; do ./bad; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x0
      
              # free | grep Swap
              Swap:             0           0           0
      
      - v4.5:
      
              # for i in {1..1000}; do ./good; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x79
      
              # mv good bad
              # for i in {1..1000}; do ./bad; done \
                  | cut -d: -f2 | sort | uniq -c
                 2702  0x0
                 1298  0x79
      
              # swapoff -av
              swapoff /swap
      
              # for i in {1..1000}; do ./bad; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x79
      
      Ceph/TCMalloc:
      =============
      
      For documentation purposes, the use case driving the analysis/fix is Ceph
      on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to
      release unused memory to the system from the mmap'ed page heap (might be
      committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan()
      -> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() ->
      TCMalloc_SystemCommit() -> do nothing.
      
      Note: TCMalloc switched back to MADV_DONTNEED a few commits after the
      release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue
      just 'disappeared' on Ceph on later Ubuntu releases but is still present
      in the kernel, and can be hit by other use cases.
      
      The observed issue seems to be the old Ceph bug #22464 [1], where checksum
      mismatches are observed (and instrumentation with buffer dumps shows
      zero-pages read from mmap'ed/MADV_FREE'd page ranges).
      
      The issue in Ceph was reasonably deemed a kernel bug (comment #50) and
      mostly worked around with a retry mechanism, but other parts of Ceph could
      still hit that (rocksdb).  Anyway, it's less likely to be hit again as
      TCMalloc switched out of MADV_FREE by default.
      
      (Some kernel versions/reports from the Ceph bug, and relation with
      the MADV_FREE introduction/changes; TCMalloc versions not checked.)
      - 4.4 good
      - 4.5 (madv_free: introduction)
      - 4.9 bad
      - 4.10 good? maybe a swapless system
      - 4.12 (madv_free: no longer free instantly on swapless systems)
      - 4.13 bad
      
      [1] https://tracker.ceph.com/issues/22464
      
      Thanks:
      ======
      
      Several people contributed to analysis/discussions/tests/reproducers in
      the first stages when drilling down on ceph/tcmalloc/linux kernel:
      
      - Dan Hill
      - Dan Streetman
      - Dongdong Tao
      - Gavin Guo
      - Gerald Yang
      - Heitor Alves de Siqueira
      - Ioanna Alifieraki
      - Jay Vosburgh
      - Matthew Ruffell
      - Ponnuvel Palaniyappan
      
      Reviews, suggestions, corrections, comments:
      
      - Minchan Kim
      - Yu Zhao
      - Huang, Ying
      - John Hubbard
      - Christoph Hellwig
      
      [mfo@canonical.com: v4]
        Link: https://lkml.kernel.org/r/20220209202659.183418-1-mfo@canonical.comLink: https://lkml.kernel.org/r/20220131230255.789059-1-mfo@canonical.com
      
      Fixes: 802a3a92
      
       ("mm: reclaim MADV_FREE pages")
      Signed-off-by: default avatarMauricio Faria de Oliveira <mfo@canonical.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Dan Hill <daniel.hill@canonical.com>
      Cc: Dan Streetman <dan.streetman@canonical.com>
      Cc: Dongdong Tao <dongdong.tao@canonical.com>
      Cc: Gavin Guo <gavin.guo@canonical.com>
      Cc: Gerald Yang <gerald.yang@canonical.com>
      Cc: Heitor Alves de Siqueira <halves@canonical.com>
      Cc: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
      Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
      Cc: Matthew Ruffell <matthew.ruffell@canonical.com>
      Cc: Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com>
      Cc: <stable@vger.kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [mfo: backport: replace folio/test_flag with page/flag equivalents;
       real Fixes: 854e9ed0
      
       ("mm: support madvise(MADV_FREE)") in v4.]
      Signed-off-by: default avatarMauricio Faria de Oliveira <mfo@canonical.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c9f50e06
    • John David Anglin's avatar
      parisc: Fix patch code locking and flushing · 93a8347f
      John David Anglin authored
      [ Upstream commit a9fe7fa7
      
       ]
      
      This change fixes the following:
      
      1) The flags variable is not initialized. Always use raw_spin_lock_irqsave
      and raw_spin_unlock_irqrestore to serialize patching.
      
      2) flush_kernel_vmap_range is primarily intended for DMA flushes. Since
      __patch_text_multiple is often called with interrupts disabled, it is
      better to directly call flush_kernel_dcache_range_asm and
      flush_kernel_icache_range_asm. This avoids an extra call.
      
      3) The final call to flush_icache_range is unnecessary.
      
      Signed-off-by: default avatarJohn David Anglin <dave.anglin@bell.net>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      93a8347f
    • Helge Deller's avatar
      parisc: Fix CPU affinity for Lasi, WAX and Dino chips · f77f482e
      Helge Deller authored
      [ Upstream commit 939fc856
      
       ]
      
      Add the missing logic to allow Lasi, WAX and Dino to set the
      CPU affinity. This fixes IRQ migration to other CPUs when a
      CPU is shutdown which currently holds the IRQs for one of those
      chips.
      
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f77f482e
    • Naresh Kamboju's avatar
      selftests: net: Add tls config dependency for tls selftests · 30dd4af4
      Naresh Kamboju authored
      [ Upstream commit d9142e1c
      
       ]
      
      selftest net tls test cases need TLS=m without this the test hangs.
      Enabling config TLS solves this problem and runs to complete.
        - CONFIG_TLS=m
      
      Reported-by: default avatarLinux Kernel Functional Testing <lkft@linaro.org>
      Signed-off-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      30dd4af4
    • Trond Myklebust's avatar
      NFS: Avoid writeback threads getting stuck in mempool_alloc() · ea029e4c
      Trond Myklebust authored
      [ Upstream commit 0bae835b
      
       ]
      
      In a low memory situation, allow the NFS writeback code to fail without
      getting stuck in infinite loops in mempool_alloc().
      
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ea029e4c
    • Trond Myklebust's avatar
      NFS: nfsiod should not block forever in mempool_alloc() · da747de6
      Trond Myklebust authored
      [ Upstream commit 515dcdcd
      
       ]
      
      The concern is that since nfsiod is sometimes required to kick off a
      commit, it can get locked up waiting forever in mempool_alloc() instead
      of failing gracefully and leaving the commit until later.
      
      Try to allocate from the slab first, with GFP_KERNEL | __GFP_NORETRY,
      then fall back to a non-blocking attempt to allocate from the memory
      pool.
      
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      da747de6
    • Trond Myklebust's avatar
      SUNRPC: Fix socket waits for write buffer space · e04ef859
      Trond Myklebust authored
      [ Upstream commit 7496b59f
      
       ]
      
      The socket layer requires that we use the socket lock to protect changes
      to the sock->sk_write_pending field and others.
      
      Reported-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e04ef859
    • Haimin Zhang's avatar
      jfs: prevent NULL deref in diFree · d925b7e7
      Haimin Zhang authored
      [ Upstream commit a5304629
      
       ]
      
      Add validation check for JFS_IP(ipimap)->i_imap to prevent a NULL deref
      in diFree since diFree uses it without do any validations.
      When function jfs_mount calls diMount to initialize fileset inode
      allocation map, it can fail and JFS_IP(ipimap)->i_imap won't be
      initialized. Then it calls diFreeSpecial to close fileset inode allocation
      map inode and it will flow into jfs_evict_inode. Function jfs_evict_inode
      just validates JFS_SBI(inode->i_sb)->ipimap, then calls diFree. diFree use
      JFS_IP(ipimap)->i_imap directly, then it will cause a NULL deref.
      
      Reported-by: default avatarTCS Robot <tcs_robot@tencent.com>
      Signed-off-by: default avatarHaimin Zhang <tcs_kernel@tencent.com>
      Signed-off-by: default avatarDave Kleikamp <dave.kleikamp@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      d925b7e7
    • Randy Dunlap's avatar
      virtio_console: eliminate anonymous module_init & module_exit · 44c2d5fb
      Randy Dunlap authored
      [ Upstream commit fefb8a2a
      
       ]
      
      Eliminate anonymous module_init() and module_exit(), which can lead to
      confusion or ambiguity when reading System.map, crashes/oops/bugs,
      or an initcall_debug log.
      
      Give each of these init and exit functions unique driver-specific
      names to eliminate the anonymous names.
      
      Example 1: (System.map)
       ffffffff832fc78c t init
       ffffffff832fc79e t init
       ffffffff832fc8f8 t init
      
      Example 2: (initcall_debug log)
       calling  init+0x0/0x12 @ 1
       initcall init+0x0/0x12 returned 0 after 15 usecs
       calling  init+0x0/0x60 @ 1
       initcall init+0x0/0x60 returned 0 after 2 usecs
       calling  init+0x0/0x9a @ 1
       initcall init+0x0/0x9a returned 0 after 74 usecs
      
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reviewed-by: default avatarAmit Shah <amit@kernel.org>
      Cc: virtualization@lists.linux-foundation.org
      Cc: Arnd Bergmann <arnd@arndb.de>
      Link: https://lore.kernel.org/r/20220316192010.19001-3-rdunlap@infradead.org
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      44c2d5fb
    • Jiri Slaby's avatar
      serial: samsung_tty: do not unlock port->lock for uart_write_wakeup() · 053bbff8
      Jiri Slaby authored
      [ Upstream commit 988c7c00 ]
      
      The commit c15c3747
      
       (serial: samsung: fix potential soft lockup
      during uart write) added an unlock of port->lock before
      uart_write_wakeup() and a lock after it. It was always problematic to
      write data from tty_ldisc_ops::write_wakeup and it was even documented
      that way. We fixed the line disciplines to conform to this recently.
      So if there is still a missed one, we should fix them instead of this
      workaround.
      
      On the top of that, s3c24xx_serial_tx_dma_complete() in this driver
      still holds the port->lock while calling uart_write_wakeup().
      
      So revert the wrap added by the commit above.
      
      Cc: Thomas Abraham <thomas.abraham@linaro.org>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Hyeonkook Kim <hk619.kim@samsung.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Link: https://lore.kernel.org/r/20220308115153.4225-1-jslaby@suse.cz
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      053bbff8
    • Nathan Chancellor's avatar
      x86/Kconfig: Do not allow CONFIG_X86_X32_ABI=y with llvm-objcopy · c393a9f4
      Nathan Chancellor authored
      [ Upstream commit aaeed6ec ]
      
      There are two outstanding issues with CONFIG_X86_X32_ABI and
      llvm-objcopy, with similar root causes:
      
      1. llvm-objcopy does not properly convert .note.gnu.property when going
         from x86_64 to x86_x32, resulting in a corrupted section when
         linking:
      
         https://github.com/ClangBuiltLinux/linux/issues/1141
      
      2. llvm-objcopy produces corrupted compressed debug sections when going
         from x86_64 to x86_x32, also resulting in an error when linking:
      
         https://github.com/ClangBuiltLinux/linux/issues/514
      
      
      
      After commit 41c5ef31ad71 ("x86/ibt: Base IBT bits"), the
      .note.gnu.property section is always generated when
      CONFIG_X86_KERNEL_IBT is enabled, which causes the first issue to become
      visible with an allmodconfig build:
      
        ld.lld: error: arch/x86/entry/vdso/vclock_gettime-x32.o:(.note.gnu.property+0x1c): program property is too short
      
      To avoid this error, do not allow CONFIG_X86_X32_ABI to be selected when
      using llvm-objcopy. If the two issues ever get fixed in llvm-objcopy,
      this can be turned into a feature check.
      
      Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20220314194842.3452-3-nathan@kernel.org
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c393a9f4
    • Peter Zijlstra's avatar
      x86: Annotate call_on_stack() · e3c961c5
      Peter Zijlstra authored
      [ Upstream commit be007595
      
       ]
      
      vmlinux.o: warning: objtool: page_fault_oops()+0x13c: unreachable instruction
      
      0000 000000000005b460 <page_fault_oops>:
      ...
      0128    5b588:  49 89 23                mov    %rsp,(%r11)
      012b    5b58b:  4c 89 dc                mov    %r11,%rsp
      012e    5b58e:  4c 89 f2                mov    %r14,%rdx
      0131    5b591:  48 89 ee                mov    %rbp,%rsi
      0134    5b594:  4c 89 e7                mov    %r12,%rdi
      0137    5b597:  e8 00 00 00 00          call   5b59c <page_fault_oops+0x13c>    5b598: R_X86_64_PLT32   handle_stack_overflow-0x4
      013c    5b59c:  5c                      pop    %rsp
      
      vmlinux.o: warning: objtool: sysvec_reboot()+0x6d: unreachable instruction
      
      0000 00000000000033f0 <sysvec_reboot>:
      ...
      005d     344d:  4c 89 dc                mov    %r11,%rsp
      0060     3450:  e8 00 00 00 00          call   3455 <sysvec_reboot+0x65>        3451: R_X86_64_PLT32    irq_enter_rcu-0x4
      0065     3455:  48 89 ef                mov    %rbp,%rdi
      0068     3458:  e8 00 00 00 00          call   345d <sysvec_reboot+0x6d>        3459: R_X86_64_PC32     .text+0x47d0c
      006d     345d:  e8 00 00 00 00          call   3462 <sysvec_reboot+0x72>        345e: R_X86_64_PLT32    irq_exit_rcu-0x4
      0072     3462:  5c                      pop    %rsp
      
      Both cases are due to a call_on_stack() calling a __noreturn function.
      Since that's an inline asm, GCC can't do anything about the
      instructions after the CALL. Therefore put in an explicit
      ASM_REACHABLE annotation to make sure objtool and gcc are consistently
      confused about control flow.
      
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Link: https://lore.kernel.org/r/20220308154319.468805622@infradead.org
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e3c961c5
    • NeilBrown's avatar
      NFS: swap-out must always use STABLE writes. · 6bb22702
      NeilBrown authored
      [ Upstream commit c265de25
      
       ]
      
      The commit handling code is not safe against memory-pressure deadlocks
      when writing to swap.  In particular, nfs_commitdata_alloc() blocks
      indefinitely waiting for memory, and this can consume all available
      workqueue threads.
      
      swap-out most likely uses STABLE writes anyway as COND_STABLE indicates
      that a stable write should be used if the write fits in a single
      request, and it normally does.  However if we ever swap with a small
      wsize, or gather unusually large numbers of pages for a single write,
      this might change.
      
      For safety, make it explicit in the code that direct writes used for swap
      must always use FLUSH_STABLE.
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      6bb22702
    • NeilBrown's avatar
      NFS: swap IO handling is slightly different for O_DIRECT IO · 24d28d9b
      NeilBrown authored
      [ Upstream commit 64158668
      
       ]
      
      1/ Taking the i_rwsem for swap IO triggers lockdep warnings regarding
         possible deadlocks with "fs_reclaim".  These deadlocks could, I believe,
         eventuate if a buffered read on the swapfile was attempted.
      
         We don't need coherence with the page cache for a swap file, and
         buffered writes are forbidden anyway.  There is no other need for
         i_rwsem during direct IO.  So never take it for swap_rw()
      
      2/ generic_write_checks() explicitly forbids writes to swap, and
         performs checks that are not needed for swap.  So bypass it
         for swap_rw().
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      24d28d9b
    • NeilBrown's avatar
      SUNRPC: remove scheduling boost for "SWAPPER" tasks. · a5538640
      NeilBrown authored
      [ Upstream commit a80a8461
      
       ]
      
      Currently, tasks marked as "swapper" tasks get put to the front of
      non-priority rpc_queues, and are sorted earlier than non-swapper tasks on
      the transport's ->xmit_queue.
      
      This is pointless as currently *all* tasks for a mount that has swap
      enabled on *any* file are marked as "swapper" tasks.  So the net result
      is that the non-priority rpc_queues are reverse-ordered (LIFO).
      
      This scheduling boost is not necessary to avoid deadlocks, and hurts
      fairness, so remove it.  If there were a need to expedite some requests,
      the tk_priority mechanism is a more appropriate tool.
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      a5538640
    • NeilBrown's avatar
      SUNRPC/xprt: async tasks mustn't block waiting for memory · 20700aa0
      NeilBrown authored
      [ Upstream commit a7210354
      
       ]
      
      When memory is short, new worker threads cannot be created and we depend
      on the minimum one rpciod thread to be able to handle everything.  So it
      must not block waiting for memory.
      
      xprt_dynamic_alloc_slot can block indefinitely.  This can tie up all
      workqueue threads and NFS can deadlock.  So when called from a
      workqueue, set __GFP_NORETRY.
      
      The rdma alloc_slot already does not block.  However it sets the error
      to -EAGAIN suggesting this will trigger a sleep.  It does not.  As we
      can see in call_reserveresult(), only -ENOMEM causes a sleep.  -EAGAIN
      causes immediate retry.
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      20700aa0
    • NeilBrown's avatar
      SUNRPC/call_alloc: async tasks mustn't block waiting for memory · a19fd1d6
      NeilBrown authored
      [ Upstream commit c487216b
      
       ]
      
      When memory is short, new worker threads cannot be created and we depend
      on the minimum one rpciod thread to be able to handle everything.
      So it must not block waiting for memory.
      
      mempools are particularly a problem as memory can only be released back
      to the mempool by an async rpc task running.  If all available
      workqueue threads are waiting on the mempool, no thread is available to
      return anything.
      
      rpc_malloc() can block, and this might cause deadlocks.
      So check RPC_IS_ASYNC(), rather than RPC_IS_SWAPPER() to determine if
      blocking is acceptable.
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      a19fd1d6
    • Maxime Ripard's avatar
      clk: Enforce that disjoints limits are invalid · b07387c4
      Maxime Ripard authored
      [ Upstream commit 10c46f2e
      
       ]
      
      If we were to have two users of the same clock, doing something like:
      
      clk_set_rate_range(user1, 1000, 2000);
      clk_set_rate_range(user2, 3000, 4000);
      
      The second call would fail with -EINVAL, preventing from getting in a
      situation where we end up with impossible limits.
      
      However, this is never explicitly checked against and enforced, and
      works by relying on an undocumented behaviour of clk_set_rate().
      
      Indeed, on the first clk_set_rate_range will make sure the current clock
      rate is within the new range, so it will be between 1000 and 2000Hz. On
      the second clk_set_rate_range(), it will consider (rightfully), that our
      current clock is outside of the 3000-4000Hz range, and will call
      clk_core_set_rate_nolock() to set it to 3000Hz.
      
      clk_core_set_rate_nolock() will then call clk_calc_new_rates() that will
      eventually check that our rate 3000Hz rate is outside the min 3000Hz max
      2000Hz range, will bail out, the error will propagate and we'll
      eventually return -EINVAL.
      
      This solely relies on the fact that clk_calc_new_rates(), and in
      particular clk_core_determine_round_nolock(), won't modify the new rate
      allowing the error to be reported. That assumption won't be true for all
      drivers, and most importantly we'll break that assumption in a later
      patch.
      
      It can also be argued that we shouldn't even reach the point where we're
      calling clk_core_set_rate_nolock().
      
      Let's make an explicit check for disjoints range before we're doing
      anything.
      
      Signed-off-by: default avatarMaxime Ripard <maxime@cerno.tech>
      Link: https://lore.kernel.org/r/20220225143534.405820-4-maxime@cerno.tech
      
      
      Signed-off-by: default avatarStephen Boyd <sboyd@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      b07387c4
    • Tony Lindgren's avatar
      clk: ti: Preserve node in ti_dt_clocks_register() · 15bfec9d
      Tony Lindgren authored
      [ Upstream commit 80864594
      
       ]
      
      In preparation for making use of the clock-output-names, we want to
      keep node around in ti_dt_clocks_register().
      
      This change should not needed as a fix currently.
      
      Signed-off-by: default avatarTony Lindgren <tony@atomide.com>
      Link: https://lore.kernel.org/r/20220204071449.16762-3-tony@atomide.com
      
      
      Signed-off-by: default avatarStephen Boyd <sboyd@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      15bfec9d
    • Dongli Zhang's avatar
      xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32 · 5c0750ca
      Dongli Zhang authored
      [ Upstream commit eed05744 ]
      
      The sched_clock() can be used very early since commit 857baa87
      ("sched/clock: Enable sched clock early"). In addition, with commit
      38669ba2
      
       ("x86/xen/time: Output xen sched_clock time from 0"), kdump
      kernel in Xen HVM guest may panic at very early stage when accessing
      &__this_cpu_read(xen_vcpu)->time as in below:
      
      setup_arch()
       -> init_hypervisor_platform()
           -> x86_init.hyper.init_platform = xen_hvm_guest_init()
               -> xen_hvm_init_time_ops()
                   -> xen_clocksource_read()
                       -> src = &__this_cpu_read(xen_vcpu)->time;
      
      This is because Xen HVM supports at most MAX_VIRT_CPUS=32 'vcpu_info'
      embedded inside 'shared_info' during early stage until xen_vcpu_setup() is
      used to allocate/relocate 'vcpu_info' for boot cpu at arbitrary address.
      
      However, when Xen HVM guest panic on vcpu >= 32, since
      xen_vcpu_info_reset(0) would set per_cpu(xen_vcpu, cpu) = NULL when
      vcpu >= 32, xen_clocksource_read() on vcpu >= 32 would panic.
      
      This patch calls xen_hvm_init_time_ops() again later in
      xen_hvm_smp_prepare_boot_cpu() after the 'vcpu_info' for boot vcpu is
      registered when the boot vcpu is >= 32.
      
      This issue can be reproduced on purpose via below command at the guest
      side when kdump/kexec is enabled:
      
      "taskset -c 33 echo c > /proc/sysrq-trigger"
      
      The bugfix for PVM is not implemented due to the lack of testing
      environment.
      
      [boris: xen_hvm_init_time_ops() returns on errors instead of jumping to end]
      
      Cc: Joe Jin <joe.jin@oracle.com>
      Signed-off-by: default avatarDongli Zhang <dongli.zhang@oracle.com>
      Reviewed-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Link: https://lore.kernel.org/r/20220302164032.14569-3-dongli.zhang@oracle.com
      
      
      Signed-off-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      5c0750ca
    • Ohad Sharabi's avatar
      habanalabs: fix possible memory leak in MMU DR fini · 12e49aef
      Ohad Sharabi authored
      [ Upstream commit eb85eec8
      
       ]
      
      This patch fixes what seems to be copy paste error.
      
      We will have a memory leak if the host-resident shadow is NULL (which
      will likely happen as the DR and HR are not dependent).
      
      Signed-off-by: default avatarOhad Sharabi <osharabi@habana.ai>
      Reviewed-by: default avatarOded Gabbay <ogabbay@kernel.org>
      Signed-off-by: default avatarOded Gabbay <ogabbay@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      12e49aef
    • Trond Myklebust's avatar
      NFSv4: Protect the state recovery thread against direct reclaim · a34752aa
      Trond Myklebust authored
      [ Upstream commit 3e17898a
      
       ]
      
      If memory allocation triggers a direct reclaim from the state recovery
      thread, then we can deadlock. Use memalloc_nofs_save/restore to ensure
      that doesn't happen.
      
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      a34752aa
    • Xin Xiong's avatar
      NFSv4.2: fix reference count leaks in _nfs42_proc_copy_notify() · b37f482b
      Xin Xiong authored
      [ Upstream commit b7f114ed ]
      
      [You don't often get email from xiongx18@fudan.edu.cn. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.
      
      ]
      
      The reference counting issue happens in two error paths in the
      function _nfs42_proc_copy_notify(). In both error paths, the function
      simply returns the error code and forgets to balance the refcount of
      object `ctx`, bumped by get_nfs_open_context() earlier, which may
      cause refcount leaks.
      
      Fix it by balancing refcount of the `ctx` object before the function
      returns in both error paths.
      
      Signed-off-by: default avatarXin Xiong <xiongx18@fudan.edu.cn>
      Signed-off-by: default avatarXiyu Yang <xiyuyang19@fudan.edu.cn>
      Signed-off-by: default avatarXin Tan <tanxin.ctf@gmail.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      b37f482b
    • Lucas Denefle's avatar
      w1: w1_therm: fixes w1_seq for ds28ea00 sensors · 24acdd5f
      Lucas Denefle authored
      [ Upstream commit 41a92a89
      
       ]
      
      w1_seq was failing due to several devices responding to the
      CHAIN_DONE at the same time. Now properly selects the current
      device in the chain with MATCH_ROM. Also acknowledgment was
      read twice.
      
      Signed-off-by: default avatarLucas Denefle <lucas.denefle@converge.io>
      Link: https://lore.kernel.org/r/20220223113558.232750-1-lucas.denefle@converge.io
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      24acdd5f
    • Xiaoke Wang's avatar
      staging: wfx: fix an error handling in wfx_init_common() · 86efcb52
      Xiaoke Wang authored
      [ Upstream commit 60f1d3c9
      
       ]
      
      One error handler of wfx_init_common() return without calling
      ieee80211_free_hw(hw), which may result in memory leak. And I add
      one err label to unify the error handler, which is useful for the
      subsequent changes.
      
      Suggested-by: default avatarJérôme Pouiller <jerome.pouiller@silabs.com>
      Reviewed-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarJérôme Pouiller <jerome.pouiller@silabs.com>
      Signed-off-by: default avatarXiaoke Wang <xkernel.wang@foxmail.com>
      Link: https://lore.kernel.org/r/tencent_24A24A3EFF61206ECCC4B94B1C5C1454E108@qq.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      86efcb52
    • Viresh Kumar's avatar
      opp: Expose of-node's name in debugfs · 7295544b
      Viresh Kumar authored
      [ Upstream commit 021dbeca
      
       ]
      
      It is difficult to find which OPPs are active at the moment, specially
      if there are multiple OPPs with same frequency available in the device
      tree (controlled by supported hardware feature).
      
      Expose name of the DT node to find out the exact OPP.
      
      While at it, also expose level field.
      
      Reported-by: default avatarLeo Yan <leo.yan@linaro.org>
      Tested-by: default avatarLeo Yan <leo.yan@linaro.org>
      Signed-off-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      7295544b
    • Pierre Gondois's avatar
      cpufreq: CPPC: Fix performance/frequency conversion · ea1f2958
      Pierre Gondois authored
      [ Upstream commit ec1c7ad4
      
       ]
      
      CPUfreq governors request CPU frequencies using information
      on current CPU usage. The CPPC driver converts them to
      performance requests. Frequency targets are computed as:
      	target_freq = (util / cpu_capacity) * max_freq
      target_freq is then clamped between [policy->min, policy->max].
      
      The CPPC driver converts performance values to frequencies
      (and vice-versa) using cppc_cpufreq_perf_to_khz() and
      cppc_cpufreq_khz_to_perf(). These functions both use two different
      factors depending on the range of the input value. For
      cppc_cpufreq_khz_to_perf():
      - (NOMINAL_PERF / NOMINAL_FREQ) or
      - (LOWEST_PERF / LOWEST_FREQ)
      and for cppc_cpufreq_perf_to_khz():
      - (NOMINAL_FREQ / NOMINAL_PERF) or
      - ((NOMINAL_PERF - LOWEST_FREQ) / (NOMINAL_PERF - LOWEST_PERF))
      
      This means:
      1- the functions are not inverse for some values:
         (perf_to_khz(khz_to_perf(x)) != x)
      2- cppc_cpufreq_perf_to_khz(LOWEST_PERF) can sometimes give
         a different value from LOWEST_FREQ due to integer approximation
      3- it is implied that performance and frequency are proportional
         (NOMINAL_FREQ / NOMINAL_PERF) == (LOWEST_PERF / LOWEST_FREQ)
      
      This patch changes the conversion functions to an affine function.
      This fixes the 3 points above.
      
      Suggested-by: default avatarLukasz Luba <lukasz.luba@arm.com>
      Suggested-by: default avatarMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: default avatarPierre Gondois <Pierre.Gondois@arm.com>
      Signed-off-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ea1f2958
    • Sascha Hauer's avatar
      clk: rockchip: drop CLK_SET_RATE_PARENT from dclk_vop* on rk3568 · 26f0a9e3
      Sascha Hauer authored
      [ Upstream commit ff3187ea
      
       ]
      
      The pixel clocks dclk_vop[012] can be clocked from hpll, vpll, gpll or
      cpll. gpll and cpll also drive many other clocks, so changing the
      dclk_vop[012] clocks could change these other clocks as well. Drop
      CLK_SET_RATE_PARENT to fix that. With this change the VOP2 driver can
      only adjust the pixel clocks with the divider between the PLL and the
      dclk_vop[012] which means the user may have to adjust the PLL clock to a
      suitable rate using the assigned-clock-rate device tree property.
      
      Signed-off-by: default avatarSascha Hauer <s.hauer@pengutronix.de>
      Link: https://lore.kernel.org/r/20220126145549.617165-25-s.hauer@pengutronix.de
      
      
      Signed-off-by: default avatarHeiko Stuebner <heiko@sntech.de>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      26f0a9e3
    • Amjad Ouled-Ameur's avatar
      phy: amlogic: meson8b-usb2: fix shared reset control use · caffa76d
      Amjad Ouled-Ameur authored
      [ Upstream commit 6f1dedf0
      
       ]
      
      Use reset_control_rearm() call if an error occurs in case
      phy_meson8b_usb2_power_on() fails after reset() has been called, or in
      case phy_meson8b_usb2_power_off() is called i.e the resource is no longer
      used and the reset line may be triggered again by other devices.
      
      reset_control_rearm() keeps use of triggered_count sane in the reset
      framework, use of reset_control_reset() on shared reset line should
      be balanced with reset_control_rearm().
      
      Signed-off-by: default avatarAmjad Ouled-Ameur <aouledameur@baylibre.com>
      Reported-by: default avatarJerome Brunet <jbrunet@baylibre.com>
      Reviewed-by: default avatarMartin Blumenstingl <martin.blumenstingl@googlemail.com>
      Acked-by: default avatarNeil Armstrong <narmstrong@baylibre.com>
      Link: https://lore.kernel.org/r/20220111095255.176141-4-aouledameur@baylibre.com
      
      
      Signed-off-by: default avatarVinod Koul <vkoul@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      caffa76d