Skip to content
  1. Feb 01, 2020
    • John Hubbard's avatar
      powerpc: book3s64: convert to pin_user_pages() and put_user_page() · aa4b87fe
      John Hubbard authored
      
      
      1. Convert from get_user_pages() to pin_user_pages().
      
      2. As required by pin_user_pages(), release these pages via
         put_user_page().
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-21-jhubbard@nvidia.com
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aa4b87fe
    • John Hubbard's avatar
      vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion · 19fed0da
      John Hubbard authored
      
      
      1. Change vfio from get_user_pages_remote(), to
         pin_user_pages_remote().
      
      2. Because all FOLL_PIN-acquired pages must be released via
         put_user_page(), also convert the put_page() call over to
         put_user_pages_dirty_lock().
      
      Note that this effectively changes the code's behavior in
      vfio_iommu_type1.c: put_pfn(): it now ultimately calls
      set_page_dirty_lock(), instead of set_page_dirty().  This is probably
      more accurate.
      
      As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
      dealing with a file backed page where we have reference on the inode it
      hangs off." [1]
      
      [1] https://lore.kernel.org/r/20190723153640.GB720@lst.de
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-20-jhubbard@nvidia.com
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Tested-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      19fed0da
    • John Hubbard's avatar
      media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion · 1f815afc
      John Hubbard authored
      
      
      1. Change v4l2 from get_user_pages() to pin_user_pages().
      
      2. Because all FOLL_PIN-acquired pages must be released via
         put_user_page(), also convert the put_page() call over to
         put_user_pages_dirty_lock().
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-19-jhubbard@nvidia.com
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarHans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f815afc
    • John Hubbard's avatar
      net/xdp: set FOLL_PIN via pin_user_pages() · fb48b474
      John Hubbard authored
      
      
      Convert net/xdp to use the new pin_longterm_pages() call, which sets
      FOLL_PIN.  Setting FOLL_PIN is now required for code that requires
      tracking of pinned pages.
      
      In partial anticipation of this work, the net/xdp code was already calling
      put_user_page() instead of put_page().  Therefore, in order to convert
      from the get_user_pages()/put_page() model, to the
      pin_user_pages()/put_user_page() model, the only change required here is
      to change get_user_pages() to pin_user_pages().
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-18-jhubbard@nvidia.com
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fb48b474
    • John Hubbard's avatar
      fs/io_uring: set FOLL_PIN via pin_user_pages() · 2113b05d
      John Hubbard authored
      
      
      Convert fs/io_uring to use the new pin_user_pages() call, which sets
      FOLL_PIN.  Setting FOLL_PIN is now required for code that requires
      tracking of pinned pages, and therefore for any code that calls
      put_user_page().
      
      In partial anticipation of this work, the io_uring code was already
      calling put_user_page() instead of put_page().  Therefore, in order to
      convert from the get_user_pages()/put_page() model, to the
      pin_user_pages()/put_user_page() model, the only change required here is
      to change get_user_pages() to pin_user_pages().
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-17-jhubbard@nvidia.com
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2113b05d
    • John Hubbard's avatar
      drm/via: set FOLL_PIN via pin_user_pages_fast() · a5adf0a0
      John Hubbard authored
      
      
      Convert drm/via to use the new pin_user_pages_fast() call, which sets
      FOLL_PIN.  Setting FOLL_PIN is now required for code that requires
      tracking of pinned pages, and therefore for any code that calls
      put_user_page().
      
      In partial anticipation of this work, the drm/via driver was already
      calling put_user_page() instead of put_page().  Therefore, in order to
      convert from the get_user_pages()/put_page() model, to the
      pin_user_pages()/put_user_page() model, the only change required is to
      change get_user_pages() to pin_user_pages().
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-16-jhubbard@nvidia.com
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a5adf0a0
    • John Hubbard's avatar
      mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote() · 803e4572
      John Hubbard authored
      
      
      Convert process_vm_access to use the new pin_user_pages_remote() call,
      which sets FOLL_PIN.  Setting FOLL_PIN is now required for code that
      requires tracking of pinned pages.
      
      Also, release the pages via put_user_page*().
      
      Also, rename "pages" to "pinned_pages", as this makes for easier reading
      of process_vm_rw_single_vec().
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-15-jhubbard@nvidia.com
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      803e4572
    • John Hubbard's avatar
      IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP · dfa0a4ff
      John Hubbard authored
      
      
      Convert infiniband to use the new pin_user_pages*() calls.
      
      Also, revert earlier changes to Infiniband ODP that had it using
      put_user_page().  ODP is "Case 3" in
      Documentation/core-api/pin_user_pages.rst, which is to say, normal
      get_user_pages() and put_page() is the API to use there.
      
      The new pin_user_pages*() calls replace corresponding get_user_pages*()
      calls, and set the FOLL_PIN flag.  The FOLL_PIN flag requires that the
      caller must return the pages via put_user_page*() calls, but infiniband
      was already doing that as part of an earlier commit.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-14-jhubbard@nvidia.com
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dfa0a4ff
    • John Hubbard's avatar
      goldish_pipe: convert to pin_user_pages() and put_user_page() · 57459435
      John Hubbard authored
      
      
      1. Call the new global pin_user_pages_fast(), from
         pin_goldfish_pages().
      
      2. As required by pin_user_pages(), release these pages via
         put_user_page().  In this case, do so via put_user_pages_dirty_lock().
      
      That has the side effect of calling set_page_dirty_lock(), instead of
      set_page_dirty().  This is probably more accurate.
      
      As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
      dealing with a file backed page where we have reference on the inode it
      hangs off." [1]
      
      Another side effect is that the release code is simplified because the
      page[] loop is now in gup.c instead of here, so just delete the local
      release_user_pages() entirely, and call put_user_pages_dirty_lock()
      directly, instead.
      
      [1] https://lore.kernel.org/r/20190723153640.GB720@lst.de
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-13-jhubbard@nvidia.com
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      57459435
    • John Hubbard's avatar
      mm/gup: introduce pin_user_pages*() and FOLL_PIN · eddb1c22
      John Hubbard authored
      
      
      Introduce pin_user_pages*() variations of get_user_pages*() calls, and
      also pin_longterm_pages*() variations.
      
      For now, these are placeholder calls, until the various call sites are
      converted to use the correct get_user_pages*() or pin_user_pages*() API.
      
      These variants will eventually all set FOLL_PIN, which is also
      introduced, and thoroughly documented.
      
          pin_user_pages()
          pin_user_pages_remote()
          pin_user_pages_fast()
      
      All pages that are pinned via the above calls, must be unpinned via
      put_user_page().
      
      The underlying rules are:
      
      * FOLL_PIN is a gup-internal flag, so the call sites should not directly
        set it.  That behavior is enforced with assertions.
      
      * Call sites that want to indicate that they are going to do DirectIO
        ("DIO") or something with similar characteristics, should call a
        get_user_pages()-like wrapper call that sets FOLL_PIN.  These wrappers
        will:
      
          * Start with "pin_user_pages" instead of "get_user_pages".  That
            makes it easy to find and audit the call sites.
      
          * Set FOLL_PIN
      
      * For pages that are received via FOLL_PIN, those pages must be returned
        via put_user_page().
      
      Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases in
      this documentation.  (I've reworded it and expanded upon it.)
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-12-jhubbard@nvidia.com
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>		[Documentation]
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eddb1c22
    • John Hubbard's avatar
      media/v4l2-core: set pages dirty upon releasing DMA buffers · 3c7470b6
      John Hubbard authored
      
      
      After DMA is complete, and the device and CPU caches are synchronized,
      it's still required to mark the CPU pages as dirty, if the data was
      coming from the device.  However, this driver was just issuing a bare
      put_page() call, without any set_page_dirty*() call.
      
      Fix the problem, by calling set_page_dirty_lock() if the CPU pages were
      potentially receiving data from the device.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-11-jhubbard@nvidia.com
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarHans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: <stable@vger.kernel.org>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c7470b6
    • John Hubbard's avatar
      IB/umem: use get_user_pages_fast() to pin DMA pages · 4789fcdd
      John Hubbard authored
      
      
      And get rid of the mmap_sem calls, as part of that.  Note that
      get_user_pages_fast() will, if necessary, fall back to
      __gup_longterm_unlocked(), which takes the mmap_sem as needed.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-10-jhubbard@nvidia.com
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4789fcdd
    • John Hubbard's avatar
      mm/gup: allow FOLL_FORCE for get_user_pages_fast() · f4000fdf
      John Hubbard authored
      Commit 817be129 ("mm: validate get_user_pages_fast flags") allowed
      only FOLL_WRITE and FOLL_LONGTERM to be passed to get_user_pages_fast().
      This, combined with the fact that get_user_pages_fast() falls back to
      "slow gup", which *does* accept FOLL_FORCE, leads to an odd situation:
      if you need FOLL_FORCE, you cannot call get_user_pages_fast().
      
      There does not appear to be any reason for filtering out FOLL_FORCE.
      There is nothing in the _fast() implementation that requires that we
      avoid writing to the pages.  So it appears to have been an oversight.
      
      Fix by allowing FOLL_FORCE to be set for get_user_pages_fast().
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-9-jhubbard@nvidia.com
      Fixes: 817be129
      
       ("mm: validate get_user_pages_fast flags")
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4000fdf
    • John Hubbard's avatar
      vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call · 3567813e
      John Hubbard authored
      
      
      Update VFIO to take advantage of the recently loosened restriction on
      FOLL_LONGTERM with get_user_pages_remote().  Also, now it is possible to
      fix a bug: the VFIO caller is logically a FOLL_LONGTERM user, but it
      wasn't setting FOLL_LONGTERM.
      
      Also, remove an unnessary pair of calls that were releasing and
      reacquiring the mmap_sem.  There is no need to avoid holding mmap_sem
      just in order to call page_to_pfn().
      
      Also, now that the the DAX check ("if a VMA is DAX, don't allow long
      term pinning") is in the internals of get_user_pages_remote() and
      __gup_longterm_locked(), there's no need for it at the VFIO call site.  So
      remove it.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-8-jhubbard@nvidia.com
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Tested-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Suggested-by: default avatarJason Gunthorpe <jgg@ziepe.ca>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3567813e
    • John Hubbard's avatar
      mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM · c4237f8b
      John Hubbard authored
      
      
      As it says in the updated comment in gup.c: current FOLL_LONGTERM
      behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the FS
      DAX check requirement on vmas.
      
      However, the corresponding restriction in get_user_pages_remote() was
      slightly stricter than is actually required: it forbade all
      FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
      that do not set the "locked" arg.
      
      Update the code and comments to loosen the restriction, allowing
      FOLL_LONGTERM in some cases.
      
      Also, copy the DAX check ("if a VMA is DAX, don't allow long term
      pinning") from the VFIO call site, all the way into the internals of
      get_user_pages_remote() and __gup_longterm_locked().  That is:
      get_user_pages_remote() calls __gup_longterm_locked(), which in turn
      calls check_dax_vmas().  This check will then be removed from the VFIO
      call site in a subsequent patch.
      
      Thanks to Jason Gunthorpe for pointing out a clean way to fix this, and
      to Dan Williams for helping clarify the DAX refactoring.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-7-jhubbard@nvidia.com
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Tested-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Suggested-by: default avatarJason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4237f8b
    • John Hubbard's avatar
      goldish_pipe: rename local pin_user_pages() routine · 1023369c
      John Hubbard authored
      
      
      Avoid naming conflicts: rename local static function from
      "pin_user_pages()" to "goldfish_pin_pages()".
      
      An upcoming patch will introduce a global pin_user_pages() function.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-6-jhubbard@nvidia.com
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1023369c
    • John Hubbard's avatar
      mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages · 07d80269
      John Hubbard authored
      
      
      An upcoming patch changes and complicates the refcounting and especially
      the "put page" aspects of it.  In order to keep everything clean,
      refactor the devmap page release routines:
      
      * Rename put_devmap_managed_page() to page_is_devmap_managed(), and
        limit the functionality to "read only": return a bool, with no side
        effects.
      
      * Add a new routine, put_devmap_managed_page(), to handle decrementing
        the refcount for ZONE_DEVICE pages.
      
      * Change callers (just release_pages() and put_page()) to check
        page_is_devmap_managed() before calling the new
        put_devmap_managed_page() routine.  This is a performance point:
        put_page() is a hot path, so we need to avoid non- inline function calls
        where possible.
      
      * Rename __put_devmap_managed_page() to free_devmap_managed_page(), and
        limit the functionality to unconditionally freeing a devmap page.
      
      This is originally based on a separate patch by Ira Weiny, which applied
      to an early version of the put_user_page() experiments.  Since then,
      Jérôme Glisse suggested the refactoring described above.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-5-jhubbard@nvidia.com
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Suggested-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      07d80269
    • Dan Williams's avatar
      mm: Cleanup __put_devmap_managed_page() vs ->page_free() · 429589d6
      Dan Williams authored
      
      
      After the removal of the device-public infrastructure there are only 2
      ->page_free() call backs in the kernel.  One of those is a
      device-private callback in the nouveau driver, the other is a generic
      wakeup needed in the DAX case.  In the hopes that all ->page_free()
      callbacks can be migrated to common core kernel functionality, move the
      device-private specific actions in __put_devmap_managed_page() under the
      is_device_private_page() conditional, including the ->page_free()
      callback.  For the other page types just open-code the generic wakeup.
      
      Yes, the wakeup is only needed in the MEMORY_DEVICE_FSDAX case, but it
      does no harm in the MEMORY_DEVICE_DEVDAX and MEMORY_DEVICE_PCI_P2PDMA
      case.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-4-jhubbard@nvidia.com
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      429589d6
    • John Hubbard's avatar
      mm/gup: move try_get_compound_head() to top, fix minor issues · a707cdd5
      John Hubbard authored
      
      
      An upcoming patch uses try_get_compound_head() more widely, so move it to
      the top of gup.c.
      
      Also fix a tiny spelling error and a checkpatch.pl warning.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-3-jhubbard@nvidia.com
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a707cdd5
    • John Hubbard's avatar
      mm/gup: factor out duplicate code from four routines · a43e9820
      John Hubbard authored
      
      
      Patch series "mm/gup: prereqs to track dma-pinned pages: FOLL_PIN", v12.
      
      Overview:
      
      This is a prerequisite to solving the problem of proper interactions
      between file-backed pages, and [R]DMA activities, as discussed in [1],
      [2], [3], and in a remarkable number of email threads since about
      2017.  :)
      
      A new internal gup flag, FOLL_PIN is introduced, and thoroughly
      documented in the last patch's Documentation/vm/pin_user_pages.rst.
      
      I believe that this will provide a good starting point for doing the
      layout lease work that Ira Weiny has been working on.  That's because
      these new wrapper functions provide a clean, constrained, systematically
      named set of functionality that, again, is required in order to even
      know if a page is "dma-pinned".
      
      In contrast to earlier approaches, the page tracking can be
      incrementally applied to the kernel call sites that, until now, have
      been simply calling get_user_pages() ("gup").  In other words, opt-in by
      changing from this:
      
          get_user_pages() (sets FOLL_GET)
          put_page()
      
      to this:
          pin_user_pages() (sets FOLL_PIN)
          unpin_user_page()
      
      Testing:
      
      * I've done some overall kernel testing (LTP, and a few other goodies),
        and some directed testing to exercise some of the changes. And as you
        can see, gup_benchmark is enhanced to exercise this. Basically, I've
        been able to runtime test the core get_user_pages() and
        pin_user_pages() and related routines, but not so much on several of
        the call sites--but those are generally just a couple of lines
        changed, each.
      
        Not much of the kernel is actually using this, which on one hand
        reduces risk quite a lot. But on the other hand, testing coverage
        is low. So I'd love it if, in particular, the Infiniband and PowerPC
        folks could do a smoke test of this series for me.
      
        Runtime testing for the call sites so far is pretty light:
      
          * io_uring: Some directed tests from liburing exercise this, and
                      they pass.
          * process_vm_access.c: A small directed test passes.
          * gup_benchmark: the enhanced version hits the new gup.c code, and
                           passes.
          * infiniband: Ran rdma-core tests: rdma-core/build/bin/run_tests.py
          * VFIO: compiles (I'm vowing to set up a run time test soon, but it's
                            not ready just yet)
          * powerpc: it compiles...
          * drm/via: compiles...
          * goldfish: compiles...
          * net/xdp: compiles...
          * media/v4l2: compiles...
      
      [1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
      [2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
      [3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/
      
      This patch (of 22):
      
      There are four locations in gup.c that have a fair amount of code
      duplication.  This means that changing one requires making the same
      changes in four places, not to mention reading the same code four times,
      and wondering if there are subtle differences.
      
      Factor out the common code into static functions, thus reducing the
      overall line count and the code's complexity.
      
      Also, take the opportunity to slightly improve the efficiency of the
      error cases, by doing a mass subtraction of the refcount, surrounded by
      get_page()/put_page().
      
      Also, further simplify (slightly), by waiting until the the successful
      end of each routine, to increment *nr.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-2-jhubbard@nvidia.com
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a43e9820
    • Wei Yang's avatar
      mm/gup.c: use is_vm_hugetlb_page() to check whether to follow huge · be9d3045
      Wei Yang authored
      
      
      No functional change, just leverage the helper function to improve
      readability as others.
      
      Link: http://lkml.kernel.org/r/20200113070322.26627-1-richardw.yang@linux.intel.com
      Signed-off-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be9d3045
    • Qiujun Huang's avatar
      mm: fix gup_pud_range · 15494520
      Qiujun Huang authored
      
      
      sorry for not processing for a long time.  I met it again.
      
      patch v1   https://lkml.org/lkml/2019/9/20/656
      
      do_machine_check()
        do_memory_failure()
          memory_failure()
            hw_poison_user_mappings()
              try_to_unmap()
                pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
      
      ...and now we have a swap entry that indicates that the page entry
      refers to a bad (and poisoned) page of memory, but gup_fast() at this
      level of the page table was ignoring swap entries, and incorrectly
      assuming that "!pxd_none() == valid and present".
      
      And this was not just a poisoned page problem, but a generaly swap entry
      problem.  So, any swap entry type (device memory migration, numa
      migration, or just regular swapping) could lead to the same problem.
      
      Fix this by checking for pxd_present(), instead of pxd_none().
      
      Link: http://lkml.kernel.org/r/1578479084-15508-1-git-send-email-hqjagain@gmail.com
      Signed-off-by: default avatarQiujun Huang <hqjagain@gmail.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      15494520
    • Ira Weiny's avatar
      mm/filemap.c: clean up filemap_write_and_wait() · ddf8f376
      Ira Weiny authored
      
      
      At some point filemap_write_and_wait() and
      filemap_write_and_wait_range() got the exact same implementation with
      the exception of the range being specified in *_range()
      
      Similar to other functions in fs.h which call *_range(..., 0,
      LLONG_MAX), change filemap_write_and_wait() to be a static inline which
      calls filemap_write_and_wait_range()
      
      Link: http://lkml.kernel.org/r/20191129160713.30892-1-ira.weiny@intel.com
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ddf8f376
    • Vlastimil Babka's avatar
      mm/debug.c: always print flags in dump_page() · 5b57b8f2
      Vlastimil Babka authored
      Commit 76a1850e ("mm/debug.c: __dump_page() prints an extra line")
      inadvertently removed printing of page flags for pages that are neither
      anon nor ksm nor have a mapping.  Fix that.
      
      Using pr_cont() again would be a solution, but the commit explicitly
      removed its use.  Avoiding the danger of mixing up split lines from
      multiple CPUs might be beneficial for near-panic dumps like this, so fix
      this without reintroducing pr_cont().
      
      Link: http://lkml.kernel.org/r/9f884d5c-ca60-dc7b-219c-c081c755fab6@suse.cz
      Fixes: 76a1850e
      
       ("mm/debug.c: __dump_page() prints an extra line")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reported-by: default avatarMichal Hocko <mhocko@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b57b8f2
    • He Zhe's avatar
      mm/kmemleak: turn kmemleak_lock and object->lock to raw_spinlock_t · 8c96f1bc
      He Zhe authored
      
      
      kmemleak_lock as a rwlock on RT can possibly be acquired in atomic
      context which does work.
      
      Since the kmemleak operation is performed in atomic context make it a
      raw_spinlock_t so it can also be acquired on RT.  This is used for
      debugging and is not enabled by default in a production like environment
      (where performance/latency matters) so it makes sense to make it a
      raw_spinlock_t instead trying to get rid of the atomic context.  Turn
      also the kmemleak_object->lock into raw_spinlock_t which is acquired
      (nested) while the kmemleak_lock is held.
      
      The time spent in "echo scan > kmemleak" slightly improved on 64core box
      with this patch applied after boot.
      
      [bigeasy@linutronix.de: redo the description, update comments. Merge the individual bits:  He Zhe did the kmemleak_lock, Liu Haitao the ->lock and Yongxin Liu forwarded Liu's patch.]
      Link: http://lkml.kernel.org/r/20191219170834.4tah3prf2gdothz4@linutronix.de
      Link: https://lkml.kernel.org/r/20181218150744.GB20197@arrakis.emea.arm.com
      Link: https://lkml.kernel.org/r/1542877459-144382-1-git-send-email-zhe.he@windriver.com
      Link: https://lkml.kernel.org/r/20190927082230.34152-1-yongxin.liu@windriver.com
      Signed-off-by: default avatarHe Zhe <zhe.he@windriver.com>
      Signed-off-by: default avatarLiu Haitao <haitao.liu@windriver.com>
      Signed-off-by: default avatarYongxin Liu <yongxin.liu@windriver.com>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c96f1bc
    • Yu Zhao's avatar
      mm/slub.c: avoid slub allocation while holding list_lock · 90e9f6a6
      Yu Zhao authored
      
      
      If we are already under list_lock, don't call kmalloc().  Otherwise we
      will run into a deadlock because kmalloc() also tries to grab the same
      lock.
      
      Fix the problem by using a static bitmap instead.
      
        WARNING: possible recursive locking detected
        --------------------------------------------
        mount-encrypted/4921 is trying to acquire lock:
        (&(&n->list_lock)->rlock){-.-.}, at: ___slab_alloc+0x104/0x437
      
        but task is already holding lock:
        (&(&n->list_lock)->rlock){-.-.}, at: __kmem_cache_shutdown+0x81/0x3cb
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(&(&n->list_lock)->rlock);
          lock(&(&n->list_lock)->rlock);
      
         *** DEADLOCK ***
      
      Link: http://lkml.kernel.org/r/20191108193958.205102-2-yuzhao@google.com
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90e9f6a6
    • wangyan's avatar
      ocfs2: use ocfs2_update_inode_fsync_trans() to access t_tid in handle->h_transaction · 25b69918
      wangyan authored
      
      
      For the uniform format, we use ocfs2_update_inode_fsync_trans() to
      access t_tid in handle->h_transaction
      
      Link: http://lkml.kernel.org/r/6ff9a312-5f7d-0e27-fb51-bc4e062fcd97@huawei.com
      Signed-off-by: default avatarYan Wang <wangyan122@huawei.com>
      Reviewed-by: default avatarJun Piao <piaojun@huawei.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      25b69918
    • wangyan's avatar
      ocfs2: fix a NULL pointer dereference when call ocfs2_update_inode_fsync_trans() · 9f16ca48
      wangyan authored
      
      
      I found a NULL pointer dereference in ocfs2_update_inode_fsync_trans(),
      handle->h_transaction may be NULL in this situation:
      
      ocfs2_file_write_iter
        ->__generic_file_write_iter
            ->generic_perform_write
              ->ocfs2_write_begin
                ->ocfs2_write_begin_nolock
                  ->ocfs2_write_cluster_by_desc
                    ->ocfs2_write_cluster
                      ->ocfs2_mark_extent_written
                        ->ocfs2_change_extent_flag
                          ->ocfs2_split_extent
                            ->ocfs2_try_to_merge_extent
                              ->ocfs2_extend_rotate_transaction
                                ->ocfs2_extend_trans
                                  ->jbd2_journal_restart
                                    ->jbd2__journal_restart
                                      // handle->h_transaction is NULL here
                                      ->handle->h_transaction = NULL;
                                      ->start_this_handle
                                        /* journal aborted due to storage
                                           network disconnection, return error */
                                        ->return -EROFS;
                               /* line 3806 in ocfs2_try_to_merge_extent (),
                                  it will ignore ret error. */
                              ->ret = 0;
              ->...
              ->ocfs2_write_end
                ->ocfs2_write_end_nolock
                  ->ocfs2_update_inode_fsync_trans
                    // NULL pointer dereference
                    ->oi->i_sync_tid = handle->h_transaction->t_tid;
      
      The information of NULL pointer dereference as follows:
          JBD2: Detected IO errors while flushing file data on dm-11-45
          Aborting journal on device dm-11-45.
          JBD2: Error -5 detected when updating journal superblock for dm-11-45.
          (dd,22081,3):ocfs2_extend_trans:474 ERROR: status = -30
          (dd,22081,3):ocfs2_try_to_merge_extent:3877 ERROR: status = -30
          Unable to handle kernel NULL pointer dereference at
          virtual address 0000000000000008
          Mem abort info:
            ESR = 0x96000004
            Exception class = DABT (current EL), IL = 32 bits
            SET = 0, FnV = 0
            EA = 0, S1PTW = 0
          Data abort info:
            ISV = 0, ISS = 0x00000004
            CM = 0, WnR = 0
          user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000e74e1338
          [0000000000000008] pgd=0000000000000000
          Internal error: Oops: 96000004 [#1] SMP
          Process dd (pid: 22081, stack limit = 0x00000000584f35a9)
          CPU: 3 PID: 22081 Comm: dd Kdump: loaded
          Hardware name: Huawei TaiShan 2280 V2/BC82AMDD, BIOS 0.98 08/25/2019
          pstate: 60400009 (nZCv daif +PAN -UAO)
          pc : ocfs2_write_end_nolock+0x2b8/0x550 [ocfs2]
          lr : ocfs2_write_end_nolock+0x2a0/0x550 [ocfs2]
          sp : ffff0000459fba70
          x29: ffff0000459fba70 x28: 0000000000000000
          x27: ffff807ccf7f1000 x26: 0000000000000001
          x25: ffff807bdff57970 x24: ffff807caf1d4000
          x23: ffff807cc79e9000 x22: 0000000000001000
          x21: 000000006c6cd000 x20: ffff0000091d9000
          x19: ffff807ccb239db0 x18: ffffffffffffffff
          x17: 000000000000000e x16: 0000000000000007
          x15: ffff807c5e15bd78 x14: 0000000000000000
          x13: 0000000000000000 x12: 0000000000000000
          x11: 0000000000000000 x10: 0000000000000001
          x9 : 0000000000000228 x8 : 000000000000000c
          x7 : 0000000000000fff x6 : ffff807a308ed6b0
          x5 : ffff7e01f10967c0 x4 : 0000000000000018
          x3 : d0bc661572445600 x2 : 0000000000000000
          x1 : 000000001b2e0200 x0 : 0000000000000000
          Call trace:
           ocfs2_write_end_nolock+0x2b8/0x550 [ocfs2]
           ocfs2_write_end+0x4c/0x80 [ocfs2]
           generic_perform_write+0x108/0x1a8
           __generic_file_write_iter+0x158/0x1c8
           ocfs2_file_write_iter+0x668/0x950 [ocfs2]
           __vfs_write+0x11c/0x190
           vfs_write+0xac/0x1c0
           ksys_write+0x6c/0xd8
           __arm64_sys_write+0x24/0x30
           el0_svc_common+0x78/0x130
           el0_svc_handler+0x38/0x78
           el0_svc+0x8/0xc
      
      To prevent NULL pointer dereference in this situation, we use
      is_handle_aborted() before using handle->h_transaction->t_tid.
      
      Link: http://lkml.kernel.org/r/03e750ab-9ade-83aa-b000-b9e81e34e539@huawei.com
      Signed-off-by: default avatarYan Wang <wangyan122@huawei.com>
      Reviewed-by: default avatarJun Piao <piaojun@huawei.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f16ca48
    • Andy Shevchenko's avatar
      ocfs2/dlm: move BITS_TO_BYTES() to bitops.h for wider use · dd3e7cba
      Andy Shevchenko authored
      
      
      There are users already and will be more of BITS_TO_BYTES() macro.  Move
      it to bitops.h for wider use.
      
      In the case of ocfs2 the replacement is identical.
      
      As for bnx2x, there are two places where floor version is used.  In the
      first case to calculate the amount of structures that can fit one memory
      page.  In this case obviously the ceiling variant is correct and
      original code might have a potential bug, if amount of bits % 8 is not
      0.  In the second case the macro is used to calculate bytes transmitted
      in one microsecond.  This will work for all speeds which is multiply of
      1Gbps without any change, for the rest new code will give ceiling value,
      for instance 100Mbps will give 13 bytes, while old code gives 12 bytes
      and the arithmetically correct one is 12.5 bytes.  Further the value is
      used to setup timer threshold which in any case has its own margins due
      to certain resolution.  I don't see here an issue with slightly shifting
      thresholds for low speed connections, the card is supposed to utilize
      highest available rate, which is usually 10Gbps.
      
      Link: http://lkml.kernel.org/r/20200108121316.22411-1-andriy.shevchenko@linux.intel.com
      Signed-off-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: default avatarSudarsana Reddy Kalluru <skalluru@marvell.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd3e7cba
    • Colin Ian King's avatar
      ocfs2/dlm: remove redundant assignment to ret · d8f18750
      Colin Ian King authored
      
      
      The variable ret is being initialized with a value that is never read
      and it is being updated later with a new value.  The initialization is
      redundant and can be removed.
      
      Addresses Coverity ("Unused value")
      
      Link: http://lkml.kernel.org/r/20191202164833.62865-1-colin.king@canonical.com
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d8f18750
    • Masahiro Yamada's avatar
      ocfs2: make local header paths relative to C files · ca322fb6
      Masahiro Yamada authored
      
      
      Gang He reports the failure of building fs/ocfs2/ as an external module
      of the kernel installed on the system:
      
       $ cd fs/ocfs2
       $ make -C /lib/modules/`uname -r`/build M=`pwd` modules
      
      If you want to make it work reliably, I'd recommend to remove ccflags-y
      from the Makefiles, and to make header paths relative to the C files.  I
      think this is the correct usage of the #include "..." directive.
      
      Link: http://lkml.kernel.org/r/20191227022950.14804-1-ghe@suse.com
      Signed-off-by: default avatarMasahiro Yamada <masahiroy@kernel.org>
      Signed-off-by: default avatarGang He <ghe@suse.com>
      Reported-by: default avatarGang He <ghe@suse.com>
      Reviewed-by: default avatarGang He <ghe@suse.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Jun Piao <piaojun@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ca322fb6
    • zhengbin's avatar
      ocfs2: remove unneeded semicolons · 5b43d645
      zhengbin authored
      
      
      Fixes coccicheck warnings:
      
        fs/ocfs2/cluster/quorum.c:76:2-3: Unneeded semicolon
        fs/ocfs2/dlmglue.c:573:2-3: Unneeded semicolon
      
      Link: http://lkml.kernel.org/r/6ee3aa16-9078-30b1-df3f-22064950bd98@linux.alibaba.com
      Signed-off-by: default avatarzhengbin <zhengbin13@huawei.com>
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Acked-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b43d645
    • Aditya Pakki's avatar
      fs: ocfs: remove unnecessary assertion in dlm_migrate_lockres · 67e2d2eb
      Aditya Pakki authored
      
      
      In the only caller of dlm_migrate_lockres() - dlm_empty_lockres(),
      target is checked for O2NM_MAX_NODES.  Thus, the assertion in
      dlm_migrate_lockres() is unnecessary and can be removed.  The patch
      eliminates such a check.
      
      Link: http://lkml.kernel.org/r/20191218194111.26041-1-pakki001@umn.edu
      Signed-off-by: default avatarAditya Pakki <pakki001@umn.edu>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67e2d2eb
    • Luca Ceresoli's avatar
      scripts/spelling.txt: add "issus" typo · 4efc61c7
      Luca Ceresoli authored
      
      
      Add "issus" and correct it as "issues".
      
      Link: http://lkml.kernel.org/r/20200105221950.8384-1-luca@lucaceresoli.net
      Signed-off-by: default avatarLuca Ceresoli <luca@lucaceresoli.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4efc61c7
    • Xiong's avatar
      scripts/spelling.txt: add more spellings to spelling.txt · 2ab1278f
      Xiong authored
      
      
      Here are some of the common spelling mistakes and typos that I've found
      while fixing up spelling mistakes in the kernel.  Most of them still
      exist in more than two source files.
      
      Link: http://lkml.kernel.org/r/20191229143626.51238-1-xndchn@gmail.com
      Signed-off-by: default avatarXiong <xndchn@gmail.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: Stephen Boyd <sboyd@kernel.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Chris Paterson <chris.paterson2@renesas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ab1278f
    • Yang Shi's avatar
      mm: move_pages: report the number of non-attempted pages · 5984fabb
      Yang Shi authored
      Since commit a49bd4d7 ("mm, numa: rework do_pages_move"), the
      semantic of move_pages() has changed to return the number of
      non-migrated pages if they were result of a non-fatal reasons (usually a
      busy page).
      
      This was an unintentional change that hasn't been noticed except for LTP
      tests which checked for the documented behavior.
      
      There are two ways to go around this change.  We can even get back to
      the original behavior and return -EAGAIN whenever migrate_pages is not
      able to migrate pages due to non-fatal reasons.  Another option would be
      to simply continue with the changed semantic and extend move_pages
      documentation to clarify that -errno is returned on an invalid input or
      when migration simply cannot succeed (e.g.  -ENOMEM, -EBUSY) or the
      number of pages that couldn't have been migrated due to ephemeral
      reasons (e.g.  page is pinned or locked for other reasons).
      
      This patch implements the second option because this behavior is in
      place for some time without anybody complaining and possibly new users
      depending on it.  Also it allows to have a slightly easier error
      handling as the caller knows that it is worth to retry when err > 0.
      
      But since the new semantic would be aborted immediately if migration is
      failed due to ephemeral reasons, need include the number of
      non-attempted pages in the return value too.
      
      Link: http://lkml.kernel.org/r/1580160527-109104-1-git-send-email-yang.shi@linux.alibaba.com
      Fixes: a49bd4d7
      
       ("mm, numa: rework do_pages_move")
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Cc: <stable@vger.kernel.org>    [4.17+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5984fabb
    • Wei Yang's avatar
      mm: thp: don't need care deferred split queue in memcg charge move path · fac0516b
      Wei Yang authored
      If compound is true, this means it is a PMD mapped THP.  Which implies
      the page is not linked to any defer list.  So the first code chunk will
      not be executed.
      
      Also with this reason, it would not be proper to add this page to a
      defer list.  So the second code chunk is not correct.
      
      Based on this, we should remove the defer list related code.
      
      [yang.shi@linux.alibaba.com: better patch title]
      Link: http://lkml.kernel.org/r/20200117233836.3434-1-richardw.yang@linux.intel.com
      Fixes: 87eaceb3
      
       ("mm: thp: make deferred split shrinker memcg aware")
      Signed-off-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Suggested-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>    [5.4+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fac0516b
    • Dan Williams's avatar
      mm/memory_hotplug: fix remove_memory() lockdep splat · f1037ec0
      Dan Williams authored
      The daxctl unit test for the dax_kmem driver currently triggers the
      (false positive) lockdep splat below.  It results from the fact that
      remove_memory_block_devices() is invoked under the mem_hotplug_lock()
      causing lockdep entanglements with cpu_hotplug_lock() and sysfs (kernfs
      active state tracking).  It is a false positive because the sysfs
      attribute path triggering the memory remove is not the same attribute
      path associated with memory-block device.
      
      sysfs_break_active_protection() is not applicable since there is no real
      deadlock conflict, instead move memory-block device removal outside the
      lock.  The mem_hotplug_lock() is not needed to synchronize the
      memory-block device removal vs the page online state, that is already
      handled by lock_device_hotplug().  Specifically, lock_device_hotplug()
      is sufficient to allow try_remove_memory() to check the offline state of
      the memblocks and be assured that any in progress online attempts are
      flushed / blocked by kernfs_drain() / attribute removal.
      
      The add_memory() path safely creates memblock devices under the
      mem_hotplug_lock().  There is no kernfs active state synchronization in
      the memblock device_register() path, so nothing to fix there.
      
      This change is only possible thanks to the recent change that refactored
      memory block device removal out of arch_remove_memory() (commit
      4c4b7f9b
      
       "mm/memory_hotplug: remove memory block devices before
      arch_remove_memory()"), and David's due diligence tracking down the
      guarantees afforded by kernfs_drain().  Not flagged for -stable since
      this only impacts ongoing development and lockdep validation, not a
      runtime issue.
      
          ======================================================
          WARNING: possible circular locking dependency detected
          5.5.0-rc3+ #230 Tainted: G           OE
          ------------------------------------------------------
          lt-daxctl/6459 is trying to acquire lock:
          ffff99c7f0003510 (kn->count#241){++++}, at: kernfs_remove_by_name_ns+0x41/0x80
      
          but task is already holding lock:
          ffffffffa76a5450 (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x20/0xe0
      
          which lock already depends on the new lock.
      
          the existing dependency chain (in reverse order) is:
      
          -> #2 (mem_hotplug_lock.rw_sem){++++}:
                 __lock_acquire+0x39c/0x790
                 lock_acquire+0xa2/0x1b0
                 get_online_mems+0x3e/0xb0
                 kmem_cache_create_usercopy+0x2e/0x260
                 kmem_cache_create+0x12/0x20
                 ptlock_cache_init+0x20/0x28
                 start_kernel+0x243/0x547
                 secondary_startup_64+0xb6/0xc0
      
          -> #1 (cpu_hotplug_lock.rw_sem){++++}:
                 __lock_acquire+0x39c/0x790
                 lock_acquire+0xa2/0x1b0
                 cpus_read_lock+0x3e/0xb0
                 online_pages+0x37/0x300
                 memory_subsys_online+0x17d/0x1c0
                 device_online+0x60/0x80
                 state_store+0x65/0xd0
                 kernfs_fop_write+0xcf/0x1c0
                 vfs_write+0xdb/0x1d0
                 ksys_write+0x65/0xe0
                 do_syscall_64+0x5c/0xa0
                 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
          -> #0 (kn->count#241){++++}:
                 check_prev_add+0x98/0xa40
                 validate_chain+0x576/0x860
                 __lock_acquire+0x39c/0x790
                 lock_acquire+0xa2/0x1b0
                 __kernfs_remove+0x25f/0x2e0
                 kernfs_remove_by_name_ns+0x41/0x80
                 remove_files.isra.0+0x30/0x70
                 sysfs_remove_group+0x3d/0x80
                 sysfs_remove_groups+0x29/0x40
                 device_remove_attrs+0x39/0x70
                 device_del+0x16a/0x3f0
                 device_unregister+0x16/0x60
                 remove_memory_block_devices+0x82/0xb0
                 try_remove_memory+0xb5/0x130
                 remove_memory+0x26/0x40
                 dev_dax_kmem_remove+0x44/0x6a [kmem]
                 device_release_driver_internal+0xe4/0x1c0
                 unbind_store+0xef/0x120
                 kernfs_fop_write+0xcf/0x1c0
                 vfs_write+0xdb/0x1d0
                 ksys_write+0x65/0xe0
                 do_syscall_64+0x5c/0xa0
                 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
          other info that might help us debug this:
      
          Chain exists of:
            kn->count#241 --> cpu_hotplug_lock.rw_sem --> mem_hotplug_lock.rw_sem
      
           Possible unsafe locking scenario:
      
                 CPU0                    CPU1
                 ----                    ----
            lock(mem_hotplug_lock.rw_sem);
                                         lock(cpu_hotplug_lock.rw_sem);
                                         lock(mem_hotplug_lock.rw_sem);
            lock(kn->count#241);
      
           *** DEADLOCK ***
      
      No fixes tag as this has been a long standing issue that predated the
      addition of kernfs lockdep annotations.
      
      Link: http://lkml.kernel.org/r/157991441887.2763922.4770790047389427325.stgit@dwillia2-desk3.amr.corp.intel.com
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1037ec0
    • Wei Yang's avatar
      mm/migrate.c: also overwrite error when it is bigger than zero · dfe9aa23
      Wei Yang authored
      If we get here after successfully adding page to list, err would be 1 to
      indicate the page is queued in the list.
      
      Current code has two problems:
      
        * on success, 0 is not returned
        * on error, if add_page_for_migratioin() return 1, and the following err1
          from do_move_pages_to_node() is set, the err1 is not returned since err
          is 1
      
      And these behaviors break the user interface.
      
      Link: http://lkml.kernel.org/r/20200119065753.21694-1-richardw.yang@linux.intel.com
      Fixes: e0153fc2
      
       ("mm: move_pages: return valid node id in status if the page is already on the target node").
      Signed-off-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Acked-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dfe9aa23
    • Pingfan Liu's avatar
      mm/sparse.c: reset section's mem_map when fully deactivated · 1f503443
      Pingfan Liu authored
      After commit ba72b4c8
      
       ("mm/sparsemem: support sub-section hotplug"),
      when a mem section is fully deactivated, section_mem_map still records
      the section's start pfn, which is not used any more and will be
      reassigned during re-addition.
      
      In analogy with alloc/free pattern, it is better to clear all fields of
      section_mem_map.
      
      Beside this, it breaks the user space tool "makedumpfile" [1], which
      makes assumption that a hot-removed section has mem_map as NULL, instead
      of checking directly against SECTION_MARKED_PRESENT bit.  (makedumpfile
      will be better to change the assumption, and need a patch)
      
      The bug can be reproduced on IBM POWERVM by "drmgr -c mem -r -q 5" ,
      trigger a crash, and save vmcore by makedumpfile
      
      [1]: makedumpfile, commit e73016540293 ("[v1.6.7] Update version")
      
      Link: http://lkml.kernel.org/r/1579487594-28889-1-git-send-email-kernelfans@gmail.com
      Signed-off-by: default avatarPingfan Liu <kernelfans@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f503443