Skip to content
  1. Nov 24, 2019
    • Jason Gunthorpe's avatar
      drm/amdgpu: Use mmu_interval_insert instead of hmm_mirror · 62914a99
      Jason Gunthorpe authored
      
      
      Remove the interval tree in the driver and rely on the tree maintained by
      the mmu_notifier for delivering mmu_notifier invalidation callbacks.
      
      For some reason amdgpu has a very complicated arrangement where it tries
      to prevent duplicate entries in the interval_tree, this is not necessary,
      each amdgpu_bo can be its own stand alone entry. interval_tree already
      allows duplicates and overlaps in the tree.
      
      Also, there is no need to remove entries upon a release callback, the
      mmu_interval API safely allows objects to remain registered beyond the
      lifetime of the mm. The driver only has to stop touching the pages during
      release.
      
      Link: https://lore.kernel.org/r/20191112202231.3856-12-jgg@ziepe.ca
      Reviewed-by: default avatarPhilip Yang <Philip.Yang@amd.com>
      Tested-by: default avatarPhilip Yang <Philip.Yang@amd.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      62914a99
    • Jason Gunthorpe's avatar
      drm/amdgpu: Call find_vma under mmap_sem · a9ae8731
      Jason Gunthorpe authored
      find_vma() must be called under the mmap_sem, reorganize this code to
      do the vma check after entering the lock.
      
      Further, fix the unlocked use of struct task_struct's mm, instead use
      the mm from hmm_mirror which has an active mm_grab. Also the mm_grab
      must be converted to a mm_get before acquiring mmap_sem or calling
      find_vma().
      
      Fixes: 66c45500 ("drm/amdgpu: use new HMM APIs and helpers")
      Fixes: 0919195f
      
       ("drm/amdgpu: Enable amdgpu_ttm_tt_get_user_pages in worker threads")
      Link: https://lore.kernel.org/r/20191112202231.3856-11-jgg@ziepe.ca
      Acked-by: default avatarChristian König <christian.koenig@amd.com>
      Reviewed-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Reviewed-by: default avatarPhilip Yang <Philip.Yang@amd.com>
      Tested-by: default avatarPhilip Yang <Philip.Yang@amd.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      a9ae8731
    • Jason Gunthorpe's avatar
      nouveau: use mmu_interval_notifier instead of hmm_mirror · 20fef4ef
      Jason Gunthorpe authored
      
      
      Remove the hmm_mirror object and use the mmu_interval_notifier API instead
      for the range, and use the normal mmu_notifier API for the general
      invalidation callback.
      
      While here re-organize the pagefault path so the locking pattern is clear.
      
      nouveau is the only driver that uses a temporary range object and instead
      forwards nearly every invalidation range directly to the HW. While this is
      not how the mmu_interval_notifier was intended to be used, the overheads on
      the pagefaulting path are similar to the existing hmm_mirror version.
      Particularly since the interval tree will be small.
      
      Link: https://lore.kernel.org/r/20191112202231.3856-10-jgg@ziepe.ca
      Tested-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      20fef4ef
    • Jason Gunthorpe's avatar
      nouveau: use mmu_notifier directly for invalidate_range_start · c625c274
      Jason Gunthorpe authored
      
      
      There is no reason to get the invalidate_range_start() callback via an
      indirection through hmm_mirror, just register a normal notifier directly.
      
      Link: https://lore.kernel.org/r/20191112202231.3856-9-jgg@ziepe.ca
      Tested-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      c625c274
    • Jason Gunthorpe's avatar
      drm/radeon: use mmu_interval_notifier_insert · 3506ff69
      Jason Gunthorpe authored
      
      
      The new API is an exact match for the needs of radeon.
      
      For some reason radeon tries to remove overlapping ranges from the
      interval tree, but interval trees (and mmu_interval_notifier_insert())
      support overlapping ranges directly. Simply delete all this code.
      
      Since this driver is missing a invalidate_range_end callback, but
      still calls get_user_pages(), it cannot be correct against all races.
      
      Link: https://lore.kernel.org/r/20191112202231.3856-8-jgg@ziepe.ca
      Reviewed-by: default avatarChristian König <christian.koenig@amd.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      3506ff69
    • Jason Gunthorpe's avatar
      RDMA/hfi1: Use mmu_interval_notifier_insert for user_exp_rcv · 3889551d
      Jason Gunthorpe authored
      
      
      This converts one of the two users of mmu_notifiers to use the new API.
      The conversion is fairly straightforward, however the existing use of
      notifiers here seems to be racey.
      
      Link: https://lore.kernel.org/r/20191112202231.3856-7-jgg@ziepe.ca
      Tested-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      3889551d
    • Jason Gunthorpe's avatar
      RDMA/odp: Use mmu_interval_notifier_insert() · f25a546e
      Jason Gunthorpe authored
      Replace the internal interval tree based mmu notifier with the new common
      mmu_interval_notifier_insert() API. This removes a lot of code and fixes a
      deadlock that can be triggered in ODP:
      
       zap_page_range()
        mmu_notifier_invalidate_range_start()
         [..]
          ib_umem_notifier_invalidate_range_start()
             down_read(&per_mm->umem_rwsem)
        unmap_single_vma()
          [..]
            __split_huge_page_pmd()
              mmu_notifier_invalidate_range_start()
              [..]
                 ib_umem_notifier_invalidate_range_start()
                    down_read(&per_mm->umem_rwsem)   // DEADLOCK
      
              mmu_notifier_invalidate_range_end()
                 up_read(&per_mm->umem_rwsem)
        mmu_notifier_invalidate_range_end()
           up_read(&per_mm->umem_rwsem)
      
      The umem_rwsem is held across the range_start/end as the ODP algorithm for
      invalidate_range_end cannot tolerate changes to the interval
      tree. However, due to the nested invalidation regions the second
      down_read() can deadlock if there are competing writers. The new core code
      provides an alternative scheme to solve this problem.
      
      Fixes: ca748c39
      
       ("RDMA/umem: Get rid of per_mm->notifier_count")
      Link: https://lore.kernel.org/r/20191112202231.3856-6-jgg@ziepe.ca
      Tested-by: default avatarArtemy Kovalyov <artemyko@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      f25a546e
    • Jason Gunthorpe's avatar
      mm/hmm: define the pre-processor related parts of hmm.h even if disabled · 107e8998
      Jason Gunthorpe authored
      
      
      Only the function calls are stubbed out with static inlines that always
      fail. This is the standard way to write a header for an optional component
      and makes it easier for drivers that only optionally need HMM_MIRROR.
      
      Link: https://lore.kernel.org/r/20191112202231.3856-5-jgg@ziepe.ca
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Tested-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      107e8998
    • Jason Gunthorpe's avatar
      mm/hmm: allow hmm_range to be used with a mmu_interval_notifier or hmm_mirror · 04ec32fb
      Jason Gunthorpe authored
      
      
      hmm_mirror's handling of ranges does not use a sequence count which
      results in this bug:
      
               CPU0                                   CPU1
                                           hmm_range_wait_until_valid(range)
                                               valid == true
                                           hmm_range_fault(range)
      hmm_invalidate_range_start()
         range->valid = false
      hmm_invalidate_range_end()
         range->valid = true
                                           hmm_range_valid(range)
                                                valid == true
      
      Where the hmm_range_valid() should not have succeeded.
      
      Adding the required sequence count would make it nearly identical to the
      new mmu_interval_notifier. Instead replace the hmm_mirror stuff with
      mmu_interval_notifier.
      
      Co-existence of the two APIs is the first step.
      
      Link: https://lore.kernel.org/r/20191112202231.3856-4-jgg@ziepe.ca
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Tested-by: default avatarPhilip Yang <Philip.Yang@amd.com>
      Tested-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      04ec32fb
    • Jason Gunthorpe's avatar
      mm/mmu_notifier: add an interval tree notifier · 99cb252f
      Jason Gunthorpe authored
      
      
      Of the 13 users of mmu_notifiers, 8 of them use only
      invalidate_range_start/end() and immediately intersect the
      mmu_notifier_range with some kind of internal list of VAs.  4 use an
      interval tree (i915_gem, radeon_mn, umem_odp, hfi1). 4 use a linked list
      of some kind (scif_dma, vhost, gntdev, hmm)
      
      And the remaining 5 either don't use invalidate_range_start() or do some
      special thing with it.
      
      It turns out that building a correct scheme with an interval tree is
      pretty complicated, particularly if the use case is synchronizing against
      another thread doing get_user_pages().  Many of these implementations have
      various subtle and difficult to fix races.
      
      This approach puts the interval tree as common code at the top of the mmu
      notifier call tree and implements a shareable locking scheme.
      
      It includes:
       - An interval tree tracking VA ranges, with per-range callbacks
       - A read/write locking scheme for the interval tree that avoids
         sleeping in the notifier path (for OOM killer)
       - A sequence counter based collision-retry locking scheme to tell
         device page fault that a VA range is being concurrently invalidated.
      
      This is based on various ideas:
      - hmm accumulates invalidated VA ranges and releases them when all
        invalidates are done, via active_invalidate_ranges count.
        This approach avoids having to intersect the interval tree twice (as
        umem_odp does) at the potential cost of a longer device page fault.
      
      - kvm/umem_odp use a sequence counter to drive the collision retry,
        via invalidate_seq
      
      - a deferred work todo list on unlock scheme like RTNL, via deferred_list.
        This makes adding/removing interval tree members more deterministic
      
      - seqlock, except this version makes the seqlock idea multi-holder on the
        write side by protecting it with active_invalidate_ranges and a spinlock
      
      To minimize MM overhead when only the interval tree is being used, the
      entire SRCU and hlist overheads are dropped using some simple
      branches. Similarly the interval tree overhead is dropped when in hlist
      mode.
      
      The overhead from the mandatory spinlock is broadly the same as most of
      existing users which already had a lock (or two) of some sort on the
      invalidation path.
      
      Link: https://lore.kernel.org/r/20191112202231.3856-3-jgg@ziepe.ca
      Acked-by: default avatarChristian König <christian.koenig@amd.com>
      Tested-by: default avatarPhilip Yang <Philip.Yang@amd.com>
      Tested-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      99cb252f
  2. Nov 13, 2019
  3. Nov 02, 2019
    • Jason Gunthorpe's avatar
      Merge branch 'odp_rework' into hmm.git · 0e64e5b3
      Jason Gunthorpe authored
      This branch is shared with the rdma.git for dependencies in following
      patches:
      
      ====================
      In order to hoist the interval tree code out of the drivers and into the
      mmu_notifiers it is necessary for the drivers to not use the interval tree
      for other things.
      
      This series replaces the interval tree with an xarray and along the way
      re-aligns all the locking to use a sensible SRCU model where the 'update'
      step is done by modifying an xarray.
      
      The result is overall much simpler and with less locking in the critical
      path. Many functions were reworked for clarity and small details like
      using 'imr' to refer to the implicit MR make the entire code flow here
      more readable.
      
      This also squashes at least two race bugs on its own, and quite possibily
      more that haven't been identified.
      ====================
      
      * branch 'odp_rework':
        RDMA/odp: Remove broken debugging call to invalidate_range
        RDMA/mlx5: Do not race with mlx5_ib_invalidate_range during create and destroy
        RDMA/mlx5: Do not store implicit children in the odp_mkeys xarray
        RDMA/mlx5: Rework implicit ODP destroy
        RDMA/mlx5: Avoid double lookups on the pagefault path
        RDMA/mlx5: Reduce locking in implicit_mr_get_data()
        RDMA/mlx5: Use an xarray for the children of an implicit ODP
        RDMA/mlx5: Split implicit handling from pagefault_mr
        RDMA/mlx5: Set the HW IOVA of the child MRs to their place in the tree
        RDMA/mlx5: Lift implicit_mr_alloc() into the two routines that call it
        RDMA/mlx5: Rework implicit_mr_get_data
        RDMA/mlx5: Delete struct mlx5_priv->mkey_table
        RDMA/mlx5: Use a dedicated mkey xarray for ODP
        RDMA/mlx5: Split sig_err MR data into its own xarray
        RDMA/mlx5: Use SRCU properly in ODP prefetch
      0e64e5b3
  4. Oct 30, 2019
  5. Oct 29, 2019
    • Jason Gunthorpe's avatar
      RDMA/odp: Remove broken debugging call to invalidate_range · 46870b23
      Jason Gunthorpe authored
      
      
      invalidate_range() also obtains the umem_mutex which is being held at this
      point, so if this path were was ever called it would deadlock. Thus
      conclude the debugging never triggers and rework it into a simple WARN_ON
      and leave things as they are.
      
      While here add a note to explain how we could possibly get inconsistent
      page pointers.
      
      Link: https://lore.kernel.org/r/20191009160934.3143-16-jgg@ziepe.ca
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      46870b23
    • Jason Gunthorpe's avatar
      RDMA/mlx5: Do not race with mlx5_ib_invalidate_range during create and destroy · 09689703
      Jason Gunthorpe authored
      For creation, as soon as the umem_odp is created the notifier can be
      called, however the underlying MR may not have been setup yet. This would
      cause problems if mlx5_ib_invalidate_range() runs. There is some
      confusing/ulocked/racy code that might by trying to solve this, but
      without locks it isn't going to work right.
      
      Instead trivially solve the problem by short-circuiting the invalidation
      if there are not yet any DMA mapped pages. By definition there is nothing
      to invalidate in this case.
      
      The create code will have the umem fully setup before anything is DMA
      mapped, and npages is fully locked by the umem_mutex.
      
      For destroy, invalidate the entire MR at the HW to stop DMA then DMA unmap
      the pages before destroying the MR. This drives npages to zero and
      prevents similar racing with invalidate while the MR is undergoing
      destruction.
      
      Arguably it would be better if the umem was created after the MR and
      destroyed before, but that would require a big rework of the MR code.
      
      Fixes: 6aec21f6
      
       ("IB/mlx5: Page faults handling infrastructure")
      Link: https://lore.kernel.org/r/20191009160934.3143-15-jgg@ziepe.ca
      Reviewed-by: default avatarArtemy Kovalyov <artemyko@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      09689703
    • Jason Gunthorpe's avatar
      RDMA/mlx5: Do not store implicit children in the odp_mkeys xarray · d561987f
      Jason Gunthorpe authored
      
      
      These mkeys are entirely internal and are never used by the HW for
      page fault. They should also never be used by userspace for prefetch.
      Simplify & optimize things by not including them in the xarray.
      
      Since the prefetch path can now never see a child mkey there is no need
      for the second synchronize_srcu() during imr destroy.
      
      Link: https://lore.kernel.org/r/20191009160934.3143-14-jgg@ziepe.ca
      Reviewed-by: default avatarArtemy Kovalyov <artemyko@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      d561987f
    • Jason Gunthorpe's avatar
      RDMA/mlx5: Rework implicit ODP destroy · 5256edcb
      Jason Gunthorpe authored
      
      
      Use SRCU in a sensible way by removing all MRs in the implicit tree from
      the two xarrays (the update operation), then a synchronize, followed by a
      normal single threaded teardown.
      
      This is only a little unusual from the normal pattern as there can still
      be some work pending in the unbound wq that may also require a workqueue
      flush. This is tracked with a single atomic, consolidating the redundant
      existing atomics and wait queue.
      
      For understand-ability the entire ODP implicit create/destroy flow now
      largely exists in a single pair of functions within odp.c, with a few
      support functions for tearing down an unused child.
      
      Link: https://lore.kernel.org/r/20191009160934.3143-13-jgg@ziepe.ca
      Reviewed-by: default avatarArtemy Kovalyov <artemyko@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      5256edcb
    • Jason Gunthorpe's avatar
      RDMA/mlx5: Avoid double lookups on the pagefault path · b70d785d
      Jason Gunthorpe authored
      
      
      Now that the locking is simplified combine pagefault_implicit_mr() with
      implicit_mr_get_data() so that we sweep over the idx range only once,
      and do the single xlt update at the end, after the child umems are
      setup.
      
      This avoids double iteration/xa_loads plus the sketchy failure path if the
      xa_load() fails.
      
      Link: https://lore.kernel.org/r/20191009160934.3143-12-jgg@ziepe.ca
      Reviewed-by: default avatarArtemy Kovalyov <artemyko@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      b70d785d
    • Jason Gunthorpe's avatar
      RDMA/mlx5: Reduce locking in implicit_mr_get_data() · 3389baa8
      Jason Gunthorpe authored
      
      
      Now that the child MRs are stored in an xarray we can rely on the SRCU
      lock to protect the xa_load and use xa_cmpxchg on the slow allocation path
      to resolve races with concurrent page fault.
      
      This reduces the scope of the critical section of umem_mutex for implicit
      MRs to only cover mlx5_ib_update_xlt, and avoids taking a lock at all if
      the child MR is already in the xarray. This makes it consistent with the
      normal ODP MR critical section for umem_lock, and the locking approach
      used for destroying an unusued implicit child MR.
      
      The MLX5_IB_UPD_XLT_ATOMIC is no longer needed in implicit_get_child_mr()
      since it is no longer called with any locks.
      
      Link: https://lore.kernel.org/r/20191009160934.3143-11-jgg@ziepe.ca
      Reviewed-by: default avatarArtemy Kovalyov <artemyko@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      3389baa8
    • Jason Gunthorpe's avatar
      RDMA/mlx5: Use an xarray for the children of an implicit ODP · 423f52d6
      Jason Gunthorpe authored
      
      
      Currently the child leaves are stored in the shared interval tree and
      every lookup for a child must be done under the interval tree rwsem.
      
      This is further complicated by dropping the rwsem during iteration (ie the
      odp_lookup(), odp_next() pattern), which requires a very tricky an
      difficult to understand locking scheme with SRCU.
      
      Instead reserve the interval tree for the exclusive use of the mmu
      notifier related code in umem_odp.c and give each implicit MR a xarray
      containing all the child MRs.
      
      Since the size of each child is 1GB of VA, a 1 level xarray will index 64G
      of VA, and a 2 level will index 2TB, making xarray a much better
      data structure choice than an interval tree.
      
      The locking properties of xarray will be used in the next patches to
      rework the implicit ODP locking scheme into something simpler.
      
      At this point, the xarray is locked by the implicit MR's umem_mutex, and
      read can also be locked by the odp_srcu.
      
      Link: https://lore.kernel.org/r/20191009160934.3143-10-jgg@ziepe.ca
      Reviewed-by: default avatarArtemy Kovalyov <artemyko@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      423f52d6
    • Jason Gunthorpe's avatar
      RDMA/mlx5: Split implicit handling from pagefault_mr · 54375e73
      Jason Gunthorpe authored
      
      
      The single routine has a very confusing scheme to advance to the next
      child MR when working on an implicit parent. This scheme can only be used
      when working with an implicit parent and must not be triggered when
      working on a normal MR.
      
      Re-arrange things by directly putting all the single-MR stuff into one
      function and calling it in a loop for the implicit case. Simplify some of
      the error handling in the new pagefault_real_mr() to remove unneeded gotos.
      
      Link: https://lore.kernel.org/r/20191009160934.3143-9-jgg@ziepe.ca
      Reviewed-by: default avatarArtemy Kovalyov <artemyko@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      54375e73
    • Jason Gunthorpe's avatar
      RDMA/mlx5: Set the HW IOVA of the child MRs to their place in the tree · 9162420d
      Jason Gunthorpe authored
      
      
      Instead of rewriting all the IOVA's to 0 as things progress down the tree
      make the IOVA of the children equal to placement in the tree. This makes
      things easier to understand by keeping mmkey.iova == HW configuration.
      
      Link: https://lore.kernel.org/r/20191009160934.3143-8-jgg@ziepe.ca
      Reviewed-by: default avatarArtemy Kovalyov <artemyko@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      9162420d
    • Jason Gunthorpe's avatar
      RDMA/mlx5: Lift implicit_mr_alloc() into the two routines that call it · c2edcd69
      Jason Gunthorpe authored
      
      
      This makes the routines easier to understand, particularly with respect
      the locking requirements of the entire sequence. The implicit_mr_alloc()
      had a lot of ifs specializing it to each of the callers, and only a very
      small amount of code was actually shared.
      
      Following patches will cause the flow in the two functions to diverge
      further.
      
      Link: https://lore.kernel.org/r/20191009160934.3143-7-jgg@ziepe.ca
      Reviewed-by: default avatarArtemy Kovalyov <artemyko@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      c2edcd69
    • Jason Gunthorpe's avatar
      RDMA/mlx5: Rework implicit_mr_get_data · 3d5f3c54
      Jason Gunthorpe authored
      
      
      This function is intended to loop across each MTT chunk in the implicit
      parent that intersects the range [io_virt, io_virt+bnct).  But it is has a
      confusing construction, so:
      
      - Consistently use imr and odp_imr to refer to the implicit parent
        to avoid confusion with the normal mr and odp of the child
      - Directly compute the inclusive start/end indexes by shifting. This is
        clearer to understand the intent and avoids any errors from unaligned
        values of addr
      - Iterate directly over the range of MTT indexes, do not make a loop
        out of goto
      - Follow 'success oriented flow', with goto error unwind
      - Directly calculate the range of idx's that need update_xlt
      - Ensure that any leaf MR added to the interval tree always results in an
        update to the XLT
      
      Link: https://lore.kernel.org/r/20191009160934.3143-6-jgg@ziepe.ca
      Reviewed-by: default avatarArtemy Kovalyov <artemyko@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      3d5f3c54
    • Jason Gunthorpe's avatar
      RDMA/mlx5: Delete struct mlx5_priv->mkey_table · 74bddb36
      Jason Gunthorpe authored
      
      
      No users are left, delete it.
      
      Link: https://lore.kernel.org/r/20191009160934.3143-5-jgg@ziepe.ca
      Reviewed-by: default avatarArtemy Kovalyov <artemyko@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      74bddb36
    • Jason Gunthorpe's avatar
      RDMA/mlx5: Use a dedicated mkey xarray for ODP · 806b101b
      Jason Gunthorpe authored
      
      
      There is a per device xarray storing mkeys that is used to store every
      mkey in the system. However, this xarray is now only read by ODP for
      certain ODP designated MRs (ODP, implicit ODP, MW, DEVX_INDIRECT).
      
      Create an xarray only for use by ODP, that only contains ODP related
      MKeys. This xarray is protected by SRCU and all erases are protected by a
      synchronize.
      
      This improves performance:
      
       - All MRs in the odp_mkeys xarray are ODP MRs, so some tests for is_odp()
         can be deleted. The xarray will also consume fewer nodes.
      
       - normal MR's are never mixed with ODP MRs in a SRCU data structure so
         performance sucking synchronize_srcu() on every MR destruction is not
         needed.
      
       - No smp_load_acquire(live) and xa_load() double barrier on read
      
      Due to the SRCU locking scheme care must be taken with the placement of
      the xa_store(). Once it completes the MR is immediately visible to other
      threads and only through a xa_erase() & synchronize_srcu() cycle could it
      be destroyed.
      
      Link: https://lore.kernel.org/r/20191009160934.3143-4-jgg@ziepe.ca
      Reviewed-by: default avatarArtemy Kovalyov <artemyko@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      806b101b
    • Jason Gunthorpe's avatar
      RDMA/mlx5: Split sig_err MR data into its own xarray · 50211ec9
      Jason Gunthorpe authored
      
      
      The locking model for signature is completely different than ODP, do not
      share the same xarray that relies on SRCU locking to support ODP.
      
      Simply store the active mlx5_core_sig_ctx's in an xarray when signature
      MRs are created and rely on trivial xarray locking to serialize
      everything.
      
      The overhead of storing only a handful of SIG related MRs is going to be
      much less than an xarray full of every mkey.
      
      Link: https://lore.kernel.org/r/20191009160934.3143-3-jgg@ziepe.ca
      Reviewed-by: default avatarArtemy Kovalyov <artemyko@mellanox.com>
      Reviewed-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      50211ec9
    • Jason Gunthorpe's avatar
      RDMA/mlx5: Use SRCU properly in ODP prefetch · fb985e27
      Jason Gunthorpe authored
      
      
      When working with SRCU protected xarrays the xarray itself should be the
      SRCU 'update' point. Instead prefetch is using live as the SRCU update
      point and this prevents switching the locking design to use the xarray
      instead.
      
      To solve this the prefetch must only read from the xarray once, and hold
      on to the actual MR pointer for the duration of the async
      operation. Incrementing num_pending_prefetch delays destruction of the MR,
      so it is suitable.
      
      Prefetch calls directly to the pagefault_mr using the MR pointer and only
      does a single xarray lookup.
      
      All the testing if a MR is prefetchable or not is now done only in the
      prefetch code and removed from the pagefault critical path.
      
      Link: https://lore.kernel.org/r/20191009160934.3143-2-jgg@ziepe.ca
      Reviewed-by: default avatarArtemy Kovalyov <artemyko@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      fb985e27
  6. Oct 28, 2019
  7. Oct 27, 2019
    • Linus Torvalds's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 153a971f
      Linus Torvalds authored
      Pull x86 fixes from Thomas Gleixner:
       "Two fixes for the VMWare guest support:
      
         - Unbreak VMWare platform detection which got wreckaged by converting
           an integer constant to a string constant.
      
         - Fix the clang build of the VMWAre hypercall by explicitely
           specifying the ouput register for INL instead of using the short
           form"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/cpu/vmware: Fix platform detection VMWARE_PORT macro
        x86/cpu/vmware: Use the full form of INL in VMWARE_HYPERCALL, for clang/llvm
      153a971f
    • Linus Torvalds's avatar
      Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 2b776b54
      Linus Torvalds authored
      Pull timer fixes from Thomas Gleixner:
       "A small set of fixes for time(keeping):
      
         - Add a missing include to prevent compiler warnings.
      
         - Make the VDSO implementation of clock_getres() POSIX compliant
           again. A recent change dropped the NULL pointer guard which is
           required as NULL is a valid pointer value for this function.
      
         - Fix two function documentation typos"
      
      * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        posix-cpu-timers: Fix two trivial comments
        timers/sched_clock: Include local timekeeping.h for missing declarations
        lib/vdso: Make clock_getres() POSIX compliant again
      2b776b54
    • Linus Torvalds's avatar
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · a8a31fdc
      Linus Torvalds authored
      Pull perf fixes from Thomas Gleixner:
       "A set of perf fixes:
      
        kernel:
      
         - Unbreak the tracking of auxiliary buffer allocations which got
           imbalanced causing recource limit failures.
      
         - Fix the fallout of splitting of ToPA entries which missed to shift
           the base entry PA correctly.
      
         - Use the correct context to lookup the AUX event when unmapping the
           associated AUX buffer so the event can be stopped and the buffer
           reference dropped.
      
        tools:
      
         - Fix buildiid-cache mode setting in copyfile_mode_ns() when copying
           /proc/kcore
      
         - Fix freeing id arrays in the event list so the correct event is
           closed.
      
         - Sync sched.h anc kvm.h headers with the kernel sources.
      
         - Link jvmti against tools/lib/ctype.o to have weak strlcpy().
      
         - Fix multiple memory and file descriptor leaks, found by coverity in
           perf annotate.
      
         - Fix leaks in error handling paths in 'perf c2c', 'perf kmem', found
           by a static analysis tool"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/aux: Fix AUX output stopping
        perf/aux: Fix tracking of auxiliary trace buffer allocation
        perf/x86/intel/pt: Fix base for single entry topa
        perf kmem: Fix memory leak in compact_gfp_flags()
        tools headers UAPI: Sync sched.h with the kernel
        tools headers kvm: Sync kvm.h headers with the kernel sources
        tools headers kvm: Sync kvm headers with the kernel sources
        tools headers kvm: Sync kvm headers with the kernel sources
        perf c2c: Fix memory leak in build_cl_output()
        perf tools: Fix mode setting in copyfile_mode_ns()
        perf annotate: Fix multiple memory and file descriptor leaks
        perf tools: Fix resource leak of closedir() on the error paths
        perf evlist: Fix fix for freed id arrays
        perf jvmti: Link against tools/lib/ctype.h to have weak strlcpy()
      a8a31fdc
    • Linus Torvalds's avatar
      Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 1e1ac1cb
      Linus Torvalds authored
      Pull irq fixes from Thomas Gleixner:
       "Two fixes for interrupt controller drivers:
      
         - Skip IRQ_M_EXT entries in the device tree when initializing the
           RISCV PLIC controller to avoid a double init attempt.
      
         - Use the correct ITS list when issuing the VMOVP synchronization
           command so the operation works only on the ITS instances which are
           associated to a VM"
      
      * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        irqchip/sifive-plic: Skip contexts except supervisor in plic_init()
        irqchip/gic-v3-its: Use the exact ITSList for VMOVP
      1e1ac1cb
    • Linus Torvalds's avatar
      Merge tag '5.4-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6 · c9a2e4a8
      Linus Torvalds authored
      Pull cifs fixes from Steve French:
       "Seven cifs/smb3 fixes, including three for stable"
      
      * tag '5.4-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
        cifs: Fix cifsInodeInfo lock_sem deadlock when reconnect occurs
        CIFS: Fix use after free of file info structures
        CIFS: Fix retry mid list corruption on reconnects
        cifs: Fix missed free operations
        CIFS: avoid using MID 0xFFFF
        cifs: clarify comment about timestamp granularity for old servers
        cifs: Handle -EINPROGRESS only when noblockcnt is set
      c9a2e4a8
    • Linus Torvalds's avatar
      Merge tag 'riscv/for-v5.4-rc5-b' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · 6995a6a5
      Linus Torvalds authored
      Pull RISC-V fixes from Paul Walmsley:
       "Several minor fixes and cleanups for v5.4-rc5:
      
         - Three build fixes for various SPARSEMEM-related kernel
           configurations
      
         - Two cleanup patches for the kernel bug and breakpoint trap handler
           code"
      
      * tag 'riscv/for-v5.4-rc5-b' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
        riscv: cleanup do_trap_break
        riscv: cleanup <asm/bug.h>
        riscv: Fix undefined reference to vmemmap_populate_basepages
        riscv: Fix implicit declaration of 'page_to_section'
        riscv: fix fs/proc/kcore.c compilation with sparsemem enabled
      6995a6a5
    • Linus Torvalds's avatar
      Merge tag 'mips_fixes_5.4_3' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux · 5a1e843c
      Linus Torvalds authored
      Pull MIPS fixes from Paul Burton:
       "A few MIPS fixes:
      
         - Fix VDSO time-related function behavior for systems where we need
           to fall back to syscalls, but were instead returning bogus results.
      
         - A fix to TLB exception handlers for Cavium Octeon systems where
           they would inadvertently clobber the $1/$at register.
      
         - A build fix for bcm63xx configurations.
      
         - Switch to using my @kernel.org email address"
      
      * tag 'mips_fixes_5.4_3' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
        MIPS: tlbex: Fix build_restore_pagemask KScratch restore
        MIPS: bmips: mark exception vectors as char arrays
        mips: vdso: Fix __arch_get_hw_counter()
        MAINTAINERS: Use @kernel.org address for Paul Burton
      5a1e843c
    • Linus Torvalds's avatar
      Merge tag 'tty-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · 29768954
      Linus Torvalds authored
      Pull tty/serial driver fix from Greg KH:
       "Here is a single tty/serial driver fix for 5.4-rc5 that resolves a
        reported issue.
      
        It has been in linux-next for a while with no problems"
      
      * tag 'tty-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        8250-men-mcb: fix error checking when get_num_ports returns -ENODEV
      29768954
    • Linus Torvalds's avatar
      Merge tag 'staging-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · 228bd624
      Linus Torvalds authored
      Pull staging driver fix from Greg KH:
       "Here is a single staging driver fix, for the wlan-ng driver, that
        resolves a reported issue.
      
        It is been in linux-next for a while with no reported issues"
      
      * tag 'staging-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        staging: wlan-ng: fix exit return when sme->key_idx >= NUM_WEPKEYS
      228bd624
    • Linus Torvalds's avatar
      Merge tag 'driver-core-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core · 13fa692e
      Linus Torvalds authored
      Pull driver core fix from Greg KH:
       "Here is a single sysfs fix for 5.4-rc5.
      
        It resolves an error if you actually try to use the __BIN_ATTR_WO()
        macro, seems I never tested it properly before :(
      
        This has been in linux-next for a while with no reported issues"
      
      * tag 'driver-core-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
        sysfs: Fixes __BIN_ATTR_WO() macro
      13fa692e
    • Linus Torvalds's avatar
      Merge tag 'char-misc-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · a03885d5
      Linus Torvalds authored
      Pull binder fix from Greg KH:
       "This is a single binder fix to resolve a reported issue by Jann. It's
        been in linux-next for a while with no reported issues"
      
      * tag 'char-misc-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        binder: Don't modify VMA bounds in ->mmap handler
      a03885d5