Skip to content
  1. Mar 30, 2022
  2. Mar 28, 2022
  3. Mar 26, 2022
  4. Mar 25, 2022
  5. Mar 24, 2022
    • NeilBrown's avatar
      SUNRPC: avoid race between mod_timer() and del_timer_sync() · 3848e96e
      NeilBrown authored
      
      
      xprt_destory() claims XPRT_LOCKED and then calls del_timer_sync().
      Both xprt_unlock_connect() and xprt_release() call
       ->release_xprt()
      which drops XPRT_LOCKED and *then* xprt_schedule_autodisconnect()
      which calls mod_timer().
      
      This may result in mod_timer() being called *after* del_timer_sync().
      When this happens, the timer may fire long after the xprt has been freed,
      and run_timer_softirq() will probably crash.
      
      The pairing of ->release_xprt() and xprt_schedule_autodisconnect() is
      always called under ->transport_lock.  So if we take ->transport_lock to
      call del_timer_sync(), we can be sure that mod_timer() will run first
      (if it runs at all).
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      3848e96e
  6. Mar 23, 2022
  7. Mar 22, 2022
  8. Mar 21, 2022
  9. Mar 14, 2022
    • NeilBrown's avatar
      SUNRPC: change locking for xs_swap_enable/disable · 693486d5
      NeilBrown authored
      
      
      It is not in general safe to wait for XPRT_LOCKED to clear.
      A wakeup is only sent when
       - connection completes
       - sock close completes
      so during normal operations, this can wait indefinitely.
      
      The event we need to protect against is ->inet being set to NULL, and
      that happens under the recv_mutex lock.
      
      So drop the handlign of XPRT_LOCKED and use recv_mutex instead.
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      693486d5
    • NeilBrown's avatar
      NFS: swap-out must always use STABLE writes. · c265de25
      NeilBrown authored
      
      
      The commit handling code is not safe against memory-pressure deadlocks
      when writing to swap.  In particular, nfs_commitdata_alloc() blocks
      indefinitely waiting for memory, and this can consume all available
      workqueue threads.
      
      swap-out most likely uses STABLE writes anyway as COND_STABLE indicates
      that a stable write should be used if the write fits in a single
      request, and it normally does.  However if we ever swap with a small
      wsize, or gather unusually large numbers of pages for a single write,
      this might change.
      
      For safety, make it explicit in the code that direct writes used for swap
      must always use FLUSH_STABLE.
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      c265de25
    • NeilBrown's avatar
      NFS: swap IO handling is slightly different for O_DIRECT IO · 64158668
      NeilBrown authored
      
      
      1/ Taking the i_rwsem for swap IO triggers lockdep warnings regarding
         possible deadlocks with "fs_reclaim".  These deadlocks could, I believe,
         eventuate if a buffered read on the swapfile was attempted.
      
         We don't need coherence with the page cache for a swap file, and
         buffered writes are forbidden anyway.  There is no other need for
         i_rwsem during direct IO.  So never take it for swap_rw()
      
      2/ generic_write_checks() explicitly forbids writes to swap, and
         performs checks that are not needed for swap.  So bypass it
         for swap_rw().
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      64158668
    • NeilBrown's avatar
      NFSv4: keep state manager thread active if swap is enabled · 4dc73c67
      NeilBrown authored
      
      
      If we are swapping over NFSv4, we may not be able to allocate memory to
      start the state-manager thread at the time when we need it.
      So keep it always running when swap is enabled, and just signal it to
      start.
      
      This requires updating and testing the cl_swapper count on the root
      rpc_clnt after following all ->cl_parent links.
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      4dc73c67
    • NeilBrown's avatar
      SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC · 8db55a03
      NeilBrown authored
      
      
      rpc tasks can be marked as RPC_TASK_SWAPPER.  This causes GFP_MEMALLOC
      to be used for some allocations.  This is needed in some cases, but not
      in all where it is currently provided, and in some where it isn't
      provided.
      
      Currently *all* tasks associated with a rpc_client on which swap is
      enabled get the flag and hence some GFP_MEMALLOC support.
      
      GFP_MEMALLOC is provided for ->buf_alloc() but only swap-writes need it.
      However xdr_alloc_bvec does not get GFP_MEMALLOC - though it often does
      need it.
      
      xdr_alloc_bvec is called while the XPRT_LOCK is held.  If this blocks,
      then it blocks all other queued tasks.  So this allocation needs
      GFP_MEMALLOC for *all* requests, not just writes, when the xprt is used
      for any swap writes.
      
      Similarly, if the transport is not connected, that will block all
      requests including swap writes, so memory allocations should get
      GFP_MEMALLOC if swap writes are possible.
      
      So with this patch:
       1/ we ONLY set RPC_TASK_SWAPPER for swap writes.
       2/ __rpc_execute() sets PF_MEMALLOC while handling any task
          with RPC_TASK_SWAPPER set, or when handling any task that
          holds the XPRT_LOCKED lock on an xprt used for swap.
          This removes the need for the RPC_IS_SWAPPER() test
          in ->buf_alloc handlers.
       3/ xprt_prepare_transmit() sets PF_MEMALLOC after locking
          any task to a swapper xprt.  __rpc_execute() will clear it.
       3/ PF_MEMALLOC is set for all the connect workers.
      
      Reviewed-by: Chuck Lever <chuck.lever@oracle.com> (for xprtrdma parts)
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      8db55a03
    • NeilBrown's avatar
      NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS · 89c2be8a
      NeilBrown authored
      
      
      NFS_RPC_SWAPFLAGS is only used for READ requests.
      It sets RPC_TASK_SWAPPER which gives some memory-allocation priority to
      requests.  This is not needed for swap READ - though it is for writes
      where it is set via a different mechanism.
      
      RPC_TASK_ROOTCREDS causes the 'machine' credential to be used.
      This is not needed as the root credential is saved when the swap file is
      opened, and this is used for all IO.
      
      So NFS_RPC_SWAPFLAGS isn't needed, and as it is the only user of
      RPC_TASK_ROOTCREDS, that isn't needed either.
      
      Remove both.
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      89c2be8a
    • NeilBrown's avatar
      SUNRPC: remove scheduling boost for "SWAPPER" tasks. · a80a8461
      NeilBrown authored
      
      
      Currently, tasks marked as "swapper" tasks get put to the front of
      non-priority rpc_queues, and are sorted earlier than non-swapper tasks on
      the transport's ->xmit_queue.
      
      This is pointless as currently *all* tasks for a mount that has swap
      enabled on *any* file are marked as "swapper" tasks.  So the net result
      is that the non-priority rpc_queues are reverse-ordered (LIFO).
      
      This scheduling boost is not necessary to avoid deadlocks, and hurts
      fairness, so remove it.  If there were a need to expedite some requests,
      the tk_priority mechanism is a more appropriate tool.
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      a80a8461
    • NeilBrown's avatar
      SUNRPC/xprt: async tasks mustn't block waiting for memory · a7210354
      NeilBrown authored
      
      
      When memory is short, new worker threads cannot be created and we depend
      on the minimum one rpciod thread to be able to handle everything.  So it
      must not block waiting for memory.
      
      xprt_dynamic_alloc_slot can block indefinitely.  This can tie up all
      workqueue threads and NFS can deadlock.  So when called from a
      workqueue, set __GFP_NORETRY.
      
      The rdma alloc_slot already does not block.  However it sets the error
      to -EAGAIN suggesting this will trigger a sleep.  It does not.  As we
      can see in call_reserveresult(), only -ENOMEM causes a sleep.  -EAGAIN
      causes immediate retry.
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      a7210354
    • NeilBrown's avatar
      SUNRPC/auth: async tasks mustn't block waiting for memory · a41b05ed
      NeilBrown authored
      
      
      When memory is short, new worker threads cannot be created and we depend
      on the minimum one rpciod thread to be able to handle everything.  So it
      must not block waiting for memory.
      
      mempools are particularly a problem as memory can only be released back
      to the mempool by an async rpc task running.  If all available workqueue
      threads are waiting on the mempool, no thread is available to return
      anything.
      
      lookup_cred() can block on a mempool or kmalloc - and this can cause
      deadlocks.  So add a new RPCAUTH_LOOKUP flag for async lookups and don't
      block on memory.  If the -ENOMEM gets back to call_refreshresult(), wait
      a short while and try again.  HZ>>4 is chosen as it is used elsewhere
      for -ENOMEM retries.
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      a41b05ed
    • NeilBrown's avatar
      SUNRPC/call_alloc: async tasks mustn't block waiting for memory · c487216b
      NeilBrown authored
      
      
      When memory is short, new worker threads cannot be created and we depend
      on the minimum one rpciod thread to be able to handle everything.
      So it must not block waiting for memory.
      
      mempools are particularly a problem as memory can only be released back
      to the mempool by an async rpc task running.  If all available
      workqueue threads are waiting on the mempool, no thread is available to
      return anything.
      
      rpc_malloc() can block, and this might cause deadlocks.
      So check RPC_IS_ASYNC(), rather than RPC_IS_SWAPPER() to determine if
      blocking is acceptable.
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      c487216b
    • NeilBrown's avatar
      NFS: remove IS_SWAPFILE hack · 944d95f7
      NeilBrown authored
      
      
      This code is pointless as IS_SWAPFILE is always defined.
      So remove it.
      
      Suggested-by: default avatarMark Hemment <markhemm@googlemail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      944d95f7
    • Dave Wysochanski's avatar
      NFS: Remove remaining dfprintks related to fscache and remove NFSDBG_FSCACHE · b5fdf66f
      Dave Wysochanski authored
      
      
      The fscache cookie APIs including fscache_acquire_cookie() and
      fscache_relinquish_cookie() now have very good tracing.  Thus,
      there is no real need for dfprintks in the NFS fscache interface.
      
      The NFS fscache interface has removed all dfprintks so remove the
      NFSDBG_FSCACHE defines.
      
      Signed-off-by: default avatarDave Wysochanski <dwysocha@redhat.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      b5fdf66f
    • Dave Wysochanski's avatar
      NFS: Replace dfprintks with tracepoints in fscache read and write page functions · e3f0a7fe
      Dave Wysochanski authored
      
      
      Most of fscache and other NFS IO paths are now using tracepoints.
      Remove the dfprintks in the NFS fscache read/write page functions
      and replace with tracepoints at the begin and end of the functions.
      
      Signed-off-by: default avatarDave Wysochanski <dwysocha@redhat.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      e3f0a7fe
    • Dave Wysochanski's avatar
      NFS: Rename fscache read and write pages functions · fc1c5abf
      Dave Wysochanski authored
      
      
      Rename NFS fscache functions in a more consistent fashion
      to better reflect when we read from and write to fscache.
      
      Signed-off-by: default avatarDave Wysochanski <dwysocha@redhat.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      fc1c5abf