Skip to content
  1. Jun 03, 2020
    • Matthew Wilcox (Oracle)'s avatar
      mm: use memalloc_nofs_save in readahead path · f2c817be
      Matthew Wilcox (Oracle) authored
      
      
      Ensure that memory allocations in the readahead path do not attempt to
      reclaim file-backed pages, which could lead to a deadlock.  It is
      possible, though unlikely this is the root cause of a problem observed
      by Cong Wang.
      
      Reported-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Chao Yu <yuchao0@huawei.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Gao Xiang <gaoxiang25@huawei.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
      Cc: Mikl...
      f2c817be
    • Matthew Wilcox (Oracle)'s avatar
      mm: document why we don't set PageReadahead · 2d8163e4
      Matthew Wilcox (Oracle) authored
      
      
      If the page is already in cache, we don't set PageReadahead on it.
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Chao Yu <yuchao0@huawei.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Gao Xiang <gaoxiang25@huawei.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Link: http://lkml.kernel.org/r/20200414150233.24495-15-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d8163e4
    • Matthew Wilcox (Oracle)'s avatar
      mm: add page_cache_readahead_unbounded · 2c684234
      Matthew Wilcox (Oracle) authored
      
      
      ext4 and f2fs have duplicated the guts of the readahead code so they can
      read past i_size.  Instead, separate out the guts of the readahead code
      so they can call it directly.
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarEric Biggers <ebiggers@google.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: default avatarEric Biggers <ebiggers@google.com>
      Cc: Chao Yu <yuchao0@huawei.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Gao Xiang <gaoxiang25@huawei.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Link: http://lkml.kernel.org/r/20200414150233.24495-14-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c684234
    • Matthew Wilcox (Oracle)'s avatar
      mm: move end_index check out of readahead loop · b0f31d78
      Matthew Wilcox (Oracle) authored
      
      
      By reducing nr_to_read, we can eliminate this check from inside the loop.
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Chao Yu <yuchao0@huawei.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Gao Xiang <gaoxiang25@huawei.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Link: http://lkml.kernel.org/r/20200414150233.24495-13-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b0f31d78
    • Matthew Wilcox (Oracle)'s avatar
      mm: add readahead address space operation · 8151b4c8
      Matthew Wilcox (Oracle) authored
      
      
      This replaces ->readpages with a saner interface:
       - Return void instead of an ignored error code.
       - Page cache is already populated with locked pages when ->readahead
         is called.
       - New arguments can be passed to the implementation without changing
         all the filesystems that use a common helper function like
         mpage_readahead().
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Chao Yu <yuchao0@huawei.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Gao Xiang <gaoxiang25@huawei.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Link: http://lkml.kernel.org/r/20200414150233.24495-12-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8151b4c8
    • Matthew Wilcox (Oracle)'s avatar
      mm: put readahead pages in cache earlier · c1f6925e
      Matthew Wilcox (Oracle) authored
      
      
      When populating the page cache for readahead, mappings that use
      ->readpages must populate the page cache themselves as the pages are
      passed on a linked list which would normally be used for the page
      cache's LRU.  For mappings that use ->readpage or the upcoming
      ->readahead method, we can put the pages into the page cache as soon as
      they're allocated, which solves a race between readahead and direct IO.
      It also lets us remove the gfp argument from read_pages().
      
      Use the new readahead_page() API to implement the repeated calls to
      ->readpage(), just like most filesystems will.
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Chao Yu <yuchao0@huawei.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Gao Xiang <gaoxiang25@huawei.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Link: http://lkml.kernel.org/r/20200414150233.24495-11-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1f6925e
    • Matthew Wilcox (Oracle)'s avatar
      mm: remove 'page_offset' from readahead loop · ef8153b6
      Matthew Wilcox (Oracle) authored
      
      
      Replace the page_offset variable with 'index + i'.
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Chao Yu <yuchao0@huawei.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Gao Xiang <gaoxiang25@huawei.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Link: http://lkml.kernel.org/r/20200414150233.24495-10-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef8153b6
    • Matthew Wilcox (Oracle)'s avatar
      mm: rename readahead loop variable to 'i' · c2c7ad74
      Matthew Wilcox (Oracle) authored
      
      
      Change the type of page_idx to unsigned long, and rename it -- it's just
      a loop counter, not a page index.
      
      Suggested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Cc: Chao Yu <yuchao0@huawei.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Gao Xiang <gaoxiang25@huawei.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Link: http://lkml.kernel.org/r/20200414150233.24495-9-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c2c7ad74
    • Matthew Wilcox (Oracle)'s avatar
      mm: rename various 'offset' parameters to 'index' · 08eb9658
      Matthew Wilcox (Oracle) authored
      
      
      The word 'offset' is used ambiguously to mean 'byte offset within a
      page', 'byte offset from the start of the file' and 'page offset from
      the start of the file'.
      
      Use 'index' to mean 'page offset from the start of the file' throughout
      the readahead code.
      
      [ We should probably rename the 'pgoff_t' type to 'pgidx_t' too - Linus ]
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Chao Yu <yuchao0@huawei.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Gao Xiang <gaoxiang25@huawei.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Link: http://lkml.kernel.org/r/20200414150233.24495-8-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      08eb9658
    • Matthew Wilcox (Oracle)'s avatar
      mm: use readahead_control to pass arguments · a4d96536
      Matthew Wilcox (Oracle) authored
      
      
      In this patch, only between __do_page_cache_readahead() and
      read_pages(), but it will be extended in upcoming patches.  The
      read_pages() function becomes aops centric, as this makes the most sense
      by the end of the patchset.
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Cc: Chao Yu <yuchao0@huawei.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Gao Xiang <gaoxiang25@huawei.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Link: http://lkml.kernel.org/r/20200414150233.24495-7-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4d96536
    • Matthew Wilcox (Oracle)'s avatar
      mm: add new readahead_control API · 042124cc
      Matthew Wilcox (Oracle) authored
      
      
      Filesystems which implement the upcoming ->readahead method will get
      their pages by calling readahead_page() or readahead_page_batch().
      These functions support large pages, even though none of the filesystems
      to be converted do yet.
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Chao Yu <yuchao0@huawei.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Gao Xiang <gaoxiang25@huawei.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Link: http://lkml.kernel.org/r/20200414150233.24495-6-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      042124cc
    • Matthew Wilcox (Oracle)'s avatar
      mm: move readahead nr_pages check into read_pages · ad4ae1c7
      Matthew Wilcox (Oracle) authored
      
      
      Simplify the callers by moving the check for nr_pages and the BUG_ON
      into read_pages().
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Cc: Chao Yu <yuchao0@huawei.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Gao Xiang <gaoxiang25@huawei.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Link: http://lkml.kernel.org/r/20200414150233.24495-5-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad4ae1c7
    • Matthew Wilcox (Oracle)'s avatar
      mm: ignore return value of ->readpages · a1ef8566
      Matthew Wilcox (Oracle) authored
      
      
      We used to assign the return value to a variable, which we then ignored.
      Remove the pretence of caring.
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Cc: Chao Yu <yuchao0@huawei.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Gao Xiang <gaoxiang25@huawei.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Link: http://lkml.kernel.org/r/20200414150233.24495-4-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a1ef8566
    • Matthew Wilcox (Oracle)'s avatar
      mm: return void from various readahead functions · 9a42823a
      Matthew Wilcox (Oracle) authored
      
      
      ondemand_readahead has two callers, neither of which use the return
      value.  That means that both ra_submit and __do_page_cache_readahead()
      can return void, and we don't need to worry that a present page in the
      readahead window causes us to return a smaller nr_pages than we ought to
      have.
      
      Similarly, no caller uses the return value from
      force_page_cache_readahead().
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Chao Yu <yuchao0@huawei.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Gao Xiang <gaoxiang25@huawei.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: ...
      9a42823a
    • Matthew Wilcox (Oracle)'s avatar
      mm: move readahead prototypes from mm.h · cee9a0c4
      Matthew Wilcox (Oracle) authored
      
      
      Patch series "Change readahead API", v11.
      
      This series adds a readahead address_space operation to replace the
      readpages operation.  The key difference is that pages are added to the
      page cache as they are allocated (and then looked up by the filesystem)
      instead of passing them on a list to the readpages operation and having
      the filesystem add them to the page cache.  It's a net reduction in code
      for each implementation, more efficient than walking a list, and solves
      the direct-write vs buffered-read problem reported by yu kuai at
      http://lkml.kernel.org/r/20200116063601.39201-1-yukuai3@huawei.com
      
      The only unconverted filesystems are those which use fscache.  Their
      conversion is pending Dave Howells' rewrite which will make the
      conversion substantially easier.  This should be completed by the end of
      the year.
      
      I want to thank the reviewers/testers; Dave Chinner, John Hubbard, Eric
      Biggers, Johannes Thumshirn, Dave Sterba, Zi Yan, Christoph Hellwig and
      Miklos Szeredi have done a marvellous job of providing constructive
      criticism.
      
      These patches pass an xfstests run on ext4, xfs & btrfs with no
      regressions that I can tell (some of the tests seem a little flaky
      before and remain flaky afterwards).
      
      This patch (of 25):
      
      The readahead code is part of the page cache so should be found in the
      pagemap.h file.  force_page_cache_readahead is only used within mm, so
      move it to mm/internal.h instead.  Remove the parameter names where they
      add no value, and rename the ones which were actively misleading.
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Cc: Chao Yu <yuchao0@huawei.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Gao Xiang <gaoxiang25@huawei.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Link: http://lkml.kernel.org/r/20200414150233.24495-1-willy@infradead.org
      Link: http://lkml.kernel.org/r/20200414150233.24495-2-willy@infradead.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cee9a0c4
    • Vlastimil Babka's avatar
      mm, dump_page(): do not crash with invalid mapping pointer · 002ae705
      Vlastimil Babka authored
      
      
      We have seen a following problem on a RPi4 with 1G RAM:
      
          BUG: Bad page state in process systemd-hwdb  pfn:35601
          page:ffff7e0000d58040 refcount:15 mapcount:131221 mapping:efd8fe765bc80080 index:0x1 compound_mapcount: -32767
          Unable to handle kernel paging request at virtual address efd8fe765bc80080
          Mem abort info:
            ESR = 0x96000004
            Exception class = DABT (current EL), IL = 32 bits
            SET = 0, FnV = 0
            EA = 0, S1PTW = 0
          Data abort info:
            ISV = 0, ISS = 0x00000004
            CM = 0, WnR = 0
          [efd8fe765bc80080] address between user and kernel address ranges
          Internal error: Oops: 96000004 [#1] SMP
          Modules linked in: btrfs libcrc32c xor xor_neon zlib_deflate raid6_pq mmc_block xhci_pci xhci_hcd usbcore sdhci_iproc sdhci_pltfm sdhci mmc_core clk_raspberrypi gpio_raspberrypi_exp pcie_brcmstb bcm2835_dma gpio_regulator phy_generic fixed sg scsi_mod efivarfs
          Supported: No, Unreleased kernel
          CPU: 3 PID: 408 Comm: systemd-hwdb Not tainted 5.3.18-8-default #1 SLE15-SP2 (unreleased)
          Hardware name: raspberrypi rpi/rpi, BIOS 2020.01 02/21/2020
          pstate: 40000085 (nZcv daIf -PAN -UAO)
          pc : __dump_page+0x268/0x368
          lr : __dump_page+0xc4/0x368
          sp : ffff000012563860
          x29: ffff000012563860 x28: ffff80003ddc4300
          x27: 0000000000000010 x26: 000000000000003f
          x25: ffff7e0000d58040 x24: 000000000000000f
          x23: efd8fe765bc80080 x22: 0000000000020095
          x21: efd8fe765bc80080 x20: ffff000010ede8b0
          x19: ffff7e0000d58040 x18: ffffffffffffffff
          x17: 0000000000000001 x16: 0000000000000007
          x15: ffff000011689708 x14: 3030386362353637
          x13: 6566386466653a67 x12: 6e697070616d2031
          x11: 32323133313a746e x10: 756f6370616d2035
          x9 : ffff00001168a840 x8 : ffff00001077a670
          x7 : 000000000000013d x6 : ffff0000118a43b5
          x5 : 0000000000000001 x4 : ffff80003dd9e2c8
          x3 : ffff80003dd9e2c8 x2 : 911c8d7c2f483500
          x1 : dead000000000100 x0 : efd8fe765bc80080
          Call trace:
           __dump_page+0x268/0x368
           bad_page+0xd4/0x168
           check_new_page_bad+0x80/0xb8
           rmqueue_bulk.constprop.26+0x4d8/0x788
           get_page_from_freelist+0x4d4/0x1228
           __alloc_pages_nodemask+0x134/0xe48
           alloc_pages_vma+0x198/0x1c0
           do_anonymous_page+0x1a4/0x4d8
           __handle_mm_fault+0x4e8/0x560
           handle_mm_fault+0x104/0x1e0
           do_page_fault+0x1e8/0x4c0
           do_translation_fault+0xb0/0xc0
           do_mem_abort+0x50/0xb0
           el0_da+0x24/0x28
          Code: f9401025 8b8018a0 9a851005 17ffffca (f94002a0)
      
      Besides the underlying issue with page->mapping containing a bogus value
      for some reason, we can see that __dump_page() crashed by trying to read
      the pointer at mapping->host, turning a recoverable warning into full
      Oops.
      
      It can be expected that when page is reported as bad state for some
      reason, the pointers there should not be trusted blindly.
      
      So this patch treats all data in __dump_page() that depends on
      page->mapping as lava, using probe_kernel_read_strict().  Ideally this
      would include the dentry->d_parent recursively, but that would mean
      changing printk handler for %pd.  Chances of reaching the dentry
      printing part with an initially bogus mapping pointer should be rather
      low, though.
      
      Also prefix printing mapping->a_ops with a description of what is being
      printed.  In case the value is bogus, %ps will print raw value instead
      of the symbol name and then it's not obvious at all that it's printing
      a_ops.
      
      Reported-by: default avatarPetr Tesarik <ptesarik@suse.cz>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Link: http://lkml.kernel.org/r/20200331165454.12263-1-vbabka@suse.cz
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      002ae705
    • Andrew Morton's avatar
      Documentation/vm/slub.rst: s/Toggle/Enable/ · a3df6927
      Andrew Morton authored
      
      
      "toggle" means to change a boolean thing's state.  This operation
      doesn't do that - it sets it to "true".
      
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3df6927
    • Qian Cai's avatar
      mm/slub: fix stack overruns with SLUB_STATS · a68ee057
      Qian Cai authored
      There is no need to copy SLUB_STATS items from root memcg cache to new
      memcg cache copies.  Doing so could result in stack overruns because the
      store function only accepts 0 to clear the stat and returns an error for
      everything else while the show method would print out the whole stat.
      
      Then, the mismatch of the lengths returns from show and store methods
      happens in memcg_propagate_slab_attrs():
      
      	else if (root_cache->max_attr_size < ARRAY_SIZE(mbuf))
      		buf = mbuf;
      
      max_attr_size is only 2 from slab_attr_store(), then, it uses mbuf[64]
      in show_stat() later where a bounch of sprintf() would overrun the stack
      variable.  Fix it by always allocating a page of buffer to be used in
      show_stat() if SLUB_STATS=y which should only be used for debug purpose.
      
        # echo 1 > /sys/kernel/slab/fs_cache/shrink
        BUG: KASAN: stack-out-of-bounds in number+0x421/0x6e0
        Write of size 1 at addr ffffc900256cfde0 by task kworker/76:0/53251
      
        Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
        Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func
        Call Trace:
          number+0x421/0x6e0
          vsnprintf+0x451/0x8e0
          sprintf+0x9e/0xd0
          show_stat+0x124/0x1d0
          alloc_slowpath_show+0x13/0x20
          __kmem_cache_create+0x47a/0x6b0
      
        addr ffffc900256cfde0 is located in stack of task kworker/76:0/53251 at offset 0 in frame:
         process_one_work+0x0/0xb90
      
        this frame has 1 object:
         [32, 72) 'lockdep_map'
      
        Memory state around the buggy address:
         ffffc900256cfc80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
         ffffc900256cfd00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        >ffffc900256cfd80: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
                                                               ^
         ffffc900256cfe00: 00 00 00 00 00 f2 f2 f2 00 00 00 00 00 00 00 00
         ffffc900256cfe80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        ==================================================================
        Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: __kmem_cache_create+0x6ac/0x6b0
        Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func
        Call Trace:
          __kmem_cache_create+0x6ac/0x6b0
      
      Fixes: 107dab5c
      
       ("slub: slub-specific propagation changes")
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Glauber Costa <glauber@scylladb.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Link: http://lkml.kernel.org/r/20200429222356.4322-1-cai@lca.pw
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a68ee057
    • Christopher Lameter's avatar
      slub: remove kmalloc under list_lock from list_slab_objects() V2 · aa456c7a
      Christopher Lameter authored
      list_slab_objects() is called when a slab is destroyed and there are
      objects still left to list the objects in the syslog.  This is a pretty
      rare event.
      
      And there it seems we take the list_lock and call kmalloc while holding
      that lock.
      
      Perform the allocation in free_partial() before the list_lock is taken.
      
      Fixes: bbd7d57b
      
       ("slub: Potential stack overflow")
      Signed-off-by: default avatarChristopher Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Yu Zhao <yuzhao@google.com>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2002031721250.1668@www.lameter.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aa456c7a
    • Christoph Lameter's avatar
      slub: Remove userspace notifier for cache add/remove · d7660ce5
      Christoph Lameter authored
      
      
      I came across some unnecessary uevents once again which reminded me
      this.  The patch seems to be lost in the leaves of the original
      discussion [1], so resending.
      
      [1] https://lore.kernel.org/r/alpine.DEB.2.21.2001281813130.745@www.lameter.com
      
      Kmem caches are internal kernel structures so it is strange that
      userspace notifiers would be needed.  And I am not aware of any use of
      these notifiers.  These notifiers may just exist because in the initial
      slub release the sysfs code was copied from another subsystem.
      
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Koutný <mkoutny@suse.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Link: http://lkml.kernel.org/r/20200423115721.19821-1-mkoutny@suse.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7660ce5
    • Dongli Zhang's avatar
      mm/slub.c: fix corrupted freechain in deactivate_slab() · 52f23478
      Dongli Zhang authored
      
      
      The slub_debug is able to fix the corrupted slab freelist/page.
      However, alloc_debug_processing() only checks the validity of current
      and next freepointer during allocation path.  As a result, once some
      objects have their freepointers corrupted, deactivate_slab() may lead to
      page fault.
      
      Below is from a test kernel module when 'slub_debug=PUF,kmalloc-128
      slub_nomerge'.  The test kernel corrupts the freepointer of one free
      object on purpose.  Unfortunately, deactivate_slab() does not detect it
      when iterating the freechain.
      
        BUG: unable to handle page fault for address: 00000000123456f8
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP PTI
        ... ...
        RIP: 0010:deactivate_slab.isra.92+0xed/0x490
        ... ...
        Call Trace:
         ___slab_alloc+0x536/0x570
         __slab_alloc+0x17/0x30
         __kmalloc+0x1d9/0x200
         ext4_htree_store_dirent+0x30/0xf0
         htree_dirblock_to_tree+0xcb/0x1c0
         ext4_htree_fill_tree+0x1bc/0x2d0
         ext4_readdir+0x54f/0x920
         iterate_dir+0x88/0x190
         __x64_sys_getdents+0xa6/0x140
         do_syscall_64+0x49/0x170
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Therefore, this patch adds extra consistency check in deactivate_slab().
      Once an object's freepointer is corrupted, all following objects
      starting at this object are isolated.
      
      [akpm@linux-foundation.org: fix build with CONFIG_SLAB_DEBUG=n]
      Signed-off-by: default avatarDongli Zhang <dongli.zhang@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Joe Jin <joe.jin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Link: http://lkml.kernel.org/r/20200331031450.12182-1-dongli.zhang@oracle.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52f23478
    • Vlastimil Babka's avatar
      usercopy: mark dma-kmalloc caches as usercopy caches · 49f2d241
      Vlastimil Babka authored
      
      
      We have seen a "usercopy: Kernel memory overwrite attempt detected to
      SLUB object 'dma-kmalloc-1 k' (offset 0, size 11)!" error on s390x, as
      IUCV uses kmalloc() with __GFP_DMA because of memory address
      restrictions.  The issue has been discussed [2] and it has been noted
      that if all the kmalloc caches are marked as usercopy, there's little
      reason not to mark dma-kmalloc caches too.  The 'dma' part merely means
      that __GFP_DMA is used to restrict memory address range.
      
      As Jann Horn put it [3]:
       "I think dma-kmalloc slabs should be handled the same way as normal
        kmalloc slabs. When a dma-kmalloc allocation is freshly created, it is
        just normal kernel memory - even if it might later be used for DMA -,
        and it should be perfectly fine to copy_from_user() into such
        allocations at that point, and to copy_to_user() out of them at the
        end. If you look at the places where such allocations are created, you
        can see things like kmemdup(), memcpy() and so on - all normal
        operations that shouldn't conceptually be different from usercopy in
        any relevant way."
      
      Thus this patch marks the dma-kmalloc-* caches as usercopy.
      
      [1] https://bugzilla.suse.com/show_bug.cgi?id=1156053
      [2] https://lore.kernel.org/kernel-hardening/bfca96db-bbd0-d958-7732-76e36c667c68@suse.cz/
      [3] https://lore.kernel.org/kernel-hardening/CAG48ez1a4waGk9kB0WLaSbs4muSoK0AYAVk8=XYaKj4_+6e6Hg@mail.gmail.com/
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: default avatarJiri Slaby <jslaby@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Julian Wiedmann <jwi@linux.ibm.com>
      Cc: Ursula Braun <ubraun@linux.ibm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: David Windsor <dave@nullcore.net>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Christoffer Dall <christoffer.dall@linaro.org>
      Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Luis de Bethencourt <luisbg@kernel.org>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Matthew Garrett <mjg59@google.com>
      Cc: Michal Kubecek <mkubecek@suse.cz>
      Link: http://lkml.kernel.org/r/7d810f6d-8085-ea2f-7805-47ba3842dc50@suse.cz
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49f2d241
    • Jeff Layton's avatar
      fs/buffer.c: record blockdev write errors in super_block that it backs · 485e9605
      Jeff Layton authored
      
      
      When syncing out a block device (a'la __sync_blockdev), any error
      encountered will only be recorded in the bd_inode's mapping.  When the
      blockdev contains a filesystem however, we'd like to also record the
      error in the super_block that's stored there.
      
      Make mark_buffer_write_io_error also record the error in the
      corresponding super_block when a writeback error occurs and the block
      device contains a mounted superblock.
      
      Since superblocks are RCU freed, hold the rcu_read_lock to ensure that
      the superblock doesn't go away while we're marking it.
      
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andres Freund <andres@anarazel.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Link: http://lkml.kernel.org/r/20200428135155.19223-3-jlayton@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      485e9605
    • Jeff Layton's avatar
      vfs: track per-sb writeback errors and report them to syncfs · 735e4ae5
      Jeff Layton authored
      
      
      Patch series "vfs: have syncfs() return error when there are writeback
      errors", v6.
      
      Currently, syncfs does not return errors when one of the inodes fails to
      be written back.  It will return errors based on the legacy AS_EIO and
      AS_ENOSPC flags when syncing out the block device fails, but that's not
      particularly helpful for filesystems that aren't backed by a blockdev.
      It's also possible for a stray sync to lose those errors.
      
      The basic idea in this set is to track writeback errors at the
      superblock level, so that we can quickly and easily check whether
      something bad happened without having to fsync each file individually.
      syncfs is then changed to reliably report writeback errors after they
      occur, much in the same fashion as fsync does now.
      
      This patch (of 2):
      
      Usually we suggest that applications call fsync when they want to ensure
      that all data written to the file has made it to the backing store, but
      that can be inefficient when there are a lot of open files.
      
      Calling syncfs on the filesystem can be more efficient in some
      situations, but the error reporting doesn't currently work the way most
      people expect.  If a single inode on a filesystem reports a writeback
      error, syncfs won't necessarily return an error.  syncfs only returns an
      error if __sync_blockdev fails, and on some filesystems that's a no-op.
      
      It would be better if syncfs reported an error if there were any
      writeback failures.  Then applications could call syncfs to see if there
      are any errors on any open files, and could then call fsync on all of
      the other descriptors to figure out which one failed.
      
      This patch adds a new errseq_t to struct super_block, and has
      mapping_set_error also record writeback errors there.
      
      To report those errors, we also need to keep an errseq_t in struct file
      to act as a cursor.  This patch adds a dedicated field for that purpose,
      which slots nicely into 4 bytes of padding at the end of struct file on
      x86_64.
      
      An earlier version of this patch used an O_PATH file descriptor to cue
      the kernel that the open file should track the superblock error and not
      the inode's writeback error.
      
      I think that API is just too weird though.  This is simpler and should
      make syncfs error reporting "just work" even if someone is multiplexing
      fsync and syncfs on the same fds.
      
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Andres Freund <andres@anarazel.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: David Howells <dhowells@redhat.com>
      Link: http://lkml.kernel.org/r/20200428135155.19223-1-jlayton@kernel.org
      Link: http://lkml.kernel.org/r/20200428135155.19223-2-jlayton@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      735e4ae5
    • Andrew Morton's avatar
      arch/parisc/include/asm/pgtable.h: remove unused `old_pte' · 78128fab
      Andrew Morton authored
      
      
      parisc's set_pte_at() macro has set-but-not-used variable:
      
        include/linux/pgtable.h: In function 'pte_clear_not_present_full':
        arch/parisc/include/asm/pgtable.h:96:9: warning: variable 'old_pte' set but not used [-Wunused-but-set-variable]
      
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      78128fab
    • Gang He's avatar
      ocfs2: mount shared volume without ha stack · 912f655d
      Gang He authored
      
      
      Usually we create and use a ocfs2 shared volume on the top of ha stack.
      For pcmk based ha stack, which includes DLM, corosync and pacemaker
      services.
      
      The customers complained they could not mount existent ocfs2 volume in
      the single node without ha stack, e.g.  single node backup/restore
      scenario.
      
      Like this case, the customers just want to access the data from the
      existent ocfs2 volume quickly, but do not want to restart or setup ha
      stack.
      
      Then, I'd like to add a mount option "nocluster", if the users use this
      option to mount a ocfs2 shared volume, the whole mount will not depend
      on the ha related services.  the command will mount the existent ocfs2
      volume directly (like local mount), for avoiding setup the ha stack.
      
      Signed-off-by: default avatarGang He <ghe@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Jun Piao <piaojun@huawei.com>
      Link: http://lkml.kernel.org/r/20200423053300.22661-1-ghe@suse.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      912f655d
    • Jules Irenge's avatar
      ocfs2: add missing annotation for dlm_empty_lockres() · 8f745e62
      Jules Irenge authored
      
      
      Sparse reports a warning at dlm_empty_lockres()
      
        warning: context imbalance in dlm_purge_lockres() - unexpected unlock
      
      The root cause is the missing annotation at dlm_purge_lockres()
      
      Add the missing __must_hold(&dlm->spinlock)
      
      Signed-off-by: default avatarJules Irenge <jbi.octave@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Link: http://lkml.kernel.org/r/20200403160505.2832-4-jbi.octave@gmail.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8f745e62
    • Philippe Liard's avatar
      squashfs: migrate from ll_rw_block usage to BIO · 93e72b3c
      Philippe Liard authored
      
      
      ll_rw_block() function has been deprecated in favor of BIO which appears
      to come with large performance improvements.
      
      This patch decreases boot time by close to 40% when using squashfs for
      the root file-system.  This is observed at least in the context of
      starting an Android VM on Chrome OS using crosvm.  The patch was tested
      on 4.19 as well as master.
      
      This patch is largely based on Adrien Schildknecht's patch that was
      originally sent as https://lkml.org/lkml/2017/9/22/814 though with some
      significant changes and simplifications while also taking Phillip
      Lougher's feedback into account, around preserving support for
      FILE_CACHE in particular.
      
      [akpm@linux-foundation.org: fix build error reported by Randy]
        Link: http://lkml.kernel.org/r/319997c2-5fc8-f889-2ea3-d913308a7c1f@infradead.org
      Signed-off-by: default avatarPhilippe Liard <pliard@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Adrien Schildknecht <adrien+dev@schischi.me>
      Cc: Phillip Lougher <phillip@squashfs.org.uk>
      Cc: Guenter Roeck <groeck@chromium.org>
      Cc: Daniel Rosenberg <drosen@google.com>
      Link: https://chromium.googlesource.com/chromiumos/platform/crosvm
      Link: http://lkml.kernel.org/r/20191106074238.186023-1-pliard@google.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93e72b3c
  2. Jun 02, 2020
    • Linus Torvalds's avatar
      Merge tag 'x86_cache_updates_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 9bf9511e
      Linus Torvalds authored
      Pull x86 cache resource control updates from Borislav Petkov:
       "Add support for wider Memory Bandwidth Monitoring counters by querying
        their width from CPUID.
      
        As a prerequsite for that, streamline and unify the CPUID detection of
        the respective resource control attributes.
      
        By Reinette Chatre"
      
      * tag 'x86_cache_updates_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/resctrl: Support wider MBM counters
        x86/resctrl: Support CPUID enumeration of MBM counter width
        x86/resctrl: Maintain MBM counter width per resource
        x86/resctrl: Query LLC monitoring properties once during boot
        x86/resctrl: Remove unnecessary RMID checks
        x86/cpu: Move resctrl CPUID code to resctrl/
        x86/resctrl: Rename asm/resctrl_sched.h to asm/resctrl.h
      9bf9511e
    • Linus Torvalds's avatar
      Merge tag 'x86_microcode_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · ef34ba6d
      Linus Torvalds authored
      Pull x86 microcode update from Borislav Petkov:
       "A single fix for late microcode loading to handle the correct return
        value from stop_machine(), from Mihai Carabas"
      
      * tag 'x86_microcode_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/microcode: Fix return value for microcode late loading
      ef34ba6d
    • Linus Torvalds's avatar
      Merge tag 'edac_updates_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras · 8b11dd54
      Linus Torvalds authored
      Pull EDAC updates from Borislav Petkov:
      
       - Fix i10nm_edac loading on some Ice Lake and Tremont/Jacobsville
         steppings due to the offset change of the bus number configuration
         register, by Qiuxu Zhuo.
      
       - The usual cleanups and fixes all over the place.
      
      * tag 'edac_updates_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
        EDAC/amd64: Remove redundant assignment to variable ret in hw_info_get()
        EDAC/skx: Use the mcmtr register to retrieve close_pg/bank_xor_enable
        EDAC/i10nm: Update driver to support different bus number config register offsets
        EDAC, {skx,i10nm}: Make some configurations CPU model specific
        EDAC/amd8131: Remove defined but not used bridge_str
        EDAC/thunderx: Make symbols static
        MAINTAINERS: Remove sifive_l2_cache.c from EDAC-SIFIVE pattern
        EDAC/xgene: Remove set but not used address local var
        EDAC/armada_xp: Fix some log messages
      8b11dd54
    • Linus Torvalds's avatar
      Merge tag 'printk-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux · ca1f5df2
      Linus Torvalds authored
      Pull printk updates from Petr Mladek:
      
       - Benjamin Herrenschmidt solved a problem with non-matched console
         aliases by first checking consoles defined on the command line. It is
         a more conservative approach than the previous attempts.
      
       - Benjamin also made sure that the console accessible via /dev/console
         always has CON_CONSDEV flag.
      
       - Andy Shevchenko added the %ptT modifier for printing struct time64_t.
         It extends the existing %ptR handling for struct rtc_time.
      
       - Bruno Meneguele fixed /dev/kmsg error value returned by unsupported
         SEEK_CUR.
      
       - Tetsuo Handa removed unused pr_cont_once().
      
      ... and a few small fixes.
      
      * tag 'printk-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
        printk: Remove pr_cont_once()
        printk: handle blank console arguments passed in.
        kernel/printk: add kmsg SEEK_CUR handling
        printk: Fix a typo in comment "interator"->"iterator"
        usb: pulse8-cec: Switch to use %ptT
        ARM: bcm2835: Switch to use %ptT
        lib/vsprintf: Print time64_t in human readable format
        lib/vsprintf: update comment about simple_strto<foo>() functions
        printk: Correctly set CON_CONSDEV even when preferred console was not registered
        printk: Fix preferred console selection with multiple matches
        printk: Move console matching logic into a separate function
        printk: Convert a use of sprintf to snprintf in console_unlock
      ca1f5df2
    • Linus Torvalds's avatar
      Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt · 4d67829e
      Linus Torvalds authored
      Pull fsverity updates from Eric Biggers:
       "Fix kerneldoc warnings and some coding style inconsistencies.
      
        This mirrors the similar cleanups being done in fs/crypto/"
      
      * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
        fs-verity: remove unnecessary extern keywords
        fs-verity: fix all kerneldoc warnings
      4d67829e
    • Linus Torvalds's avatar
      Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt · afdb0f2e
      Linus Torvalds authored
      Pull fscrypt updates from Eric Biggers:
      
       - Add the IV_INO_LBLK_32 encryption policy flag which modifies the
         encryption to be optimized for eMMC inline encryption hardware.
      
       - Make the test_dummy_encryption mount option for ext4 and f2fs support
         v2 encryption policies.
      
       - Fix kerneldoc warnings and some coding style inconsistencies.
      
      * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
        fscrypt: add support for IV_INO_LBLK_32 policies
        fscrypt: make test_dummy_encryption use v2 by default
        fscrypt: support test_dummy_encryption=v2
        fscrypt: add fscrypt_add_test_dummy_key()
        linux/parser.h: add include guards
        fscrypt: remove unnecessary extern keywords
        fscrypt: name all function parameters
        fscrypt: fix all kerneldoc warnings
      afdb0f2e
    • Linus Torvalds's avatar
      Merge tag 'pstore-v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · 829f3b94
      Linus Torvalds authored
      Pull pstore updates from Kees Cook:
       "Fixes and new features for pstore.
      
        This is a pretty big set of changes (relative to past pstore pulls),
        but it has been in -next for a while. The biggest change here is the
        ability to support a block device as a pstore backend, which has been
        desired for a while. A lot of additional fixes and refactorings are
        also included, mostly in support of the new features.
      
         - refactor pstore locking for safer module unloading (Kees Cook)
      
         - remove orphaned records from pstorefs when backend unloaded (Kees
           Cook)
      
         - refactor dump_oops parameter into max_reason (Pavel Tatashin)
      
         - introduce pstore/zone for common code for contiguous storage
           (WeiXiong Liao)
      
         - introduce pstore/blk for block device backend (WeiXiong Liao)
      
         - introduce mtd backend (WeiXiong Liao)"
      
      * tag 'pstore-v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (35 commits)
        mtd: Support kmsg dumper based on pstore/blk
        pstore/blk: Introduce "best_effort" mode
        pstore/blk: Support non-block storage devices
        pstore/blk: Provide way to query pstore configuration
        pstore/zone: Provide way to skip "broken" zone for MTD devices
        Documentation: Add details for pstore/blk
        pstore/zone,blk: Add ftrace frontend support
        pstore/zone,blk: Add console frontend support
        pstore/zone,blk: Add support for pmsg frontend
        pstore/blk: Introduce backend for block devices
        pstore/zone: Introduce common layer to manage storage zones
        ramoops: Add "max-reason" optional field to ramoops DT node
        pstore/ram: Introduce max_reason and convert dump_oops
        pstore/platform: Pass max_reason to kmesg dump
        printk: Introduce kmsg_dump_reason_str()
        printk: honor the max_reason field in kmsg_dumper
        printk: Collapse shutdown types into a single dump reason
        pstore/ftrace: Provide ftrace log merging routine
        pstore/ram: Refactor ftrace buffer merging
        pstore/ram: Refactor DT size parsing
        ...
      829f3b94
    • Linus Torvalds's avatar
      Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · 81e8c10d
      Linus Torvalds authored
      Pull crypto updates from Herbert Xu:
       "API:
         - Introduce crypto_shash_tfm_digest() and use it wherever possible.
         - Fix use-after-free and race in crypto_spawn_alg.
         - Add support for parallel and batch requests to crypto_engine.
      
        Algorithms:
         - Update jitter RNG for SP800-90B compliance.
         - Always use jitter RNG as seed in drbg.
      
        Drivers:
         - Add Arm CryptoCell driver cctrng.
         - Add support for SEV-ES to the PSP driver in ccp"
      
      * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (114 commits)
        crypto: hisilicon - fix driver compatibility issue with different versions of devices
        crypto: engine - do not requeue in case of fatal error
        crypto: cavium/nitrox - Fix a typo in a comment
        crypto: hisilicon/qm - change debugfs file name from qm_regs to regs
        crypto: hisilicon/qm - add DebugFS for xQC and xQE dump
        crypto: hisilicon/zip - add debugfs for Hisilicon ZIP
        crypto: hisilicon/hpre - add debugfs for Hisilicon HPRE
        crypto: hisilicon/sec2 - add debugfs for Hisilicon SEC
        crypto: hisilicon/qm - add debugfs to the QM state machine
        crypto: hisilicon/qm - add debugfs for QM
        crypto: stm32/crc32 - protect from concurrent accesses
        crypto: stm32/crc32 - don't sleep in runtime pm
        crypto: stm32/crc32 - fix multi-instance
        crypto: stm32/crc32 - fix run-time self test issue.
        crypto: stm32/crc32 - fix ext4 chksum BUG_ON()
        crypto: hisilicon/zip - Use temporary sqe when doing work
        crypto: hisilicon - add device error report through abnormal irq
        crypto: hisilicon - remove codes of directly report device errors through MSI
        crypto: hisilicon - QM memory management optimization
        crypto: hisilicon - unify initial value assignment into QM
        ...
      81e8c10d
    • Linus Torvalds's avatar
      Merge tag 'i3c/for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux · 729ea4e0
      Linus Torvalds authored
      Pull i3c update from Boris Brezillon:
       "Fix GETMRL's logic"
      
      * tag 'i3c/for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux:
        i3c master: GETMRL's 3rd byte is optional even with BCR_IBI_PAYLOAD
      729ea4e0
    • Linus Torvalds's avatar
      Merge tag 'regulator-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator · d30fc97c
      Linus Torvalds authored
      Pull regulator updates from Mark Brown:
       "The big change in this release is that Matti Vaittinen has factored
        out the linear ranges support into a separate library in lib/ since it
        is also useful for at least the power subsystem (and most likely
        others too), it helps subsystems which need to map register values
        into more useful real world values do so with minimal per-driver code.
      
         - Factoring out of the linear ranges support into a library in lib/
           from Matti Vaittinen.
      
         - Trace points for bypass mode.
      
         - Use the consumer name in debugfs to make it easier to understand.
      
         - New drivers for Maxim MAX77826 and MAX8998"
      
      * tag 'regulator-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (23 commits)
        regulator: max8998: max8998_set_current_limit() can be static
        dt-bindings: regulator: Convert anatop regulator to json-schema
        regulator: core: Add regulator bypass trace points
        regulator: extract voltage balancing code to the separate function
        regulator/mfd: max8998: Document charger regulator
        regulator: max8998: Add charger regulator
        MAINTAINERS: Add maintainer entry for linear ranges helper
        regulator: bd718x7: remove voltage change restriction from BD71847 LDOs
        lib: linear_ranges: Add missing MODULE_LICENSE()
        regulator: use linear_ranges helper
        power: supply: bd70528: rename linear_range to avoid collision
        lib/test_linear_ranges: add a test for the 'linear_ranges'
        lib: add linear ranges helpers
        regulator: db8500-prcmu: Use true,false for bool variable
        regulator: bd718x7: remove voltage change restriction from BD71847
        regulator: max77826: Remove erroneous additionalProperties
        regulator: qcom-rpmh: Fix typos in pm8150 and pm8150l
        regulator: Document bindings for max77826
        regulator: max77826: Add max77826 regulator driver
        regulator: tps80031: remove redundant assignment to variables ret and val
        ...
      d30fc97c
    • Linus Torvalds's avatar
      Merge tag 'spi-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi · a36de5eb
      Linus Torvalds authored
      Pull spi updates from Mark Brown:
       "This has been a very active release for the DesignWare driver in
        particular - after a long period of inactivity we have had a lot of
        people actively working on it for unrelated reasons this cycle with
        some of that work still not landed.
      
        Otherwise it's been fairly quiet for the subsystem.
      
        Highlights include:
      
         - Lots of performance improvements and fixes for the DesignWare
           driver from Serge Semin, Andy Shevchenko, Wan Ahmad Zainie, Clement
           Leger, Dinh Nguyen and Jarkko Nikula.
      
         - Support for octal mode transfers in spidev.
      
         - Slave mode support for the Rockchip drivers.
      
         - Support for AMD controllers, Broadcom mspi and Raspberry Pi 4, and
           Intel Elkhart Lake"
      
      * tag 'spi-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: (125 commits)
        spi: spi-fsl-dspi: fix native data copy
        spi: Convert DW SPI binding to DT schema
        spi: dw: Refactor mid_spi_dma_setup() to separate DMA and IRQ config
        spi: dw: Make DMA request line assignments explicit for Intel Medfield
        spi: bcm2835: Remove shared interrupt support
        dt-bindings: snps,dw-apb-ssi: add optional reset property
        spi: dw: add reset control
        spi: bcm2835: Enable shared interrupt support
        spi: bcm2835: Implement shutdown callback
        spi: dw: Use regset32 DebugFS method to create regdump file
        spi: dw: Add DMA support to the DW SPI MMIO driver
        spi: dw: Cleanup generic DW DMA code namings
        spi: dw: Add DW SPI DMA/PCI/MMIO dependency on the DW SPI core
        spi: dw: Remove DW DMA code dependency from DW_DMAC_PCI
        spi: dw: Move Non-DMA code to the DW PCIe-SPI driver
        spi: dw: Add core suffix to the DW APB SSI core source file
        spi: dw: Fix Rx-only DMA transfers
        spi: dw: Use DMA max burst to set the request thresholds
        spi: dw: Parameterize the DMA Rx/Tx burst length
        spi: dw: Add SPI Rx-done wait method to DMA-based transfer
        ...
      a36de5eb
    • Linus Torvalds's avatar
      Merge tag 'regmap-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap · 213fd09e
      Linus Torvalds authored
      Pull regmap updates from Mark Brown:
       "This has been a very active release for the regmap API for some
        reason, a lot of it due to new devices with odd requirements that can
        sensibly be handled here.
      
         - Add support for buses implementing a custom reg_update_bits()
           method in case the bus has a native operation for this.
      
         - Support 16 bit register addresses in SMBus.
      
         - Allow customization of the device attached to regmap-irq.
      
         - Helpers for bitfield operations and per-port field initializations"
      
      * tag 'regmap-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
        regmap: provide helpers for simple bit operations
        regmap: add helper for per-port regfield initialization
        regmap-i2c: add 16-bit width registers support
        regmap: Simplify implementation of the regmap_field_read_poll_timeout() macro
        regmap: Simplify implementation of the regmap_read_poll_timeout() macro
        regmap: add reg_sequence helpers
        regmap-irq: make it possible to add irq_chip do a specific device node
        regmap: Add bus reg_update_bits() support
        regmap: debugfs: check count when read regmap file
      213fd09e