Skip to content
  1. Nov 01, 2023
    • Brian Foster's avatar
      ext4: fix racy may inline data check in dio write · ce56d213
      Brian Foster authored
      
      
      syzbot reports that the following warning from ext4_iomap_begin()
      triggers as of the commit referenced below:
      
              if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
                      return -ERANGE;
      
      This occurs during a dio write, which is never expected to encounter
      an inode with inline data. To enforce this behavior,
      ext4_dio_write_iter() checks the current inline state of the inode
      and clears the MAY_INLINE_DATA state flag to either fall back to
      buffered writes, or enforce that any other writers in progress on
      the inode are not allowed to create inline data.
      
      The problem is that the check for existing inline data and the state
      flag can span a lock cycle. For example, if the ilock is originally
      locked shared and subsequently upgraded to exclusive, another writer
      may have reacquired the lock and created inline data before the dio
      write task acquires the lock and proceeds.
      
      The commit referenced below loosens the lock requirements to allow
      some forms of unaligned dio writes to occur under shared lock, but
      AFAICT the inline data check was technically already racy for any
      dio write that would have involved a lock cycle. Regardless, lift
      clearing of the state bit to the same lock critical section that
      checks for preexisting inline data on the inode to close the race.
      
      Cc: stable@kernel.org
      Reported-by: default avatar <syzbot+307da6ca5cb0d01d581a@syzkaller.appspotmail.com>
      Fixes: 310ee090
      
       ("ext4: allow concurrent unaligned dio overwrites")
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Link: https://lore.kernel.org/r/20231002185020.531537-1-bfoster@redhat.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      ce56d213
  2. Oct 06, 2023
    • Kemeng Shi's avatar
      ext4: run mballoc test with different layouts setting · 28b95ee8
      Kemeng Shi authored
      
      
      Use KUNIT_CASE_PARAM to run mballoc test with different layouts setting.
      
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Reviewed-by: default avatar"Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/20230928160407.142069-13-shikemeng@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      28b95ee8
    • Kemeng Shi's avatar
      ext4: add first unit test for ext4_mb_new_blocks_simple in mballoc · 7c9fa399
      Kemeng Shi authored
      
      
      Here are prepared work:
      1. Include mballoc-test.c to mballoc.c to be able test static function
      in mballoc.c.
      2. Implement static stub to avoid read IO to disk.
      3. Construct fake super_block. Only partial members are set, more members
      will be set when more functions are tested.
      Then unit test for ext4_mb_new_blocks_simple is added.
      
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Reviewed-by: default avatar"Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/20230928160407.142069-12-shikemeng@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      7c9fa399
    • Kemeng Shi's avatar
      ext4: add some kunit stub for mballoc kunit test · bdefd689
      Kemeng Shi authored
      
      
      Multiblocks allocation will read and write block bitmap and group
      descriptor which reside on disk. Add kunit stub to function
      ext4_get_group_desc, ext4_read_block_bitmap_nowait, ext4_wait_block_bitmap
      and ext4_mb_mark_context to avoid real IO to disk.
      
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Reviewed-by: default avatar"Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/20230928160407.142069-11-shikemeng@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      bdefd689
    • Kemeng Shi's avatar
      ext4: call ext4_mb_mark_context in ext4_group_add_blocks() · 5c657db4
      Kemeng Shi authored
      
      
      Call ext4_mb_mark_context in ext4_group_add_blocks() to remove repeat code.
      
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Reviewed-by: default avatar"Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/20230928160407.142069-10-shikemeng@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      5c657db4
    • Kemeng Shi's avatar
      ext4: Separate block bitmap and buddy bitmap freeing in ext4_group_add_blocks() · 03c7fc39
      Kemeng Shi authored
      
      
      This patch separates block bitmap and buddy bitmap freeing in order to
      update block bitmap with ext4_mb_mark_context in following patch.
      The reason why this can be sperated is explained in previous submit.
      Put the explanation here to simplify the code archeology to
      ext4_group_add_blocks():
      
      Separated freeing is safe with concurrent allocation as long as:
      1. Firstly allocate block in buddy bitmap, and then in block bitmap.
      2. Firstly free block in block bitmap, and then buddy bitmap.
      Then freed block will only be available to allocation when both buddy
      bitmap and block bitmap are updated by freeing.
      Allocation obeys rule 1 already, just do sperated freeing with rule 2.
      
      Separated freeing has no race with generate_buddy as:
      Once ext4_mb_load_buddy_gfp is executed successfully, the update-to-date
      buddy page can be found in sbi->s_buddy_cache and no more buddy
      initialization of the buddy page will be executed concurrently until
      buddy page is unloaded. As we always do free in "load buddy, free,
      unload buddy" sequence, separated freeing has no race with generate_buddy.
      
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Reviewed-by: default avatar"Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/20230928160407.142069-9-shikemeng@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      03c7fc39
    • Kemeng Shi's avatar
      ext4: call ext4_mb_mark_context in ext4_mb_clear_bb · 38b8f70c
      Kemeng Shi authored
      
      
      Call ext4_mb_mark_context in ext4_mb_clear_bb to remove repeat code.
      
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Reviewed-by: default avatar"Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/20230928160407.142069-8-shikemeng@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      38b8f70c
    • Kemeng Shi's avatar
      ext4: Separate block bitmap and buddy bitmap freeing in ext4_mb_clear_bb() · 33e728c6
      Kemeng Shi authored
      
      
      This patch separates block bitmap and buddy bitmap freeing in order to
      update block bitmap with ext4_mb_mark_context in following patch.
      
      Separated freeing is safe with concurrent allocation as long as:
      1. Firstly allocate block in buddy bitmap, and then in block bitmap.
      2. Firstly free block in block bitmap, and then buddy bitmap.
      Then freed block will only be available to allocation when both buddy
      bitmap and block bitmap are updated by freeing.
      Allocation obeys rule 1 already, just do sperated freeing with rule 2.
      
      Separated freeing has no race with generate_buddy as:
      Once ext4_mb_load_buddy_gfp is executed successfully, the update-to-date
      buddy page can be found in sbi->s_buddy_cache and no more buddy
      initialization of the buddy page will be executed concurrently until
      buddy page is unloaded. As we always do free in "load buddy, free,
      unload buddy" sequence, separated freeing has no race with generate_buddy.
      
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Reviewed-by: default avatar"Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/20230928160407.142069-7-shikemeng@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      33e728c6
    • Kemeng Shi's avatar
      ext4: call ext4_mb_mark_context in ext4_mb_mark_diskspace_used · 2f94711b
      Kemeng Shi authored
      
      
      Call ext4_mb_mark_context in ext4_mb_mark_diskspace_used to:
      1. Remove repeat code to normally update bitmap and group descriptor
      on disk.
      2. Now that we have a common API for marking blocks inuse/free in block
      bitmap, use that instead of open coding it in function
      ext4_mb_mark_diskspace_used(). The current code was not updating
      checksum and other counters. ext4_mb_mark_context() should fix these
      consistency problems.
      
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Reviewed-by: default avatar"Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/20230928160407.142069-6-shikemeng@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      2f94711b
    • Kemeng Shi's avatar
      ext4: extend ext4_mb_mark_context to support allocation under journal · c431d386
      Kemeng Shi authored
      
      
      Previously, ext4_mb_mark_context is only called under fast commit
      replay path, so there is no valid handle when we update block bitmap
      and group descriptor. This patch try to extend ext4_mb_mark_context
      to be used by code under journal. There are several improvement:
      1. Add "handle_t *handle" to struct ext4_mark_context to journal block
      bitmap and group descriptor update inside ext4_mb_mark_context (the
      added journal code is based on ext4_mb_mark_diskspace_used where
      ext4_mb_mark_context is going to be used.)
      2. Adds a flag argument to ext4_mb_mark_context() which controls
      a. EXT4_MB_BITMAP_MARKED_CHECK - whether block bitmap checking is needed.
      b. EXT4_MB_SYNC_UPDATE - whether dirty buffers (bitmap and group
      descriptor) needs sync.
      
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Reviewed-by: default avatar"Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/20230928160407.142069-5-shikemeng@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      c431d386
    • Kemeng Shi's avatar
      ext4: call ext4_mb_mark_context in ext4_free_blocks_simple · 26d0f87b
      Kemeng Shi authored
      
      
      call ext4_mb_mark_context in ext4_free_blocks_simple to:
      1. remove repeat code
      2. pair update of free_clusters in ext4_mb_new_blocks_simple.
      3. add missing ext4_lock_group/ext4_unlock_group protection.
      
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Reviewed-by: default avatar"Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/20230928160407.142069-4-shikemeng@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      26d0f87b
    • Kemeng Shi's avatar
      ext4: factor out codes to update block bitmap and group descriptor on disk from ext4_mb_mark_bb · f9e2d95a
      Kemeng Shi authored
      
      
      There are several reasons to add a general function ext4_mb_mark_context
      to update block bitmap and group descriptor on disk:
      1. pair behavior of alloc/free bits. For example,
      ext4_mb_new_blocks_simple will update free_clusters in struct flex_groups
      in ext4_mb_mark_bb while ext4_free_blocks_simple forgets this.
      2. remove repeat code to read from disk, update and write back to disk.
      3. reduce future unit test mocks to catch real IO to update structure
      on disk.
      
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Reviewed-by: default avatar"Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/20230928160407.142069-3-shikemeng@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      f9e2d95a
    • Kemeng Shi's avatar
      ext4: make state in ext4_mb_mark_bb to be bool · d2f7cf40
      Kemeng Shi authored
      
      
      As state could only be either 0 or 1, just make it bool.
      
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Reviewed-by: default avatar"Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/20230928160407.142069-2-shikemeng@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      d2f7cf40
    • Zhihao Cheng's avatar
      jbd2: fix potential data lost in recovering journal raced with synchronizing fs bdev · 61187fce
      Zhihao Cheng authored
      
      
      JBD2 makes sure journal data is fallen on fs device by sync_blockdev(),
      however, other process could intercept the EIO information from bdev's
      mapping, which leads journal recovering successful even EIO occurs during
      data written back to fs device.
      
      We found this problem in our product, iscsi + multipath is chosen for block
      device of ext4. Unstable network may trigger kpartx to rescan partitions in
      device mapper layer. Detailed process is shown as following:
      
        mount          kpartx          irq
      jbd2_journal_recover
       do_one_pass
        memcpy(nbh->b_data, obh->b_data) // copy data to fs dev from journal
        mark_buffer_dirty // mark bh dirty
               vfs_read
      	  generic_file_read_iter // dio
      	   filemap_write_and_wait_range
      	    __filemap_fdatawrite_range
      	     do_writepages
      	      block_write_full_folio
      	       submit_bh_wbc
      	            >>  EIO occurs in disk  <<
      	                     end_buffer_async_write
      			      mark_buffer_write_io_error
      			       mapping_set_error
      			        set_bit(AS_EIO, &mapping->flags) // set!
      	    filemap_check_errors
      	     test_and_clear_bit(AS_EIO, &mapping->flags) // clear!
       err2 = sync_blockdev
        filemap_write_and_wait
         filemap_check_errors
          test_and_clear_bit(AS_EIO, &mapping->flags) // false
       err2 = 0
      
      Filesystem is mounted successfully even data from journal is failed written
      into disk, and ext4/ocfs2 could become corrupted.
      
      Fix it by comparing the wb_err state in fs block device before recovering
      and after recovering.
      
      A reproducer can be found in the kernel bugzilla referenced below.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=217888
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarZhihao Cheng <chengzhihao1@huawei.com>
      Signed-off-by: default avatarZhang Yi <yi.zhang@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20230919012525.1783108-1-chengzhihao1@huawei.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      61187fce
    • Max Kellermann's avatar
      ext4: apply umask if ACL support is disabled · 484fd6c1
      Max Kellermann authored
      
      
      The function ext4_init_acl() calls posix_acl_create() which is
      responsible for applying the umask.  But without
      CONFIG_EXT4_FS_POSIX_ACL, ext4_init_acl() is an empty inline function,
      and nobody applies the umask.
      
      This fixes a bug which causes the umask to be ignored with O_TMPFILE
      on ext4:
      
       https://github.com/MusicPlayerDaemon/MPD/issues/558
       https://bugs.gentoo.org/show_bug.cgi?id=686142#c3
       https://bugzilla.kernel.org/show_bug.cgi?id=203625
      
      Reviewed-by: default avatar"J. Bruce Fields" <bfields@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMax Kellermann <max.kellermann@ionos.com>
      Link: https://lore.kernel.org/r/20230919081824.1096619-1-max.kellermann@ionos.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      484fd6c1
    • Ojaswin Mujoo's avatar
      ext4: mark buffer new if it is unwritten to avoid stale data exposure · 2cd8bdb5
      Ojaswin Mujoo authored
      
      
      ** Short Version **
      
      In ext4 with dioread_nolock, we could have a scenario where the bh returned by
      get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has
      UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we
      never zero out the range of bh that is not under write, causing whatever stale
      data is present in the folio at that time to be written out to disk. To fix this
      mark the buffer as new, in case it is unwritten, in ext4_get_block_unwritten().
      
      ** Long Version **
      
      The issue mentioned above was resulting in two different bugs:
      
      1. On block size < page size case in ext4, generic/269 was reliably
      failing with dioread_nolock. The state of the write was as follows:
      
        * The write was extending i_size.
        * The last block of the file was fallocated and had an unwritten extent
        * We were near ENOSPC and hence we were switching to non-delayed alloc
          allocation.
      
      In this case, the back trace that triggers the bug is as follows:
      
        ext4_da_write_begin()
          /* switch to nodelalloc due to low space */
          ext4_write_begin()
            ext4_should_dioread_nolock() // true since mount flags still have delalloc
            __block_write_begin(..., ext4_get_block_unwritten)
              __block_write_begin_int()
                for(each buffer head in page) {
                  /* first iteration, this is bh1 which contains i_size */
                  if (!buffer_mapped)
                    get_block() /* returns bh with only UNWRITTEN and MAPPED */
                  /* second iteration, bh2 */
                    if (!buffer_mapped)
                      get_block() /* we fail here, could be ENOSPC */
                }
                if (err)
                  /*
                   * this would zero out all new buffers and mark them uptodate.
                   * Since bh1 was never marked new, we skip it here which causes
                   * the bug later.
                   */
                  folio_zero_new_buffers();
            /* ext4_wrte_begin() error handling */
            ext4_truncate_failed_write()
              ext4_truncate()
                ext4_block_truncate_page()
                  __ext4_block_zero_page_range()
                    if(!buffer_uptodate())
                      ext4_read_bh_lock()
                        ext4_read_bh() -> ... ext4_submit_bh_wbc()
                          BUG_ON(buffer_unwritten(bh)); /* !!! */
      
      2. The second issue is stale data exposure with page size >= blocksize
      with dioread_nolock. The conditions needed for it to happen are same as
      the previous issue ie dioread_nolock around ENOSPC condition. The issue
      is also similar where in __block_write_begin_int() when we call
      ext4_get_block_unwritten() on the buffer_head and the underlying extent
      is unwritten, we get an unwritten and mapped buffer head. Since it is
      not new, we never zero out the partial range which is not under write,
      thus writing stale data to disk. This can be easily observed with the
      following reproducer:
      
       fallocate -l 4k testfile
       xfs_io -c "pwrite 2k 2k" testfile
       # hexdump output will have stale data in from byte 0 to 2k in testfile
       hexdump -C testfile
      
      NOTE: To trigger this, we need dioread_nolock enabled and write happening via
      ext4_write_begin(), which is usually used when we have -o nodealloc. Since
      dioread_nolock is disabled with nodelalloc, the only alternate way to call
      ext4_write_begin() is to ensure that delayed alloc switches to nodelalloc ie
      ext4_da_write_begin() calls ext4_write_begin(). This will usually happen when
      ext4 is almost full like the way generic/269 was triggering it in Issue 1 above.
      This might make the issue harder to hit. Hence, for reliable replication, I used
      the below patch to temporarily allow dioread_nolock with nodelalloc and then
      mount the disk with -o nodealloc,dioread_nolock. With this you can hit the stale
      data issue 100% of times:
      
      @@ -508,8 +508,8 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
        if (ext4_should_journal_data(inode))
          return 0;
        /* temporary fix to prevent generic/422 test failures */
      - if (!test_opt(inode->i_sb, DELALLOC))
      -   return 0;
      + // if (!test_opt(inode->i_sb, DELALLOC))
      + //  return 0;
        return 1;
       }
      
      After applying this patch to mark buffer as NEW, both the above issues are
      fixed.
      
      Signed-off-by: default avatarOjaswin Mujoo <ojaswin@linux.ibm.com>
      Cc: stable@kernel.org
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatar"Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/d0ed09d70a9733fbb5349c5c7b125caac186ecdf.1695033645.git.ojaswin@linux.ibm.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      2cd8bdb5
    • Gou Hao's avatar
      ext4: move 'ix' sanity check to corrent position · af90a8f4
      Gou Hao authored
      Check 'ix' before it is used.
      
      Fixes: 80e675f9
      
       ("ext4: optimize memmmove lengths in extent/index insertions")
      Signed-off-by: default avatarGou Hao <gouhao@uniontech.com>
      Link: https://lore.kernel.org/r/20230906013341.7199-1-gouhao@uniontech.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      af90a8f4
    • Ye Bin's avatar
      jbd2: fix printk format type for 'io_block' in do_one_pass() · 8b6b5621
      Ye Bin authored
      
      
      'io_block' is unsinged long but print it by '%ld'.
      
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20230904105817.1728356-3-yebin10@huawei.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      8b6b5621
    • Ye Bin's avatar
      jbd2: print io_block if check data block checksum failed when do recovery · 71cd5a5a
      Ye Bin authored
      
      
      Now, if check data block checksum failed only print data's block number
      then skip write data. However, one data block may in more than one transaction.
      In some scenarios, offline analysis is inconvenient. As a result, it is
      difficult to locate the areas where data is faulty.
      So print 'io_block' if check data block checksum failed.
      
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20230904105817.1728356-2-yebin10@huawei.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      71cd5a5a
    • Kemeng Shi's avatar
      ext4: remove unnecessary initialization of count2 in set_flexbg_block_bitmap · 248b45b6
      Kemeng Shi authored
      
      
      We always overwrite count2 to "EXT4_CLUSTERS_PER_GROUP(sb) -
      (first_cluster - start)" after its initialization in for loop
      initialization statement .
      Just remove unnecessary initialization of count2.
      
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Reviewed-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Link: https://lore.kernel.org/r/20230826174712.4059355-14-shikemeng@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      248b45b6
    • Kemeng Shi's avatar
      ext4: remove unnecessary check to avoid repeat update_backups for the same gdb · 350bb48b
      Kemeng Shi authored
      The sbi->s_group_desc contains array of bh's for block group descriptors
      and continuous EXT4_DESC_PER_BLOCK(sb) bg descriptors in single block
      share the same bh.
      Simply call update_backups for each gdb_bh in sbi->s_group_desc will not
      update same group descriptors block for multiple times.
      
      Commit 0acdb887
      
       ("ext4: don't call update_backups() multiple times for
      the same bg") wrongly assumed each block group descriptor in the same block
      has a individual bh and unnecessary check was added.
      
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Link: https://lore.kernel.org/r/20230826174712.4059355-13-shikemeng@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      350bb48b
    • Kemeng Shi's avatar
      ext4: simplify the gdbblock calculation in add_new_gdb_meta_bg · 9dca529b
      Kemeng Shi authored
      
      
      We always call add_new_gdb_meta_bg with first group in mete_bg. Remove the
      unnecessary ext4_meta_bg_first_group conversion to simplify the gdbblock
      calculation.
      
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Link: https://lore.kernel.org/r/20230826174712.4059355-12-shikemeng@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      9dca529b
    • Kemeng Shi's avatar
      ext4: use saved local variable sbi instead of EXT4_SB(sb) · 70cbfd25
      Kemeng Shi authored
      
      
      We save EXT4_SB(sb) to local variable sbi at beginning of function
      ext4_resize_begin. Use sbi directly instead of EXT4_SB(sb) to
      remove unnecessary pointer dereference.
      
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Reviewed-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Link: https://lore.kernel.org/r/20230826174712.4059355-11-shikemeng@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      70cbfd25
    • Kemeng Shi's avatar
      ext4: remove EXT4FS_DEBUG defination in resize.c · 95b63568
      Kemeng Shi authored
      
      
      Remove EXT4FS_DEBUG defination in resize.c for following reasons:
      1. EXT4FS_DEBUG will enable debug messages, it should only be defined
      when debugging.
      2. ext4.h included from ext4_jbd2.h after EXT4FS_DEBUG defination will
      "#undef EXT4FS_DEBUG", then EXT4FS_DEBUG defination in resize.c can't
      actually turn on ext4_debug messages.
      
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Reviewed-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Link: https://lore.kernel.org/r/20230826174712.4059355-10-shikemeng@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      95b63568
    • Kemeng Shi's avatar
      ext4: calculate free_clusters_count in cluster unit in verify_group_input · 1fc1bd2d
      Kemeng Shi authored
      
      
      The field free_cluster_count in struct ext4_new_group_data should be
      in units of clusters.  In verify_group_input() this field is being
      filled in units of blocks.  Fortunately, we don't support online
      resizing of bigalloc file systems, and for non-bigalloc file systems,
      the cluster size == block size.  But fix this in case we do support
      online resizing of bigalloc file systems in the future.
      
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Reviewed-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Link: https://lore.kernel.org/r/20230826174712.4059355-9-shikemeng@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      1fc1bd2d
    • Kemeng Shi's avatar
      ext4: remove commented code in reserve_backup_gdb · 31458077
      Kemeng Shi authored
      
      
      Remove commented code in reserve_backup_gdb
      
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Reviewed-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Link: https://lore.kernel.org/r/20230826174712.4059355-8-shikemeng@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      31458077
    • Kemeng Shi's avatar
      ext4: remove redundant check of count · 7d4cd3b4
      Kemeng Shi authored
      
      
      Remove zero check of count which is always non-zero.
      
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Reviewed-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Link: https://lore.kernel.org/r/20230826174712.4059355-7-shikemeng@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      7d4cd3b4
    • Kemeng Shi's avatar
      ext4: fix typo in setup_new_flex_group_blocks · e44fc921
      Kemeng Shi authored
      
      
      grop -> group
      
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Reviewed-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Link: https://lore.kernel.org/r/20230826174712.4059355-6-shikemeng@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      e44fc921
    • Kemeng Shi's avatar
      ext4: remove gdb backup copy for meta bg in setup_new_flex_group_blocks · 40dd7953
      Kemeng Shi authored
      
      
      Wrong check of gdb backup in meta bg as following:
      first_group is the first group of meta_bg which contains target group, so
      target group is always >= first_group. We check if target group has gdb
      backup by comparing first_group with [group + 1] and [group +
      EXT4_DESC_PER_BLOCK(sb) - 1]. As group >= first_group, then [group + N] is
      > first_group. So no copy of gdb backup in meta bg is done in
      setup_new_flex_group_blocks.
      
      No need to do gdb backup copy in meta bg from setup_new_flex_group_blocks
      as we always copy updated gdb block to backups at end of
      ext4_flex_group_add as following:
      
      ext4_flex_group_add
        /* no gdb backup copy for meta bg any more */
        setup_new_flex_group_blocks
      
        /* update current group number */
        ext4_update_super
          sbi->s_groups_count += flex_gd->count;
      
        /*
         * if group in meta bg contains backup is added, the primary gdb block
         * of the meta bg will be copy to backup in new added group here.
         */
        for (; gdb_num <= gdb_num_end; gdb_num++)
          update_backups(...)
      
      In summary, we can remove wrong gdb backup copy code in
      setup_new_flex_group_blocks.
      
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Reviewed-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Link: https://lore.kernel.org/r/20230826174712.4059355-5-shikemeng@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      40dd7953
    • Kemeng Shi's avatar
      ext4: correct return value of ext4_convert_meta_bg · 48f15515
      Kemeng Shi authored
      
      
      Avoid to ignore error in "err".
      
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Link: https://lore.kernel.org/r/20230826174712.4059355-4-shikemeng@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      48f15515
    • Kemeng Shi's avatar
      ext4: add missed brelse in update_backups · 9adac8b0
      Kemeng Shi authored
      
      
      add missed brelse in update_backups
      
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Reviewed-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Link: https://lore.kernel.org/r/20230826174712.4059355-3-shikemeng@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      9adac8b0
    • Kemeng Shi's avatar
      ext4: correct offset of gdb backup in non meta_bg group to update_backups · 31f13421
      Kemeng Shi authored
      Commit 0aeaa255
      
       ("ext4: fix corruption when online resizing a 1K
      bigalloc fs") found that primary superblock's offset in its group is
      not equal to offset of backup superblock in its group when block size
      is 1K and bigalloc is enabled. As group descriptor blocks are right
      after superblock, we can't pass block number of gdb to update_backups
      for the same reason.
      
      The root casue of the issue above is that leading 1K padding block is
      count as data block offset for primary block while backup block has no
      padding block offset in its group.
      
      Remove padding data block count to fix the issue for gdb backups.
      
      For meta_bg case, update_backups treat blk_off as block number, do no
      conversion in this case.
      
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Reviewed-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Link: https://lore.kernel.org/r/20230826174712.4059355-2-shikemeng@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      31f13421
    • Wang Jianjian's avatar
      ext4: no need to generate from free list in mballoc · ebf6cb7c
      Wang Jianjian authored
      Commit 7a2fcbf7 ("ext4: don't use blocks freed but not yet committed in
      buddy cache init") added a code to mark as used blocks in the list of not yet
      committed freed blocks during initialization of a buddy page. However
      ext4_mb_free_metadata() makes sure buddy page is already loaded and takes a
      reference to it so it cannot happen that ext4_mb_init_cache() is called
      when efd list is non-empty. Just remove the
      ext4_mb_generate_from_freelist() call.
      
      Fixes: 7a2fcbf7
      
      ('ext4: don't use blocks freed but not yet committed in buddy cache init')
      Signed-off-by: default avatarWang Jianjian <wangjianjian0@foxmail.com>
      Link: https://lore.kernel.org/r/tencent_53CBCB1668358AE862684E453DF37B722008@qq.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      ebf6cb7c
    • Wang Jianjian's avatar
      ext4: fix incorrect offset · 8fedebb5
      Wang Jianjian authored
      
      
      The last argument of ext4_check_dir_entry is dentry offset int the
      file.  Luckily this error only results in the wrong offset being
      printed in the eventual error message.
      
      Signed-off-by: default avatarWang Jianjian <wangjianjian0@foxmail.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/tencent_F992989953734FD5DE3F88ECB2191A856206@qq.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      8fedebb5
    • Zhang Yi's avatar
      ext4: make sure allocate pending entry not fail · 8e387c89
      Zhang Yi authored
      
      
      __insert_pending() allocate memory in atomic context, so the allocation
      could fail, but we are not handling that failure now. It could lead
      ext4_es_remove_extent() to get wrong reserved clusters, and the global
      data blocks reservation count will be incorrect. The same to
      extents_status entry preallocation, preallocate pending entry out of the
      i_es_lock with __GFP_NOFAIL, make sure __insert_pending() and
      __revise_pending() always succeeds.
      
      Signed-off-by: default avatarZhang Yi <yi.zhang@huawei.com>
      Cc: stable@kernel.org
      Link: https://lore.kernel.org/r/20230824092619.1327976-3-yi.zhang@huaweicloud.com
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      8e387c89
    • Zhang Yi's avatar
      ext4: correct the start block of counting reserved clusters · 40ea9839
      Zhang Yi authored
      
      
      When big allocate feature is enabled, we need to count and update
      reserved clusters before removing a delayed only extent_status entry.
      {init|count|get}_rsvd() have already done this, but the start block
      number of this counting isn't correct in the following case.
      
        lblk            end
         |               |
         v               v
                -------------------------
                |                       | orig_es
                -------------------------
                         ^              ^
            len1 is 0    |     len2     |
      
      If the start block of the orig_es entry founded is bigger than lblk, we
      passed lblk as start block to count_rsvd(), but the length is correct,
      finally, the range to be counted is offset. This patch fix this by
      passing the start blocks to 'orig_es->lblk + len1'.
      
      Signed-off-by: default avatarZhang Yi <yi.zhang@huawei.com>
      Cc: stable@kernel.org
      Link: https://lore.kernel.org/r/20230824092619.1327976-2-yi.zhang@huaweicloud.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      40ea9839
    • Jinke Han's avatar
      ext4: make running and commit transaction have their own freed_data_list · ce774e53
      Jinke Han authored
      
      
      When releasing space in jbd, we traverse s_freed_data_list to get the
      free range belonging to the current commit transaction. In extreme cases,
      the time spent may not be small, and we have observed cases exceeding
      10ms. This patch makes running and commit transactions manage their own
      free_data_list respectively, eliminating unnecessary traversal.
      
      And in the callback phase of the commit transaction, no one will touch
      it except the jbd thread itself, so s_md_lock is no longer needed.
      
      Signed-off-by: default avatarJinke Han <hanjinke.666@bytedance.com>
      Reviewed-by: default avatarZhang Yi <yi.zhang@huawei.com>
      Link: https://lore.kernel.org/r/20230612124017.14115-1-hanjinke.666@bytedance.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      ce774e53
    • Lu Hongfei's avatar
      ext4: fix traditional comparison using max/min method · a8c1eb77
      Lu Hongfei authored
      
      
      It would be better to replace the traditional ternary conditional
      operator with max()/min()
      
      Signed-off-by: default avatarLu Hongfei <luhongfei@vivo.com>
      Reviewed-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20230529070930.37949-1-luhongfei@vivo.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      a8c1eb77
    • Baokun Li's avatar
      ext4: fix race between writepages and remount · 745f17a4
      Baokun Li authored
      We got a WARNING in ext4_add_complete_io:
      ==================================================================
       WARNING: at fs/ext4/page-io.c:231 ext4_put_io_end_defer+0x182/0x250
       CPU: 10 PID: 77 Comm: ksoftirqd/10 Tainted: 6.3.0-rc2 #85
       RIP: 0010:ext4_put_io_end_defer+0x182/0x250 [ext4]
       [...]
       Call Trace:
        <TASK>
        ext4_end_bio+0xa8/0x240 [ext4]
        bio_endio+0x195/0x310
        blk_update_request+0x184/0x770
        scsi_end_request+0x2f/0x240
        scsi_io_completion+0x75/0x450
        scsi_finish_command+0xef/0x160
        scsi_complete+0xa3/0x180
        blk_complete_reqs+0x60/0x80
        blk_done_softirq+0x25/0x40
        __do_softirq+0x119/0x4c8
        run_ksoftirqd+0x42/0x70
        smpboot_thread_fn+0x136/0x3c0
        kthread+0x140/0x1a0
        ret_from_fork+0x2c/0x50
      ==================================================================
      
      Above issue may happen as follows:
      
                  cpu1                        cpu2
      ----------------------------|----------------------------
      mount -o dioread_lock
      ext4_writepages
       ext4_do_writepages
        *if (ext4_should_dioread_nolock(inode))*
          // rsv_blocks is not assigned here
                                       mount -o remount,dioread_nolock
        ext4_journal_start_with_reserve
         __ext4_journal_start
          __ext4_journal_start_sb
           jbd2__journal_start
            *if (rsv_blocks)*
              // h_rsv_handle is not initialized here
        mpage_map_and_submit_extent
          mpage_map_one_extent
            dioread_nolock = ext4_should_dioread_nolock(inode)
            if (dioread_nolock && (map->m_flags & EXT4_MAP_UNWRITTEN))
              mpd->io_submit.io_end->handle = handle->h_rsv_handle
              ext4_set_io_unwritten_flag
                io_end->flag |= EXT4_IO_END_UNWRITTEN
            // now io_end->handle is NULL but has EXT4_IO_END_UNWRITTEN flag
      
      scsi_finish_command
       scsi_io_completion
        scsi_io_completion_action
         scsi_end_request
          blk_update_request
           req_bio_endio
            bio_endio
             bio->bi_end_io  > ext4_end_bio
              ext4_put_io_end_defer
      	 ext4_add_complete_io
      	  // trigger WARN_ON(!io_end->handle && sbi->s_journal);
      
      The immediate cause of this problem is that ext4_should_dioread_nolock()
      function returns inconsistent values in the ext4_do_writepages() and
      mpage_map_one_extent(). There are four conditions in this function that
      can be changed at mount time to cause this problem. These four conditions
      can be divided into two categories:
      
          (1) journal_data and EXT4_EXTENTS_FL, which can be changed by ioctl
          (2) DELALLOC and DIOREAD_NOLOCK, which can be changed by remount
      
      The two in the first category have been fixed by commit c8585c6f
      ("ext4: fix races between changing inode journal mode and ext4_writepages")
      and commit cb85f4d2 ("ext4: fix race between writepages and enabling
      EXT4_EXTENTS_FL") respectively.
      
      Two cases in the other category have not yet been fixed, and the above
      issue is caused by this situation. We refer to the fix for the first
      category, when applying options during remount, we grab s_writepages_rwsem
      to avoid racing with writepages ops to trigger this problem.
      
      Fixes: 6b523df4
      
       ("ext4: use transaction reservation for extent conversion in ext4_end_io")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarBaokun Li <libaokun1@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20230524072538.2883391-1-libaokun1@huawei.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      745f17a4
    • Theodore Ts'o's avatar
      ext4: add missing initialization of call_notify_error in update_super_work() · ee6a12d0
      Theodore Ts'o authored
      Fixes: ff0722de
      
       ("ext4: add periodic superblock update check")
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      ee6a12d0