Skip to content
  1. Jun 21, 2015
    • Jan Kara's avatar
      jbd2: speedup jbd2_journal_dirty_metadata() · 2143c196
      Jan Kara authored
      
      
      It is often the case that we mark buffer as having dirty metadata when
      the buffer is already in that state (frequent for bitmaps, inode table
      blocks, superblock). Thus it is unnecessary to contend on grabbing
      journal head reference and bh_state lock. Avoid that by checking whether
      any modification to the buffer is needed before grabbing any locks or
      references.
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      2143c196
  2. Jun 16, 2015
    • Michal Hocko's avatar
      jbd2: get rid of open coded allocation retry loop · 7b506b10
      Michal Hocko authored
      
      
      insert_revoke_hash does an open coded endless allocation loop if
      journal_oom_retry is true. It doesn't implement any allocation fallback
      strategy between the retries, though. The memory allocator doesn't know
      about the never fail requirement so it cannot potentially help to move
      on with the allocation (e.g. use memory reserves).
      
      Get rid of the retry loop and use __GFP_NOFAIL instead. We will lose the
      debugging message but I am not sure it is anyhow helpful.
      
      Do the same for journal_alloc_journal_head which is doing a similar
      thing.
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      7b506b10
    • Andreas Dilger's avatar
      ext4: improve warning directory handling messages · b03a2f7e
      Andreas Dilger authored
      
      
      Several ext4_warning() messages in the directory handling code do not
      report the inode number of the (potentially corrupt) directory where a
      problem is seen, and others report this in an ad-hoc manner.  Add an
      ext4_warning_inode() helper to print the inode number and command name
      consistent with ext4_error_inode().
      
      Consolidate the place in ext4.h that these macros are defined.
      
      Clean up some other directory error and warning messages to print the
      calling function name.
      
      Minor code style fixes in nearby lines.
      
      Signed-off-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      b03a2f7e
    • Joseph Qi's avatar
      jbd2: fix ocfs2 corrupt when updating journal superblock fails · 6f6a6fda
      Joseph Qi authored
      If updating journal superblock fails after journal data has been
      flushed, the error is omitted and this will mislead the caller as a
      normal case.  In ocfs2, the checkpoint will be treated successfully
      and the other node can get the lock to update. Since the sb_start is
      still pointing to the old log block, it will rewrite the journal data
      during journal recovery by the other node. Thus the new updates will
      be overwritten and ocfs2 corrupts.  So in above case we have to return
      the error, and ocfs2_commit_cache will take care of the error and
      prevent the other node to do update first.  And only after recovering
      journal it can do the new updates.
      
      The issue discussion mail can be found at:
      https://oss.oracle.com/pipermail/ocfs2-devel/2015-June/010856.html
      http://comments.gmane.org/gmane.comp.file-systems.ext4/48841
      
      
      
      [ Fixed bug in patch which allowed a non-negative error return from
        jbd2_cleanup_journal_tail() to leak out of jbd2_fjournal_flush(); this
        was causing xfstests ext4/306 to fail. -- Ted ]
      
      Reported-by: default avatarYiwen Jiang <jiangyiwen@huawei.com>
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Tested-by: default avatarYiwen Jiang <jiangyiwen@huawei.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: stable@vger.kernel.org
      6f6a6fda
  3. Jun 15, 2015
    • Rasmus Villemoes's avatar
      ext4: mballoc: avoid 20-argument function call · 97b4af2f
      Rasmus Villemoes authored
      
      
      Making a function call with 20 arguments is rather expensive in both
      stack and .text. In this case, doing the formatting manually doesn't
      make it any less readable, so we might as well save 155 bytes of .text
      and 112 bytes of stack.
      
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      97b4af2f
    • Lukas Czerner's avatar
      ext4: wait for existing dio workers in ext4_alloc_file_blocks() · 0d306dcf
      Lukas Czerner authored
      
      
      Currently existing dio workers can jump in and potentially increase
      extent tree depth while we're allocating blocks in
      ext4_alloc_file_blocks().  This may cause us to underestimate the
      number of credits needed for the transaction because the extent tree
      depth can change after our estimation.
      
      Fix this by waiting for all the existing dio workers in the same way
      as we do it in ext4_punch_hole.  We've seen errors caused by this in
      xfstest generic/299, however it's really hard to reproduce.
      
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      0d306dcf
    • Lukas Czerner's avatar
      ext4: recalculate journal credits as inode depth changes · 4134f5c8
      Lukas Czerner authored
      
      
      Currently in ext4_alloc_file_blocks() the number of credits is
      calculated only once before we enter the allocation loop. However within
      the allocation loop the extent tree depth can change, hence the number
      of credits needed can increase potentially exceeding the number of credits
      reserved in the handle which can cause journal failures.
      
      Fix this by recalculating number of credits when the inode depth
      changes. Note that even though ext4_alloc_file_blocks() is only
      currently used by extent base inodes we will avoid recalculating number
      of credits unnecessarily in the case of indirect based inodes.
      
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      4134f5c8
    • Dmitry Monakhov's avatar
      jbd2: use GFP_NOFS in jbd2_cleanup_journal_tail() · b4f1afcd
      Dmitry Monakhov authored
      
      
      jbd2_cleanup_journal_tail() can be invoked by jbd2__journal_start()
      So allocations should be done with GFP_NOFS
      
      [Full stack trace snipped from 3.10-rh7]
      [<ffffffff815c4bd4>] dump_stack+0x19/0x1b
      [<ffffffff8105dba1>] warn_slowpath_common+0x61/0x80
      [<ffffffff8105dcca>] warn_slowpath_null+0x1a/0x20
      [<ffffffff815c2142>] slab_pre_alloc_hook.isra.31.part.32+0x15/0x17
      [<ffffffff8119c045>] kmem_cache_alloc+0x55/0x210
      [<ffffffff811477f5>] ? mempool_alloc_slab+0x15/0x20
      [<ffffffff811477f5>] mempool_alloc_slab+0x15/0x20
      [<ffffffff81147939>] mempool_alloc+0x69/0x170
      [<ffffffff815cb69e>] ? _raw_spin_unlock_irq+0xe/0x20
      [<ffffffff8109160d>] ? finish_task_switch+0x5d/0x150
      [<ffffffff811f1a8e>] bio_alloc_bioset+0x1be/0x2e0
      [<ffffffff8127ee49>] blkdev_issue_flush+0x99/0x120
      [<ffffffffa019a733>] jbd2_cleanup_journal_tail+0x93/0xa0 [jbd2] -->GFP_KERNEL
      [<ffffffffa019aca1>] jbd2_log_do_checkpoint+0x221/0x4a0 [jbd2]
      [<ffffffffa019afc7>] __jbd2_log_wait_for_space+0xa7/0x1e0 [jbd2]
      [<ffffffffa01952d8>] start_this_handle+0x2d8/0x550 [jbd2]
      [<ffffffff811b02a9>] ? __memcg_kmem_put_cache+0x29/0x30
      [<ffffffff8119c120>] ? kmem_cache_alloc+0x130/0x210
      [<ffffffffa019573a>] jbd2__journal_start+0xba/0x190 [jbd2]
      [<ffffffff811532ce>] ? lru_cache_add+0xe/0x10
      [<ffffffffa01c9549>] ? ext4_da_write_begin+0xf9/0x330 [ext4]
      [<ffffffffa01f2c77>] __ext4_journal_start_sb+0x77/0x160 [ext4]
      [<ffffffffa01c9549>] ext4_da_write_begin+0xf9/0x330 [ext4]
      [<ffffffff811446ec>] generic_file_buffered_write_iter+0x10c/0x270
      [<ffffffff81146918>] __generic_file_write_iter+0x178/0x390
      [<ffffffff81146c6b>] __generic_file_aio_write+0x8b/0xb0
      [<ffffffff81146ced>] generic_file_aio_write+0x5d/0xc0
      [<ffffffffa01bf289>] ext4_file_write+0xa9/0x450 [ext4]
      [<ffffffff811c31d9>] ? pipe_read+0x379/0x4f0
      [<ffffffff811b93f0>] do_sync_write+0x90/0xe0
      [<ffffffff811b9b6d>] vfs_write+0xbd/0x1e0
      [<ffffffff811ba5b8>] SyS_write+0x58/0xb0
      [<ffffffff815d4799>] system_call_fastpath+0x16/0x1b
      
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      b4f1afcd
  4. Jun 13, 2015
    • Fabian Frederick's avatar
      ext4: use swap() in mext_page_double_lock() · bf865467
      Fabian Frederick authored
      
      
      Use kernel.h macro definition.
      
      Thanks to Julia Lawall for Coccinelle scripting support.
      
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      bf865467
    • Fabian Frederick's avatar
      ext4: use swap() in memswap() · 4b7e2db5
      Fabian Frederick authored
      
      
      Use kernel.h macro definition.
      
      Thanks to Julia Lawall for Coccinelle scripting support.
      
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      4b7e2db5
    • Theodore Ts'o's avatar
      ext4: fix race between truncate and __ext4_journalled_writepage() · bdf96838
      Theodore Ts'o authored
      The commit cf108bca
      
      : "ext4: Invert the locking order of page_lock
      and transaction start" caused __ext4_journalled_writepage() to drop
      the page lock before the page was written back, as part of changing
      the locking order to jbd2_journal_start -> page_lock.  However, this
      introduced a potential race if there was a truncate racing with the
      data=journalled writeback mode.
      
      Fix this by grabbing the page lock after starting the journal handle,
      and then checking to see if page had gotten truncated out from under
      us.
      
      This fixes a number of different warnings or BUG_ON's when running
      xfstests generic/086 in data=journalled mode, including:
      
      jbd2_journal_dirty_metadata: vdc-8: bad jh for block 115643: transaction (ee3fe7
      c0, 164), jh->b_transaction (  (null), 0), jh->b_next_transaction (  (null), 0), jlist 0
      
      	      	      	  - and -
      
      kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200!
          ...
      Call Trace:
       [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
       [<c02b2de5>] __ext4_journalled_invalidatepage+0x10f/0x117
       [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
       [<c027d883>] ? lock_buffer+0x36/0x36
       [<c02b2dfa>] ext4_journalled_invalidatepage+0xd/0x22
       [<c0229139>] do_invalidatepage+0x22/0x26
       [<c0229198>] truncate_inode_page+0x5b/0x85
       [<c022934b>] truncate_inode_pages_range+0x156/0x38c
       [<c0229592>] truncate_inode_pages+0x11/0x15
       [<c022962d>] truncate_pagecache+0x55/0x71
       [<c02b913b>] ext4_setattr+0x4a9/0x560
       [<c01ca542>] ? current_kernel_time+0x10/0x44
       [<c026c4d8>] notify_change+0x1c7/0x2be
       [<c0256a00>] do_truncate+0x65/0x85
       [<c0226f31>] ? file_ra_state_init+0x12/0x29
      
      	      	      	  - and -
      
      WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396
      irty_metadata+0x14a/0x1ae()
          ...
      Call Trace:
       [<c01b879f>] ? console_unlock+0x3a1/0x3ce
       [<c082cbb4>] dump_stack+0x48/0x60
       [<c0178b65>] warn_slowpath_common+0x89/0xa0
       [<c02ef2cf>] ? jbd2_journal_dirty_metadata+0x14a/0x1ae
       [<c0178bef>] warn_slowpath_null+0x14/0x18
       [<c02ef2cf>] jbd2_journal_dirty_metadata+0x14a/0x1ae
       [<c02d8615>] __ext4_handle_dirty_metadata+0xd4/0x19d
       [<c02b2f44>] write_end_fn+0x40/0x53
       [<c02b4a16>] ext4_walk_page_buffers+0x4e/0x6a
       [<c02b59e7>] ext4_writepage+0x354/0x3b8
       [<c02b2f04>] ? mpage_release_unused_pages+0xd4/0xd4
       [<c02b1b21>] ? wait_on_buffer+0x2c/0x2c
       [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
       [<c02b5a5b>] __writepage+0x10/0x2e
       [<c0225956>] write_cache_pages+0x22d/0x32c
       [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
       [<c02b6ee8>] ext4_writepages+0x102/0x607
       [<c019adfe>] ? sched_clock_local+0x10/0x10e
       [<c01a8a7c>] ? __lock_is_held+0x2e/0x44
       [<c01a8ad5>] ? lock_is_held+0x43/0x51
       [<c0226dff>] do_writepages+0x1c/0x29
       [<c0276bed>] __writeback_single_inode+0xc3/0x545
       [<c0277c07>] writeback_sb_inodes+0x21f/0x36d
          ...
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      bdf96838
    • Theodore Ts'o's avatar
      ext4 crypto: fail the mount if blocksize != pagesize · 1cb767cd
      Theodore Ts'o authored
      
      
      We currently don't correctly handle the case where blocksize !=
      pagesize, so disallow the mount in those cases.
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      1cb767cd
  5. Jun 09, 2015
  6. Jun 08, 2015
    • David Moore's avatar
      ext4: BUG_ON assertion repeated for inode1, not done for inode2 · 8bc3b1e6
      David Moore authored
      
      
      During a source code review of fs/ext4/extents.c I noted identical
      consecutive lines. An assertion is repeated for inode1 and never done
      for inode2. This is not in keeping with the rest of the code in the
      ext4_swap_extents function and appears to be a bug.
      
      Assert that the inode2 mutex is not locked.
      
      Signed-off-by: default avatarDavid Moore <dmoorefo@gmail.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarEric Sandeen <sandeen@redhat.com>
      8bc3b1e6
    • Theodore Ts'o's avatar
    • Lukas Czerner's avatar
      ext4: return error code from ext4_mb_good_group() · 42ac1848
      Lukas Czerner authored
      
      
      Currently ext4_mb_good_group() only returns 0 or 1 depending on whether
      the allocation group is suitable for use or not. However we might get
      various errors and fail while initializing new group including -EIO
      which would never get propagated up the call chain. This might lead to
      an endless loop at writeback when we're trying to find a good group to
      allocate from and we fail to initialize new group (read error for
      example).
      
      Fix this by returning proper error code from ext4_mb_good_group() and
      using it in ext4_mb_regular_allocator(). In ext4_mb_regular_allocator()
      we will always return only the first occurred error from
      ext4_mb_good_group() and we only propagate it back  to the caller if we
      do not get any other errors and we fail to allocate any blocks.
      
      Note that with other modes than errors=continue, we will fail
      immediately in ext4_mb_good_group() in case of error, however with
      errors=continue we should try to continue using the file system, that's
      why we're not going to fail immediately when we see an error from
      ext4_mb_good_group(), but rather when we fail to find a suitable block
      group to allocate from due to an problem in group initialization.
      
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      42ac1848
    • Lukas Czerner's avatar
      ext4: try to initialize all groups we can in case of failure on ppc64 · bbdc322f
      Lukas Czerner authored
      
      
      Currently on the machines with page size > block size when initializing
      block group buddy cache we initialize it for all the block group bitmaps
      in the page. However in the case of read error, checksum error, or if
      a single bitmap is in any way corrupted we would fail to initialize all
      of the bitmaps. This is problematic because we will not have access to
      the other allocation groups even though those might be perfectly fine
      and usable.
      
      Fix this by reading all the bitmaps instead of error out on the first
      problem and simply skip the bitmaps which were either not read properly,
      or are not valid.
      
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      bbdc322f
    • Lukas Czerner's avatar
      ext4: verify block bitmap even after fresh initialization · 41e5b7ed
      Lukas Czerner authored
      
      
      If we want to rely on the buffer_verified() flag of the block bitmap
      buffer, we have to set it consistently. However currently if we're
      initializing uninitialized block bitmap in
      ext4_read_block_bitmap_nowait() we're not going to set buffer verified
      at all.
      
      We can do this by simply setting the flag on the buffer, but I think
      it's actually better to run ext4_validate_block_bitmap() to make sure
      that what we did in the ext4_init_block_bitmap() is right.
      
      So run ext4_validate_block_bitmap() even after the block bitmap
      initialization. Also bail out early from ext4_validate_block_bitmap() if
      we see corrupt bitmap, since we already know it's corrupt and we do not
      need to verify that.
      
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      41e5b7ed
    • Michal Hocko's avatar
      jbd2: revert must-not-fail allocation loops back to GFP_NOFAIL · 6ccaf3e2
      Michal Hocko authored
      This basically reverts 47def826
      
       (jbd2: Remove __GFP_NOFAIL from jbd2
      layer). The deprecation of __GFP_NOFAIL was a bad choice because it led
      to open coding the endless loop around the allocator rather than
      removing the dependency on the non failing allocation. So the
      deprecation was a clear failure and the reality tells us that
      __GFP_NOFAIL is not even close to go away.
      
      It is still true that __GFP_NOFAIL allocations are generally discouraged
      and new uses should be evaluated and an alternative (pre-allocations or
      reservations) should be considered but it doesn't make any sense to lie
      the allocator about the requirements. Allocator can take steps to help
      making a progress if it knows the requirements.
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      6ccaf3e2
  7. Jun 03, 2015
    • Theodore Ts'o's avatar
      ext4 crypto: allocate bounce pages using GFP_NOWAIT · 3dbb5eb9
      Theodore Ts'o authored
      
      
      Previously we allocated bounce pages using a combination of
      alloc_page() and mempool_alloc() with the __GFP_WAIT bit set.
      Instead, use mempool_alloc() with GFP_NOWAIT.  The mempool_alloc()
      function will try using alloc_pages() initially, and then only use the
      mempool reserve of pages if alloc_pages() is unable to fulfill the
      request.
      
      This minimizes the the impact on the mm layer when we need to do a
      large amount of writeback of encrypted files, as Jaeguk Kim had
      reported that under a heavy fio workload on a system with restricted
      amounts memory (which unfortunately, includes many mobile handsets),
      he had observed the the OOM killer getting triggered several times.
      Using GFP_NOWAIT
      
      If the mempool_alloc() function fails, we will retry the page
      writeback at a later time; the function of the mempool is to ensure
      that we can writeback at least 32 pages at a time, so we can more
      efficiently dispatch I/O under high memory pressure situations.  In
      the future we should make this be a tunable so we can determine the
      best tradeoff between permanently sequestering memory and the ability
      to quickly launder pages so we can free up memory quickly when
      necessary.
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      3dbb5eb9
  8. Jun 01, 2015
  9. May 19, 2015