Skip to content
  1. Jul 02, 2022
    • Christian Brauner's avatar
      fs: port higher-level mapping helpers · 21c6c720
      Christian Brauner authored
      commit 209188ce upstream.
      
      Enable the mapped_fs{g,u}id() helpers to support filesystems mounted
      with an idmapping. Apart from core mapping helpers that use
      mapped_fs{g,u}id() to initialize struct inode's i_{g,u}id fields xfs is
      the only place that uses these low-level helpers directly.
      
      The patch only extends the helpers to be able to take the filesystem
      idmapping into account. Since we don't actually yet pass the
      filesystem's idmapping in no functional changes happen. This will happen
      in a final patch.
      
      Link: https://lore.kernel.org/r/20211123114227.3124056-9-brauner@kernel.org (v1)
      Link: https://lore.kernel.org/r/20211130121032.3753852-9-brauner@kernel.org (v2)
      Link: https://lore.kernel.org/r/20211203111707.3901969-9-brauner@kernel.org
      
      
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      CC: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: default avatarSeth Forshee <sforshee@digitalocean.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      21c6c720
    • Christian Brauner's avatar
      fs: remove unused low-level mapping helpers · 7d0536a8
      Christian Brauner authored
      commit 02e40799 upstream.
      
      Now that we ported all places to use the new low-level mapping helpers
      that are able to support filesystems mounted with an idmapping we can
      remove the old low-level mapping helpers. With the removal of these old
      helpers we also conclude the renaming of the mapping helpers we started
      in commit a65e58e7 ("fs: document and rename fsid helpers").
      
      Link: https://lore.kernel.org/r/20211123114227.3124056-8-brauner@kernel.org (v1)
      Link: https://lore.kernel.org/r/20211130121032.3753852-8-brauner@kernel.org (v2)
      Link: https://lore.kernel.org/r/20211203111707.3901969-8-brauner@kernel.org
      
      
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      CC: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: default avatarSeth Forshee <sforshee@digitalocean.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7d0536a8
    • Christian Brauner's avatar
      fs: use low-level mapping helpers · f895d0ff
      Christian Brauner authored
      commit 44720713 upstream.
      
      In a few places the vfs needs to interact with bare k{g,u}ids directly
      instead of struct inode. These are just a few. In previous patches we
      introduced low-level mapping helpers that are able to support
      filesystems mounted an idmapping. This patch simply converts the places
      to use these new helpers.
      
      Link: https://lore.kernel.org/r/20211123114227.3124056-7-brauner@kernel.org (v1)
      Link: https://lore.kernel.org/r/20211130121032.3753852-7-brauner@kernel.org (v2)
      Link: https://lore.kernel.org/r/20211203111707.3901969-7-brauner@kernel.org
      
      
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      CC: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarSeth Forshee <sforshee@digitalocean.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f895d0ff
    • Christian Brauner's avatar
      docs: update mapping documentation · 1c62e018
      Christian Brauner authored
      commit 8cc5c54d upstream.
      
      Now that we implement the full remapping algorithms described in our
      documentation remove the section about shortcircuting them.
      
      Link: https://lore.kernel.org/r/20211123114227.3124056-6-brauner@kernel.org (v1)
      Link: https://lore.kernel.org/r/20211130121032.3753852-6-brauner@kernel.org (v2)
      Link: https://lore.kernel.org/r/20211203111707.3901969-6-brauner@kernel.org
      
      
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      CC: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarSeth Forshee <sforshee@digitalocean.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1c62e018
    • Christian Brauner's avatar
      fs: account for filesystem mappings · b20dcf60
      Christian Brauner authored
      commit 1ac2a410 upstream.
      
      Currently we only support idmapped mounts for filesystems mounted
      without an idmapping. This was a conscious decision mentioned in
      multiple places (cf. e.g. [1]).
      
      As explained at length in [3] it is perfectly fine to extend support for
      idmapped mounts to filesystem's mounted with an idmapping should the
      need arise. The need has been there for some time now. Various container
      projects in userspace need this to run unprivileged and nested
      unprivileged containers (cf. [2]).
      
      Before we can port any filesystem that is mountable with an idmapping to
      support idmapped mounts we need to first extend the mapping helpers to
      account for the filesystem's idmapping. This again, is explained at
      length in our documentation at [3] but I'll give an overview here again.
      
      Currently, the low-level mapping helpers implement the remapping
      algorithms described in [3] in a simplified manner. Because we could
      rely on the fact that all filesystems supporting idmapped mounts are
      mounted without an idmapping the translation step from or into the
      filesystem idmapping could be skipped.
      
      In order to support idmapped mounts of filesystem's mountable with an
      idmapping the translation step we were able to skip before cannot be
      skipped anymore. A filesystem mounted with an idmapping is very likely
      to not use an identity mapping and will instead use a non-identity
      mapping. So the translation step from or into the filesystem's idmapping
      in the remapping algorithm cannot be skipped for such filesystems. More
      details with examples can be found in [3].
      
      This patch adds a few new and prepares some already existing low-level
      mapping helpers to perform the full translation algorithm explained in
      [3]. The low-level helpers can be written in a way that they only
      perform the additional translation step when the filesystem is indeed
      mounted with an idmapping.
      
      If the low-level helpers detect that they are not dealing with an
      idmapped mount they can simply return the relevant k{g,u}id unchanged;
      no remapping needs to be performed at all. The no_idmapping() helper
      detects whether the shortcut can be used.
      
      If the low-level helpers detected that they are dealing with an idmapped
      mount but the underlying filesystem is mounted without an idmapping we
      can rely on the previous shorcut and can continue to skip the
      translation step from or into the filesystem's idmapping.
      
      These checks guarantee that only the minimal amount of work is
      performed. As before, if idmapped mounts aren't used the low-level
      helpers are idempotent and no work is performed at all.
      
      This patch adds the helpers mapped_k{g,u}id_fs() and
      mapped_k{g,u}id_user(). Following patches will port all places to
      replace the old k{g,u}id_into_mnt() and k{g,u}id_from_mnt() with these
      two new helpers. After the conversion is done k{g,u}id_into_mnt() and
      k{g,u}id_from_mnt() will be removed. This also concludes the renaming of
      the mapping helpers we started in [4]. Now, all mapping helpers will
      started with the "mapped_" prefix making everything nice and consistent.
      
      The mapped_k{g,u}id_fs() helpers replace the k{g,u}id_into_mnt()
      helpers. They are to be used when k{g,u}ids are to be mapped from the
      vfs, e.g. from from struct inode's i_{g,u}id.  Conversely, the
      mapped_k{g,u}id_user() helpers replace the k{g,u}id_from_mnt() helpers.
      They are to be used when k{g,u}ids are to be written to disk, e.g. when
      entering from a system call to change ownership of a file.
      
      This patch only introduces the helpers. It doesn't yet convert the
      relevant places to account for filesystem mounted with an idmapping.
      
      [1]: commit 2ca4dcc4 ("fs/mount_setattr: tighten permission checks")
      [2]: https://github.com/containers/podman/issues/10374
      [3]: Documentations/filesystems/idmappings.rst
      [4]: commit a65e58e7 ("fs: document and rename fsid helpers")
      
      Link: https://lore.kernel.org/r/20211123114227.3124056-5-brauner@kernel.org (v1)
      Link: https://lore.kernel.org/r/20211130121032.3753852-5-brauner@kernel.org (v2)
      Link: https://lore.kernel.org/r/20211203111707.3901969-5-brauner@kernel.org
      
      
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      CC: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarSeth Forshee <sforshee@digitalocean.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b20dcf60
    • Christian Brauner's avatar
      fs: tweak fsuidgid_has_mapping() · 3374eb1b
      Christian Brauner authored
      commit 476860b3 upstream.
      
      If the caller's fs{g,u}id aren't mapped in the mount's idmapping we can
      return early and skip the check whether the mapped fs{g,u}id also have a
      mapping in the filesystem's idmapping. If the fs{g,u}id aren't mapped in
      the mount's idmapping they consequently can't be mapped in the
      filesystem's idmapping. So there's no point in checking that.
      
      Link: https://lore.kernel.org/r/20211123114227.3124056-4-brauner@kernel.org (v1)
      Link: https://lore.kernel.org/r/20211130121032.3753852-4-brauner@kernel.org (v2)
      Link: https://lore.kernel.org/r/20211203111707.3901969-4-brauner@kernel.org
      
      
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      CC: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: default avatarSeth Forshee <sforshee@digitalocean.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3374eb1b
    • Christian Brauner's avatar
      fs: move mapping helpers · 7bc23abc
      Christian Brauner authored
      commit a793d79e upstream.
      
      The low-level mapping helpers were so far crammed into fs.h. They are
      out of place there. The fs.h header should just contain the higher-level
      mapping helpers that interact directly with vfs objects such as struct
      super_block or struct inode and not the bare mapping helpers. Similarly,
      only vfs and specific fs code shall interact with low-level mapping
      helpers. And so they won't be made accessible automatically through
      regular {g,u}id helpers.
      
      Link: https://lore.kernel.org/r/20211123114227.3124056-3-brauner@kernel.org (v1)
      Link: https://lore.kernel.org/r/20211130121032.3753852-3-brauner@kernel.org (v2)
      Link: https://lore.kernel.org/r/20211203111707.3901969-3-brauner@kernel.org
      
      
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      CC: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: default avatarSeth Forshee <sforshee@digitalocean.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7bc23abc
    • Christian Brauner's avatar
      fs: add is_idmapped_mnt() helper · b3679e8b
      Christian Brauner authored
      commit bb49e9e7 upstream.
      
      Multiple places open-code the same check to determine whether a given
      mount is idmapped. Introduce a simple helper function that can be used
      instead. This allows us to get rid of the fragile open-coding. We will
      later change the check that is used to determine whether a given mount
      is idmapped. Introducing a helper allows us to do this in a single
      place instead of doing it for multiple places.
      
      Link: https://lore.kernel.org/r/20211123114227.3124056-2-brauner@kernel.org (v1)
      Link: https://lore.kernel.org/r/20211130121032.3753852-2-brauner@kernel.org (v2)
      Link: https://lore.kernel.org/r/20211203111707.3901969-2-brauner@kernel.org
      
      
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      CC: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: default avatarSeth Forshee <sforshee@digitalocean.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b3679e8b
    • Naveen N. Rao's avatar
      powerpc/ftrace: Remove ftrace init tramp once kernel init is complete · ab0b6dc5
      Naveen N. Rao authored
      
      
      commit 84ade0a6 upstream.
      
      Stop using the ftrace trampoline for init section once kernel init is
      complete.
      
      Fixes: 67361cf8 ("powerpc/ftrace: Handle large kernel configs")
      Cc: stable@vger.kernel.org # v4.20+
      Signed-off-by: default avatarNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20220516071422.463738-1-naveen.n.rao@linux.vnet.ibm.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ab0b6dc5
    • Darrick J. Wong's avatar
      xfs: only bother with sync_filesystem during readonly remount · ce6bfe55
      Darrick J. Wong authored
      
      
      [ Upstream commit b97cca3b ]
      
      In commit 02b9984d, we pushed a sync_filesystem() call from the VFS
      into xfs_fs_remount.  The only time that we ever need to push dirty file
      data or metadata to disk for a remount is if we're remounting the
      filesystem read only, so this really could be moved to xfs_remount_ro.
      
      Once we've moved the call site, actually check the return value from
      sync_filesystem.
      
      Fixes: 02b9984d ("fs: push sync_filesystem() down to the file system's remount_fs()")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarLeah Rumancik <leah.rumancik@gmail.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ce6bfe55
    • Darrick J. Wong's avatar
      xfs: prevent UAF in xfs_log_item_in_current_chkpt · 3465b167
      Darrick J. Wong authored
      
      
      [ Upstream commit f8d92a66 ]
      
      While I was running with KASAN and lockdep enabled, I stumbled upon an
      KASAN report about a UAF to a freed CIL checkpoint.  Looking at the
      comment for xfs_log_item_in_current_chkpt, it seems pretty obvious to me
      that the original patch to xfs_defer_finish_noroll should have done
      something to lock the CIL to prevent it from switching the CIL contexts
      while the predicate runs.
      
      For upper level code that needs to know if a given log item is new
      enough not to need relogging, add a new wrapper that takes the CIL
      context lock long enough to sample the current CIL context.  This is
      kind of racy in that the CIL can switch the contexts immediately after
      sampling, but that's ok because the consequence is that the defer ops
      code is a little slow to relog items.
      
       ==================================================================
       BUG: KASAN: use-after-free in xfs_log_item_in_current_chkpt+0x139/0x160 [xfs]
       Read of size 8 at addr ffff88804ea5f608 by task fsstress/527999
      
       CPU: 1 PID: 527999 Comm: fsstress Tainted: G      D      5.16.0-rc4-xfsx #rc4
       Call Trace:
        <TASK>
        dump_stack_lvl+0x45/0x59
        print_address_description.constprop.0+0x1f/0x140
        kasan_report.cold+0x83/0xdf
        xfs_log_item_in_current_chkpt+0x139/0x160
        xfs_defer_finish_noroll+0x3bb/0x1e30
        __xfs_trans_commit+0x6c8/0xcf0
        xfs_reflink_remap_extent+0x66f/0x10e0
        xfs_reflink_remap_blocks+0x2dd/0xa90
        xfs_file_remap_range+0x27b/0xc30
        vfs_dedupe_file_range_one+0x368/0x420
        vfs_dedupe_file_range+0x37c/0x5d0
        do_vfs_ioctl+0x308/0x1260
        __x64_sys_ioctl+0xa1/0x170
        do_syscall_64+0x35/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
       RIP: 0033:0x7f2c71a2950b
       Code: 0f 1e fa 48 8b 05 85 39 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff
      ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01
      f0 ff ff 73 01 c3 48 8b 0d 55 39 0d 00 f7 d8 64 89 01 48
       RSP: 002b:00007ffe8c0e03c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
       RAX: ffffffffffffffda RBX: 00005600862a8740 RCX: 00007f2c71a2950b
       RDX: 00005600862a7be0 RSI: 00000000c0189436 RDI: 0000000000000004
       RBP: 000000000000000b R08: 0000000000000027 R09: 0000000000000003
       R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000005a
       R13: 00005600862804a8 R14: 0000000000016000 R15: 00005600862a8a20
        </TASK>
      
       Allocated by task 464064:
        kasan_save_stack+0x1e/0x50
        __kasan_kmalloc+0x81/0xa0
        kmem_alloc+0xcd/0x2c0 [xfs]
        xlog_cil_ctx_alloc+0x17/0x1e0 [xfs]
        xlog_cil_push_work+0x141/0x13d0 [xfs]
        process_one_work+0x7f6/0x1380
        worker_thread+0x59d/0x1040
        kthread+0x3b0/0x490
        ret_from_fork+0x1f/0x30
      
       Freed by task 51:
        kasan_save_stack+0x1e/0x50
        kasan_set_track+0x21/0x30
        kasan_set_free_info+0x20/0x30
        __kasan_slab_free+0xed/0x130
        slab_free_freelist_hook+0x7f/0x160
        kfree+0xde/0x340
        xlog_cil_committed+0xbfd/0xfe0 [xfs]
        xlog_cil_process_committed+0x103/0x1c0 [xfs]
        xlog_state_do_callback+0x45d/0xbd0 [xfs]
        xlog_ioend_work+0x116/0x1c0 [xfs]
        process_one_work+0x7f6/0x1380
        worker_thread+0x59d/0x1040
        kthread+0x3b0/0x490
        ret_from_fork+0x1f/0x30
      
       Last potentially related work creation:
        kasan_save_stack+0x1e/0x50
        __kasan_record_aux_stack+0xb7/0xc0
        insert_work+0x48/0x2e0
        __queue_work+0x4e7/0xda0
        queue_work_on+0x69/0x80
        xlog_cil_push_now.isra.0+0x16b/0x210 [xfs]
        xlog_cil_force_seq+0x1b7/0x850 [xfs]
        xfs_log_force_seq+0x1c7/0x670 [xfs]
        xfs_file_fsync+0x7c1/0xa60 [xfs]
        __x64_sys_fsync+0x52/0x80
        do_syscall_64+0x35/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
       The buggy address belongs to the object at ffff88804ea5f600
        which belongs to the cache kmalloc-256 of size 256
       The buggy address is located 8 bytes inside of
        256-byte region [ffff88804ea5f600, ffff88804ea5f700)
       The buggy address belongs to the page:
       page:ffffea00013a9780 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88804ea5ea00 pfn:0x4ea5e
       head:ffffea00013a9780 order:1 compound_mapcount:0
       flags: 0x4fff80000010200(slab|head|node=1|zone=1|lastcpupid=0xfff)
       raw: 04fff80000010200 ffffea0001245908 ffffea00011bd388 ffff888004c42b40
       raw: ffff88804ea5ea00 0000000000100009 00000001ffffffff 0000000000000000
       page dumped because: kasan: bad access detected
      
       Memory state around the buggy address:
        ffff88804ea5f500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
        ffff88804ea5f580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       >ffff88804ea5f600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                             ^
        ffff88804ea5f680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ffff88804ea5f700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ==================================================================
      
      Fixes: 4e919af7 ("xfs: periodically relog deferred intent items")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarLeah Rumancik <leah.rumancik@gmail.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3465b167
    • Dave Chinner's avatar
      xfs: check sb_meta_uuid for dabuf buffer recovery · 4f0c91ab
      Dave Chinner authored
      
      
      [ Upstream commit 09654ed8 ]
      
      Got a report that a repeated crash test of a container host would
      eventually fail with a log recovery error preventing the system from
      mounting the root filesystem. It manifested as a directory leaf node
      corruption on writeback like so:
      
       XFS (loop0): Mounting V5 Filesystem
       XFS (loop0): Starting recovery (logdev: internal)
       XFS (loop0): Metadata corruption detected at xfs_dir3_leaf_check_int+0x99/0xf0, xfs_dir3_leaf1 block 0x12faa158
       XFS (loop0): Unmount and run xfs_repair
       XFS (loop0): First 128 bytes of corrupted metadata buffer:
       00000000: 00 00 00 00 00 00 00 00 3d f1 00 00 e1 9e d5 8b  ........=.......
       00000010: 00 00 00 00 12 fa a1 58 00 00 00 29 00 00 1b cc  .......X...)....
       00000020: 91 06 78 ff f7 7e 4a 7d 8d 53 86 f2 ac 47 a8 23  ..x..~J}.S...G.#
       00000030: 00 00 00 00 17 e0 00 80 00 43 00 00 00 00 00 00  .........C......
       00000040: 00 00 00 2e 00 00 00 08 00 00 17 2e 00 00 00 0a  ................
       00000050: 02 35 79 83 00 00 00 30 04 d3 b4 80 00 00 01 50  .5y....0.......P
       00000060: 08 40 95 7f 00 00 02 98 08 41 fe b7 00 00 02 d4  .@.......A......
       00000070: 0d 62 ef a7 00 00 01 f2 14 50 21 41 00 00 00 0c  .b.......P!A....
       XFS (loop0): Corruption of in-memory data (0x8) detected at xfs_do_force_shutdown+0x1a/0x20 (fs/xfs/xfs_buf.c:1514).  Shutting down.
       XFS (loop0): Please unmount the filesystem and rectify the problem(s)
       XFS (loop0): log mount/recovery failed: error -117
       XFS (loop0): log mount failed
      
      Tracing indicated that we were recovering changes from a transaction
      at LSN 0x29/0x1c16 into a buffer that had an LSN of 0x29/0x1d57.
      That is, log recovery was overwriting a buffer with newer changes on
      disk than was in the transaction. Tracing indicated that we were
      hitting the "recovery immediately" case in
      xfs_buf_log_recovery_lsn(), and hence it was ignoring the LSN in the
      buffer.
      
      The code was extracting the LSN correctly, then ignoring it because
      the UUID in the buffer did not match the superblock UUID. The
      problem arises because the UUID check uses the wrong UUID - it
      should be checking the sb_meta_uuid, not sb_uuid. This filesystem
      has sb_uuid != sb_meta_uuid (which is fine), and the buffer has the
      correct matching sb_meta_uuid in it, it's just the code checked it
      against the wrong superblock uuid.
      
      The is no corruption in the filesystem, and failing to recover the
      buffer due to a write verifier failure means the recovery bug did
      not propagate the corruption to disk. Hence there is no corruption
      before or after this bug has manifested, the impact is limited
      simply to an unmountable filesystem....
      
      This was missed back in 2015 during an audit of incorrect sb_uuid
      usage that resulted in commit fcfbe2c4 ("xfs: log recovery needs
      to validate against sb_meta_uuid") that fixed the magic32 buffers to
      validate against sb_meta_uuid instead of sb_uuid. It missed the
      magicda buffers....
      
      Fixes: ce748eaa ("xfs: create new metadata UUID field and incompat flag")
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarLeah Rumancik <leah.rumancik@gmail.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4f0c91ab
    • Darrick J. Wong's avatar
      xfs: remove all COW fork extents when remounting readonly · c4f376ba
      Darrick J. Wong authored
      
      
      [ Upstream commit 089558bc ]
      
      As part of multiple customer escalations due to file data corruption
      after copy on write operations, I wrote some fstests that use fsstress
      to hammer on COW to shake things loose.  Regrettably, I caught some
      filesystem shutdowns due to incorrect rmap operations with the following
      loop:
      
      mount <filesystem>				# (0)
      fsstress <run only readonly ops> &		# (1)
      while true; do
      	fsstress <run all ops>
      	mount -o remount,ro			# (2)
      	fsstress <run only readonly ops>
      	mount -o remount,rw			# (3)
      done
      
      When (2) happens, notice that (1) is still running.  xfs_remount_ro will
      call xfs_blockgc_stop to walk the inode cache to free all the COW
      extents, but the blockgc mechanism races with (1)'s reader threads to
      take IOLOCKs and loses, which means that it doesn't clean them all out.
      Call such a file (A).
      
      When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
      walks the ondisk refcount btree and frees any COW extent that it finds.
      This function does not check the inode cache, which means that incore
      COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
      one of those former COW extents are allocated and mapped into another
      file (B) and someone triggers a COW to the stale reservation in (A), A's
      dirty data will be written into (B) and once that's done, those blocks
      will be transferred to (A)'s data fork without bumping the refcount.
      
      The results are catastrophic -- file (B) and the refcount btree are now
      corrupt.  Solve this race by forcing the xfs_blockgc_free_space to run
      synchronously, which causes xfs_icwalk to return to inodes that were
      skipped because the blockgc code couldn't take the IOLOCK.  This is safe
      to do here because the VFS has already prohibited new writer threads.
      
      Fixes: 10ddf64e ("xfs: remove leftover CoW reservations when remounting ro")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChandan Babu R <chandan.babu@oracle.com>
      Signed-off-by: default avatarLeah Rumancik <leah.rumancik@gmail.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c4f376ba
    • Yang Xu's avatar
      xfs: Fix the free logic of state in xfs_attr_node_hasname · 40de647b
      Yang Xu authored
      
      
      [ Upstream commit a1de97fe ]
      
      When testing xfstests xfs/126 on lastest upstream kernel, it will hang on some machine.
      Adding a getxattr operation after xattr corrupted, I can reproduce it 100%.
      
      The deadlock as below:
      [983.923403] task:setfattr        state:D stack:    0 pid:17639 ppid: 14687 flags:0x00000080
      [  983.923405] Call Trace:
      [  983.923410]  __schedule+0x2c4/0x700
      [  983.923412]  schedule+0x37/0xa0
      [  983.923414]  schedule_timeout+0x274/0x300
      [  983.923416]  __down+0x9b/0xf0
      [  983.923451]  ? xfs_buf_find.isra.29+0x3c8/0x5f0 [xfs]
      [  983.923453]  down+0x3b/0x50
      [  983.923471]  xfs_buf_lock+0x33/0xf0 [xfs]
      [  983.923490]  xfs_buf_find.isra.29+0x3c8/0x5f0 [xfs]
      [  983.923508]  xfs_buf_get_map+0x4c/0x320 [xfs]
      [  983.923525]  xfs_buf_read_map+0x53/0x310 [xfs]
      [  983.923541]  ? xfs_da_read_buf+0xcf/0x120 [xfs]
      [  983.923560]  xfs_trans_read_buf_map+0x1cf/0x360 [xfs]
      [  983.923575]  ? xfs_da_read_buf+0xcf/0x120 [xfs]
      [  983.923590]  xfs_da_read_buf+0xcf/0x120 [xfs]
      [  983.923606]  xfs_da3_node_read+0x1f/0x40 [xfs]
      [  983.923621]  xfs_da3_node_lookup_int+0x69/0x4a0 [xfs]
      [  983.923624]  ? kmem_cache_alloc+0x12e/0x270
      [  983.923637]  xfs_attr_node_hasname+0x6e/0xa0 [xfs]
      [  983.923651]  xfs_has_attr+0x6e/0xd0 [xfs]
      [  983.923664]  xfs_attr_set+0x273/0x320 [xfs]
      [  983.923683]  xfs_xattr_set+0x87/0xd0 [xfs]
      [  983.923686]  __vfs_removexattr+0x4d/0x60
      [  983.923688]  __vfs_removexattr_locked+0xac/0x130
      [  983.923689]  vfs_removexattr+0x4e/0xf0
      [  983.923690]  removexattr+0x4d/0x80
      [  983.923693]  ? __check_object_size+0xa8/0x16b
      [  983.923695]  ? strncpy_from_user+0x47/0x1a0
      [  983.923696]  ? getname_flags+0x6a/0x1e0
      [  983.923697]  ? _cond_resched+0x15/0x30
      [  983.923699]  ? __sb_start_write+0x1e/0x70
      [  983.923700]  ? mnt_want_write+0x28/0x50
      [  983.923701]  path_removexattr+0x9b/0xb0
      [  983.923702]  __x64_sys_removexattr+0x17/0x20
      [  983.923704]  do_syscall_64+0x5b/0x1a0
      [  983.923705]  entry_SYSCALL_64_after_hwframe+0x65/0xca
      [  983.923707] RIP: 0033:0x7f080f10ee1b
      
      When getxattr calls xfs_attr_node_get function, xfs_da3_node_lookup_int fails with EFSCORRUPTED in
      xfs_attr_node_hasname because we have use blocktrash to random it in xfs/126. So it
      free state in internal and xfs_attr_node_get doesn't do xfs_buf_trans release job.
      
      Then subsequent removexattr will hang because of it.
      
      This bug was introduced by kernel commit 07120f1a ("xfs: Add xfs_has_attr and subroutines").
      It adds xfs_attr_node_hasname helper and said caller will be responsible for freeing the state
      in this case. But xfs_attr_node_hasname will free state itself instead of caller if
      xfs_da3_node_lookup_int fails.
      
      Fix this bug by moving the step of free state into caller.
      
      Also, use "goto error/out" instead of returning error directly in xfs_attr_node_addname_find_attr and
      xfs_attr_node_removename_setup function because we should free state ourselves.
      
      Fixes: 07120f1a ("xfs: Add xfs_has_attr and subroutines")
      Signed-off-by: default avatarYang Xu <xuyang2018.jy@fujitsu.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarLeah Rumancik <leah.rumancik@gmail.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      40de647b
    • Brian Foster's avatar
      xfs: punch out data fork delalloc blocks on COW writeback failure · 0e84e17c
      Brian Foster authored
      
      
      [ Upstream commit 5ca5916b ]
      
      If writeback I/O to a COW extent fails, the COW fork blocks are
      punched out and the data fork blocks left alone. It is possible for
      COW fork blocks to overlap non-shared data fork blocks (due to
      cowextsz hint prealloc), however, and writeback unconditionally maps
      to the COW fork whenever blocks exist at the corresponding offset of
      the page undergoing writeback. This means it's quite possible for a
      COW fork extent to overlap delalloc data fork blocks, writeback to
      convert and map to the COW fork blocks, writeback to fail, and
      finally for ioend completion to cancel the COW fork blocks and leave
      stale data fork delalloc blocks around in the inode. The blocks are
      effectively stale because writeback failure also discards dirty page
      state.
      
      If this occurs, it is likely to trigger assert failures, free space
      accounting corruption and failures in unrelated file operations. For
      example, a subsequent reflink attempt of the affected file to a new
      target file will trip over the stale delalloc in the source file and
      fail. Several of these issues are occasionally reproduced by
      generic/648, but are reproducible on demand with the right sequence
      of operations and timely I/O error injection.
      
      To fix this problem, update the ioend failure path to also punch out
      underlying data fork delalloc blocks on I/O error. This is analogous
      to the writeback submission failure path in xfs_discard_page() where
      we might fail to map data fork delalloc blocks and consistent with
      the successful COW writeback completion path, which is responsible
      for unmapping from the data fork and remapping in COW fork blocks.
      
      Fixes: 787eb485 ("xfs: fix and streamline error handling in xfs_end_io")
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarLeah Rumancik <leah.rumancik@gmail.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0e84e17c
    • Rustam Kovhaev's avatar
      xfs: use kmem_cache_free() for kmem_cache objects · 71a218ca
      Rustam Kovhaev authored
      
      
      [ Upstream commit c30a0cbd ]
      
      For kmalloc() allocations SLOB prepends the blocks with a 4-byte header,
      and it puts the size of the allocated blocks in that header.
      Blocks allocated with kmem_cache_alloc() allocations do not have that
      header.
      
      SLOB explodes when you allocate memory with kmem_cache_alloc() and then
      try to free it with kfree() instead of kmem_cache_free().
      SLOB will assume that there is a header when there is none, read some
      garbage to size variable and corrupt the adjacent objects, which
      eventually leads to hang or panic.
      
      Let's make XFS work with SLOB by using proper free function.
      
      Fixes: 9749fee8 ("xfs: enable the xfs_defer mechanism to process extents to free")
      Signed-off-by: default avatarRustam Kovhaev <rkovhaev@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarLeah Rumancik <leah.rumancik@gmail.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      71a218ca
    • Coly Li's avatar
      bcache: memset on stack variables in bch_btree_check() and bch_sectors_dirty_init() · 1cdcd496
      Coly Li authored
      
      
      commit 7d6b902e upstream.
      
      The local variables check_state (in bch_btree_check()) and state (in
      bch_sectors_dirty_init()) should be fully filled by 0, because before
      allocating them on stack, they were dynamically allocated by kzalloc().
      
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Link: https://lore.kernel.org/r/20220527152818.27545-2-colyli@suse.de
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1cdcd496
    • Greg Kroah-Hartman's avatar
      x86, kvm: use proper ASM macros for kvm_vcpu_is_preempted · edbaf6e5
      Greg Kroah-Hartman authored
      
      
      The build rightfully complains about:
      	arch/x86/kernel/kvm.o: warning: objtool: __raw_callee_save___kvm_vcpu_is_preempted()+0x12: missing int3 after ret
      
      because the ASM_RET call is not being used correctly in kvm_vcpu_is_preempted().
      
      This was hand-fixed-up in the kvm merge commit a4cfff3f ("Merge branch
      'kvm-older-features' into HEAD") which of course can not be backported to
      stable kernels, so just fix this up directly instead.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      edbaf6e5
    • Masahiro Yamada's avatar
      tick/nohz: unexport __init-annotated tick_nohz_full_setup() · f4a80ec8
      Masahiro Yamada authored
      
      
      commit 23900951 upstream.
      
      EXPORT_SYMBOL and __init is a bad combination because the .init.text
      section is freed up after the initialization. Hence, modules cannot
      use symbols annotated __init. The access to a freed symbol may end up
      with kernel panic.
      
      modpost used to detect it, but it had been broken for a decade.
      
      Commit 28438794 ("modpost: fix section mismatch check for exported
      init/exit sections") fixed it so modpost started to warn it again, then
      this showed up:
      
          MODPOST vmlinux.symvers
        WARNING: modpost: vmlinux.o(___ksymtab_gpl+tick_nohz_full_setup+0x0): Section mismatch in reference from the variable __ksymtab_tick_nohz_full_setup to the function .init.text:tick_nohz_full_setup()
        The symbol tick_nohz_full_setup is exported and annotated __init
        Fix this by removing the __init annotation of tick_nohz_full_setup or drop the export.
      
      Drop the export because tick_nohz_full_setup() is only called from the
      built-in code in kernel/sched/isolation.c.
      
      Fixes: ae9e557b ("time: Export tick start/stop functions for rcutorture")
      Reported-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMasahiro Yamada <masahiroy@kernel.org>
      Tested-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Backlund <tmb@tmb.nu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f4a80ec8
  2. Jun 29, 2022