Skip to content
  1. Dec 11, 2014
    • Filipe Manana's avatar
      Btrfs: remove non-sense btrfs_error_discard_extent() function · 1edb647b
      Filipe Manana authored
      
      
      It doesn't do anything special, it just calls btrfs_discard_extent(),
      so just remove it.
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1edb647b
    • Filipe Manana's avatar
      Btrfs: fix fs corruption on transaction abort if device supports discard · 678886bd
      Filipe Manana authored
      When we abort a transaction we iterate over all the ranges marked as dirty
      in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them
      from those trees, add them back (unpin) to the free space caches and, if
      the fs was mounted with "-o discard", perform a discard on those regions.
      Also, after adding the regions to the free space caches, a fitrim ioctl call
      can see those ranges in a block group's free space cache and perform a discard
      on the ranges, so the same issue can happen without "-o discard" as well.
      
      This causes corruption, affecting one or multiple btree nodes (in the worst
      case leaving the fs unmountable) because some of those ranges (the ones in
      the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are
      referred by the last committed super block - breaking the rule that anything
      that was committed by a transaction is untouched until the next transaction
      commits successfully.
      
      I ran into this while running in a loop (for several hours) the fstest that
      I recently submitted:
      
        [PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim
      
      The corruption always happened when a transaction aborted and then fsck complained
      like this:
      
         _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
         *** fsck.btrfs output ***
         Check tree block failed, want=94945280, have=0
         Check tree block failed, want=94945280, have=0
         Check tree block failed, want=94945280, have=0
         Check tree block failed, want=94945280, have=0
         Check tree block failed, want=94945280, have=0
         read block failed check_tree_block
         Couldn't open file system
      
      In this case 94945280 corresponded to the root of a tree.
      Using frace what I observed was the following sequence of steps happened:
      
         1) transaction N started, fs_info->pinned_extents pointed to
            fs_info->freed_extents[0];
      
         2) node/eb 94945280 is created;
      
         3) eb is persisted to disk;
      
         4) transaction N commit starts, fs_info->pinned_extents now points to
            fs_info->freed_extents[1], and transaction N completes;
      
         5) transaction N + 1 starts;
      
         6) eb is COWed, and btrfs_free_tree_block() called for this eb;
      
         7) eb range (94945280 to 94945280 + 16Kb) is added to
            fs_info->pinned_extents (fs_info->freed_extents[1]);
      
         8) Something goes wrong in transaction N + 1, like hitting ENOSPC
            for example, and the transaction is aborted, turning the fs into
            readonly mode. The stack trace I got for example:
      
            [112065.253935]  [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
            [112065.254271]  [<ffffffff81042984>] warn_slowpath_common+0x7f/0x98
            [112065.254567]  [<ffffffffa0325990>] ? __btrfs_abort_transaction+0x50/0x10b [btrfs]
            [112065.261674]  [<ffffffff810429e5>] warn_slowpath_fmt+0x48/0x50
            [112065.261922]  [<ffffffffa032949e>] ? btrfs_free_path+0x26/0x29 [btrfs]
            [112065.262211]  [<ffffffffa0325990>] __btrfs_abort_transaction+0x50/0x10b [btrfs]
            [112065.262545]  [<ffffffffa036b1d6>] btrfs_remove_chunk+0x537/0x58b [btrfs]
            [112065.262771]  [<ffffffffa033840f>] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs]
            [112065.263105]  [<ffffffffa0343106>] cleaner_kthread+0x100/0x12f [btrfs]
            (...)
            [112065.264493] ---[ end trace dd7903a975a31a08 ]---
            [112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left
            [112065.264997] BTRFS info (device sdc): forced readonly
      
         9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in
            fs_info->fs_state and calls btrfs_cleanup_transaction(), which in
            turn calls btrfs_destroy_pinned_extent();
      
         10) Then btrfs_destroy_pinned_extent() iterates over all the ranges
             marked as dirty in fs_info->freed_extents[], and for each one
             it calls discard, if the fs was mounted with "-o discard", and
             adds the range to the free space cache of the respective block
             group;
      
         11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path,
             sees the free space entries and performs a discard;
      
         12) After an umount and mount (or fsck), our eb's location on disk was full
             of zeroes, and it should have been untouched, because it was marked as
             dirty in the fs_info->pinned_extents tree, and therefore used by the
             trees that the last committed superblock points to.
      
      Fix this by not performing a discard and not adding the ranges to the free space
      caches - it's useless from this point since the fs is now in readonly mode and
      we won't write free space caches to disk anymore (otherwise we would leak space)
      nor any new superblock. By not adding the ranges to the free space caches, it
      prevents other code paths from allocating that space and write to it as well,
      therefore being safer and simpler.
      
      This isn't a new problem, as it's been present since 2011 (git commit
      acce952b
      
      ).
      
      Cc: stable@vger.kernel.org  # any kernel released after 2011-01-06
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      678886bd
    • Filipe Manana's avatar
      Btrfs: always clear a block group node when removing it from the tree · 01eacb27
      Filipe Manana authored
      
      
      Always clear a block group's rbnode after removing it from the rbtree to
      ensure that any tasks that might be holding a reference on the block group
      don't end up accessing stale rbnode left and right child pointers through
      next_block_group().
      
      This is a leftover from the change titled:
      "Btrfs: fix invalid block group rbtree access after bg is removed"
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      01eacb27
    • Filipe Manana's avatar
      Btrfs: ensure deletion from pinned_chunks list is protected · a1e7e16e
      Filipe Manana authored
      
      
      The call to remove_extent_mapping() actually deletes the extent map
      from the list it's included in - fs_info->pinned_chunks - and that
      list is protected by the chunk mutex. Therefore make that call
      while holding the chunk mutex and remove the redundant list delete
      call because it's a noop.
      
      This fixes an overlook of the patch titled
      "Btrfs: fix race between fs trimming and block group remove/allocation"
      following the same obvervation from the patch titled
      "Btrfs: fix unprotected deletion from pending_chunks list".
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a1e7e16e
  2. Dec 03, 2014
    • Chris Mason's avatar
    • Josef Bacik's avatar
      Btrfs: make get_caching_control unconditionally return the ctl · cb83b7b8
      Josef Bacik authored
      
      
      This was written when we didn't do a caching control for the fast free space
      cache loading.  However we started doing that a long time ago, and there is
      still a small window of time that we could be caching the block group the fast
      way, so if there is a caching_ctl at all on the block group just return it, the
      callers all wait properly for what they want.  Thanks,
      
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      cb83b7b8
    • Filipe Manana's avatar
      Btrfs: fix unprotected deletion from pending_chunks list · 8dbcd10f
      Filipe Manana authored
      
      
      On block group remove if the corresponding extent map was on the
      transaction->pending_chunks list, we were deleting the extent map
      from that list, through remove_extent_mapping(), without any
      synchronization with chunk allocation (which iterates that list
      and adds new elements to it). Fix this by ensure that this is done
      while the chunk mutex is held, since that's the mutex that protects
      the list in the chunk allocation code path.
      
      This applies on top (depends on) of my previous patch titled:
      "Btrfs: fix race between fs trimming and block group remove/allocation"
      
      But the issue in fact was already present before that change, it only
      became easier to hit after Josef's 3.18 patch that added automatic
      removal of empty block groups.
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      8dbcd10f
    • Filipe Manana's avatar
      Btrfs: fix fs mapping extent map leak · 495e64f4
      Filipe Manana authored
      
      
      On chunk allocation error (label "error_del_extent"), after adding the
      extent map to the tree and to the pending chunks list, we would leave
      decrementing the extent map's refcount by 2 instead of 3 (our allocation
      + tree reference + list reference).
      
      Also, on chunk/block group removal, if the block group was on the list
      pending_chunks we weren't decrementing the respective list reference.
      
      Detected by 'rmmod btrfs':
      
      [20770.105881] kmem_cache_destroy btrfs_extent_map: Slab cache still has objects
      [20770.106127] CPU: 2 PID: 11093 Comm: rmmod Tainted: G        W    L 3.17.0-rc5-btrfs-next-1+ #1
      [20770.106128] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [20770.106130]  0000000000000000 ffff8800ba867eb8 ffffffff813e7a13 ffff8800a2e11040
      [20770.106132]  ffff8800ba867ed0 ffffffff81105d0c 0000000000000000 ffff8800ba867ee0
      [20770.106134]  ffffffffa035d65e ffff8800ba867ef0 ffffffffa03b0654 ffff8800ba867f78
      [20770.106136] Call Trace:
      [20770.106142]  [<ffffffff813e7a13>] dump_stack+0x45/0x56
      [20770.106145]  [<ffffffff81105d0c>] kmem_cache_destroy+0x4b/0x90
      [20770.106164]  [<ffffffffa035d65e>] extent_map_exit+0x1a/0x1c [btrfs]
      [20770.106176]  [<ffffffffa03b0654>] exit_btrfs_fs+0x27/0x9d3 [btrfs]
      [20770.106179]  [<ffffffff8109dc97>] SyS_delete_module+0x153/0x1c4
      [20770.106182]  [<ffffffff8121261b>] ? trace_hardirqs_on_thunk+0x3a/0x3c
      [20770.106184]  [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b
      
      This applies on top (depends on) of my previous patch titled:
      "Btrfs: fix race between fs trimming and block group remove/allocation"
      
      But the issue in fact was already present before that change, it only
      became easier to hit after Josef's 3.18 patch that added automatic
      removal of empty block groups.
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      495e64f4
    • Filipe Manana's avatar
      Btrfs: fix memory leak after block remove + trimming · 946ddbe8
      Filipe Manana authored
      
      
      There was a free space entry structure memeory leak if a block
      group is remove while a free space entry is being trimmed, which
      the following diagram explains:
      
                 CPU 1                                          CPU 2
      
        btrfs_trim_block_group()
            trim_no_bitmap()
                remove free space entry from
                block group cache's rbtree
                do_trimming()
      
                                                      btrfs_remove_block_group()
                                                          btrfs_remove_free_space_cache()
      
                    add back free space entry to
                    block group's cache rbtree
        btrfs_put_block_group()
      
                                                          (...)
                                                          btrfs_put_block_group()
                                                              kfree(bg->free_space_ctl)
                                                              kfree(bg)
      
      The free space entry added after doing the discard of its respective
      range ends up never being freed.
      Detected after doing an "rmmod btrfs" after running the stress test
      recently submitted for fstests:
      
      [ 8234.642212] kmem_cache_destroy btrfs_free_space: Slab cache still has objects
      [ 8234.642657] CPU: 1 PID: 32276 Comm: rmmod Tainted: G        W    L 3.17.0-rc5-btrfs-next-2+ #1
      [ 8234.642660] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [ 8234.642664]  0000000000000000 ffff8801af1b3eb8 ffffffff8140c7b6 ffff8801dbedd0c0
      [ 8234.642670]  ffff8801af1b3ed0 ffffffff811149ce 0000000000000000 ffff8801af1b3ee0
      [ 8234.642676]  ffffffffa042dbe7 ffff8801af1b3ef0 ffffffffa0487422 ffff8801af1b3f78
      [ 8234.642682] Call Trace:
      [ 8234.642692]  [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
      [ 8234.642699]  [<ffffffff811149ce>] kmem_cache_destroy+0x4d/0x92
      [ 8234.642731]  [<ffffffffa042dbe7>] btrfs_destroy_cachep+0x63/0x76 [btrfs]
      [ 8234.642757]  [<ffffffffa0487422>] exit_btrfs_fs+0x9/0xbe7 [btrfs]
      [ 8234.642762]  [<ffffffff810a76a5>] SyS_delete_module+0x155/0x1c6
      [ 8234.642768]  [<ffffffff8122a7eb>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [ 8234.642773]  [<ffffffff814122d2>] system_call_fastpath+0x16/0x1b
      
      This applies on top (depends on) of my previous patch titled:
      "Btrfs: fix race between fs trimming and block group remove/allocation"
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      946ddbe8
    • Filipe Manana's avatar
      Btrfs: make btrfs_abort_transaction consider existence of new block groups · c92f6be3
      Filipe Manana authored
      
      
      If the transaction handle doesn't have used blocks but has created new block
      groups make sure we turn the fs into readonly mode too. This is because the
      new block groups didn't get all their metadata persisted into the chunk and
      device trees, and therefore if a subsequent transaction starts, allocates
      space from the new block groups, writes data or metadata into that space,
      commits successfully and then after we unmount and mount the filesystem
      again, the same space can be allocated again for a new block group,
      resulting in file data or metadata corruption.
      
      Example where we don't abort the transaction when we fail to finish the
      chunk allocation (add items to the chunk and device trees) and later a
      future transaction where the block group is removed fails because it can't
      find the chunk item in the chunk tree:
      
      [25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x50/0xfc [btrfs]()
      [25230.404301] BTRFS: Transaction aborted (error -28)
      [25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2 dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod virtio [last unloaded: btrfs]
      [25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted 3.17.0-rc5-btrfs-next-1+ #1
      [25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [25230.404328]  0000000000000000 ffff88004581bb08 ffffffff813e7a13 ffff88004581bb50
      [25230.404330]  ffff88004581bb40 ffffffff810423aa ffffffffa049386a 00000000ffffffe4
      [25230.404332]  ffffffffa05214c0 000000000000240c ffff88010fc8f800 ffff88004581bba8
      [25230.404334] Call Trace:
      [25230.404338]  [<ffffffff813e7a13>] dump_stack+0x45/0x56
      [25230.404342]  [<ffffffff810423aa>] warn_slowpath_common+0x7f/0x98
      [25230.404351]  [<ffffffffa049386a>] ? __btrfs_abort_transaction+0x50/0xfc [btrfs]
      [25230.404353]  [<ffffffff8104240b>] warn_slowpath_fmt+0x48/0x50
      [25230.404362]  [<ffffffffa049386a>] __btrfs_abort_transaction+0x50/0xfc [btrfs]
      [25230.404374]  [<ffffffffa04a8c43>] btrfs_create_pending_block_groups+0x10c/0x135 [btrfs]
      [25230.404387]  [<ffffffffa04b77fd>] __btrfs_end_transaction+0x7e/0x2de [btrfs]
      [25230.404398]  [<ffffffffa04b7a6d>] btrfs_end_transaction+0x10/0x12 [btrfs]
      [25230.404408]  [<ffffffffa04a3d64>] btrfs_check_data_free_space+0x111/0x1f0 [btrfs]
      [25230.404421]  [<ffffffffa04c53bd>] __btrfs_buffered_write+0x160/0x48d [btrfs]
      [25230.404425]  [<ffffffff811a9268>] ? cap_inode_need_killpriv+0x2d/0x37
      [25230.404429]  [<ffffffff810f6501>] ? get_page+0x1a/0x2b
      [25230.404441]  [<ffffffffa04c7c95>] btrfs_file_write_iter+0x321/0x42f [btrfs]
      [25230.404443]  [<ffffffff8110f5d9>] ? handle_mm_fault+0x7f3/0x846
      [25230.404446]  [<ffffffff813e98c5>] ? mutex_unlock+0x16/0x18
      [25230.404449]  [<ffffffff81138d68>] new_sync_write+0x7c/0xa0
      [25230.404450]  [<ffffffff81139401>] vfs_write+0xb0/0x112
      [25230.404452]  [<ffffffff81139c9d>] SyS_pwrite64+0x66/0x84
      [25230.404454]  [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b
      [25230.404455] ---[ end trace 5aa5684fdf47ab38 ]---
      [25230.404458] BTRFS warning (device sdc): btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space left).
      [25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509: errno=-2 No such entry (Failed lookup while freeing chunk.)
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c92f6be3
    • Filipe Manana's avatar
      Btrfs: fix race between writing free space cache and trimming · 55507ce3
      Filipe Manana authored
      
      
      Trimming is completely transactionless, and the way it operates consists
      of hiding free space entries from a block group, perform the trim/discard
      and then make the free space entries visible again.
      Therefore while a free space entry is being trimmed, we can have free space
      cache writing running in parallel (as part of a transaction commit) which
      will miss the free space entry. This means that an unmount (or crash/reboot)
      after that transaction commit and mount again before another transaction
      starts/commits after the discard finishes, we will have some free space
      that won't be used again unless the free space cache is rebuilt. After the
      unmount, fsck (btrfsck, btrfs check) reports the issue like the following
      example:
      
              *** fsck.btrfs output ***
              checking extents
              checking free space cache
              There is no free space entry for 521764864-521781248
              There is no free space entry for 521764864-1103101952
              cache appears valid but isnt 29360128
              Checking filesystem on /dev/sdc
              UUID: b4789e27-4774-4626-98e9-ae8dfbfb0fb5
              found 1235681286 bytes used err is -22
              (...)
      
      Another issue caused by this race is a crash while writing bitmap entries
      to the cache, because while the cache writeout task accesses the bitmaps,
      the trim task can be concurrently modifying the bitmap or worse might
      be freeing the bitmap. The later case results in the following crash:
      
      [55650.804460] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
      [55650.804835] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop parport_pc parport i2c_piix4 psmouse evdev pcspkr microcode processor i2ccore serio_raw thermal_sys button ext4 crc16 jbd2 mbcache sg sd_mod crc_t10dif sr_mod cdrom crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata virtio_pci virtio_ring virtio scsi_mod e1000 [last unloaded: btrfs]
      [55650.806169] CPU: 1 PID: 31002 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
      [55650.806493] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [55650.806867] task: ffff8800b12f6410 ti: ffff880071538000 task.ti: ffff880071538000
      [55650.807166] RIP: 0010:[<ffffffffa037cf45>]  [<ffffffffa037cf45>] write_bitmap_entries+0x65/0xbb [btrfs]
      [55650.807514] RSP: 0018:ffff88007153bc30  EFLAGS: 00010246
      [55650.807687] RAX: 000000005d1ec000 RBX: ffff8800a665df08 RCX: 0000000000000400
      [55650.807885] RDX: ffff88005d1ec000 RSI: 6b6b6b6b6b6b6b6b RDI: ffff88005d1ec000
      [55650.808017] RBP: ffff88007153bc58 R08: 00000000ddd51536 R09: 00000000000001e0
      [55650.808017] R10: 0000000000000000 R11: 0000000000000037 R12: 6b6b6b6b6b6b6b6b
      [55650.808017] R13: ffff88007153bca8 R14: 6b6b6b6b6b6b6b6b R15: ffff88007153bc98
      [55650.808017] FS:  0000000000000000(0000) GS:ffff88023ec80000(0000) knlGS:0000000000000000
      [55650.808017] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [55650.808017] CR2: 0000000002273b88 CR3: 00000000b18f6000 CR4: 00000000000006e0
      [55650.808017] Stack:
      [55650.808017]  ffff88020e834e00 ffff880172d68db0 0000000000000000 ffff88019257c800
      [55650.808017]  ffff8801d42ea720 ffff88007153bd10 ffffffffa037d2fa ffff880224e99180
      [55650.808017]  ffff8801469a6188 ffff880224e99140 ffff880172d68c50 00000003000000b7
      [55650.808017] Call Trace:
      [55650.808017]  [<ffffffffa037d2fa>] __btrfs_write_out_cache+0x1ea/0x37f [btrfs]
      [55650.808017]  [<ffffffffa037d959>] btrfs_write_out_cache+0xa1/0xd8 [btrfs]
      [55650.808017]  [<ffffffffa033936b>] btrfs_write_dirty_block_groups+0x4b5/0x505 [btrfs]
      [55650.808017]  [<ffffffffa03aa98e>] commit_cowonly_roots+0x15e/0x1f7 [btrfs]
      [55650.808017]  [<ffffffff813eb9c7>] ? _raw_spin_lock+0xe/0x10
      [55650.808017]  [<ffffffffa0346e46>] btrfs_commit_transaction+0x411/0x882 [btrfs]
      [55650.808017]  [<ffffffffa03432a4>] transaction_kthread+0xf2/0x1a4 [btrfs]
      [55650.808017]  [<ffffffffa03431b2>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
      [55650.808017]  [<ffffffff8105966b>] kthread+0xb7/0xbf
      [55650.808017]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
      [55650.808017]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
      [55650.808017]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
      [55650.808017] Code: 4c 89 ef 8d 70 ff e8 d4 fc ff ff 41 8b 45 34 41 39 45 30 7d 5c 31 f6 4c 89 ef e8 80 f6 ff ff 49 8b 7d 00 4c 89 f6 b9 00 04 00 00 <f3> a5 4c 89 ef 41 8b 45 30 8d 70 ff e8 a3 fc ff ff 41 8b 45 34
      [55650.808017] RIP  [<ffffffffa037cf45>] write_bitmap_entries+0x65/0xbb [btrfs]
      [55650.808017]  RSP <ffff88007153bc30>
      [55650.815725] ---[ end trace 1c032e96b149ff86 ]---
      
      Fix this by serializing both tasks in such a way that cache writeout
      doesn't wait for the trim/discard of free space entries to finish and
      doesn't miss any free space entry.
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      55507ce3
    • Filipe Manana's avatar
      Btrfs: fix race between fs trimming and block group remove/allocation · 04216820
      Filipe Manana authored
      
      
      Our fs trim operation, which is completely transactionless (doesn't start
      or joins an existing transaction) consists of visiting all block groups
      and then for each one to iterate its free space entries and perform a
      discard operation against the space range represented by the free space
      entries. However before performing a discard, the corresponding free space
      entry is removed from the free space rbtree, and when the discard completes
      it is added back to the free space rbtree.
      
      If a block group remove operation happens while the discard is ongoing (or
      before it starts and after a free space entry is hidden), we end up not
      waiting for the discard to complete, remove the extent map that maps
      logical address to physical addresses and the corresponding chunk metadata
      from the the chunk and device trees. After that and before the discard
      completes, the current running transaction can finish and a new one start,
      allowing for new block groups that map to the same physical addresses to
      be allocated and written to.
      
      So fix this by keeping the extent map in memory until the discard completes
      so that the same physical addresses aren't reused before it completes.
      
      If the physical locations that are under a discard operation end up being
      used for a new metadata block group for example, and dirty metadata extents
      are written before the discard finishes (the VM might call writepages() of
      our btree inode's i_mapping for example, or an fsync log commit happens) we
      end up overwriting metadata with zeroes, which leads to errors from fsck
      like the following:
      
              checking extents
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              read block failed check_tree_block
              owner ref check failed [833912832 16384]
              Errors found in extent allocation tree or chunk allocation
              checking free space cache
              checking fs roots
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              read block failed check_tree_block
              root 5 root dir 256 error
              root 5 inode 260 errors 2001, no inode item, link count wrong
                      unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
              root 5 inode 262 errors 2001, no inode item, link count wrong
                      unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
              root 5 inode 263 errors 2001, no inode item, link count wrong
              (...)
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      04216820
    • Zhao Lei's avatar
      Btrfs, replace: enable dev-replace for raid56 · 5d3edd8f
      Zhao Lei authored
      
      
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      5d3edd8f
    • Filipe Manana's avatar
      Btrfs: fix freeing used extents after removing empty block group · ae0ab003
      Filipe Manana authored
      
      
      There's a race between adding a block group to the list of the unused
      block groups and removing an unused block group (cleaner kthread) that
      leads to freeing extents that are in use or a crash during transaction
      commmit. Basically the cleaner kthread, when executing
      btrfs_delete_unused_bgs(), might catch the newly added block group to
      the list fs_info->unused_bgs and clear the range representing the whole
      group from fs_info->freed_extents[] before the task that added the block
      group to the list (running update_block_group()) marked the last freed
      extent as dirty in fs_info->freed_extents (pinned_extents).
      
      That is:
      
           CPU 1                                CPU 2
      
                                        btrfs_delete_unused_bgs()
      update_block_group()
         add block group to
         fs_info->unused_bgs
                                          got block group from the list
                                          clear_extent_bits for the whole
                                          block group range in freed_extents[]
         set_extent_dirty for the
         range covering the freed
         extent in freed_extents[]
         (fs_info->pinned_extents)
      
                                        block group deleted, and a new block
                                        group with the same logical address is
                                        created
      
                                        reserve space from the new block group
                                        for new data or metadata - the reserved
                                        space overlaps the range specified by
                                        CPU 1 for set_extent_dirty()
      
                                        commit transaction
                                          find all ranges marked as dirty in
                                          fs_info->pinned_extents, clear them
                                          and add them to the free space cache
      
      Alternatively, if CPU 2 doesn't create a new block group with the same
      logical address, we get a crash/BUG_ON at transaction commit when unpining
      extent ranges because we can't find a block group for the range marked as
      dirty by CPU 1. Sample trace:
      
      [ 2163.426462] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
      [ 2163.426640] Modules linked in: btrfs xor raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio crc32c_generic libcrc32c dm_mod nfsd auth_rpc
      gss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse parport_pc parport i2c_piix4 processor thermal_sys i2ccore evdev button pcspkr microcode serio_raw ext4 crc16 jbd2 mbcache
       sg sr_mod cdrom sd_mod crc_t10dif crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata e1000 scsi_mod virtio_pci virtio_ring virtio
      [ 2163.428209] CPU: 0 PID: 11858 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
      [ 2163.428519] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [ 2163.428875] task: ffff88009f2c0650 ti: ffff8801356bc000 task.ti: ffff8801356bc000
      [ 2163.429157] RIP: 0010:[<ffffffffa037728e>]  [<ffffffffa037728e>] unpin_extent_range.isra.58+0x62/0x192 [btrfs]
      [ 2163.429562] RSP: 0018:ffff8801356bfda8  EFLAGS: 00010246
      [ 2163.429802] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      [ 2163.429990] RDX: 0000000041bfffff RSI: 0000000001c00000 RDI: ffff880024307080
      [ 2163.430042] RBP: ffff8801356bfde8 R08: 0000000000000068 R09: ffff88003734f118
      [ 2163.430042] R10: ffff8801356bfcb8 R11: fffffffffffffb69 R12: ffff8800243070d0
      [ 2163.430042] R13: 0000000083c04000 R14: ffff8800751b0f00 R15: ffff880024307000
      [ 2163.430042] FS:  0000000000000000(0000) GS:ffff88013f400000(0000) knlGS:0000000000000000
      [ 2163.430042] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [ 2163.430042] CR2: 00007ff10eb43fc0 CR3: 0000000004cb8000 CR4: 00000000000006f0
      [ 2163.430042] Stack:
      [ 2163.430042]  ffff8800243070d0 0000000083c08000 0000000083c07fff ffff88012d6bc800
      [ 2163.430042]  ffff8800243070d0 ffff8800751b0f18 ffff8800751b0f00 0000000000000000
      [ 2163.430042]  ffff8801356bfe18 ffffffffa037a481 0000000083c04000 0000000083c07fff
      [ 2163.430042] Call Trace:
      [ 2163.430042]  [<ffffffffa037a481>] btrfs_finish_extent_commit+0xac/0xbf [btrfs]
      [ 2163.430042]  [<ffffffffa038c06d>] btrfs_commit_transaction+0x6ee/0x882 [btrfs]
      [ 2163.430042]  [<ffffffffa03881f1>] transaction_kthread+0xf2/0x1a4 [btrfs]
      [ 2163.430042]  [<ffffffffa03880ff>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
      [ 2163.430042]  [<ffffffff8105966b>] kthread+0xb7/0xbf
      [ 2163.430042]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
      [ 2163.430042]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
      [ 2163.430042]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
      
      So fix this by making update_block_group() first set the range as dirty
      in pinned_extents before adding the block group to the unused_bgs list.
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ae0ab003
    • Filipe Manana's avatar
      Btrfs: fix crash caused by block group removal · 4f69cb98
      Filipe Manana authored
      
      
      If we remove a block group (because it became empty), we might have left
      a caching_ctl structure in fs_info->caching_block_groups that points to
      the block group and is accessed at transaction commit time. This results
      in accessing an invalid or incorrect block group. This issue became visible
      after Josef's patch "Btrfs: remove empty block groups automatically".
      
      So if the block group is removed make sure we don't leave a dangling
      caching_ctl in caching_block_groups.
      
      Sample crash trace:
      
      [58380.439449] BUG: unable to handle kernel paging request at ffff8801446eaeb8
      [58380.439707] IP: [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
      [58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE 80000001446ea060
      [58380.441220] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      [58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring virtio [last unloaded: btrfs]
      [58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
      [58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [58380.443238] task: ffff88013ac82090 ti: ffff88013896c000 task.ti: ffff88013896c000
      [58380.443238] RIP: 0010:[<ffffffffa03f6d05>]  [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
      [58380.443238] RSP: 0018:ffff88013896fdd8  EFLAGS: 00010283
      [58380.443238] RAX: ffff880222cae850 RBX: ffff880119ba74c0 RCX: 0000000000000000
      [58380.443238] RDX: 0000000000000000 RSI: ffff880185e16800 RDI: ffff8801446eaeb8
      [58380.443238] RBP: ffff88013896fdd8 R08: ffff8801a9ca9fa8 R09: ffff88013896fc60
      [58380.443238] R10: ffff88013896fd28 R11: 0000000000000000 R12: ffff880222cae000
      [58380.443238] R13: ffff880222cae850 R14: ffff880222cae6b0 R15: ffff8801446eae00
      [58380.443238] FS:  0000000000000000(0000) GS:ffff88023ed80000(0000) knlGS:0000000000000000
      [58380.443238] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [58380.443238] CR2: ffff8801446eaeb8 CR3: 0000000001811000 CR4: 00000000000006e0
      [58380.443238] Stack:
      [58380.443238]  ffff88013896fe18 ffffffffa03fe2d5 ffff880222cae850 ffff880185e16800
      [58380.443238]  ffff88000dc41c20 0000000000000000 ffff8801a9ca9f00 0000000000000000
      [58380.443238]  ffff88013896fe80 ffffffffa040fbcf ffff88018b0dcdb0 ffff88013ac82090
      [58380.443238] Call Trace:
      [58380.443238]  [<ffffffffa03fe2d5>] btrfs_prepare_extent_commit+0x5a/0xd7 [btrfs]
      [58380.443238]  [<ffffffffa040fbcf>] btrfs_commit_transaction+0x45c/0x882 [btrfs]
      [58380.443238]  [<ffffffffa040c058>] transaction_kthread+0xf2/0x1a4 [btrfs]
      [58380.443238]  [<ffffffffa040bf66>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
      [58380.443238]  [<ffffffff8105966b>] kthread+0xb7/0xbf
      [58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
      [58380.443238]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
      [58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      4f69cb98
    • Filipe Manana's avatar
      Btrfs: fix invalid block group rbtree access after bg is removed · 292cbd51
      Filipe Manana authored
      
      
      If we grab a block group, for example in btrfs_trim_fs(), we will be holding
      a reference on it but the block group can be removed after we got it (via
      btrfs_remove_block_group), which means it will no longer be part of the
      rbtree.
      
      However, btrfs_remove_block_group() was only calling rb_erase() which leaves
      the block group's rb_node left and right child pointers with the same content
      they had before calling rb_erase. This was dangerous because a call to
      next_block_group() would access the node's left and right child pointers (via
      rb_next), which can be no longer valid.
      
      Fix this by clearing a block group's node after removing it from the tree,
      and have next_block_group() do a tree search to get the next block group
      instead of using rb_next() if our block group was removed.
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      292cbd51
    • Miao Xie's avatar
      Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56 · 4245215d
      Miao Xie authored
      The commit c404e0dc
      
       (Btrfs: fix use-after-free in the finishing
      procedure of the device replace) fixed a use-after-free problem
      which happened when removing the source device at the end of device
      replace, but at that time, btrfs didn't support device replace
      on raid56, so we didn't fix the problem on the raid56 profile.
      Currently, we implemented device replace for raid56, so we need
      kick that problem out before we enable that function for raid56.
      
      The fix method is very simple, we just increase the bio per-cpu
      counter before we submit a raid56 io, and decrease the counter
      when the raid56 io ends.
      
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      4245215d
    • Miao Xie's avatar
      Btrfs, replace: write raid56 parity into the replace target device · 76035976
      Miao Xie authored
      
      
      This function reused the code of parity scrub, and we just write
      the right parity or corrected parity into the target device before
      the parity scrub end.
      
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      76035976
    • Miao Xie's avatar
      Btrfs, replace: write dirty pages into the replace target device · 2c8cdd6e
      Miao Xie authored
      
      
      The implementation is simple:
      - In order to avoid changing the code logic of btrfs_map_bio and
        RAID56, we add the stripes of the replace target devices at the
        end of the stripe array in btrfs bio, and we sort those target
        device stripes in the array. And we keep the number of the target
        device stripes in the btrfs bio.
      - Except write operation on RAID56, all the other operation don't
        take the target device stripes into account.
      - When we do write operation, we read the data from the common devices
        and calculate the parity. Then write the dirty data and new parity
        out, at this time, we will find the relative replace target stripes
        and wirte the relative data into it.
      
      Note: The function that copying old data on the source device to
      the target device was implemented in the past, it is similar to
      the other RAID type.
      
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      2c8cdd6e
    • Miao Xie's avatar
      Btrfs, raid56: support parity scrub on raid56 · 5a6ac9ea
      Miao Xie authored
      
      
      The implementation is:
      - Read and check all the data with checksum in the same stripe.
        All the data which has checksum is COW data, and we are sure
        that it is not changed though we don't lock the stripe. because
        the space of that data just can be reclaimed after the current
        transction is committed, and then the fs can use it to store the
        other data, but when doing scrub, we hold the current transaction,
        that is that data can not be recovered, it is safe that read and check
        it out of the stripe lock.
      - Lock the stripe
      - Read out all the data without checksum and parity
        The data without checksum and the parity may be changed if we don't
        lock the stripe, so we need read it in the stripe lock context.
      - Check the parity
      - Re-calculate the new parity and write back it if the old parity
        is not right
      - Unlock the stripe
      
      If we can not read out the data or the data we read is corrupted,
      we will try to repair it. If the repair fails. we will mark the
      horizontal sub-stripe(pages on the same horizontal) as corrupted
      sub-stripe, and we will skip the parity check and repair of that
      horizontal sub-stripe.
      
      And in order to skip the horizontal sub-stripe that has no data, we
      introduce a bitmap. If there is some data on the horizontal sub-stripe,
      we will the relative bit to 1, and when we check and repair the
      parity, we will skip those horizontal sub-stripes that the relative
      bits is 0.
      
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      5a6ac9ea
    • Miao Xie's avatar
      Btrfs, raid56: use a variant to record the operation type · 1b94b556
      Miao Xie authored
      
      
      We will introduce new operation type later, if we still use integer
      variant as bool variant to record the operation type, we would add new
      variant and increase the size of raid bio structure. It is not good,
      by this patch, we define different number for different operation,
      and we can just use a variant to record the operation type.
      
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      1b94b556
    • Miao Xie's avatar
      Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted · af8e2d1d
      Miao Xie authored
      
      
      This patch implement the RAID5/6 common data repair function, the
      implementation is similar to the scrub on the other RAID such as
      RAID1, the differentia is that we don't read the data from the
      mirror, we use the data repair function of RAID5/6.
      
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      af8e2d1d
    • Miao Xie's avatar
      Btrfs, raid56: don't change bbio and raid_map · b89e1b01
      Miao Xie authored
      
      
      Because we will reuse bbio and raid_map during the scrub later, it is
      better that we don't change any variant of bbio and don't free it at
      the end of IO request. So we introduced similar variants into the raid
      bio, and don't access those bbio's variants any more.
      
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      b89e1b01
    • Zhao Lei's avatar
      Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block · 6de65650
      Zhao Lei authored
      
      
      stripe_index's value was set again in latter line:
      stripe_index = 0;
      
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      6de65650
    • Zhao Lei's avatar
      Btrfs: remove noused bbio_ret in __btrfs_map_block in condition · f90523d1
      Zhao Lei authored
      
      
      bbio_ret in this condition is always !NULL because previous code
      already have a check-and-skip:
      4908 if (!bbio_ret)
      4909     goto out;
      
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      f90523d1
  3. Nov 25, 2014
    • Filipe Manana's avatar
      Btrfs: fix snapshot inconsistency after a file write followed by truncate · 9ea24bbe
      Filipe Manana authored
      
      
      If right after starting the snapshot creation ioctl we perform a write against a
      file followed by a truncate, with both operations increasing the file's size, we
      can get a snapshot tree that reflects a state of the source subvolume's tree where
      the file truncation happened but the write operation didn't. This leaves a gap
      between 2 file extent items of the inode, which makes btrfs' fsck complain about it.
      
      For example, if we perform the following file operations:
      
          $ mkfs.btrfs -f /dev/vdd
          $ mount /dev/vdd /mnt
          $ xfs_io -f \
                -c "pwrite -S 0xaa -b 32K 0 32K" \
                -c "fsync" \
                -c "pwrite -S 0xbb -b 32770 16K 32770" \
                -c "truncate 90123" \
                /mnt/foobar
      
      and the snapshot creation ioctl was just called before the second write, we often
      can get the following inode items in the snapshot's btree:
      
              item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
                      inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
              item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
                      inode ref index 282 namelen 10 name: foobar
              item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
                      extent data disk byte 1104855040 nr 32768
                      extent data offset 0 nr 32768 ram 32768
                      extent compression 0
              item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
                      extent data disk byte 0 nr 0
                      extent data offset 0 nr 40960 ram 40960
                      extent compression 0
      
      There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
      for which there's no file extent item covering it. This is because the file write
      and file truncate operations happened both right after the snapshot creation ioctl
      called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
      ordered extent that matches the write and, in btrfs_setsize(), we were able to call
      btrfs_cont_expand() before being able to commit the current transaction in the
      snapshot creation ioctl. So this made it possibe to insert the hole file extent
      item in the source subvolume (which represents the region added by the truncate)
      right before the transaction commit from the snapshot creation ioctl.
      
      Btrfs' fsck tool complains about such cases with a message like the following:
      
          "root 331 inode 257 errors 100, file extent discount"
      
      >From a user perspective, the expectation when a snapshot is created while those
      file operations are being performed is that the snapshot will have a file that
      either:
      
      1) is empty
      2) only the first write was captured
      3) only the 2 writes were captured
      4) both writes and the truncation were captured
      
      But never capture a state where only the first write and the truncation were
      captured (since the second write was performed before the truncation).
      
      A test case for xfstests follows.
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      9ea24bbe
    • Filipe Manana's avatar
      Btrfs: ensure send always works on roots without orphans · e5fa8f86
      Filipe Manana authored
      
      
      Move the logic from the snapshot creation ioctl into send. This avoids
      doing the transaction commit if send isn't used, and ensures that if
      a crash/reboot happens after the transaction commit that created the
      snapshot and before the transaction commit that switched the commit
      root, send will not get a commit root that differs from the main root
      (that has orphan items).
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e5fa8f86
    • Filipe Manana's avatar
      Btrfs: fix freeing used extent after removing empty block group · 758eb51e
      Filipe Manana authored
      
      
      Due to ignoring errors returned by clear_extent_bits (at the moment only
      -ENOMEM is possible), we can end up freeing an extent that is actually in
      use (i.e. return the extent to the free space cache).
      
      The sequence of steps that lead to this:
      
      1) Cleaner thread starts execution and calls btrfs_delete_unused_bgs(), with
         the goal of freeing empty block groups;
      
      2) btrfs_delete_unused_bgs() finds an empty block group, joins the current
         transaction (or starts a new one if none is running) and attempts to
         clear the EXTENT_DIRTY bit for the block group's range from freed_extents[0]
         and freed_extents[1] (of which one corresponds to fs_info->pinned_extents);
      
      3) Clearing the EXTENT_DIRTY bit (via clear_extent_bits()) fails with
         -ENOMEM, but such error is ignored and btrfs_delete_unused_bgs() proceeds
         to delete the block group and the respective chunk, while pinned_extents
         remains with that bit set for the whole (or a part of the) range covered
         by the block group;
      
      4) Later while the transaction is still running, the chunk ends up being reused
         for a new block group (maybe for different purpose, data or metadata), and
         extents belonging to the new block group are allocated for file data or btree
         nodes/leafs;
      
      5) The current transaction is committed, meaning that we unpinned one or more
         extents from the new block group (through btrfs_finish_extent_commit() and
         unpin_extent_range()) which are now being used for new file data or new
         metadata (through btrfs_finish_extent_commit() and unpin_extent_range()).
         And unpinning means we returned the extents to the free space cache of the
         new block group, which implies those extents can be used for future allocations
         while they're still in use.
      
      Alternatively, we can hit a BUG_ON() when doing a lookup for a block group's cache
      object in unpin_extent_range() if a new block group didn't end up being allocated for
      the same chunk (step 4 above).
      
      Fix this by not freeing the block group and chunk if we fail to clear the dirty bit.
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      758eb51e
    • Chris Mason's avatar
      Btrfs: include vmalloc.h in check-integrity.c · 8f608de6
      Chris Mason authored
      
      
      Fengguang's build monster reported warnings on some arches because we
      don't have vmalloc.h included
      
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Reported-by: default avatar <fengguang.wu@intel.com>
      8f608de6
    • Qu Wenruo's avatar
      btrfs: Fix a lockdep warning when running xfstest. · 084b6e7c
      Qu Wenruo authored
      
      
      The following lockdep warning is triggered during xfstests:
      
      [ 1702.980872] =========================================================
      [ 1702.981181] [ INFO: possible irq lock inversion dependency detected ]
      [ 1702.981482] 3.18.0-rc1 #27 Not tainted
      [ 1702.981781] ---------------------------------------------------------
      [ 1702.982095] kswapd0/77 just changed the state of lock:
      [ 1702.982415]  (&delayed_node->mutex){+.+.-.}, at: [<ffffffffa03b0b51>] __btrfs_release_delayed_node+0x41/0x1f0 [btrfs]
      [ 1702.982794] but this lock took another, RECLAIM_FS-unsafe lock in the past:
      [ 1702.983160]  (&fs_info->dev_replace.lock){+.+.+.}
      
      and interrupts could create inverse lock ordering between them.
      
      [ 1702.984675]
      other info that might help us debug this:
      [ 1702.985524] Chain exists of:
        &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock
      
      [ 1702.986799]  Possible interrupt unsafe locking scenario:
      
      [ 1702.987681]        CPU0                    CPU1
      [ 1702.988137]        ----                    ----
      [ 1702.988598]   lock(&fs_info->dev_replace.lock);
      [ 1702.989069]                                local_irq_disable();
      [ 1702.989534]                                lock(&delayed_node->mutex);
      [ 1702.990038]                                lock(&found->groups_sem);
      [ 1702.990494]   <Interrupt>
      [ 1702.990938]     lock(&delayed_node->mutex);
      [ 1702.991407]
       *** DEADLOCK ***
      
      It is because the btrfs_kobj_{add/rm}_device() will call memory
      allocation with GFP_KERNEL,
      which may flush fs page cache to free space, waiting for it self to do
      the commit, causing the deadlock.
      
      To solve the problem, move btrfs_kobj_{add/rm}_device() out of the
      dev_replace lock range, also involing split the
      btrfs_rm_dev_replace_srcdev() function into remove and free parts.
      
      Now only btrfs_rm_dev_replace_remove_srcdev() is called in dev_replace
      lock range, and kobj_{add/rm} and btrfs_rm_dev_replace_free_srcdev() are
      called out of the lock range.
      
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      084b6e7c
    • Chris Mason's avatar
      Merge branch 'dev/pending-changes' of... · ad27c0da
      Chris Mason authored
      Merge branch 'dev/pending-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus
      ad27c0da
  4. Nov 24, 2014
    • Linus Torvalds's avatar
      Linux 3.18-rc6 · 5d01410f
      Linus Torvalds authored
      5d01410f
    • Andy Lutomirski's avatar
      uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME · 82975bc6
      Andy Lutomirski authored
      
      
      x86 call do_notify_resume on paranoid returns if TIF_UPROBE is set but
      not on non-paranoid returns.  I suspect that this is a mistake and that
      the code only works because int3 is paranoid.
      
      Setting _TIF_NOTIFY_RESUME in the uprobe code was probably a workaround
      for the x86 bug.  With that bug fixed, we can remove _TIF_NOTIFY_RESUME
      from the uprobes code.
      
      Reported-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      82975bc6
    • Thomas Gleixner's avatar
      sched: Provide update_curr callbacks for stop/idle scheduling classes · 90e362f4
      Thomas Gleixner authored
      Chris bisected a NULL pointer deference in task_sched_runtime() to
      commit 6e998916 'sched/cputime: Fix clock_nanosleep()/clock_gettime()
      inconsistency'.
      
      Chris observed crashes in atop or other /proc walking programs when he
      started fork bombs on his machine.  He assumed that this is a new exit
      race, but that does not make any sense when looking at that commit.
      
      What's interesting is that, the commit provides update_curr callbacks
      for all scheduling classes except stop_task and idle_task.
      
      While nothing can ever hit that via the clock_nanosleep() and
      clock_gettime() interfaces, which have been the target of the commit in
      question, the author obviously forgot that there are other code paths
      which invoke task_sched_runtime()
      
      do_task_stat(()
       thread_group_cputime_adjusted()
         thread_group_cputime()
           task_cputime()
             task_sched_runtime()
              if (task_current(rq, p) && task_on_rq_queued(p)) {
                update_rq_clock(rq);
                up->sched_class->update_curr(rq);
              }
      
      If the stats are read for a stomp machine task, aka 'migration/N' and
      that task is current on its cpu, this will happily call the NULL pointer
      of stop_task->update_curr.  Ooops.
      
      Chris observation that this happens faster when he runs the fork bomb
      makes sense as the fork bomb will kick migration threads more often so
      the probability to hit the issue will increase.
      
      Add the missing update_curr callbacks to the scheduler classes stop_task
      and idle_task.  While idle tasks cannot be monitored via /proc we have
      other means to hit the idle case.
      
      Fixes: 6e998916
      
       'sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency'
      Reported-by: default avatarChris Mason <clm@fb.com>
      Reported-and-tested-by: default avatarBorislav Petkov <bp@alien8.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90e362f4
    • Linus Torvalds's avatar
      Merge branch 'x86-traps' (trap handling from Andy Lutomirski) · 00c89b2f
      Linus Torvalds authored
      Merge x86-64 iret fixes from Andy Lutomirski:
       "This addresses the following issues:
      
         - an unrecoverable double-fault triggerable with modify_ldt.
         - invalid stack usage in espfix64 failed IRET recovery from IST
           context.
         - invalid stack usage in non-espfix64 failed IRET recovery from IST
           context.
      
        It also makes a good but IMO scary change: non-espfix64 failed IRET
        will now report the correct error.  Hopefully nothing depended on the
        old incorrect behavior, but maybe Wine will get confused in some
        obscure corner case"
      
      * emailed patches from Andy Lutomirski <luto@amacapital.net>:
        x86_64, traps: Rework bad_iret
        x86_64, traps: Stop using IST for #SS
        x86_64, traps: Fix the espfix64 #DF fixup and rewrite it in C
      00c89b2f
    • Andy Lutomirski's avatar
      x86_64, traps: Rework bad_iret · b645af2d
      Andy Lutomirski authored
      
      
      It's possible for iretq to userspace to fail.  This can happen because
      of a bad CS, SS, or RIP.
      
      Historically, we've handled it by fixing up an exception from iretq to
      land at bad_iret, which pretends that the failed iret frame was really
      the hardware part of #GP(0) from userspace.  To make this work, there's
      an extra fixup to fudge the gs base into a usable state.
      
      This is suboptimal because it loses the original exception.  It's also
      buggy because there's no guarantee that we were on the kernel stack to
      begin with.  For example, if the failing iret happened on return from an
      NMI, then we'll end up executing general_protection on the NMI stack.
      This is bad for several reasons, the most immediate of which is that
      general_protection, as a non-paranoid idtentry, will try to deliver
      signals and/or schedule from the wrong stack.
      
      This patch throws out bad_iret entirely.  As a replacement, it augments
      the existing swapgs fudge into a full-blown iret fixup, mostly written
      in C.  It's should be clearer and more correct.
      
      Signed-off-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b645af2d
    • Andy Lutomirski's avatar
      x86_64, traps: Stop using IST for #SS · 6f442be2
      Andy Lutomirski authored
      
      
      On a 32-bit kernel, this has no effect, since there are no IST stacks.
      
      On a 64-bit kernel, #SS can only happen in user code, on a failed iret
      to user space, a canonical violation on access via RSP or RBP, or a
      genuine stack segment violation in 32-bit kernel code.  The first two
      cases don't need IST, and the latter two cases are unlikely fatal bugs,
      and promoting them to double faults would be fine.
      
      This fixes a bug in which the espfix64 code mishandles a stack segment
      violation.
      
      This saves 4k of memory per CPU and a tiny bit of code.
      
      Signed-off-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6f442be2
    • Andy Lutomirski's avatar
      x86_64, traps: Fix the espfix64 #DF fixup and rewrite it in C · af726f21
      Andy Lutomirski authored
      There's nothing special enough about the espfix64 double fault fixup to
      justify writing it in assembly.  Move it to C.
      
      This also fixes a bug: if the double fault came from an IST stack, the
      old asm code would return to a partially uninitialized stack frame.
      
      Fixes: 3891a04a
      
      
      Signed-off-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af726f21
    • Linus Torvalds's avatar
      Merge tag 'armsoc-for-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · 27946315
      Linus Torvalds authored
      Pull ARM SoC fixes from Olof Johansson:
       "A collection of fixes this week:
      
         - A set of clock fixes for shmobile platforms
         - A fix for tegra that moves serial port labels to be per board.
           We're choosing to merge this for 3.18 because the labels will start
           being parsed in 3.19, and without this change serial port numbers
           that used to be stable since the dawn of time will change numbers.
         - A few other DT tweaks for Tegra.
         - A fix for multi_v7_defconfig that makes it stop spewing cpufreq
           errors on Arndale (Exynos)"
      
      * tag 'armsoc-for-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
        ARM: multi_v7_defconfig: fix failure setting CPU voltage by enabling dependent I2C controller
        ARM: tegra: roth: Fix SD card VDD_IO regulator
        ARM: tegra: Remove eMMC vmmc property for roth/tn7
        ARM: dts: tegra: move serial aliases to per-board
        ARM: tegra: Add serial port labels to Tegra124 DT
        ARM: shmobile: kzm9g legacy: Set i2c clks_per_count to 2
        ARM: shmobile: r8a7740 dtsi: Correct IIC0 parent clock
        ARM: shmobile: r8a7790: Fix SD3CKCR address to device tree
        ARM: shmobile: r8a7740 legacy: Correct IIC0 parent clock
        ARM: shmobile: r8a7740 legacy: Add missing INTCA clock for irqpin module
        ARM: shmobile: r8a7790: Fix SD3CKCR address
        ARM: dts: sun6i: Re-parent ahb1_mux to pll6 as required by dma controller
      27946315
    • Linus Torvalds's avatar
      Merge branch 'for-3.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu · 9f2e0f63
      Linus Torvalds authored
      Pull percpu fix from Tejun Heo:
       "This contains one patch to fix a race condition which can lead to
        percpu_ref using a percpu pointer which is corrupted with a set DEAD
        bit.  The bug was introduced while separating out the ATOMIC mode flag
        from the DEAD flag.  The fix is pretty straight forward.
      
        I just committed the patch to the percpu tree but am sending out the
        pull request early as I'll be on vacation for a week.  The patch
        should be fairly safe and while the latency will be higher I'll be
        checking emails"
      
      * 'for-3.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
        percpu-ref: fix DEAD flag contamination of percpu pointer
      9f2e0f63