Skip to content
  1. Jul 24, 2017
    • Nikolay Borisov's avatar
      btrfs: round down size diff when shrinking/growing device · 0e4324a4
      Nikolay Borisov authored
      Further testing showed that the fix introduced in 7dfb8be1 ("btrfs:
      Round down values which are written for total_bytes_size") was
      insufficient and it could still lead to discrepancies between the
      total_bytes in the super block and the device total bytes. So this patch
      also ensures that the difference between old/new sizes when
      shrinking/growing is also rounded down. This ensure that we won't be
      subtracting/adding a non-sectorsize multiples to the superblock/device
      total sizees.
      
      Fixes: 7dfb8be1
      
       ("btrfs: Round down values which are written for total_bytes_size")
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0e4324a4
    • Omar Sandoval's avatar
      Btrfs: fix early ENOSPC due to delalloc · 17024ad0
      Omar Sandoval authored
      If a lot of metadata is reserved for outstanding delayed allocations, we
      rely on shrink_delalloc() to reclaim metadata space in order to fulfill
      reservation tickets. However, shrink_delalloc() has a shortcut where if
      it determines that space can be overcommitted, it will stop early. This
      made sense before the ticketed enospc system, but now it means that
      shrink_delalloc() will often not reclaim enough space to fulfill any
      tickets, leading to an early ENOSPC. (Reservation tickets don't care
      about being able to overcommit, they need every byte accounted for.)
      
      Fix it by getting rid of the shortcut so that shrink_delalloc() reclaims
      all of the metadata it is supposed to. This fixes early ENOSPCs we were
      seeing when doing a btrfs receive to populate a new filesystem, as well
      as early ENOSPCs Christoph saw when doing a big cp -r onto Btrfs.
      
      Fixes: 957780eb
      
       ("Btrfs: introduce ticketed enospc infrastructure")
      Tested-by: default avatarChristoph Anton Mitterer <mail@christoph.anton.mitterer.name>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      17024ad0
    • Jeff Mahoney's avatar
      btrfs: fix lockup in find_free_extent with read-only block groups · 14443937
      Jeff Mahoney authored
      
      
      If we have a block group that is all of the following:
      1) uncached in memory
      2) is read-only
      3) has a disk cache state that indicates we need to recreate the cache
      
      AND the file system has enough free space fragmentation such that the
      request for an extent of a given size can't be honored;
      
      AND have a single CPU core;
      
      AND it's the block group with the highest starting offset such that
      there are no opportunities (like reading from disk) for the loop to
      yield the CPU;
      
      We can end up with a lockup.
      
      The root cause is simple.  Once we're in the position that we've read in
      all of the other block groups directly and none of those block groups
      can honor the request, there are no more opportunities to sleep.  We end
      up trying to start a caching thread which never gets run if we only have
      one core.  This *should* present as a hung task waiting on the caching
      thread to make some progress, but it doesn't.  Instead, it degrades into
      a busy loop because of the placement of the read-only check.
      
      During the first pass through the loop, block_group->cached will be set
      to BTRFS_CACHE_STARTED and have_caching_bg will be set.  Then we hit the
      read-only check and short circuit the loop.  We're not yet in
      LOOP_CACHING_WAIT, so we skip that loop back before going through the
      loop again for other raid groups.
      
      Then we move to LOOP_CACHING_WAIT state.
      
      During the this pass through the loop, ->cached will still be
      BTRFS_CACHE_STARTED, which means it's not cached, so we'll enter
      cache_block_group, do a lot of nothing, and return, and also set
      have_caching_bg again.  Then we hit the read-only check and short circuit
      the loop.  The same thing happens as before except now we DO trigger
      the LOOP_CACHING_WAIT && have_caching_bg check and loop back up to the
      top.  We do this forever.
      
      There are two fixes in this patch since they address the same underlying
      bug.
      
      The first is to add a cond_resched to the end of the loop to ensure
      that the caching thread always has an opportunity to run.  This will
      fix the soft lockup issue, but find_free_extent will still loop doing
      nothing until the thread has completed.
      
      The second is to move the read-only check to the top of the loop.  We're
      never going to return an allocation within a read-only block group so
      we may as well skip it early.  The check for ->cached == BTRFS_CACHE_ERROR
      would cause the same problem except that BTRFS_CACHE_ERROR is considered
      a "done" state and we won't re-set have_caching_bg again.
      
      Many thanks to Stephan Kulow <coolo@suse.de> for his excellent help in
      the testing process.
      
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      14443937
  2. Jul 20, 2017
  3. Jul 15, 2017
  4. Jul 14, 2017
    • Filipe Manana's avatar
      Btrfs: fix write corruption due to bio cloning on raid5/6 · 6592e58c
      Filipe Manana authored
      
      
      The recent changes to make bio cloning faster (added in the 4.13 merge
      window) by using the bio_clone_fast() API introduced a regression on
      raid5/6 modes, because cloned bios have an invalid bi_vcnt field
      (therefore it can not be used) and the raid5/6 code uses the
      bio_for_each_segment_all() API to iterate the segments of a bio, and this
      API uses a bio's bi_vcnt field.
      
      The issue is very simple to trigger by doing for example a direct IO write
      against a raid5 or raid6 filesystem and then attempting to read what we
      wrote before:
      
        $ mkfs.btrfs -m raid5 -d raid5 -f /dev/sdc /dev/sdd /dev/sde /dev/sdf
        $ mount /dev/sdc /mnt
        $ xfs_io -f -d -c "pwrite -S 0xab 0 1M" /mnt/foobar
        $ od -t x1 /mnt/foobar
        od: /mnt/foobar: read error: Input/output error
      
      For that example, the following is also reported in dmesg/syslog:
      
        [18274.985557] btrfs_print_data_csum_error: 18 callbacks suppressed
        [18274.995277] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1
        [18274.997205] BTRFS warning (device sdf): csum failed root 5 ino 257 off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
        [18275.025221] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
        [18275.047422] BTRFS warning (device sdf): csum failed root 5 ino 257 off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1
        [18275.054818] BTRFS warning (device sdf): csum failed root 5 ino 257 off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
        [18275.054834] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
        [18275.054943] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 2
        [18275.055207] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 3
        [18275.055571] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1
        [18275.062171] BTRFS warning (device sdf): csum failed root 5 ino 257 off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1
      
      A scrub will also fail correcting bad copies, mentioning the following in
      dmesg/syslog:
      
        [18276.128696] scrub_handle_errored_block: 498 callbacks suppressed
        [18276.129617] BTRFS warning (device sdf): checksum error at logical 2186346496 on dev /dev/sde, sector 2116608, root 5, inode 257, offset 65536, length 4096, links $
        [18276.149235] btrfs_dev_stat_print_on_error: 498 callbacks suppressed
        [18276.157897] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
        [18276.206059] BTRFS warning (device sdf): checksum error at logical 2186477568 on dev /dev/sdd, sector 2116736, root 5, inode 257, offset 196608, length 4096, links$
        [18276.206059] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
        [18276.306552] BTRFS warning (device sdf): checksum error at logical 2186543104 on dev /dev/sdd, sector 2116864, root 5, inode 257, offset 262144, length 4096, links$
        [18276.319152] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
        [18276.394316] BTRFS warning (device sdf): checksum error at logical 2186739712 on dev /dev/sdf, sector 2116992, root 5, inode 257, offset 458752, length 4096, links$
        [18276.396348] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
        [18276.434127] BTRFS warning (device sdf): checksum error at logical 2186870784 on dev /dev/sde, sector 2117120, root 5, inode 257, offset 589824, length 4096, links$
        [18276.434127] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
        [18276.500504] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186477568 on dev /dev/sdd
        [18276.538400] BTRFS warning (device sdf): checksum error at logical 2186481664 on dev /dev/sdd, sector 2116744, root 5, inode 257, offset 200704, length 4096, links$
        [18276.540452] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
        [18276.542012] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186481664 on dev /dev/sdd
        [18276.585030] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186346496 on dev /dev/sde
        [18276.598306] BTRFS warning (device sdf): checksum error at logical 2186412032 on dev /dev/sde, sector 2116736, root 5, inode 257, offset 131072, length 4096, links$
        [18276.598310] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
        [18276.598582] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186350592 on dev /dev/sde
        [18276.603455] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
        [18276.638362] BTRFS warning (device sdf): checksum error at logical 2186354688 on dev /dev/sde, sector 2116624, root 5, inode 257, offset 73728, length 4096, links $
        [18276.640445] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
        [18276.645942] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186354688 on dev /dev/sde
        [18276.657204] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186412032 on dev /dev/sde
        [18276.660563] BTRFS warning (device sdf): checksum error at logical 2186416128 on dev /dev/sde, sector 2116744, root 5, inode 257, offset 135168, length 4096, links$
        [18276.664609] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 6, gen 0
        [18276.664609] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186358784 on dev /dev/sde
      
      So fix this by using the bio_for_each_segment() API and setting before
      the bio's bi_iter field to the value of the corresponding btrfs bio
      container's saved iterator if we are processing a cloned bio in the
      raid5/6 code (the same code processes both cloned and non-cloned bios).
      
      This incorrect iteration of cloned bios was also causing some occasional
      BUG_ONs when running fstest btrfs/064, which have a trace like the
      following:
      
        [ 6674.416156] ------------[ cut here ]------------
        [ 6674.416157] kernel BUG at fs/btrfs/raid56.c:1897!
        [ 6674.416159] invalid opcode: 0000 [#1] PREEMPT SMP
        [ 6674.416160] Modules linked in: dm_flakey dm_mod dax ppdev tpm_tis parport_pc tpm_tis_core evdev tpm psmouse sg i2c_piix4 pcspkr parport i2c_core serio_raw button s
        [ 6674.416184] CPU: 3 PID: 19236 Comm: kworker/u32:10 Not tainted 4.12.0-rc6-btrfs-next-44+ #1
        [ 6674.416185] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
        [ 6674.416210] Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
        [ 6674.416211] task: ffff880147f6c740 task.stack: ffffc90001fb8000
        [ 6674.416229] RIP: 0010:__raid_recover_end_io+0x1ac/0x370 [btrfs]
        [ 6674.416230] RSP: 0018:ffffc90001fbbb90 EFLAGS: 00010217
        [ 6674.416231] RAX: ffff8801ff4b4f00 RBX: 0000000000000002 RCX: 0000000000000001
        [ 6674.416232] RDX: ffff880099b045d8 RSI: ffffffff81a5f6e0 RDI: 0000000000000004
        [ 6674.416232] RBP: ffffc90001fbbbc8 R08: 0000000000000001 R09: 0000000000000001
        [ 6674.416233] R10: ffffc90001fbbac8 R11: 0000000000001000 R12: 0000000000000002
        [ 6674.416234] R13: ffff880099b045c0 R14: 0000000000000004 R15: ffff88012bff2000
        [ 6674.416235] FS:  0000000000000000(0000) GS:ffff88023f2c0000(0000) knlGS:0000000000000000
        [ 6674.416235] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [ 6674.416236] CR2: 00007f28cf282000 CR3: 00000001000c6000 CR4: 00000000000006e0
        [ 6674.416239] Call Trace:
        [ 6674.416259]  __raid56_parity_recover+0xfc/0x16e [btrfs]
        [ 6674.416276]  raid56_parity_recover+0x157/0x16b [btrfs]
        [ 6674.416293]  btrfs_map_bio+0xe0/0x259 [btrfs]
        [ 6674.416310]  btrfs_submit_bio_hook+0xbf/0x147 [btrfs]
        [ 6674.416327]  end_bio_extent_readpage+0x27b/0x4a0 [btrfs]
        [ 6674.416331]  bio_endio+0x17d/0x1b3
        [ 6674.416346]  end_workqueue_fn+0x3c/0x3f [btrfs]
        [ 6674.416362]  btrfs_scrubparity_helper+0x1aa/0x3b8 [btrfs]
        [ 6674.416379]  btrfs_endio_helper+0xe/0x10 [btrfs]
        [ 6674.416381]  process_one_work+0x276/0x4b6
        [ 6674.416384]  worker_thread+0x1ac/0x266
        [ 6674.416386]  ? rescuer_thread+0x278/0x278
        [ 6674.416387]  kthread+0x106/0x10e
        [ 6674.416389]  ? __list_del_entry+0x22/0x22
        [ 6674.416391]  ret_from_fork+0x27/0x40
        [ 6674.416395] Code: 44 89 e2 be 00 10 00 00 ff 15 b0 ab ef ff eb 72 4d 89 e8 89 d9 44 89 e2 be 00 10 00 00 ff 15 a3 ab ef ff eb 5d 41 83 fc ff 74 02 <0f> 0b 49 63 97
        [ 6674.416432] RIP: __raid_recover_end_io+0x1ac/0x370 [btrfs] RSP: ffffc90001fbbb90
        [ 6674.416434] ---[ end trace 74d56ebe7489dd6a ]---
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      6592e58c
  5. Jul 07, 2017
    • Filipe Manana's avatar
      Btrfs: incremental send, fix invalid memory access · 24e52b11
      Filipe Manana authored
      When doing an incremental send, while processing an extent that changed
      between the parent and send snapshots and that extent was an inline extent
      in the parent snapshot, it's possible to access a memory region beyond
      the end of leaf if the inline extent is very small and it is the first
      item in a leaf.
      
      An example scenario is described below.
      
      The send snapshot has the following leaf:
      
       leaf 33865728 items 33 free space 773 generation 46 owner 5
       fs uuid ab7090d8-dafd-4fb9-9246-723b6d2e2fb7
       chunk uuid 2d16478c-c704-4ab9-b574-68bff2281b1f
              (...)
              item 14 key (335 EXTENT_DATA 0) itemoff 3052 itemsize 53
                      generation 36 type 1 (regular)
                      extent data disk byte 12791808 nr 4096
                      extent data offset 0 nr 4096 ram 4096
                      extent compression 0 (none)
              item 15 key (335 EXTENT_DATA 8192) itemoff 2999 itemsize 53
                      generation 36 type 1 (regular)
                      extent data disk byte 138170368 nr 225280
                      extent data offset 0 nr 225280 ram 225280
                      extent compression 0 (none)
              (...)
      
      And the parent snapshot has the following leaf:
      
       leaf 31272960 items 17 free space 17 generation 31 owner 5
       fs uuid ab7090d8-dafd-4fb9-9246-723b6d2e2fb7
       chunk uuid 2d16478c-c704-4ab9-b574-68bff2281b1f
              item 0 key (335 EXTENT_DATA 0) itemoff 3951 itemsize 44
                      generation 31 type 0 (inline)
                      inline extent data size 23 ram_bytes 613 compression 1 (zlib)
              (...)
      
      When computing the send stream, it is detected that the extent of inode
      335, at file offset 0, and at fs/btrfs/send.c:is_extent_unchanged() we
      grab the leaf from the parent snapshot and access the inline extent item.
      However, before jumping to the 'out' label, we access the 'offset' and
      'disk_bytenr' fields of the extent item, which should not be done for
      inline extents since the inlined data starts at the offset of the
      'disk_bytenr' field and can be very small. For example accessing the
      'offset' field of the file extent item results in the following trace:
      
      [  599.705368] general protection fault: 0000 [#1] PREEMPT SMP
      [  599.706296] Modules linked in: btrfs psmouse i2c_piix4 ppdev acpi_cpufreq serio_raw parport_pc i2c_core evdev tpm_tis tpm_tis_core sg pcspkr parport tpm button su$
      [  599.709340] CPU: 7 PID: 5283 Comm: btrfs Not tainted 4.10.0-rc8-btrfs-next-46+ #1
      [  599.709340] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [  599.709340] task: ffff88023eedd040 task.stack: ffffc90006658000
      [  599.709340] RIP: 0010:read_extent_buffer+0xdb/0xf4 [btrfs]
      [  599.709340] RSP: 0018:ffffc9000665ba00 EFLAGS: 00010286
      [  599.709340] RAX: db73880000000000 RBX: 0000000000000000 RCX: 0000000000000001
      [  599.709340] RDX: ffffc9000665ba60 RSI: db73880000000000 RDI: ffffc9000665ba5f
      [  599.709340] RBP: ffffc9000665ba30 R08: 0000000000000001 R09: ffff88020dc5e098
      [  599.709340] R10: 0000000000001000 R11: 0000160000000000 R12: 6db6db6db6db6db7
      [  599.709340] R13: ffff880000000000 R14: 0000000000000000 R15: ffff88020dc5e088
      [  599.709340] FS:  00007f519555a8c0(0000) GS:ffff88023f3c0000(0000) knlGS:0000000000000000
      [  599.709340] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  599.709340] CR2: 00007f1411afd000 CR3: 0000000235f8e000 CR4: 00000000000006e0
      [  599.709340] Call Trace:
      [  599.709340]  btrfs_get_token_64+0x93/0xce [btrfs]
      [  599.709340]  ? printk+0x48/0x50
      [  599.709340]  btrfs_get_64+0xb/0xd [btrfs]
      [  599.709340]  process_extent+0x3a1/0x1106 [btrfs]
      [  599.709340]  ? btree_read_extent_buffer_pages+0x5/0xef [btrfs]
      [  599.709340]  changed_cb+0xb03/0xb3d [btrfs]
      [  599.709340]  ? btrfs_get_token_32+0x7a/0xcc [btrfs]
      [  599.709340]  btrfs_compare_trees+0x432/0x53d [btrfs]
      [  599.709340]  ? process_extent+0x1106/0x1106 [btrfs]
      [  599.709340]  btrfs_ioctl_send+0x960/0xe26 [btrfs]
      [  599.709340]  btrfs_ioctl+0x181b/0x1fed [btrfs]
      [  599.709340]  ? trace_hardirqs_on_caller+0x150/0x1ac
      [  599.709340]  vfs_ioctl+0x21/0x38
      [  599.709340]  ? vfs_ioctl+0x21/0x38
      [  599.709340]  do_vfs_ioctl+0x611/0x645
      [  599.709340]  ? rcu_read_unlock+0x5b/0x5d
      [  599.709340]  ? __fget+0x6d/0x79
      [  599.709340]  SyS_ioctl+0x57/0x7b
      [  599.709340]  entry_SYSCALL_64_fastpath+0x18/0xad
      [  599.709340] RIP: 0033:0x7f51945eec47
      [  599.709340] RSP: 002b:00007ffc21c13e98 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
      [  599.709340] RAX: ffffffffffffffda RBX: ffffffff81096459 RCX: 00007f51945eec47
      [  599.709340] RDX: 00007ffc21c13f20 RSI: 0000000040489426 RDI: 0000000000000004
      [  599.709340] RBP: ffffc9000665bf98 R08: 00007f519450d700 R09: 00007f519450d700
      [  599.709340] R10: 00007f519450d9d0 R11: 0000000000000202 R12: 0000000000000046
      [  599.709340] R13: ffffc9000665bf78 R14: 0000000000000000 R15: 00007f5195574040
      [  599.709340]  ? trace_hardirqs_off_caller+0x43/0xb1
      [  599.709340] Code: 29 f0 49 39 d8 4c 0f 47 c3 49 03 81 58 01 00 00 44 89 c1 4c 01 c2 4c 29 c3 48 c1 f8 03 49 0f af c4 48 c1 e0 0c 4c 01 e8 48 01 c6 <f3> a4 31 f6 4$
      [  599.709340] RIP: read_extent_buffer+0xdb/0xf4 [btrfs] RSP: ffffc9000665ba00
      [  599.762057] ---[ end trace fe00d7af61b9f49e ]---
      
      This is because the 'offset' field starts at an offset of 37 bytes
      (offsetof(struct btrfs_file_extent_item, offset)), has a length of 8
      bytes and therefore attemping to read it causes a 1 byte access beyond
      the end of the leaf, as the first item's content in a leaf is located
      at the tail of the leaf, the item size is 44 bytes and the offset of
      that field plus its length (37 + 8 = 45) goes beyond the item's size
      by 1 byte.
      
      So fix this by accessing the 'offset' and 'disk_bytenr' fields after
      jumping to the 'out' label if we are processing an inline extent. We
      move the reading operation of the 'disk_bytenr' field too because we
      have the same problem as for the 'offset' field explained above when
      the inline data is less then 8 bytes. The access to the 'generation'
      field is also moved but just for the sake of grouping access to all
      the fields.
      
      Fixes: e1cbfd7b
      
       ("Btrfs: send, fix file hole not being preserved due to inline extent")
      Cc: <stable@vger.kernel.org>  # v4.12+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      24e52b11
    • Filipe Manana's avatar
      Btrfs: incremental send, fix invalid path for link commands · f5962781
      Filipe Manana authored
      
      
      In some scenarios an incremental send stream can contain link commands
      with an invalid target path. Such scenarios happen after moving some
      directory inode A, renaming a regular file inode B into the old name of
      inode A and finally creating a new hard link for inode B at directory
      inode A.
      
      Consider the following example scenario where this issue happens.
      
      Parent snapshot:
      
        .                                                      (ino 256)
        |
        |--- dir1/                                             (ino 257)
        |      |--- dir2/                                      (ino 258)
        |             |--- dir3/                               (ino 259)
        |                   |--- file1                         (ino 261)
        |                   |--- dir4/                         (ino 262)
        |
        |--- dir5/                                             (ino 260)
      
      Send snapshot:
      
        .                                                      (ino 256)
        |
        |--- dir1/                                             (ino 257)
               |--- dir2/                                      (ino 258)
               |      |--- dir3/                               (ino 259)
               |            |--- dir4                          (ino 261)
               |
               |--- dir6/                                      (ino 263)
                      |--- dir44/                              (ino 262)
                             |--- file11                       (ino 261)
                             |--- dir55/                       (ino 260)
      
      When attempting to apply the corresponding incremental send stream, a
      link command contains an invalid target path which makes the receiver
      fail. The following is the verbose output of the btrfs receive command:
      
        receiving snapshot mysnap2 uuid=90076fe6-5ba6-e64a-9321-9279670ed16b (...)
        utimes
        utimes dir1
        utimes dir1/dir2/dir3
        utimes
        rename dir1/dir2/dir3/dir4 -> o262-7-0
        link dir1/dir2/dir3/dir4 -> dir1/dir2/dir3/file1
        link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1
        ERROR: link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1 failed: Not a directory
      
      The following steps happen during the computation of the incremental send
      stream the lead to this issue:
      
      1) When processing inode 261, we orphanize inode 262 due to a name/location
         collision with one of the new hard links for inode 261 (created in the
         second step below).
      
      2) We create one of the 2 new hard links for inode 261, the one whose
         location is at "dir1/dir2/dir3/dir4".
      
      3) We then attempt to create the other new hard link for inode 261, which
         has inode 262 as its parent directory. Because the path for this new
         hard link was computed before we started processing the new references
         (hard links), it reflects the old name/location of inode 262, that is,
         it does not account for the orphanization step that happened when
         we started processing the new references for inode 261, whence it is
         no longer valid, causing the receiver to fail.
      
      So fix this issue by recomputing the full path of new references if we
      ended up orphanizing other inodes which are directories.
      
      A test case for fstests follows soon.
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      f5962781
  6. Jun 30, 2017
    • Qu Wenruo's avatar
      btrfs: Remove false alert when fiemap range is smaller than on-disk extent · 848c23b7
      Qu Wenruo authored
      Commit 4751832d
      
       ("btrfs: fiemap: Cache and merge fiemap extent before
      submit it to user") introduced a warning to catch unemitted cached
      fiemap extent.
      
      However such warning doesn't take the following case into consideration:
      
      0			4K			8K
      |<---- fiemap range --->|
      |<----------- On-disk extent ------------------>|
      
      In this case, the whole 0~8K is cached, and since it's larger than
      fiemap range, it break the fiemap extent emit loop.
      This leaves the fiemap extent cached but not emitted, and caught by the
      final fiemap extent sanity check, causing kernel warning.
      
      This patch removes the kernel warning and renames the sanity check to
      emit_last_fiemap_cache() since it's possible and valid to have cached
      fiemap extent.
      
      Reported-by: default avatarDavid Sterba <dsterba@suse.cz>
      Reported-by: default avatarAdam Borowski <kilobyte@angband.pl>
      Fixes: 4751832d
      
       ("btrfs: fiemap: Cache and merge fiemap extent ...")
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      848c23b7
    • Jan Kara's avatar
      btrfs: Don't clear SGID when inheriting ACLs · b7f8a09f
      Jan Kara authored
      When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
      set, DIR1 is expected to have SGID bit set (and owning group equal to
      the owning group of 'DIR0'). However when 'DIR0' also has some default
      ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
      'DIR1' to get cleared if user is not member of the owning group.
      
      Fix the problem by moving posix_acl_update_mode() out of
      __btrfs_set_acl() into btrfs_set_acl(). That way the function will not be
      called when inheriting ACLs which is what we want as it prevents SGID
      bit clearing and the mode has been properly set by posix_acl_create()
      anyway.
      
      Fixes: 07393101
      
      
      CC: stable@vger.kernel.org
      CC: linux-btrfs@vger.kernel.org
      CC: David Sterba <dsterba@suse.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b7f8a09f
    • Chris Mason's avatar
      btrfs: fix integer overflow in calc_reclaim_items_nr · 6374e57a
      Chris Mason authored
      Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with
      v4.12-rc6.  This was because commit 70e7af24
      
       made it possible for
      calc_reclaim_items_nr() to return a negative number.  It's not really a
      bug in that commit, it just didn't go far enough down the stack to find
      all the possible 64->32 bit overflows.
      
      This switches calc_reclaim_items_nr() to return a u64 and changes everyone
      that uses the results of that math to u64 as well.
      
      Reported-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Fixes: 70e7af24
      
       ("Btrfs: fix delalloc accounting leak caused by u32 overflow")
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6374e57a
    • David Sterba's avatar
      btrfs: scrub: fix target device intialization while setting up scrub context · ded56184
      David Sterba authored
      
      
      The commit "btrfs: scrub: inline helper scrub_setup_wr_ctx" inlined a
      helper but wrongly sets up the target device. Incidentally there's a
      local variable with the same name as a parameter in the previous
      function, so this got caught during runtime as crash in test btrfs/027.
      
      Reported-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ded56184
    • Qu Wenruo's avatar
      btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges · bc42bda2
      Qu Wenruo authored
      
      
      [BUG]
      For the following case, btrfs can underflow qgroup reserved space
      at an error path:
      (Page size 4K, function name without "btrfs_" prefix)
      
               Task A                  |             Task B
      ----------------------------------------------------------------------
      Buffered_write [0, 2K)           |
      |- check_data_free_space()       |
      |  |- qgroup_reserve_data()      |
      |     Range aligned to page      |
      |     range [0, 4K)          <<< |
      |     4K bytes reserved      <<< |
      |- copy pages to page cache      |
                                       | Buffered_write [2K, 4K)
                                       | |- check_data_free_space()
                                       | |  |- qgroup_reserved_data()
                                       | |     Range alinged to page
                                       | |     range [0, 4K)
                                       | |     Already reserved by A <<<
                                       | |     0 bytes reserved      <<<
                                       | |- delalloc_reserve_metadata()
                                       | |  And it *FAILED* (Maybe EQUOTA)
                                       | |- free_reserved_data_space()
                                            |- qgroup_free_data()
                                               Range aligned to page range
                                               [0, 4K)
                                               Freeing 4K
      (Special thanks to Chandan for the detailed report and analyse)
      
      [CAUSE]
      Above Task B is freeing reserved data range [0, 4K) which is actually
      reserved by Task A.
      
      And at writeback time, page dirty by Task A will go through writeback
      routine, which will free 4K reserved data space at file extent insert
      time, causing the qgroup underflow.
      
      [FIX]
      For btrfs_qgroup_free_data(), add @reserved parameter to only free
      data ranges reserved by previous btrfs_qgroup_reserve_data().
      So in above case, Task B will try to free 0 byte, so no underflow.
      
      Reported-by: default avatarChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarChandan Rajendra <chandan@linux.vnet.ibm.com>
      Tested-by: default avatarChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bc42bda2
    • Qu Wenruo's avatar
      btrfs: qgroup: Introduce extent changeset for qgroup reserve functions · 364ecf36
      Qu Wenruo authored
      
      
      Introduce a new parameter, struct extent_changeset for
      btrfs_qgroup_reserved_data() and its callers.
      
      Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
      which range it reserved in current reserve, so it can free it in error
      paths.
      
      The reason we need to export it to callers is, at buffered write error
      path, without knowing what exactly which range we reserved in current
      allocation, we can free space which is not reserved by us.
      
      This will lead to qgroup reserved space underflow.
      
      Reviewed-by: default avatarChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      364ecf36
    • Qu Wenruo's avatar
      btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write... · a12b877b
      Qu Wenruo authored
      
      btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled
      
      [BUG]
      Under the following case, we can underflow qgroup reserved space.
      
                  Task A                |            Task B
      ---------------------------------------------------------------
       Quota disabled                   |
       Buffered write                   |
       |- btrfs_check_data_free_space() |
       |  *NO* qgroup space is reserved |
       |  since quota is *DISABLED*     |
       |- All pages are copied to page  |
          cache                         |
                                        | Enable quota
                                        | Quota scan finished
                                        |
                                        | Sync_fs
                                        | |- run_delalloc_range
                                        | |- Write pages
                                        | |- btrfs_finish_ordered_io
                                        |    |- insert_reserved_file_extent
                                        |       |- btrfs_qgroup_release_data()
                                        |          Since no qgroup space is
                                                   reserved in Task A, we
                                                   underflow qgroup reserved
                                                   space
      This can be detected by fstest btrfs/104.
      
      [CAUSE]
      In insert_reserved_file_extent() we tell qgroup to release the @ram_bytes
      size of qgroup reserved_space in all cases.
      And btrfs_qgroup_release_data() will check if quotas are enabled.
      
      However in the above case, the buffered write happens before quota is
      enabled, so we don't have the reserved space for that range.
      
      [FIX]
      In insert_reserved_file_extent(), we tell qgroup to release the acctual
      byte number it released.
      In the above case, since we don't have the reserved space, we tell
      qgroups to release 0 byte, so the problem can be fixed.
      
      And thanks to the @reserved parameter introduced by the qgroup rework,
      and previous patch to return released bytes, the fix can be as small as
      10 lines.
      
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      [ changelog updates ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a12b877b
    • Qu Wenruo's avatar
      btrfs: qgroup: Return actually freed bytes for qgroup release or free data · 7bc329c1
      Qu Wenruo authored
      
      
      btrfs_qgroup_release/free_data() only returns 0 or a negative error
      number (ENOMEM is the only possible error).
      
      This is normally good enough, but sometimes we need the exact byte
      count it freed/released.
      
      Change it to return actually released/freed bytenr number instead of 0
      for success.
      And slightly modify related extent_changeset structure, since in btrfs
      one no-hole data extent won't be larger than 128M, so "unsigned int"
      is large enough for the use case.
      
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7bc329c1
    • Qu Wenruo's avatar
      btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function · d1b8b94a
      Qu Wenruo authored
      
      
      Quite a lot of qgroup corruption happens due to wrong time of calling
      btrfs_qgroup_prepare_account_extents().
      
      Since the safest time is to call it just before
      btrfs_qgroup_account_extents(), there is no need to separate these 2
      functions.
      
      Merging them will make code cleaner and less bug prone.
      
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      [ changelog and comment adjustments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d1b8b94a
    • Qu Wenruo's avatar
      btrfs: qgroup: Add quick exit for non-fs extents · 5edfd9fd
      Qu Wenruo authored
      
      
      Modify btrfs_qgroup_account_extent() to exit quicker for non-fs extents.
      
      The quick exit condition is:
      1) The extent belongs to a non-fs tree
         Only fs-tree extents can affect qgroup numbers and is the only case
         where extent can be shared between different trees.
      
         Although strictly speaking extent in data-reloc or tree-reloc tree
         can be shared, data/tree-reloc root won't appear in the result of
         btrfs_find_all_roots(), so we can ignore such case.
      
         So we can check the first root in old_roots/new_roots ulist.
         - if we find the 1st root is a not a fs/subvol root, then we can skip
           the extent
         - if we find the 1st root is a fs/subvol root, then we must continue
           calculation
      
      OR
      
      2) both 'nr_old_roots' and 'nr_new_roots' are 0
         This means either such extent got allocated then freed in current
         transaction or it's a new reloc tree extent, whose nr_new_roots is 0.
         Either way it won't affect qgroup accounting and can be skipped
         safely.
      
      Such quick exit can make trace output more quite and less confusing:
      (example with fs uuid and time stamp removed)
      
      Before:
      ------
      add_delayed_tree_ref: bytenr=29556736 num_bytes=16384 action=ADD_DELAYED_REF parent=0(-) ref_root=2(EXTENT_TREE) level=0 type=TREE_BLOCK_REF seq=0
      btrfs_qgroup_account_extent: bytenr=29556736 num_bytes=16384 nr_old_roots=0 nr_new_roots=1
      ------
      Extent tree block will trigger btrfs_qgroup_account_extent() trace point
      while no qgroup number is changed, as extent tree won't affect qgroup
      accounting.
      
      After:
      ------
      add_delayed_tree_ref: bytenr=29556736 num_bytes=16384 action=ADD_DELAYED_REF parent=0(-) ref_root=2(EXTENT_TREE) level=0 type=TREE_BLOCK_REF seq=0
      ------
      Now such unrelated extent won't trigger btrfs_qgroup_account_extent()
      trace point, making the trace less noisy.
      
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      [ changelog and comment adjustments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5edfd9fd
    • Omar Sandoval's avatar
      Btrfs: rework delayed ref total_bytes_pinned accounting · d7eae340
      Omar Sandoval authored
      The total_bytes_pinned counter is completely broken when accounting
      delayed refs:
      
      - If two drops for the same extent are merged, we will decrement
        total_bytes_pinned twice but only increment it once.
      - If an add is merged into a drop or vice versa, we will decrement the
        total_bytes_pinned counter but never increment it.
      - If multiple references to an extent are dropped, we will account it
        multiple times, potentially vastly over-estimating the number of bytes
        that will be freed by a commit and doing unnecessary work when we're
        close to ENOSPC.
      
      The last issue is relatively minor, but the first two make the
      total_bytes_pinned counter leak or underflow very often. These
      accounting issues were introduced in b150a4f1 ("Btrfs: use a percpu
      to keep track of possibly pinned bytes"), but they were papered over by
      zeroing out the counter on every commit until d288db5d
      
       ("Btrfs: fix
      race of using total_bytes_pinned").
      
      We need to make sure that an extent is accounted as pinned exactly once
      if and only if we will drop references to it when when the transaction
      is committed. Ideally we would only add to total_bytes_pinned when the
      *last* reference is dropped, but this information isn't readily
      available for data extents. Again, this over-estimation can lead to
      extra commits when we're close to ENOSPC, but it's not as bad as before.
      
      The fix implemented here is to increment total_bytes_pinned when the
      total refmod count for an extent goes negative and decrement it if the
      refmod count goes back to non-negative or after we've run all of the
      delayed refs for that extent.
      
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d7eae340
    • Omar Sandoval's avatar
      Btrfs: return old and new total ref mods when adding delayed refs · 7be07912
      Omar Sandoval authored
      
      
      We need this to decide when to account pinned bytes.
      
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7be07912
    • Omar Sandoval's avatar
      Btrfs: always account pinned bytes when dropping a tree block ref · 0a16c7d7
      Omar Sandoval authored
      
      
      Currently, we only increment total_bytes_pinned in
      btrfs_free_tree_block() when dropping the last reference on the block.
      However, when the delayed ref is run later, we will decrement
      total_bytes_pinned regardless of whether it was the last reference or
      not. This causes the counter to underflow when the reference we dropped
      was not the last reference. Fix it by incrementing the counter
      unconditionally, which is what btrfs_free_extent() does. This makes
      total_bytes_pinned an overestimate when references to shared extents are
      dropped, but in the worst case this will just make us try to commit the
      transaction to try to free up space and find we didn't free enough.
      
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0a16c7d7
    • Omar Sandoval's avatar
      Btrfs: update total_bytes_pinned when pinning down extents · 4da8b76d
      Omar Sandoval authored
      
      
      The extents marked in pin_down_extent() will be unpinned later in
      unpin_extent_range(), which decrements total_bytes_pinned.
      pin_down_extent() must increment the counter to avoid underflowing it.
      Also adjust btrfs_free_tree_block() to avoid accounting for the same
      extent twice.
      
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4da8b76d
    • Omar Sandoval's avatar
      Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT() · 55e8196a
      Omar Sandoval authored
      
      
      The value of flags is one of DATA/METADATA/SYSTEM, they must exist at
      when add_pinned_bytes is called.
      
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ added changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      55e8196a
    • Omar Sandoval's avatar
      Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64 · 0d9f824d
      Omar Sandoval authored
      
      
      There are a few places where we pass in a negative num_bytes, so make it
      signed for clarity. Also move it up in the file since later patches will
      need it there.
      
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d9f824d
    • David Sterba's avatar
      btrfs: fix validation of XATTR_ITEM dir items · 1164a9fb
      David Sterba authored
      
      
      The XATTR_ITEM is a type of a directory item so we use the common
      validator helper. Unlike other dir items, it can have data. The way the
      name len validation is currently implemented does not reflect that. We'd
      have to adjust by the data_len when comparing the read and item limits.
      
      However, this will not work for multi-item xattr dir items.
      
      Example from tree dump of generic/337:
      
              item 7 key (257 XATTR_ITEM 751495445) itemoff 15667 itemsize 147
                      location key (0 UNKNOWN.0 0) type XATTR
                      transid 8 data_len 3 name_len 11
                      name: user.foobar
                      data 123
                      location key (0 UNKNOWN.0 0) type XATTR
                      transid 8 data_len 6 name_len 13
                      name: user.WvG1c1Td
                      data qwerty
                      location key (0 UNKNOWN.0 0) type XATTR
                      transid 8 data_len 5 name_len 19
                      name: user.J3__T_Km3dVsW_
                      data hello
      
      At the point of btrfs_is_name_len_valid call we don't have access to the
      data_len value of the 2nd and 3rd sub-item. So simple btrfs_dir_data_len(leaf,
      di) would always return 3, although we'd need to get 6 and 5 respectively to
      get the claculations right. (read_end + name_len + data_len vs item_end)
      
      We'd have to also pass data_len externally, which is not point of the
      name validation. The last check is supposed to test if there's at least
      one dir item space after the one we're processing. I don't think this is
      particularly useful, validation of the next item would catch that too.
      So the check is removed and we don't weaken the validation. Now tests
      btrfs/048, btrfs/053, generic/273 and generic/337 pass.
      
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1164a9fb
  7. Jun 22, 2017