Commit 6abaa83c authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull f2fs updates from Jaegeuk Kim:
 "In this cycle, we've addressed some performance issues such as lock
  contention, misbehaving compress_cache, allowing extent_cache for
  compressed files, and new sysfs to adjust ra_size for fadvise.

  In order to diagnose the performance issues quickly, we also added an
  iostat which shows the IO latencies periodically.

  On the stability side, we've found two memory leakage cases in the
  error path in compression flow. And, we've also fixed various corner
  cases in fiemap, quota, checkpoint=disable, zstd, and so on.

  Enhancements:
   - avoid long checkpoint latency by releasing nat_tree_lock
   - collect and show iostats periodically
   - support extent_cache for compressed files
   - add a sysfs entry to manage ra_size given fadvise(POSIX_FADV_SEQUENTIAL)
   - report f2fs GC status via sysfs
   - add discard_unit=%s in mount option to handle zoned device

  Bug fixes:
   - fix two memory leakages when an error happens in the compressed IO flow
   - fix commpress_cache to get the right LBA
   - fix fiemap to deal with compressed case correctly
   - fix wrong EIO returns due to SBI_NEED_FSCK
   - fix missing writes when enabling checkpoint back
   - fix quota deadlock
   - fix zstd level mount option

  In addition to the above major updates, we've cleaned up several code
  paths such as dio, unnecessary operations, debugfs/f2fs/status, sanity
  check, and typos"

* tag 'f2fs-for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (46 commits)
  f2fs: should put a page beyond EOF when preparing a write
  f2fs: deallocate compressed pages when error happens
  f2fs: enable realtime discard iff device supports discard
  f2fs: guarantee to write dirty data when enabling checkpoint back
  f2fs: fix to unmap pages from userspace process in punch_hole()
  f2fs: fix unexpected ENOENT comes from f2fs_map_blocks()
  f2fs: fix to account missing .skipped_gc_rwsem
  f2fs: adjust unlock order for cleanup
  f2fs: Don't create discard thread when device doesn't support realtime discard
  f2fs: rebuild nat_bits during umount
  f2fs: introduce periodic iostat io latency traces
  f2fs: separate out iostat feature
  f2fs: compress: do sanity check on cluster
  f2fs: fix description about main_blkaddr node
  f2fs: convert S_IRUGO to 0444
  f2fs: fix to keep compatibility of fault injection interface
  f2fs: support fault injection for f2fs_kmem_cache_alloc()
  f2fs: compress: allow write compress released file after truncate to zero
  f2fs: correct comment in segment.h
  f2fs: improve sbi status info in debugfs/f2fs/status
  ...
parents 0961f0c0 9605f75c
Loading
Loading
Loading
Loading
+21 −2
Original line number Diff line number Diff line
@@ -41,8 +41,7 @@ Description: This parameter controls the number of prefree segments to be
What:		/sys/fs/f2fs/<disk>/main_blkaddr
Date:		November 2019
Contact:	"Ramon Pantin" <pantin@google.com>
Description:
		 Shows first block address of MAIN area.
Description:	Shows first block address of MAIN area.

What:		/sys/fs/f2fs/<disk>/ipu_policy
Date:		November 2013
@@ -493,3 +492,23 @@ Contact: "Chao Yu" <yuchao0@huawei.com>
Description:	When ATGC is on, it controls age threshold to bypass GCing young
		candidates whose age is not beyond the threshold, by default it was
		initialized as 604800 seconds (equals to 7 days).

What:		/sys/fs/f2fs/<disk>/gc_reclaimed_segments
Date:		July 2021
Contact:	"Daeho Jeong" <daehojeong@google.com>
Description:	Show how many segments have been reclaimed by GC during a specific
		GC mode (0: GC normal, 1: GC idle CB, 2: GC idle greedy,
		3: GC idle AT, 4: GC urgent high, 5: GC urgent low)
		You can re-initialize this value to "0".

What:		/sys/fs/f2fs/<disk>/gc_segment_mode
Date:		July 2021
Contact:	"Daeho Jeong" <daehojeong@google.com>
Description:	You can control for which gc mode the "gc_reclaimed_segments" node shows.
		Refer to the description of the modes in "gc_reclaimed_segments".

What:		/sys/fs/f2fs/<disk>/seq_file_ra_mul
Date:		July 2021
Contact:	"Daeho Jeong" <daehojeong@google.com>
Description:	You can	control the multiplier value of	bdi device readahead window size
		between 2 (default) and 256 for POSIX_FADV_SEQUENTIAL advise option.
+15 −2
Original line number Diff line number Diff line
@@ -185,6 +185,7 @@ fault_type=%d Support configuring fault injection type, should be
			 FAULT_KVMALLOC		  0x000000002
			 FAULT_PAGE_ALLOC	  0x000000004
			 FAULT_PAGE_GET		  0x000000008
			 FAULT_ALLOC_BIO	  0x000000010 (obsolete)
			 FAULT_ALLOC_NID	  0x000000020
			 FAULT_ORPHAN		  0x000000040
			 FAULT_BLOCK		  0x000000080
@@ -195,6 +196,7 @@ fault_type=%d Support configuring fault injection type, should be
			 FAULT_CHECKPOINT	  0x000001000
			 FAULT_DISCARD		  0x000002000
			 FAULT_WRITE_IO		  0x000004000
			 FAULT_SLAB_ALLOC	  0x000008000
			 ===================	  ===========
mode=%s			 Control block allocation mode which supports "adaptive"
			 and "lfs". In "lfs" mode, there should be no random
@@ -312,6 +314,14 @@ inlinecrypt When possible, encrypt/decrypt the contents of encrypted
			 Documentation/block/inline-encryption.rst.
atgc			 Enable age-threshold garbage collection, it provides high
			 effectiveness and efficiency on background GC.
discard_unit=%s		 Control discard unit, the argument can be "block", "segment"
			 and "section", issued discard command's offset/size will be
			 aligned to the unit, by default, "discard_unit=block" is set,
			 so that small discard functionality is enabled.
			 For blkzoned device, "discard_unit=section" will be set by
			 default, it is helpful for large sized SMR or ZNS devices to
			 reduce memory cost by getting rid of fs metadata supports small
			 discard.
======================== ============================================================

Debugfs Entries
@@ -857,8 +867,11 @@ Compression implementation
  directly in order to guarantee potential data updates later to the space.
  Instead, the main goal is to reduce data writes to flash disk as much as
  possible, resulting in extending disk life time as well as relaxing IO
  congestion. Alternatively, we've added ioctl interface to reclaim compressed
  space and show it to user after putting the immutable bit.
  congestion. Alternatively, we've added ioctl(F2FS_IOC_RELEASE_COMPRESS_BLOCKS)
  interface to reclaim compressed space and show it to user after putting the
  immutable bit. Immutable bit, after release, it doesn't allow writing/mmaping
  on the file, until reserving compressed space via
  ioctl(F2FS_IOC_RESERVE_COMPRESS_BLOCKS) or truncating filesize to zero.

Compress metadata layout::

+13 −6
Original line number Diff line number Diff line
@@ -105,6 +105,13 @@ config F2FS_FS_LZO
	help
	  Support LZO compress algorithm, if unsure, say Y.

config F2FS_FS_LZORLE
	bool "LZO-RLE compression support"
	depends on F2FS_FS_LZO
	default y
	help
	  Support LZO-RLE compress algorithm, if unsure, say Y.

config F2FS_FS_LZ4
	bool "LZ4 compression support"
	depends on F2FS_FS_COMPRESSION
@@ -114,7 +121,6 @@ config F2FS_FS_LZ4

config F2FS_FS_LZ4HC
	bool "LZ4HC compression support"
	depends on F2FS_FS_COMPRESSION
	depends on F2FS_FS_LZ4
	default y
	help
@@ -128,10 +134,11 @@ config F2FS_FS_ZSTD
	help
	  Support ZSTD compress algorithm, if unsure, say Y.

config F2FS_FS_LZORLE
	bool "LZO-RLE compression support"
	depends on F2FS_FS_COMPRESSION
	depends on F2FS_FS_LZO
config F2FS_IOSTAT
	bool "F2FS IO statistics information"
	depends on F2FS_FS
	default y
	help
	  Support LZO-RLE compress algorithm, if unsure, say Y.
	  Support getting IO statistics through sysfs and printing out periodic
	  IO statistics tracepoint events. You have to turn on "iostat_enable"
	  sysfs node to enable this feature.
+1 −0
Original line number Diff line number Diff line
@@ -9,3 +9,4 @@ f2fs-$(CONFIG_F2FS_FS_XATTR) += xattr.o
f2fs-$(CONFIG_F2FS_FS_POSIX_ACL) += acl.o
f2fs-$(CONFIG_FS_VERITY) += verity.o
f2fs-$(CONFIG_F2FS_FS_COMPRESSION) += compress.o
f2fs-$(CONFIG_F2FS_IOSTAT) += iostat.o
+43 −14
Original line number Diff line number Diff line
@@ -18,6 +18,7 @@
#include "f2fs.h"
#include "node.h"
#include "segment.h"
#include "iostat.h"
#include <trace/events/f2fs.h>

#define DEFAULT_CHECKPOINT_IOPRIO (IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 3))
@@ -465,16 +466,29 @@ static void __add_ino_entry(struct f2fs_sb_info *sbi, nid_t ino,
						unsigned int devidx, int type)
{
	struct inode_management *im = &sbi->im[type];
	struct ino_entry *e, *tmp;
	struct ino_entry *e = NULL, *new = NULL;

	if (type == FLUSH_INO) {
		rcu_read_lock();
		e = radix_tree_lookup(&im->ino_root, ino);
		rcu_read_unlock();
	}

	tmp = f2fs_kmem_cache_alloc(ino_entry_slab, GFP_NOFS);
retry:
	if (!e)
		new = f2fs_kmem_cache_alloc(ino_entry_slab,
						GFP_NOFS, true, NULL);

	radix_tree_preload(GFP_NOFS | __GFP_NOFAIL);

	spin_lock(&im->ino_lock);
	e = radix_tree_lookup(&im->ino_root, ino);
	if (!e) {
		e = tmp;
		if (!new) {
			spin_unlock(&im->ino_lock);
			goto retry;
		}
		e = new;
		if (unlikely(radix_tree_insert(&im->ino_root, ino, e)))
			f2fs_bug_on(sbi, 1);

@@ -492,8 +506,8 @@ static void __add_ino_entry(struct f2fs_sb_info *sbi, nid_t ino,
	spin_unlock(&im->ino_lock);
	radix_tree_preload_end();

	if (e != tmp)
		kmem_cache_free(ino_entry_slab, tmp);
	if (new && e != new)
		kmem_cache_free(ino_entry_slab, new);
}

static void __remove_ino_entry(struct f2fs_sb_info *sbi, nid_t ino, int type)
@@ -1289,12 +1303,20 @@ static void update_ckpt_flags(struct f2fs_sb_info *sbi, struct cp_control *cpc)
	struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
	unsigned long flags;

	spin_lock_irqsave(&sbi->cp_lock, flags);
	if (cpc->reason & CP_UMOUNT) {
		if (le32_to_cpu(ckpt->cp_pack_total_block_count) >
			sbi->blocks_per_seg - NM_I(sbi)->nat_bits_blocks) {
			clear_ckpt_flags(sbi, CP_NAT_BITS_FLAG);
			f2fs_notice(sbi, "Disable nat_bits due to no space");
		} else if (!is_set_ckpt_flags(sbi, CP_NAT_BITS_FLAG) &&
						f2fs_nat_bitmap_enabled(sbi)) {
			f2fs_enable_nat_bits(sbi);
			set_ckpt_flags(sbi, CP_NAT_BITS_FLAG);
			f2fs_notice(sbi, "Rebuild and enable nat_bits");
		}
	}

	if ((cpc->reason & CP_UMOUNT) &&
			le32_to_cpu(ckpt->cp_pack_total_block_count) >
			sbi->blocks_per_seg - NM_I(sbi)->nat_bits_blocks)
		disable_nat_bits(sbi, false);
	spin_lock_irqsave(&sbi->cp_lock, flags);

	if (cpc->reason & CP_TRIMMED)
		__set_ckpt_flags(ckpt, CP_TRIMMED_FLAG);
@@ -1480,7 +1502,8 @@ static int do_checkpoint(struct f2fs_sb_info *sbi, struct cp_control *cpc)
	start_blk = __start_cp_next_addr(sbi);

	/* write nat bits */
	if (enabled_nat_bits(sbi, cpc)) {
	if ((cpc->reason & CP_UMOUNT) &&
			is_set_ckpt_flags(sbi, CP_NAT_BITS_FLAG)) {
		__u64 cp_ver = cur_cp_version(ckpt);
		block_t blk;

@@ -1639,8 +1662,11 @@ int f2fs_write_checkpoint(struct f2fs_sb_info *sbi, struct cp_control *cpc)

	/* write cached NAT/SIT entries to NAT/SIT area */
	err = f2fs_flush_nat_entries(sbi, cpc);
	if (err)
	if (err) {
		f2fs_err(sbi, "f2fs_flush_nat_entries failed err:%d, stop checkpoint", err);
		f2fs_bug_on(sbi, !f2fs_cp_error(sbi));
		goto stop;
	}

	f2fs_flush_sit_entries(sbi, cpc);

@@ -1648,10 +1674,13 @@ int f2fs_write_checkpoint(struct f2fs_sb_info *sbi, struct cp_control *cpc)
	f2fs_save_inmem_curseg(sbi);

	err = do_checkpoint(sbi, cpc);
	if (err)
	if (err) {
		f2fs_err(sbi, "do_checkpoint failed err:%d, stop checkpoint", err);
		f2fs_bug_on(sbi, !f2fs_cp_error(sbi));
		f2fs_release_discard_addrs(sbi);
	else
	} else {
		f2fs_clear_prefree_segments(sbi, cpc);
	}

	f2fs_restore_inmem_curseg(sbi);
stop:
Loading