Skip to content
  1. Sep 12, 2022
    • Dan Williams's avatar
      xfs: quiet notify_failure EOPNOTSUPP cases · b14d067e
      Dan Williams authored
      Patch series "mm, xfs, dax: Fixes for memory_failure() handling".
      
      I failed to run the memory error injection section of the ndctl test suite
      on linux-next prior to the merge window and as a result some bugs were
      missed.  While the new enabling targeted reflink enabled XFS filesystems
      the bugs cropped up in the surrounding cases of DAX error injection on
      ext4-fsdax and device-dax.
      
      One new assumption / clarification in this set is the notion that if a
      filesystem's ->notify_failure() handler returns -EOPNOTSUPP, then it must
      be the case that the fsdax usage of page->index and page->mapping are
      valid.  I am fairly certain this is true for xfs_dax_notify_failure(), but
      would appreciate another set of eyes.
      
      
      This patch (of 4):
      
      XFS always registers dax_holder_operations regardless of whether the
      filesystem is capable of handling the notifications.  The expectation is
      that if the notify_failure handler cannot run then there are no scenarios
      where it needs to run.  In other words the expected semantic is that
      page->index and page->mapping are valid for memory_failure() when the
      conditions that cause -EOPNOTSUPP in xfs_dax_notify_failure() are present.
      
      A fallback to the generic memory_failure() path is expected so do not warn
      when that happens.
      
      Link: https://lkml.kernel.org/r/166153426798.2758201.15108211981034512993.stgit@dwillia2-xfh.jf.intel.com
      Link: https://lkml.kernel.org/r/166153427440.2758201.6709480562966161512.stgit@dwillia2-xfh.jf.intel.com
      Fixes: 6f643c57
      
       ("xfs: implement ->notify_failure() for XFS")
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Shiyang Ruan <ruansy.fnst@fujitsu.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Ritesh Harjani <riteshh@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b14d067e
    • Mel Gorman's avatar
      mm/page_alloc: fix race condition between build_all_zonelists and page allocation · 3d36424b
      Mel Gorman authored
      Patrick Daly reported the following problem;
      
      	NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation
      	[0] - ZONE_MOVABLE
      	[1] - ZONE_NORMAL
      	[2] - NULL
      
      	For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the
      	offset of ZONE_NORMAL in ac->preferred_zoneref. If a concurrent
      	memory_offline operation removes the last page from ZONE_MOVABLE,
      	build_all_zonelists() & build_zonerefs_node() will update
      	node_zonelists as shown below. Only populated zones are added.
      
      	NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation
      	[0] - ZONE_NORMAL
      	[1] - NULL
      	[2] - NULL
      
      The race is simple -- page allocation could be in progress when a memory
      hot-remove operation triggers a zonelist rebuild that removes zones.  The
      allocation request will still have a valid ac->preferred_zoneref that is
      now pointing to NULL and triggers an OOM kill.
      
      This problem probably always existed but may be slightly easier to trigger
      due to 6aa303de ("mm, vmscan: only allocate and reclaim from zones
      with pages managed by the buddy allocator") which distinguishes between
      zones that are completely unpopulated versus zones that have valid pages
      not managed by the buddy allocator (e.g.  reserved, memblock, ballooning
      etc).  Memory hotplug had multiple stages with timing considerations
      around managed/present page updates, the zonelist rebuild and the zone
      span updates.  As David Hildenbrand puts it
      
      	memory offlining adjusts managed+present pages of the zone
      	essentially in one go. If after the adjustments, the zone is no
      	longer populated (present==0), we rebuild the zone lists.
      
      	Once that's done, we try shrinking the zone (start+spanned
      	pages) -- which results in zone_start_pfn == 0 if there are no
      	more pages. That happens *after* rebuilding the zonelists via
      	remove_pfn_range_from_zone().
      
      The only requirement to fix the race is that a page allocation request
      identifies when a zonelist rebuild has happened since the allocation
      request started and no page has yet been allocated.  Use a seqlock_t to
      track zonelist updates with a lockless read-side of the zonelist and
      protecting the rebuild and update of the counter with a spinlock.
      
      [akpm@linux-foundation.org: make zonelist_update_seq static]
      Link: https://lkml.kernel.org/r/20220824110900.vh674ltxmzb3proq@techsingularity.net
      Fixes: 6aa303de
      
       ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reported-by: default avatarPatrick Daly <quic_pdaly@quicinc.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.9+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3d36424b
    • ChenXiaoSong's avatar
      ntfs: fix BUG_ON in ntfs_lookup_inode_by_name() · 1b513f61
      ChenXiaoSong authored
      Syzkaller reported BUG_ON as follows:
      
      ------------[ cut here ]------------
      kernel BUG at fs/ntfs/dir.c:86!
      invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
      CPU: 3 PID: 758 Comm: a.out Not tainted 5.19.0-next-20220808 #5
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      RIP: 0010:ntfs_lookup_inode_by_name+0xd11/0x2d10
      Code: ff e9 b9 01 00 00 e8 1e fe d6 fe 48 8b 7d 98 49 8d 5d 07 e8 91 85 29 ff 48 c7 45 98 00 00 00 00 e9 5a fb ff ff e8 ff fd d6 fe <0f> 0b e8 f8 fd d6 fe 0f 0b e8 f1 fd d6 fe 48 8b b5 50 ff ff ff 4c
      RSP: 0018:ffff888079607978 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: 0000000000008000 RCX: 0000000000000000
      RDX: ffff88807cf10000 RSI: ffffffff82a4a081 RDI: 0000000000000003
      RBP: ffff888079607a70 R08: 0000000000000001 R09: ffff88807a6d01d7
      R10: ffffed100f4da03a R11: 0000000000000000 R12: ffff88800f0fb110
      R13: ffff88800f0ee000 R14: ffff88800f0fb000 R15: 0000000000000001
      FS:  00007f33b63c7540(0000) GS:ffff888108580000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f33b635c090 CR3: 000000000f39e005 CR4: 0000000000770ee0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       <TASK>
       load_system_files+0x1f7f/0x3620
       ntfs_fill_super+0xa01/0x1be0
       mount_bdev+0x36a/0x440
       ntfs_mount+0x3a/0x50
       legacy_get_tree+0xfb/0x210
       vfs_get_tree+0x8f/0x2f0
       do_new_mount+0x30a/0x760
       path_mount+0x4de/0x1880
       __x64_sys_mount+0x2b3/0x340
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f33b62ff9ea
      Code: 48 8b 0d a9 f4 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 76 f4 0b 00 f7 d8 64 89 01 48
      RSP: 002b:00007ffd0c471aa8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f33b62ff9ea
      RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffd0c471be0
      RBP: 00007ffd0c471c60 R08: 00007ffd0c471ae0 R09: 00007ffd0c471c24
      R10: 0000000000000000 R11: 0000000000000202 R12: 000055bac5afc160
      R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
       </TASK>
      Modules linked in:
      ---[ end trace 0000000000000000 ]---
      
      Fix this by adding sanity check on extended system files' directory inode
      to ensure that it is directory, just like ntfs_extend_init() when mounting
      ntfs3.
      
      Link: https://lkml.kernel.org/r/20220809064730.2316892-1-chenxiaosong2@huawei.com
      
      
      Signed-off-by: default avatarChenXiaoSong <chenxiaosong2@huawei.com>
      Cc: Anton Altaparmakov <anton@tuxera.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1b513f61
  2. Aug 29, 2022
  3. Aug 28, 2022
  4. Aug 27, 2022
    • Zhengjun Xing's avatar
      perf stat: Capitalize topdown metrics' names · 48648548
      Zhengjun Xing authored
      
      
      Capitalize topdown metrics' names to follow the intel SDM.
      
      Before:
      
       # ./perf stat -a  sleep 1
      
       Performance counter stats for 'system wide':
      
              228,094.05 msec cpu-clock                        #  225.026 CPUs utilized
                     842      context-switches                 #    3.691 /sec
                     224      cpu-migrations                   #    0.982 /sec
                      70      page-faults                      #    0.307 /sec
              23,164,105      cycles                           #    0.000 GHz
              29,403,446      instructions                     #    1.27  insn per cycle
               5,268,185      branches                         #   23.097 K/sec
                  33,239      branch-misses                    #    0.63% of all branches
             136,248,990      slots                            #  597.337 K/sec
              32,976,450      topdown-retiring                 #     24.2% retiring
               4,651,918      topdown-bad-spec                 #      3.4% bad speculation
              26,148,695      topdown-fe-bound                 #     19.2% frontend bound
              72,515,776      topdown-be-bound                 #     53.2% backend bound
               6,008,540      topdown-heavy-ops                #      4.4% heavy operations       #     19.8% light operations
               3,934,049      topdown-br-mispredict            #      2.9% branch mispredict      #      0.5% machine clears
              16,655,439      topdown-fetch-lat                #     12.2% fetch latency          #      7.0% fetch bandwidth
              41,635,972      topdown-mem-bound                #     30.5% memory bound           #     22.7% Core bound
      
             1.013634593 seconds time elapsed
      
      After:
      
       # ./perf stat -a  sleep 1
      
       Performance counter stats for 'system wide':
      
              228,081.94 msec cpu-clock                        #  225.003 CPUs utilized
                     824      context-switches                 #    3.613 /sec
                     224      cpu-migrations                   #    0.982 /sec
                      67      page-faults                      #    0.294 /sec
              22,647,423      cycles                           #    0.000 GHz
              28,870,551      instructions                     #    1.27  insn per cycle
               5,167,099      branches                         #   22.655 K/sec
                  32,383      branch-misses                    #    0.63% of all branches
             133,411,074      slots                            #  584.926 K/sec
              32,352,607      topdown-retiring                 #     24.3% Retiring
               4,456,977      topdown-bad-spec                 #      3.3% Bad Speculation
              25,626,487      topdown-fe-bound                 #     19.2% Frontend Bound
              70,955,316      topdown-be-bound                 #     53.2% Backend Bound
               5,834,844      topdown-heavy-ops                #      4.4% Heavy Operations       #     19.9% Light Operations
               3,738,781      topdown-br-mispredict            #      2.8% Branch Mispredict      #      0.5% Machine Clears
              16,286,803      topdown-fetch-lat                #     12.2% Fetch Latency          #      7.0% Fetch Bandwidth
              40,802,069      topdown-mem-bound                #     30.6% Memory Bound           #     22.6% Core Bound
      
             1.013683125 seconds time elapsed
      
      Reviewed-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarXing Zhengjun <zhengjun.xing@linux.intel.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20220825015458.3252239-1-zhengjun.xing@linux.intel.com
      
      
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      48648548
    • Kan Liang's avatar
      perf docs: Update the documentation for the save_type filter · 3126204c
      Kan Liang authored
      
      
      Update the documentation to reflect the kernel changes.
      
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Link: https://lore.kernel.org/r/20220816125612.2042397-2-kan.liang@linux.intel.com
      
      
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      3126204c
    • Ian Rogers's avatar
      perf sched: Fix memory leaks in __cmd_record detected with -fsanitize=address · d72e5cf3
      Ian Rogers authored
      
      
      An array of strings is passed to cmd_record but not freed. As
      cmd_record modifies the array, add another array as a copy that can be
      mutated allowing the original array contents to all be freed.
      
      Detected with -fsanitize=address.
      
      Signed-off-by: default avatarIan Rogers <irogers@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Link: https://lore.kernel.org/r/20220824145733.409005-1-irogers@google.com
      
      
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      d72e5cf3
    • Andi Kleen's avatar
      perf record: Fix manpage formatting of description of support to hybrid systems · e89eaa61
      Andi Kleen authored
      
      
      The Intel hybrid description is written in a different style than the
      rest of the perf record man page. There were some new command line
      options added after it which resulted in very strange section ordering.
      Move the hybrid include last.
      
      Also the sub sections in the hybrid document don't fit the record
      manpage well (especially since it talks about all kinds of unrelated
      commands). I left this for now, but would be better to separate this
      properly in the different man pages.
      
      It would be better to use sub sections for the other sections, but these
      don't seem to be supported in AsciiDoc?
      
      Some of the examples are still misrendered in the manpage with an
      indented troff command, but I don't know how to fix that.
      
      In any case it's now better than before.
      
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Cc: zhengjun.xing@intel.com
      Link: https://lore.kernel.org/r/20220818100127.249401-1-ak@linux.intel.com
      
      
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      e89eaa61