Skip to content
  1. Feb 09, 2024
    • Masami Hiramatsu (Google)'s avatar
      ftrace: Fix DIRECT_CALLS to use SAVE_REGS by default · a8b9cf62
      Masami Hiramatsu (Google) authored
      The commit 60c89718 ("ftrace: Make DIRECT_CALLS work WITH_ARGS
      and !WITH_REGS") changed DIRECT_CALLS to use SAVE_ARGS when there
      are multiple ftrace_ops at the same function, but since the x86 only
      support to jump to direct_call from ftrace_regs_caller, when we set
      the function tracer on the same target function on x86, ftrace-direct
      does not work as below (this actually works on arm64.)
      
      At first, insmod ftrace-direct.ko to put a direct_call on
      'wake_up_process()'.
      
       # insmod kernel/samples/ftrace/ftrace-direct.ko
       # less trace
      ...
                <idle>-0       [006] ..s1.   564.686958: my_direct_func: waking up rcu_preempt-17
                <idle>-0       [007] ..s1.   564.687836: my_direct_func: waking up kcompactd0-63
                <idle>-0       [006] ..s1.   564.690926: my_direct_func: waking up rcu_preempt-17
                <idle>-0       [006] ..s1.   564.696872: my_direct_func: waking up rcu_preempt-17
                <idle>-0       [007] ..s1.   565.191982: my_direct_func: waking up kcompactd0-63
      
      Setup a function filter to the 'wake_up_process' too, and enable it.
      
       # cd /sys/kernel/tracing/
       # echo wake_up_process > set_ftrace_filter
       # echo function > current_tracer
       # less trace
      ...
                <idle>-0       [006] ..s3.   686.180972: wake_up_process <-call_timer_fn
                <idle>-0       [006] ..s3.   686.186919: wake_up_process <-call_timer_fn
                <idle>-0       [002] ..s3.   686.264049: wake_up_process <-call_timer_fn
                <idle>-0       [002] d.h6.   686.515216: wake_up_process <-kick_pool
                <idle>-0       [002] d.h6.   686.691386: wake_up_process <-kick_pool
      
      Then, only function tracer is shown on x86.
      But if you enable 'kprobe on ftrace' event (which uses SAVE_REGS flag)
      on the same function, it is shown again.
      
       # echo 'p wake_up_process' >> dynamic_events
       # echo 1 > events/kprobes/p_wake_up_process_0/enable
       # echo > trace
       # less trace
      ...
                <idle>-0       [006] ..s2.  2710.345919: p_wake_up_process_0: (wake_up_process+0x4/0x20)
                <idle>-0       [006] ..s3.  2710.345923: wake_up_process <-call_timer_fn
                <idle>-0       [006] ..s1.  2710.345928: my_direct_func: waking up rcu_preempt-17
                <idle>-0       [006] ..s2.  2710.349931: p_wake_up_process_0: (wake_up_process+0x4/0x20)
                <idle>-0       [006] ..s3.  2710.349934: wake_up_process <-call_timer_fn
                <idle>-0       [006] ..s1.  2710.349937: my_direct_func: waking up rcu_preempt-17
      
      To fix this issue, use SAVE_REGS flag for multiple ftrace_ops flag of
      direct_call by default.
      
      Link: https://lore.kernel.org/linux-trace-kernel/170484558617.178953.1590516949390270842.stgit@devnote2
      
      Fixes: 60c89718
      
       ("ftrace: Make DIRECT_CALLS work WITH_ARGS and !WITH_REGS")
      Cc: stable@vger.kernel.org
      Cc: Florent Revest <revest@chromium.org>
      Signed-off-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Reviewed-by: default avatarMark Rutland <mark.rutland@arm.com>
      Tested-by: Mark Rutland <mark.rutland@arm.com> [arm64]
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      a8b9cf62
  2. Feb 02, 2024
    • Steven Rostedt (Google)'s avatar
      eventfs: Keep all directory links at 1 · ca185770
      Steven Rostedt (Google) authored
      The directory link count in eventfs was somewhat bogus. It was only being
      updated when a directory child was being looked up and not on creation.
      
      One solution would be to update in get_attr() the link count by iterating
      the ei->children list and then adding 2. But that could slow down simple
      stat() calls, especially if it's done on all directories in eventfs.
      
      Another solution would be to add a parent pointer to the eventfs_inode
      and keep track of the number of sub directories it has on creation. But
      this adds overhead for something not really worthwhile.
      
      The solution decided upon is to keep all directory links in eventfs as 1.
      This tells user space not to rely on the hard links of directories. Which
      in this case it shouldn't.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240201002719.GS2087318@ZenIV/
      Link: https://lore.kernel.org/linux-trace-kernel/20240201161617.339968298@goodmis.org
      
      Cc: stable@vger.kernel.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Ajay Kaher <ajay.kaher@broadcom.com>
      Fixes: c1504e51
      
       ("eventfs: Implement eventfs dir creation functions")
      Suggested-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      ca185770
    • Steven Rostedt (Google)'s avatar
      eventfs: Remove fsnotify*() functions from lookup() · 12d823b3
      Steven Rostedt (Google) authored
      The dentries and inodes are created when referenced in the lookup code.
      There's no reason to call fsnotify_*() functions when they are created by
      a reference. It doesn't make any sense.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240201002719.GS2087318@ZenIV/
      Link: https://lore.kernel.org/linux-trace-kernel/20240201161617.166973329@goodmis.org
      
      Cc: stable@vger.kernel.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Ajay Kaher <ajay.kaher@broadcom.com>
      Fixes: a3760079
      
       ("eventfs: Implement functions to create files and dirs when accessed");
      Suggested-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      12d823b3
    • Steven Rostedt (Google)'s avatar
      eventfs: Restructure eventfs_inode structure to be more condensed · 264424df
      Steven Rostedt (Google) authored
      Some of the eventfs_inode structure has holes in it. Rework the structure
      to be a bit more condensed, and also remove the no longer used llist
      field.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240201161617.002321438@goodmis.org
      
      
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Ajay Kaher <ajay.kaher@broadcom.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      264424df
    • Steven Rostedt (Google)'s avatar
      eventfs: Warn if an eventfs_inode is freed without is_freed being set · 5a49f996
      Steven Rostedt (Google) authored
      There should never be a case where an evenfs_inode is being freed without
      is_freed being set. Add a WARN_ON_ONCE() if it ever happens. That would
      mean there was one too many put_ei()s.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240201161616.843551963@goodmis.org
      
      
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Ajay Kaher <ajay.kaher@broadcom.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      5a49f996
    • Daniel Bristot de Oliveira's avatar
      tracing/timerlat: Move hrtimer_init to timerlat_fd open() · 1389358b
      Daniel Bristot de Oliveira authored
      Currently, the timerlat's hrtimer is initialized at the first read of
      timerlat_fd, and destroyed at close(). It works, but it causes an error
      if the user program open() and close() the file without reading.
      
      Here's an example:
      
       # echo NO_OSNOISE_WORKLOAD > /sys/kernel/debug/tracing/osnoise/options
       # echo timerlat > /sys/kernel/debug/tracing/current_tracer
      
       # cat <<EOF > ./timerlat_load.py
       # !/usr/bin/env python3
      
       timerlat_fd = open("/sys/kernel/tracing/osnoise/per_cpu/cpu0/timerlat_fd", 'r')
       timerlat_fd.close();
       EOF
      
       # ./taskset -c 0 ./timerlat_load.py
      <BOOM>
      
       BUG: kernel NULL pointer dereference, address: 0000000000000010
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 0 P4D 0
       Oops: 0000 [#1] PREEMPT SMP NOPTI
       CPU: 1 PID: 2673 Comm: python3 Not tainted 6.6.13-200.fc39.x86_64 #1
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014
       RIP: 0010:hrtimer_active+0xd/0x50
       Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 57 30 <8b> 42 10 a8 01 74 09 f3 90 8b 42 10 a8 01 75 f7 80 7f 38 00 75 1d
       RSP: 0018:ffffb031009b7e10 EFLAGS: 00010286
       RAX: 000000000002db00 RBX: ffff9118f786db08 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: ffff9117a0e64400 RDI: ffff9118f786db08
       RBP: ffff9118f786db80 R08: ffff9117a0ddd420 R09: ffff9117804d4f70
       R10: 0000000000000000 R11: 0000000000000000 R12: ffff9118f786db08
       R13: ffff91178fdd5e20 R14: ffff9117840978c0 R15: 0000000000000000
       FS:  00007f2ffbab1740(0000) GS:ffff9118f7840000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000010 CR3: 00000001b402e000 CR4: 0000000000750ee0
       PKRU: 55555554
       Call Trace:
        <TASK>
        ? __die+0x23/0x70
        ? page_fault_oops+0x171/0x4e0
        ? srso_alias_return_thunk+0x5/0x7f
        ? avc_has_extended_perms+0x237/0x520
        ? exc_page_fault+0x7f/0x180
        ? asm_exc_page_fault+0x26/0x30
        ? hrtimer_active+0xd/0x50
        hrtimer_cancel+0x15/0x40
        timerlat_fd_release+0x48/0xe0
        __fput+0xf5/0x290
        __x64_sys_close+0x3d/0x80
        do_syscall_64+0x60/0x90
        ? srso_alias_return_thunk+0x5/0x7f
        ? __x64_sys_ioctl+0x72/0xd0
        ? srso_alias_return_thunk+0x5/0x7f
        ? syscall_exit_to_user_mode+0x2b/0x40
        ? srso_alias_return_thunk+0x5/0x7f
        ? do_syscall_64+0x6c/0x90
        ? srso_alias_return_thunk+0x5/0x7f
        ? exit_to_user_mode_prepare+0x142/0x1f0
        ? srso_alias_return_thunk+0x5/0x7f
        ? syscall_exit_to_user_mode+0x2b/0x40
        ? srso_alias_return_thunk+0x5/0x7f
        ? do_syscall_64+0x6c/0x90
        entry_SYSCALL_64_after_hwframe+0x6e/0xd8
       RIP: 0033:0x7f2ffb321594
       Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 80 3d d5 cd 0d 00 00 74 13 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3c c3 0f 1f 00 55 48 89 e5 48 83 ec 10 89 7d
       RSP: 002b:00007ffe8d8eef18 EFLAGS: 00000202 ORIG_RAX: 0000000000000003
       RAX: ffffffffffffffda RBX: 00007f2ffba4e668 RCX: 00007f2ffb321594
       RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
       RBP: 00007ffe8d8eef40 R08: 0000000000000000 R09: 0000000000000000
       R10: 55c926e3167eae79 R11: 0000000000000202 R12: 0000000000000003
       R13: 00007ffe8d8ef030 R14: 0000000000000000 R15: 00007f2ffba4e668
        </TASK>
       CR2: 0000000000000010
       ---[ end trace 0000000000000000 ]---
      
      Move hrtimer_init to timerlat_fd open() to avoid this problem.
      
      Link: https://lore.kernel.org/linux-trace-kernel/7324dd3fc0035658c99b825204a66049389c56e3.1706798888.git.bristot@kernel.org
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: stable@vger.kernel.org
      Fixes: e88ed227
      
       ("tracing/timerlat: Add user-space interface")
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      1389358b
  3. Feb 01, 2024
  4. Jan 29, 2024
  5. Jan 27, 2024
  6. Jan 23, 2024
    • Steven Rostedt (Google)'s avatar
      eventfs: Save directory inodes in the eventfs_inode structure · 834bf76a
      Steven Rostedt (Google) authored
      The eventfs inodes and directories are allocated when referenced. But this
      leaves the issue of keeping consistent inode numbers and the number is
      only saved in the inode structure itself. When the inode is no longer
      referenced, it can be freed. When the file that the inode was representing
      is referenced again, the inode is once again created, but the inode number
      needs to be the same as it was before.
      
      Just making the inode numbers the same for all files is fine, but that
      does not work with directories. The find command will check for loops via
      the inode number and having the same inode number for directories triggers:
      
        # find /sys/kernel/tracing
      find: File system loop detected;
      '/sys/kernel/debug/tracing/events/initcall/initcall_finish' is part of the same file system loop as
      '/sys/kernel/debug/tracing/events/initcall'.
      [..]
      
      Linus pointed out that the eventfs_inode structure ends with a single
      32bit int, and on 64 bit machines, there's likely a 4 byte hole due to
      alignment. We can use this hole to store the inode number for the
      eventfs_inode. All directories in eventfs are represented by an
      eventfs_inode and that data structure can hold its inode number.
      
      That last int was also purposely placed at the end of the structure to
      prevent holes from within. Now that there's a 4 byte number to hold the
      inode, both the inode number and the last integer can be moved up in the
      structure for better cache locality, where the llist and rcu fields can be
      moved to the end as they are only used when the eventfs_inode is being
      deleted.
      
      Link: https://lore.kernel.org/all/CAMuHMdXKiorg-jiuKoZpfZyDJ3Ynrfb8=X+c7x0Eewxn-YRdCA@mail.gmail.com/
      Link: https://lore.kernel.org/linux-trace-kernel/20240122152748.46897388@gandalf.local.home
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Reported-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Fixes: 53c41052
      
       ("eventfs: Have the inodes all for files and directories all be the same")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      834bf76a
    • Petr Pavlu's avatar
      tracing: Ensure visibility when inserting an element into tracing_map · 2b447606
      Petr Pavlu authored
      Running the following two commands in parallel on a multi-processor
      AArch64 machine can sporadically produce an unexpected warning about
      duplicate histogram entries:
      
       $ while true; do
           echo hist:key=id.syscall:val=hitcount > \
             /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
           cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
           sleep 0.001
         done
       $ stress-ng --sysbadaddr $(nproc)
      
      The warning looks as follows:
      
      [ 2911.172474] ------------[ cut here ]------------
      [ 2911.173111] Duplicates detected: 1
      [ 2911.173574] WARNING: CPU: 2 PID: 12247 at kernel/trace/tracing_map.c:983 tracing_map_sort_entries+0x3e0/0x408
      [ 2911.174702] Modules linked in: iscsi_ibft(E) iscsi_boot_sysfs(E) rfkill(E) af_packet(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) ena(E) tiny_power_button(E) qemu_fw_cfg(E) button(E) fuse(E) efi_pstore(E) ip_tables(E) x_tables(E) xfs(E) libcrc32c(E) aes_ce_blk(E) aes_ce_cipher(E) crct10dif_ce(E) polyval_ce(E) polyval_generic(E) ghash_ce(E) gf128mul(E) sm4_ce_gcm(E) sm4_ce_ccm(E) sm4_ce(E) sm4_ce_cipher(E) sm4(E) sm3_ce(E) sm3(E) sha3_ce(E) sha512_ce(E) sha512_arm64(E) sha2_ce(E) sha256_arm64(E) nvme(E) sha1_ce(E) nvme_core(E) nvme_auth(E) t10_pi(E) sg(E) scsi_mod(E) scsi_common(E) efivarfs(E)
      [ 2911.174738] Unloaded tainted modules: cppc_cpufreq(E):1
      [ 2911.180985] CPU: 2 PID: 12247 Comm: cat Kdump: loaded Tainted: G            E      6.7.0-default #2 1b58bbb22c97e4399dc09f92d309344f69c44a01
      [ 2911.182398] Hardware name: Amazon EC2 c7g.8xlarge/, BIOS 1.0 11/1/2018
      [ 2911.183208] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
      [ 2911.184038] pc : tracing_map_sort_entries+0x3e0/0x408
      [ 2911.184667] lr : tracing_map_sort_entries+0x3e0/0x408
      [ 2911.185310] sp : ffff8000a1513900
      [ 2911.185750] x29: ffff8000a1513900 x28: ffff0003f272fe80 x27: 0000000000000001
      [ 2911.186600] x26: ffff0003f272fe80 x25: 0000000000000030 x24: 0000000000000008
      [ 2911.187458] x23: ffff0003c5788000 x22: ffff0003c16710c8 x21: ffff80008017f180
      [ 2911.188310] x20: ffff80008017f000 x19: ffff80008017f180 x18: ffffffffffffffff
      [ 2911.189160] x17: 0000000000000000 x16: 0000000000000000 x15: ffff8000a15134b8
      [ 2911.190015] x14: 0000000000000000 x13: 205d373432323154 x12: 5b5d313131333731
      [ 2911.190844] x11: 00000000fffeffff x10: 00000000fffeffff x9 : ffffd1b78274a13c
      [ 2911.191716] x8 : 000000000017ffe8 x7 : c0000000fffeffff x6 : 000000000057ffa8
      [ 2911.192554] x5 : ffff0012f6c24ec0 x4 : 0000000000000000 x3 : ffff2e5b72b5d000
      [ 2911.193404] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0003ff254480
      [ 2911.194259] Call trace:
      [ 2911.194626]  tracing_map_sort_entries+0x3e0/0x408
      [ 2911.195220]  hist_show+0x124/0x800
      [ 2911.195692]  seq_read_iter+0x1d4/0x4e8
      [ 2911.196193]  seq_read+0xe8/0x138
      [ 2911.196638]  vfs_read+0xc8/0x300
      [ 2911.197078]  ksys_read+0x70/0x108
      [ 2911.197534]  __arm64_sys_read+0x24/0x38
      [ 2911.198046]  invoke_syscall+0x78/0x108
      [ 2911.198553]  el0_svc_common.constprop.0+0xd0/0xf8
      [ 2911.199157]  do_el0_svc+0x28/0x40
      [ 2911.199613]  el0_svc+0x40/0x178
      [ 2911.200048]  el0t_64_sync_handler+0x13c/0x158
      [ 2911.200621]  el0t_64_sync+0x1a8/0x1b0
      [ 2911.201115] ---[ end trace 0000000000000000 ]---
      
      The problem appears to be caused by CPU reordering of writes issued from
      __tracing_map_insert().
      
      The check for the presence of an element with a given key in this
      function is:
      
       val = READ_ONCE(entry->val);
       if (val && keys_match(key, val->key, map->key_size)) ...
      
      The write of a new entry is:
      
       elt = get_free_elt(map);
       memcpy(elt->key, key, map->key_size);
       entry->val = elt;
      
      The "memcpy(elt->key, key, map->key_size);" and "entry->val = elt;"
      stores may become visible in the reversed order on another CPU. This
      second CPU might then incorrectly determine that a new key doesn't match
      an already present val->key and subsequently insert a new element,
      resulting in a duplicate.
      
      Fix the problem by adding a write barrier between
      "memcpy(elt->key, key, map->key_size);" and "entry->val = elt;", and for
      good measure, also use WRITE_ONCE(entry->val, elt) for publishing the
      element. The sequence pairs with the mentioned "READ_ONCE(entry->val);"
      and the "val->key" check which has an address dependency.
      
      The barrier is placed on a path executed when adding an element for
      a new key. Subsequent updates targeting the same key remain unaffected.
      
      From the user's perspective, the issue was introduced by commit
      c193707d ("tracing: Remove code which merges duplicates"), which
      followed commit cbf4100e ("tracing: Add support to detect and avoid
      duplicates"). The previous code operated differently; it inherently
      expected potential races which result in duplicates but merged them
      later when they occurred.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240122150928.27725-1-petr.pavlu@suse.com
      
      Fixes: c193707d
      
       ("tracing: Remove code which merges duplicates")
      Signed-off-by: default avatarPetr Pavlu <petr.pavlu@suse.com>
      Acked-by: default avatarTom Zanussi <tom.zanussi@linux.intel.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      2b447606
  7. Jan 22, 2024