Skip to content
  1. Jul 14, 2011
    • Steven Rostedt's avatar
      ftrace: Fix dynamic selftest failure on some archs · 6331c28c
      Steven Rostedt authored
      
      
      Archs that do not implement CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST, will
      fail the dynamic ftrace selftest.
      
      The function tracer has a quick 'off' variable that will prevent
      the call back functions from being called. This variable is called
      function_trace_stop. In x86, this is implemented directly in the mcount
      assembly, but for other archs, an intermediate function is used called
      ftrace_test_stop_func().
      
      In dynamic ftrace, the function pointer variable ftrace_trace_function is
      used to update the caller code in the mcount caller. But for archs that
      do not have CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST set, it only calls
      ftrace_test_stop_func() instead, which in turn calls __ftrace_trace_function.
      
      When more than one ftrace_ops is registered, the function it calls is
      ftrace_ops_list_func(), which will iterate over all registered ftrace_ops
      and call the callbacks that have their hash matching.
      
      The issue happens when two ftrace_ops are registered for different functions
      and one is then unregistered. The __ftrace_trace_function is then pointed
      to the remaining ftrace_ops callback function directly. This mean it will
      be called for all functions that were registered to trace by both ftrace_ops
      that were registered.
      
      This is not an issue for archs with CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST,
      because the update of ftrace_trace_function doesn't happen until after all
      functions have been updated, and then the mcount caller is updated. But
      for those archs that do use the ftrace_test_stop_func(), the update is
      immediate.
      
      The dynamic selftest fails because it hits this situation, and the
      ftrace_ops that it registers fails to only trace what it was suppose to
      and instead traces all other functions.
      
      The solution is to delay the setting of __ftrace_trace_function until
      after all the functions have been updated according to the registered
      ftrace_ops. Also, function_trace_stop is set during the update to prevent
      function tracing from calling code that is caused by the function tracer
      itself.
      
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      6331c28c
    • Steven Rostedt's avatar
      ftrace: Update filter when tracing enabled in set_ftrace_filter() · 072126f4
      Steven Rostedt authored
      
      
      Currently, if set_ftrace_filter() is called when the ftrace_ops is
      active, the function filters will not be updated. They will only be updated
      when tracing is disabled and re-enabled.
      
      Update the functions immediately during set_ftrace_filter().
      
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      072126f4
    • Steven Rostedt's avatar
      ftrace: Balance records when updating the hash · 41fb61c2
      Steven Rostedt authored
      
      
      Whenever the hash of the ftrace_ops is updated, the record counts
      must be balance. This requires disabling the records that are set
      in the original hash, and then enabling the records that are set
      in the updated hash.
      
      Moving the update into ftrace_hash_move() removes the bug where the
      hash was updated but the records were not, which results in ftrace
      triggering a warning and disabling itself because the ftrace_ops filter
      is updated while the ftrace_ops was registered, and then the failure
      happens when the ftrace_ops is unregistered.
      
      The current code will not trigger this bug, but new code will.
      
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      41fb61c2
  2. Jul 08, 2011
    • Steven Rostedt's avatar
      ftrace: Do not disable interrupts for modules in mcount update · 4376cac6
      Steven Rostedt authored
      
      
      When I mounted an NFS directory, it caused several modules to be loaded. At the
      time I was running the preemptirqsoff tracer, and it showed the following
      output:
      
      # tracer: preemptirqsoff
      #
      # preemptirqsoff latency trace v1.1.5 on 2.6.33.9-rt30-mrg-test
      # --------------------------------------------------------------------
      # latency: 1177 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
      #    -----------------
      #    | task: modprobe-19370 (uid:0 nice:0 policy:0 rt_prio:0)
      #    -----------------
      #  => started at: ftrace_module_notify
      #  => ended at:   ftrace_module_notify
      #
      #
      #                  _------=> CPU#
      #                 / _-----=> irqs-off
      #                | / _----=> need-resched
      #                || / _---=> hardirq/softirq
      #                ||| / _--=> preempt-depth
      #                |||| /_--=> lock-depth
      #                |||||/     delay
      #  cmd     pid   |||||| time  |   caller
      #     \   /      ||||||   \   |   /
      modprobe-19370   3d....    0us!: ftrace_process_locs <-ftrace_module_notify
      modprobe-19370   3d.... 1176us : ftrace_process_locs <-ftrace_module_notify
      modprobe-19370   3d.... 1178us : trace_hardirqs_on <-ftrace_module_notify
      modprobe-19370   3d.... 1178us : <stack trace>
       => ftrace_process_locs
       => ftrace_module_notify
       => notifier_call_chain
       => __blocking_notifier_call_chain
       => blocking_notifier_call_chain
       => sys_init_module
       => system_call_fastpath
      
      That's over 1ms that interrupts are disabled on a Real-Time kernel!
      
      Looking at the cause (being the ftrace author helped), I found that the
      interrupts are disabled before the code modification of mcounts into nops. The
      interrupts only need to be disabled on start up around this code, not when
      modules are being loaded.
      
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      4376cac6
    • Steven Rostedt's avatar
      tracing: Still trace filtered irq functions when irq trace is disabled · e4a3f541
      Steven Rostedt authored
      
      
      If a function is set to be traced by the set_graph_function, but the
      option funcgraph-irqs is zero, and the traced function happens to be
      called from a interrupt, it will not be traced.
      
      The point of funcgraph-irqs is to not trace interrupts when we are
      preempted by an irq, not to not trace functions we want to trace that
      happen to be *in* a irq.
      
      Luckily the current->trace_recursion element is perfect to add a flag
      to help us be able to trace functions within an interrupt even when
      we are not tracing interrupts that preempt the trace.
      
      Reported-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Tested-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      e4a3f541
    • Steven Rostedt's avatar
      tracing, x86/irq: Do not trace arch_local_{*,irq_*}() functions · e08fbb78
      Steven Rostedt authored
      
      
      I triggered a triple fault with gcc 4.5.1 because it did not
      honor the inline annotation to arch_local_save_flags() function
      and that function was added to the pool of functions traced by
      the function tracer.
      
      When preempt_schedule() called arch_local_save_flags() (called
      by irqs_disabled()), it was traced, but the first thing the
      function tracer does is disable preemption. When it enables
      preemption, the NEED_RESCHED flag will not have been cleared and
      the preemption check will trigger the call to preempt_schedule()
      again.
      
      Although the dynamic function tracer crashed immediately, the
      static version of the function tracer (CONFIG_DYNAMIC_FTRACE is
      not set) actually was able to show where the problem was.
      
       swapper-1       3.N.. 103885us : arch_local_save_flags <-preempt_schedule
       swapper-1       3.N.. 103886us : arch_local_save_flags <-preempt_schedule
       swapper-1       3.N.. 103886us : arch_local_save_flags <-preempt_schedule
       swapper-1       3.N.. 103887us : arch_local_save_flags <-preempt_schedule
       swapper-1       3.N.. 103887us : arch_local_save_flags <-preempt_schedule
       swapper-1       3.N.. 103888us : arch_local_save_flags <-preempt_schedule
       swapper-1       3.N.. 103888us : arch_local_save_flags <-preempt_schedule
      
      It went on for a while before it triple faulted with a corrupted
      stack.
      
      The arch_local_save_flags and arch_local_irq_* functions should
      not be traced. Even though they are marked as inline, gcc may
      still make them a function and enable tracing of them.
      
      The simple solution is to just mark them as notrace. I had to
      add the <linux/types.h> for this file to include the notrace
      tag.
      
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/20110702033852.733414762@goodmis.org
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      e08fbb78
  3. Jul 05, 2011
  4. Jul 04, 2011
  5. Jul 03, 2011
    • Frederic Weisbecker's avatar
      x86: Don't use frame pointer to save old stack on irq entry · a2bbe750
      Frederic Weisbecker authored
      
      
      rbp is used in SAVE_ARGS_IRQ to save the old stack pointer
      in order to restore it later in ret_from_intr.
      
      It is convenient because we save its value in the irq regs
      and it's easily restored using the leave instruction.
      
      However this is a kind of abuse of the frame pointer which
      role is to help unwinding the kernel by chaining frames
      together, each node following the return address to the
      previous frame.
      
      But although we are breaking the frame by changing the stack
      pointer, there is no preceding return address before the new
      frame. Hence using the frame pointer to link the two stacks
      breaks the stack unwinders that find a random value instead of
      a return address here.
      
      There is no workaround that can work in every case. We are using
      the fixup_bp_irq_link() function to dereference that abused frame
      pointer in the case of non nesting interrupt (which means stack
      changed).
      But that doesn't fix the case of interrupts that don't change the
      stack (but we still have the unconditional frame link), which is
      the case of hardirq interrupting softirq. We have no way to detect
      this transition so the frame irq link is considered as a real frame
      pointer and the return address is dereferenced but it is still a
      spurious one.
      
      There are two possible results of this: either the spurious return
      address, a random stack value, luckily belongs to the kernel text
      and then the unwinding can continue and we just have a weird entry
      in the stack trace. Or it doesn't belong to the kernel text and
      unwinding stops there.
      
      This is the reason why stacktraces (including perf callchains) on
      irqs that interrupted softirqs don't work very well.
      
      To solve this, we don't save the old stack pointer on rbp anymore
      but we save it to a scratch register that we push on the new
      stack and that we pop back later on irq return.
      
      This preserves the whole frame chain without spurious return addresses
      in the middle and drops the need for the horrid fixup_bp_irq_link()
      workaround.
      
      And finally irqs that interrupt softirq are sanely unwinded.
      
      Before:
      
          99.81%         perf  [kernel.kallsyms]  [k] perf_pending_event
                         |
                         --- perf_pending_event
                             irq_work_run
                             smp_irq_work_interrupt
                             irq_work_interrupt
                            |
                            |--41.60%-- __read
                            |          |
                            |          |--99.90%-- create_worker
                            |          |          bench_sched_messaging
                            |          |          cmd_bench
                            |          |          run_builtin
                            |          |          main
                            |          |          __libc_start_main
                            |           --0.10%-- [...]
      
      After:
      
           1.64%  swapper  [kernel.kallsyms]  [k] perf_pending_event
                  |
                  --- perf_pending_event
                      irq_work_run
                      smp_irq_work_interrupt
                      irq_work_interrupt
                     |
                     |--95.00%-- arch_irq_work_raise
                     |          irq_work_queue
                     |          __perf_event_overflow
                     |          perf_swevent_overflow
                     |          perf_swevent_event
                     |          perf_tp_event
                     |          perf_trace_softirq
                     |          __do_softirq
                     |          call_softirq
                     |          do_softirq
                     |          irq_exit
                     |          |
                     |          |--73.68%-- smp_apic_timer_interrupt
                     |          |          apic_timer_interrupt
                     |          |          |
                     |          |          |--96.43%-- amd_e400_idle
                     |          |          |          cpu_idle
                     |          |          |          start_secondary
      
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jan Beulich <JBeulich@novell.com>
      a2bbe750
    • Frederic Weisbecker's avatar
      x86: Remove useless unwinder backlink from irq regs saving · 48ffee7d
      Frederic Weisbecker authored
      
      
      The unwinder backlink in interrupt entry is very useless.
      It's actually not part of the stack frame chain and thus is
      never used.
      
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jan Beulich <JBeulich@novell.com>
      48ffee7d
    • Frederic Weisbecker's avatar
      x86,64: Separate arg1 from rbp handling in SAVE_REGS_IRQ · 3b99a3ef
      Frederic Weisbecker authored
      
      
      Just for clarity in the code. Have a first block that handles
      the frame pointer and a separate one that handles pt_regs
      pointer and its use.
      
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jan Beulich <JBeulich@novell.com>
      3b99a3ef
    • Frederic Weisbecker's avatar
      x86,64: Simplify save_regs() · 1871853f
      Frederic Weisbecker authored
      
      
      The save_regs function that saves the regs on low level
      irq entry is complicated because of the fact it changes
      its stack in the middle and also because it manipulates
      data allocated in the caller frame and accesses there
      are directly calculated from callee rsp value with the
      return address in the middle of the way.
      
      This complicates the static stack offsets calculation and
      require more dynamic ones. It also needs a save/restore
      of the function's return address.
      
      To simplify and optimize this, turn save_regs() into a
      macro.
      
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jan Beulich <JBeulich@novell.com>
      1871853f
    • Frederic Weisbecker's avatar
      x86: Fetch stack from regs when possible in dump_trace() · 47ce11a2
      Frederic Weisbecker authored
      
      
      When regs are passed to dump_stack(), we fetch the frame
      pointer from the regs but the stack pointer is taken from
      the current frame.
      
      Thus the frame and stack pointers may not come from the same
      context. For example this can result in the unwinder to
      think the context is in irq, due to the current value of
      the stack, but the frame pointer coming from the regs points
      to a frame from another place. It then tries to fix up
      the irq link but ends up dereferencing a random frame
      pointer that doesn't belong to the irq stack:
      
      [ 9131.706906] ------------[ cut here ]------------
      [ 9131.707003] WARNING: at arch/x86/kernel/dumpstack_64.c:129 dump_trace+0x2aa/0x330()
      [ 9131.707003] Hardware name: AMD690VM-FMH
      [ 9131.707003] Perf: bad frame pointer = 0000000000000005 in callchain
      [ 9131.707003] Modules linked in:
      [ 9131.707003] Pid: 1050, comm: perf Not tainted 3.0.0-rc3+ #181
      [ 9131.707003] Call Trace:
      [ 9131.707003]  <IRQ>  [<ffffffff8104bd4a>] warn_slowpath_common+0x7a/0xb0
      [ 9131.707003]  [<ffffffff8104be21>] warn_slowpath_fmt+0x41/0x50
      [ 9131.707003]  [<ffffffff8178b873>] ? bad_to_user+0x6d/0x10be
      [ 9131.707003]  [<ffffffff8100c2da>] dump_trace+0x2aa/0x330
      [ 9131.707003]  [<ffffffff810107d3>] ? native_sched_clock+0x13/0x50
      [ 9131.707003]  [<ffffffff8101b164>] perf_callchain_kernel+0x54/0x70
      [ 9131.707003]  [<ffffffff810d391f>] perf_prepare_sample+0x19f/0x2a0
      [ 9131.707003]  [<ffffffff810d546c>] __perf_event_overflow+0x16c/0x290
      [ 9131.707003]  [<ffffffff810d5430>] ? __perf_event_overflow+0x130/0x290
      [ 9131.707003]  [<ffffffff810107d3>] ? native_sched_clock+0x13/0x50
      [ 9131.707003]  [<ffffffff8100fbb9>] ? sched_clock+0x9/0x10
      [ 9131.707003]  [<ffffffff810752e5>] ? T.375+0x15/0x90
      [ 9131.707003]  [<ffffffff81084da4>] ? trace_hardirqs_on_caller+0x64/0x180
      [ 9131.707003]  [<ffffffff810817bd>] ? trace_hardirqs_off+0xd/0x10
      [ 9131.707003]  [<ffffffff810d5764>] perf_event_overflow+0x14/0x20
      [ 9131.707003]  [<ffffffff810d588c>] perf_swevent_hrtimer+0x11c/0x130
      [ 9131.707003]  [<ffffffff817821a1>] ? error_exit+0x51/0xb0
      [ 9131.707003]  [<ffffffff81072e93>] __run_hrtimer+0x83/0x1e0
      [ 9131.707003]  [<ffffffff810d5770>] ? perf_event_overflow+0x20/0x20
      [ 9131.707003]  [<ffffffff81073256>] hrtimer_interrupt+0x106/0x250
      [ 9131.707003]  [<ffffffff812a3bfd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
      [ 9131.707003]  [<ffffffff81024833>] smp_apic_timer_interrupt+0x53/0x90
      [ 9131.707003]  [<ffffffff81789053>] apic_timer_interrupt+0x13/0x20
      [ 9131.707003]  <EOI>  [<ffffffff817821a1>] ? error_exit+0x51/0xb0
      [ 9131.707003]  [<ffffffff8178219c>] ? error_exit+0x4c/0xb0
      [ 9131.707003] ---[ end trace b2560d4876709347 ]---
      
      Fix this by simply taking the stack pointer from regs->sp
      when regs are provided.
      
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      47ce11a2
    • Frederic Weisbecker's avatar
      x86: Save stack pointer in perf live regs savings · 9e46294d
      Frederic Weisbecker authored
      
      
      In order to prepare for fetching the stack pointer from the
      regs when possible in dump_trace() instead of taking the
      local one, save the current stack pointer in perf live regs saving.
      
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      9e46294d
  6. Jul 01, 2011
  7. Jun 30, 2011
    • Frederic Weisbecker's avatar
      perf tools: Only display parent field if explictly sorted · cb1955b8
      Frederic Weisbecker authored
      
      
      We don't need to display the parent field if the parent
      sorting machinery is only used for parent filtering
      (as in "-p foo").
      
      However if parent filtering is used in combination with
      explicit parent sorting ( -s parent), we want to
      display it.
      
      Result with:
      
        perf report -p kernel_thread -s parent
      
      Before:
      
       # Overhead  Parent symbol
       # ........  .............
       #
           0.07%
                  |
                  --- ioread8
                      ata_sff_check_status
                      ata_sff_tf_load
                      ata_sff_qc_issue
                      ata_bmdma_qc_issue
                      ata_qc_issue
                      ata_scsi_translate
                      ata_scsi_queuecmd
                      scsi_dispatch_cmd
                      scsi_request_fn
                      __blk_run_queue
                      __make_request
                      generic_make_request
                      submit_bio
                      submit_bh
                      journal_submit_commit_record
                      jbd2_journal_commit_transaction
                      kjournald2
                      kthread
                      kernel_thread_helpe
      
      After:
      
       # Overhead  Parent symbol
       # ........  .............
       #
           0.07%  kernel_thread_helper
                  |
                  --- ioread8
                      ata_sff_check_status
                      ata_sff_tf_load
                      ata_sff_qc_issue
                      ata_bmdma_qc_issue
                      ata_qc_issue
                      ata_scsi_translate
                      ata_scsi_queuecmd
                      scsi_dispatch_cmd
                      scsi_request_fn
                      __blk_run_queue
                      __make_request
                      generic_make_request
                      submit_bio
                      submit_bh
                      journal_submit_commit_record
                      jbd2_journal_commit_transaction
                      kjournald2
                      kthread
                      kernel_thread_helper
      
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Sam Liao <phyomh@gmail.com>
      cb1955b8
    • Frederic Weisbecker's avatar
      perf tools: Allow sort dimensions to be registered more than once · fd8ea212
      Frederic Weisbecker authored
      
      
      So that the parent sort dimension can be registered twice: once
      if we add it as an explicit sort dimension (-s parent) and twice
      if we request a parent filter (-p foo).
      
      We'll have only one parent sort dimension in the end but this
      allows to override the default parent filter with we gave in "-p"
      option. The goal of this is to prepare to allow the use of
      "-s parent" and "-p foo" at the same time, ie: sort by filtered
      parent.
      
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Sam Liao <phyomh@gmail.com>
      fd8ea212