Skip to content
  1. Jul 31, 2018
    • Joel Fernandes (Google)'s avatar
      tracing: Centralize preemptirq tracepoints and unify their usage · c3bc8fd6
      Joel Fernandes (Google) authored
      
      
      This patch detaches the preemptirq tracepoints from the tracers and
      keeps it separate.
      
      Advantages:
      * Lockdep and irqsoff event can now run in parallel since they no longer
      have their own calls.
      
      * This unifies the usecase of adding hooks to an irqsoff and irqson
      event, and a preemptoff and preempton event.
        3 users of the events exist:
        - Lockdep
        - irqsoff and preemptoff tracers
        - irqs and preempt trace events
      
      The unification cleans up several ifdefs and makes the code in preempt
      tracer and irqsoff tracers simpler. It gets rid of all the horrific
      ifdeferry around PROVE_LOCKING and makes configuration of the different
      users of the tracepoints more easy and understandable. It also gets rid
      of the time_* function calls from the lockdep hooks used to call into
      the preemptirq tracer which is not needed anymore. The negative delta in
      lines of code in this patch is quite large too.
      
      In the patch we introduce a new CONFIG option PREEMPTIRQ_TRACEPOINTS
      as a single point for registering probes onto the tracepoints. With
      this,
      the web of config options for preempt/irq toggle tracepoints and its
      users becomes:
      
       PREEMPT_TRACER   PREEMPTIRQ_EVENTS  IRQSOFF_TRACER PROVE_LOCKING
             |                 |     \         |           |
             \    (selects)    /      \        \ (selects) /
            TRACE_PREEMPT_TOGGLE       ----> TRACE_IRQFLAGS
                            \                  /
                             \ (depends on)   /
                           PREEMPTIRQ_TRACEPOINTS
      
      Other than the performance tests mentioned in the previous patch, I also
      ran the locking API test suite. I verified that all tests cases are
      passing.
      
      I also injected issues by not registering lockdep probes onto the
      tracepoints and I see failures to confirm that the probes are indeed
      working.
      
      This series + lockdep probes not registered (just to inject errors):
      [    0.000000]      hard-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
      [    0.000000]      soft-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
      [    0.000000]        sirq-safe-A => hirqs-on/12:FAILED|FAILED|  ok  |
      [    0.000000]        sirq-safe-A => hirqs-on/21:FAILED|FAILED|  ok  |
      [    0.000000]          hard-safe-A + irqs-on/12:FAILED|FAILED|  ok  |
      [    0.000000]          soft-safe-A + irqs-on/12:FAILED|FAILED|  ok  |
      [    0.000000]          hard-safe-A + irqs-on/21:FAILED|FAILED|  ok  |
      [    0.000000]          soft-safe-A + irqs-on/21:FAILED|FAILED|  ok  |
      [    0.000000]     hard-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
      [    0.000000]     soft-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
      
      With this series + lockdep probes registered, all locking tests pass:
      
      [    0.000000]      hard-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
      [    0.000000]      soft-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
      [    0.000000]        sirq-safe-A => hirqs-on/12:  ok  |  ok  |  ok  |
      [    0.000000]        sirq-safe-A => hirqs-on/21:  ok  |  ok  |  ok  |
      [    0.000000]          hard-safe-A + irqs-on/12:  ok  |  ok  |  ok  |
      [    0.000000]          soft-safe-A + irqs-on/12:  ok  |  ok  |  ok  |
      [    0.000000]          hard-safe-A + irqs-on/21:  ok  |  ok  |  ok  |
      [    0.000000]          soft-safe-A + irqs-on/21:  ok  |  ok  |  ok  |
      [    0.000000]     hard-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
      [    0.000000]     soft-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
      
      Link: http://lkml.kernel.org/r/20180730222423.196630-4-joel@joelfernandes.org
      
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      c3bc8fd6
    • Joel Fernandes (Google)'s avatar
      tracepoint: Make rcuidle tracepoint callers use SRCU · e6753f23
      Joel Fernandes (Google) authored
      
      
      In recent tests with IRQ on/off tracepoints, a large performance
      overhead ~10% is noticed when running hackbench. This is root caused to
      calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the
      tracepoint code. Following a long discussion on the list [1] about this,
      we concluded that srcu is a better alternative for use during rcu idle.
      Although it does involve extra barriers, its lighter than the sched-rcu
      version which has to do additional RCU calls to notify RCU idle about
      entry into RCU sections.
      
      In this patch, we change the underlying implementation of the
      trace_*_rcuidle API to use SRCU. This has shown to improve performance
      alot for the high frequency irq enable/disable tracepoints.
      
      Test: Tested idle and preempt/irq tracepoints.
      
      Here are some performance numbers:
      
      With a run of the following 30 times on a single core x86 Qemu instance
      with 1GB memory:
      hackbench -g 4 -f 2 -l 3000
      
      Completion times in seconds. CONFIG_PROVE_LOCKING=y.
      
      No patches (without this series)
      Mean: 3.048
      Median: 3.025
      Std Dev: 0.064
      
      With Lockdep using irq tracepoints with RCU implementation:
      Mean: 3.451   (-11.66 %)
      Median: 3.447 (-12.22%)
      Std Dev: 0.049
      
      With Lockdep using irq tracepoints with SRCU implementation (this series):
      Mean: 3.020   (I would consider the improvement against the "without
      	       this series" case as just noise).
      Median: 3.013
      Std Dev: 0.033
      
      [1] https://patchwork.kernel.org/patch/10344297/
      
      [remove rcu_read_lock_sched_notrace as its the equivalent of
      preempt_disable_notrace and is unnecessary to call in tracepoint code]
      Link: http://lkml.kernel.org/r/20180730222423.196630-3-joel@joelfernandes.org
      
      Cleaned-up-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Reviewed-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      [ Simplified WARN_ON_ONCE() ]
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      e6753f23
    • Joel Fernandes (Google)'s avatar
      lockdep: Use this_cpu_ptr instead of get_cpu_var stats · 01f38497
      Joel Fernandes (Google) authored
      
      
      get_cpu_var disables preemption which has the potential to call into the
      preemption disable trace points causing some complications. There's also
      no need to disable preemption in uses of get_lock_stats anyway since
      preempt is already disabled. So lets simplify the code.
      
      Link: http://lkml.kernel.org/r/20180730222423.196630-2-joel@joelfernandes.org
      
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      01f38497
    • Masami Hiramatsu's avatar
      selftests/ftrace: Fix kprobe string testcase to not probe notrace function · 6fc7c411
      Masami Hiramatsu authored
      
      
      Fix kprobe string argument testcase to not probe notrace
      function. Instead, it probes tracefs function which must
      be available with ftrace.
      
      Link: http://lkml.kernel.org/r/153294607107.32740.1664854684396589624.stgit@devbox
      
      Signed-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      6fc7c411
    • Francis Deslauriers's avatar
      selftest/ftrace: Move kprobe selftest function to separate compile unit · d899926f
      Francis Deslauriers authored
      
      
      Move selftest function to its own compile unit so it can be compiled
      with the ftrace cflags (CC_FLAGS_FTRACE) allowing it to be probed
      during the ftrace startup tests.
      
      Link: http://lkml.kernel.org/r/153294604271.32740.16490677128630177030.stgit@devbox
      
      Signed-off-by: default avatarFrancis Deslauriers <francis.deslauriers@efficios.com>
      Signed-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      d899926f
    • Masami Hiramatsu's avatar
      tracing: kprobes: Prohibit probing on notrace function · 45408c4f
      Masami Hiramatsu authored
      
      
      Prohibit kprobe-events probing on notrace functions.  Since probing on a
      notrace function can cause a recursive event call. In most cases those are just
      skipped, but in some cases it falls into an infinite recursive call.
      
      This protection can be disabled by the kconfig
      CONFIG_KPROBE_EVENTS_ON_NOTRACE=y, but it is highly recommended to keep it
      "n" for normal kernel builds.  Note that this is only available if "kprobes on
      ftrace" has been implemented on the target arch and CONFIG_KPROBES_ON_FTRACE=y.
      
      Link: http://lkml.kernel.org/r/153294601436.32740.10557881188933661239.stgit@devbox
      
      Signed-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Tested-by: default avatarFrancis Deslauriers <francis.deslauriers@efficios.com>
      [ Slight grammar and spelling fixes ]
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      45408c4f
  2. Jul 28, 2018
  3. Jul 27, 2018
  4. Jul 26, 2018
    • Masami Hiramatsu's avatar
      tracing: Remove orphaned function ftrace_nr_registered_ops() · 72809cbf
      Masami Hiramatsu authored
      Remove ftrace_nr_registered_ops() because it is no longer used.
      
      ftrace_nr_registered_ops() has been introduced by commit ea701f11
      ("ftrace: Add selftest to test function trace recursion protection"), but
      its caller has been removed by commit 05cbbf64
      
       ("tracing: Fix selftest
      function recursion accounting"). So it is not called anymore.
      
      Link: http://lkml.kernel.org/r/153260907227.12474.5234899025934963683.stgit@devbox
      
      Signed-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      72809cbf
    • Masami Hiramatsu's avatar
      tracing: Remove orphaned function using_ftrace_ops_list_func(). · 7b144b6c
      Masami Hiramatsu authored
      Remove using_ftrace_ops_list_func() since it is no longer used.
      
      Using ftrace_ops_list_func() has been introduced by commit 7eea4fce
      ("tracing/stack_trace: Skip 4 instead of 3 when using ftrace_ops_list_func")
      as a helper function, but its caller has been removed by commit 72ac426a
      
      
      ("tracing: Clean up stack tracing and fix fentry updates").  So it is not
      called anymore.
      
      Link: http://lkml.kernel.org/r/153260904427.12474.9952096317439329851.stgit@devbox
      
      Signed-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      7b144b6c
    • Steven Rostedt (VMware)'s avatar
      tracing: Make unregister_trigger() static · f6b7425c
      Steven Rostedt (VMware) authored
      
      
      Nothing uses unregister_trigger() outside of trace_events_trigger.c file,
      thus it should be static. Not sure why this was ever converted, because
      its counter part, register_trigger(), was always static.
      
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      f6b7425c
    • Joel Fernandes (Google)'s avatar
      kselftests: Add tests for the preemptoff and irqsoff tracers · 8bd1369b
      Joel Fernandes (Google) authored
      
      
      Here we add unit tests for the preemptoff and irqsoff tracer by using a
      kernel module introduced previously to trigger long preempt or irq
      disabled sections in the kernel.
      
      Link: http://lkml.kernel.org/r/20180711063540.91101-3-joel@joelfernandes.org
      
      Reviewed-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Acked-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      8bd1369b
    • Joel Fernandes (Google)'s avatar
      lib: Add module for testing preemptoff/irqsoff latency tracers · f96e8577
      Joel Fernandes (Google) authored
      
      
      Here we introduce a test module for introducing a long preempt or irq
      disable delay in the kernel which the preemptoff or irqsoff tracers can
      detect. This module is to be used only for test purposes and is default
      disabled.
      
      Following is the expected output (only briefly shown) that can be parsed
      to verify that the tracers are working correctly. We will use this from
      the kselftests in future patches.
      
      For the preemptoff tracer:
      
      echo preemptoff > /d/tracing/current_tracer
      sleep 1
      insmod ./preemptirq_delay_test.ko test_mode=preempt delay=500000
      sleep 1
      bash-4.3# cat /d/tracing/trace
      preempt -1066    2...2    0us@: preemptirq_delay_run <-preemptirq_delay_run
      preempt -1066    2...2 500002us : preemptirq_delay_run <-preemptirq_delay_run
      preempt -1066    2...2 500004us : tracer_preempt_on <-preemptirq_delay_run
      preempt -1066    2...2 500012us : <stack trace>
       => kthread
       => ret_from_fork
      
      For the irqsoff tracer:
      
      echo irqsoff > /d/tracing/current_tracer
      sleep 1
      insmod ./preemptirq_delay_test.ko test_mode=irq delay=500000
      sleep 1
      bash-4.3# cat /d/tracing/trace
      irq dis -1069    1d..1    0us@: preemptirq_delay_run
      irq dis -1069    1d..1 500001us : preemptirq_delay_run
      irq dis -1069    1d..1 500002us : tracer_hardirqs_on <-preemptirq_delay_run
      irq dis -1069    1d..1 500005us : <stack trace>
       => ret_from_fork
      
      Link: http://lkml.kernel.org/r/20180712213611.GA8743@joelaf.mtv.corp.google.com
      
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Julia Cartwright <julia@ni.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Thomas Glexiner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Reviewed-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      [ Erick is a co-developer of this commit ]
      Signed-off-by: default avatarErick Reyes <erickreyes@google.com>
      Signed-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      f96e8577
    • Joel Fernandes (Google)'s avatar
      tracing/irqsoff: Split reset into separate functions · 2b27ece6
      Joel Fernandes (Google) authored
      
      
      Split reset functions into seperate functions in preparation
      of future patches that need to do tracer specific reset.
      
      Link: http://lkml.kernel.org/r/20180628182149.226164-4-joel@joelfernandes.org
      
      Reviewed-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      2b27ece6
    • Joel Fernandes (Google)'s avatar
      srcu: Add notrace variant of srcu_dereference · 0b764a6e
      Joel Fernandes (Google) authored
      
      
      In the last patch in this series, we are making lockdep register hooks
      onto the irq_{disable,enable} tracepoints. These tracepoints use the
      _rcuidle tracepoint variant. In this series we switch the _rcuidle
      tracepoint callers to use SRCU instead of sched-RCU. Inorder to
      dereference the pointer to the probe functions, we could call
      srcu_dereference, however this API will call back into lockdep to check
      if the lock is held *before* the lockdep probe hooks have a chance to
      run and annotate the IRQ enabled/disabled state.
      
      For this reason we need a notrace variant of srcu_dereference since
      otherwise we get lockdep splats. This patch adds the needed
      srcu_dereference_notrace variant.
      
      Link: http://lkml.kernel.org/r/20180628182149.226164-3-joel@joelfernandes.org
      
      Reviewed-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      0b764a6e
    • Paul McKenney's avatar
      srcu: Add notrace variants of srcu_read_{lock,unlock} · 1f45a4db
      Paul McKenney authored
      
      
      This is needed for a future tracepoint patch that uses srcu, and to make
      sure it doesn't call into lockdep.
      
      tracepoint code already calls notrace variants for rcu_read_lock_sched
      so this patch does the same for srcu which will be used in a later
      patch. Keeps it consistent with rcu-sched.
      
      [Joel: Added commit message]
      Link: http://lkml.kernel.org/r/20180628182149.226164-2-joel@joelfernandes.org
      
      Reviewed-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarPaul McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      1f45a4db
    • Snild Dolkow's avatar
      kthread, tracing: Don't expose half-written comm when creating kthreads · 3e536e22
      Snild Dolkow authored
      There is a window for racing when printing directly to task->comm,
      allowing other threads to see a non-terminated string. The vsnprintf
      function fills the buffer, counts the truncated chars, then finally
      writes the \0 at the end.
      
      	creator                     other
      	vsnprintf:
      	  fill (not terminated)
      	  count the rest            trace_sched_waking(p):
      	  ...                         memcpy(comm, p->comm, TASK_COMM_LEN)
      	  write \0
      
      The consequences depend on how 'other' uses the string. In our case,
      it was copied into the tracing system's saved cmdlines, a buffer of
      adjacent TASK_COMM_LEN-byte buffers (note the 'n' where 0 should be):
      
      	crash-arm64> x/1024s savedcmd->saved_cmdlines | grep 'evenk'
      	0xffffffd5b3818640:     "irq/497-pwr_evenkworker/u16:12"
      
      ...and a strcpy out of there would cause stack corruption:
      
      	[224761.522292] Kernel panic - not syncing: stack-protector:
      	    Kernel stack is corrupted in: ffffff9bf9783c78
      
      	crash-arm64> kbt | grep 'comm\|trace_print_context'
      	#6  0xffffff9bf9783c78 in trace_print_context+0x18c(+396)
      	      comm (char [16]) =  "irq/497-pwr_even"
      
      	crash-arm64> rd 0xffffffd4d0e17d14 8
      	ffffffd4d0e17d14:  2f71726900000000 5f7277702d373934   ....irq/497-pwr_
      	ffffffd4d0e17d24:  726f776b6e657665 3a3631752f72656b   evenkworker/u16:
      	ffffffd4d0e17d34:  f9780248ff003231 cede60e0ffffff9b   12..H.x......`..
      	ffffffd4d0e17d44:  cede60c8ffffffd4 00000fffffffffd4   .....`..........
      
      The workaround in e09e2867 (use strlcpy in __trace_find_cmdline) was
      likely needed because of this same bug.
      
      Solved by vsnprintf:ing to a local buffer, then using set_task_comm().
      This way, there won't be a window where comm is not terminated.
      
      Link: http://lkml.kernel.org/r/20180726071539.188015-1-snild@sony.com
      
      Cc: stable@vger.kernel.org
      Fixes: bc0c38d1
      
       ("ftrace: latency tracer infrastructure")
      Reviewed-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarSnild Dolkow <snild@sony.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      3e536e22
    • Steven Rostedt (VMware)'s avatar
      tracing: Quiet gcc warning about maybe unused link variable · 2519c1bb
      Steven Rostedt (VMware) authored
      Commit 57ea2a34 ("tracing/kprobes: Fix trace_probe flags on
      enable_trace_kprobe() failure") added an if statement that depends on another
      if statement that gcc doesn't see will initialize the "link" variable and
      gives the warning:
      
       "warning: 'link' may be used uninitialized in this function"
      
      It is really a false positive, but to quiet the warning, and also to make
      sure that it never actually is used uninitialized, initialize the "link"
      variable to NULL and add an if (!WARN_ON_ONCE(!link)) where the compiler
      thinks it could be used uninitialized.
      
      Cc: stable@vger.kernel.org
      Fixes: 57ea2a34
      
       ("tracing/kprobes: Fix trace_probe flags on enable_trace_kprobe() failure")
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      2519c1bb
    • Steven Rostedt (VMware)'s avatar
      tracing: Fix possible double free in event_enable_trigger_func() · 15cc7864
      Steven Rostedt (VMware) authored
      There was a case that triggered a double free in event_trigger_callback()
      due to the called reg() function freeing the trigger_data and then it
      getting freed again by the error return by the caller. The solution there
      was to up the trigger_data ref count.
      
      Code inspection found that event_enable_trigger_func() has the same issue,
      but is not as easy to trigger (requires harder to trigger failures). It
      needs to be solved slightly different as it needs more to clean up when the
      reg() function fails.
      
      Link: http://lkml.kernel.org/r/20180725124008.7008e586@gandalf.local.home
      
      Cc: stable@vger.kernel.org
      Fixes: 7862ad18
      
       ("tracing: Add 'enable_event' and 'disable_event' event trigger commands")
      Reivewed-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      15cc7864
  5. Jul 25, 2018
    • Artem Savkov's avatar
      tracing/kprobes: Fix trace_probe flags on enable_trace_kprobe() failure · 57ea2a34
      Artem Savkov authored
      If enable_trace_kprobe fails to enable the probe in enable_k(ret)probe
      it returns an error, but does not unset the tp flags it set previously.
      This results in a probe being considered enabled and failures like being
      unable to remove the probe through kprobe_events file since probes_open()
      expects every probe to be disabled.
      
      Link: http://lkml.kernel.org/r/20180725102826.8300-1-asavkov@redhat.com
      Link: http://lkml.kernel.org/r/20180725142038.4765-1-asavkov@redhat.com
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: 41a7dd42
      
       ("tracing/kprobes: Support ftrace_event_file base multibuffer")
      Acked-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: default avatarArtem Savkov <asavkov@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      57ea2a34
    • Masami Hiramatsu's avatar
      selftests/ftrace: Add snapshot and tracing_on test case · 82f4f3e6
      Masami Hiramatsu authored
      
      
      Add a testcase for checking snapshot and tracing_on
      relationship. This ensures that the snapshotting doesn't
      affect current tracing on/off settings.
      
      Link: http://lkml.kernel.org/r/153149932412.11274.15289227592627901488.stgit@devbox
      
      Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
      Cc: Hiraku Toyooka <hiraku.toyooka@cybertrust.co.jp>
      Signed-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: linux-kselftest@vger.kernel.org
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      82f4f3e6
    • Masami Hiramatsu's avatar
      ring_buffer: tracing: Inherit the tracing setting to next ring buffer · 73c8d894
      Masami Hiramatsu authored
      Maintain the tracing on/off setting of the ring_buffer when switching
      to the trace buffer snapshot.
      
      Taking a snapshot is done by swapping the backup ring buffer
      (max_tr_buffer). But since the tracing on/off setting is defined
      by the ring buffer, when swapping it, the tracing on/off setting
      can also be changed. This causes a strange result like below:
      
        /sys/kernel/debug/tracing # cat tracing_on
        1
        /sys/kernel/debug/tracing # echo 0 > tracing_on
        /sys/kernel/debug/tracing # cat tracing_on
        0
        /sys/kernel/debug/tracing # echo 1 > snapshot
        /sys/kernel/debug/tracing # cat tracing_on
        1
        /sys/kernel/debug/tracing # echo 1 > snapshot
        /sys/kernel/debug/tracing # cat tracing_on
        0
      
      We don't touch tracing_on, but snapshot changes tracing_on
      setting each time. This is an anomaly, because user doesn't know
      that each "ring_buffer" stores its own tracing-enable state and
      the snapshot is done by swapping ring buffers.
      
      Link: http://lkml.kernel.org/r/153149929558.11274.11730609978254724394.stgit@devbox
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
      Cc: Hiraku Toyooka <hiraku.toyooka@cybertrust.co.jp>
      Cc: stable@vger.kernel.org
      Fixes: debdd57f
      
       ("tracing: Make a snapshot feature available from userspace")
      Signed-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      [ Updated commit log and comment in the code ]
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      73c8d894
    • Steven Rostedt (VMware)'s avatar
      tracing: Fix double free of event_trigger_data · 1863c387
      Steven Rostedt (VMware) authored
      Running the following:
      
       # cd /sys/kernel/debug/tracing
       # echo 500000 > buffer_size_kb
      [ Or some other number that takes up most of memory ]
       # echo snapshot > events/sched/sched_switch/trigger
      
      Triggers the following bug:
      
       ------------[ cut here ]------------
       kernel BUG at mm/slub.c:296!
       invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
       CPU: 6 PID: 6878 Comm: bash Not tainted 4.18.0-rc6-test+ #1066
       Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
       RIP: 0010:kfree+0x16c/0x180
       Code: 05 41 0f b6 72 51 5b 5d 41 5c 4c 89 d7 e9 ac b3 f8 ff 48 89 d9 48 89 da 41 b8 01 00 00 00 5b 5d 41 5c 4c 89 d6 e9 f4 f3 ff ff <0f> 0b 0f 0b 48 8b 3d d9 d8 f9 00 e9 c1 fe ff ff 0f 1f 40 00 0f 1f
       RSP: 0018:ffffb654436d3d88 EFLAGS: 00010246
       RAX: ffff91a9d50f3d80 RBX: ffff91a9d50f3d80 RCX: ffff91a9d50f3d80
       RDX: 00000000000006a4 RSI: ffff91a9de5a60e0 RDI: ffff91a9d9803500
       RBP: ffffffff8d267c80 R08: 00000000000260e0 R09: ffffffff8c1a56be
       R10: fffff0d404543cc0 R11: 0000000000000389 R12: ffffffff8c1a56be
       R13: ffff91a9d9930e18 R14: ffff91a98c0c2890 R15: ffffffff8d267d00
       FS:  00007f363ea64700(0000) GS:ffff91a9de580000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 000055c1cacc8e10 CR3: 00000000d9b46003 CR4: 00000000001606e0
       Call Trace:
        event_trigger_callback+0xee/0x1d0
        event_trigger_write+0xfc/0x1a0
        __vfs_write+0x33/0x190
        ? handle_mm_fault+0x115/0x230
        ? _cond_resched+0x16/0x40
        vfs_write+0xb0/0x190
        ksys_write+0x52/0xc0
        do_syscall_64+0x5a/0x160
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
       RIP: 0033:0x7f363e16ab50
       Code: 73 01 c3 48 8b 0d 38 83 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 79 db 2c 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 1e e3 01 00 48 89 04 24
       RSP: 002b:00007fff9a4c6378 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
       RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007f363e16ab50
       RDX: 0000000000000009 RSI: 000055c1cacc8e10 RDI: 0000000000000001
       RBP: 000055c1cacc8e10 R08: 00007f363e435740 R09: 00007f363ea64700
       R10: 0000000000000073 R11: 0000000000000246 R12: 0000000000000009
       R13: 0000000000000001 R14: 00007f363e4345e0 R15: 00007f363e4303c0
       Modules linked in: ip6table_filter ip6_tables snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_seq snd_seq_device i915 snd_pcm snd_timer i2c_i801 snd soundcore i2c_algo_bit drm_kms_helper
      86_pkg_temp_thermal video kvm_intel kvm irqbypass wmi e1000e
       ---[ end trace d301afa879ddfa25 ]---
      
      The cause is because the register_snapshot_trigger() call failed to
      allocate the snapshot buffer, and then called unregister_trigger()
      which freed the data that was passed to it. Then on return to the
      function that called register_snapshot_trigger(), as it sees it
      failed to register, it frees the trigger_data again and causes
      a double free.
      
      By calling event_trigger_init() on the trigger_data (which only ups
      the reference counter for it), and then event_trigger_free() afterward,
      the trigger_data would not get freed by the registering trigger function
      as it would only up and lower the ref count for it. If the register
      trigger function fails, then the event_trigger_free() called after it
      will free the trigger data normally.
      
      Link: http://lkml.kernel.org/r/20180724191331.738eb819@gandalf.local.home
      
      Cc: stable@vger.kerne.org
      Fixes: 93e31ffb
      
       ("tracing: Add 'snapshot' event trigger command")
      Reported-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Reviewed-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      1863c387
  6. Jul 23, 2018
    • Linus Torvalds's avatar
      Linux 4.18-rc6 · d72e90f3
      Linus Torvalds authored
      v4.18-rc6
      d72e90f3
    • Linus Torvalds's avatar
      Merge tag 'nvme-for-4.18' of git://git.infradead.org/nvme · 74413084
      Linus Torvalds authored
      Pull NVMe fixes from Christoph Hellwig:
      
       - fix a regression in 4.18 that causes a memory leak on probe failure
         (Keith Bush)
      
       - fix a deadlock in the passthrough ioctl code (Scott Bauer)
      
       - don't enable AENs if not supported (Weiping Zhang)
      
       - fix an old regression in metadata handling in the passthrough ioctl
         code (Roland Dreier)
      
      * tag 'nvme-for-4.18' of git://git.infradead.org/nvme:
        nvme: fix handling of metadata_len for NVME_IOCTL_IO_CMD
        nvme: don't enable AEN if not supported
        nvme: ensure forward progress during Admin passthru
        nvme-pci: fix memory leak on probe failure
      74413084
    • Linus Torvalds's avatar
      Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 165ea0d1
      Linus Torvalds authored
      Pull vfs fixes from Al Viro:
       "Fix several places that screw up cleanups after failures halfway
        through opening a file (one open-coding filp_clone_open() and getting
        it wrong, two misusing alloc_file()). That part is -stable fodder from
        the 'work.open' branch.
      
        And Christoph's regression fix for uapi breakage in aio series;
        include/uapi/linux/aio_abi.h shouldn't be pulling in the kernel
        definition of sigset_t, the reason for doing so in the first place had
        been bogus - there's no need to expose struct __aio_sigset in
        aio_abi.h at all"
      
      * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        aio: don't expose __aio_sigset in uapi
        ocxlflash_getfile(): fix double-iput() on alloc_file() failures
        cxl_getfile(): fix double-iput() on alloc_file() failures
        drm_mode_create_lease_ioctl(): fix open-coded filp_clone_open()
      165ea0d1
    • Al Viro's avatar
      alpha: fix osf_wait4() breakage · f88a333b
      Al Viro authored
      kernel_wait4() expects a userland address for status - it's only
      rusage that goes as a kernel one (and needs a copyout afterwards)
      
      [ Also, fix the prototype of kernel_wait4() to have that __user
        annotation   - Linus ]
      
      Fixes: 92ebce5a
      
       ("osf_wait4: switch to kernel_wait4()")
      Cc: stable@kernel.org # v4.13+
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f88a333b
  7. Jul 22, 2018
    • Linus Torvalds's avatar
      Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · 45ae4df9
      Linus Torvalds authored
      Pull ARM SoC fixes from Olof Johansson:
      
       - Fix interrupt type on ethernet switch for i.MX-based RDU2
      
       - GPC on i.MX exposed too large a register window which resulted in
         userspace being able to crash the machine.
      
       - Fixup of bad merge resolution moving GPIO DT nodes under pinctrl on
         droid4.
      
      * tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
        ARM: dts: imx6: RDU2: fix irq type for mv88e6xxx switch
        soc: imx: gpc: restrict register range for regmap access
        ARM: dts: omap4-droid4: fix dts w.r.t. pwm
      45ae4df9
    • Linus Torvalds's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · ef81e63e
      Linus Torvalds authored
      Pull x86 fix from Ingo Molnar:
       "A single fix for a MCE-polling regression, which prevented the
        disabling of polling"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/MCE: Remove min interval polling limitation
      ef81e63e
    • Linus Torvalds's avatar
      Merge branch 'x86-pti-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 43227e09
      Linus Torvalds authored
      Pull x86 pti fixes from Ingo Molnar:
       "An APM fix, and a BTS hardware-tracing fix related to PTI changes"
      
      * 'x86-pti-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/apm: Don't access __preempt_count with zeroed fs
        x86/events/intel/ds: Fix bts_interrupt_threshold alignment
      43227e09
    • Linus Torvalds's avatar
      Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 48b1db7c
      Linus Torvalds authored
      Pull scheduler fixes from Ingo Molnar:
       "Two fixes: a stop-machine preemption fix and a SCHED_DEADLINE fix"
      
      * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/deadline: Fix switched_from_dl() warning
        stop_machine: Disable preemption when waking two stopper threads
      48b1db7c
    • Linus Torvalds's avatar
      Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · ea75a2c7
      Linus Torvalds authored
      Pull core kernel fixes from Ingo Molnar:
       "This is mostly the copy_to_user_mcsafe() related fixes from Dan
        Williams, and an ORC fix for Clang"
      
      * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/asm/memcpy_mcsafe: Fix copy_to_user_mcsafe() exception handling
        lib/iov_iter: Fix pipe handling in _copy_to_iter_mcsafe()
        lib/iov_iter: Document _copy_to_iter_flushcache()
        lib/iov_iter: Document _copy_to_iter_mcsafe()
        objtool: Use '.strtab' if '.shstrtab' doesn't exist, to support ORC tables on Clang
      ea75a2c7
    • Linus Torvalds's avatar
      Merge tag 'powerpc-4.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · ffb48e79
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
       "Two regression fixes, one for xmon disassembly formatting and the
        other to fix the E500 build.
      
        Two commits to fix a potential security issue in the VFIO code under
        obscure circumstances.
      
        And finally a fix to the Power9 idle code to restore SPRG3, which is
        user visible and used for sched_getcpu().
      
        Thanks to: Alexey Kardashevskiy, David Gibson. Gautham R. Shenoy,
        James Clarke"
      
      * tag 'powerpc-4.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/powernv: Fix save/restore of SPRG3 on entry/exit from stop (idle)
        powerpc/Makefile: Assemble with -me500 when building for E500
        KVM: PPC: Check if IOMMU page is contained in the pinned physical page
        vfio/spapr: Use IOMMU pageshift rather than pagesize
        powerpc/xmon: Fix disassembly since printf changes
      ffb48e79
    • Linus Torvalds's avatar
      Merge tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 55b636b4
      Linus Torvalds authored
      Pull btrfs fix from David Sterba:
       "A fix of a corruption regarding fsync and clone, under some very
        specific conditions explained in the patch.
      
        The fix is marked for stable 3.16+ so I'd like to get it merged now
        given the impact"
      
      * tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        Btrfs: fix file data corruption after cloning a range and fsync
      55b636b4
    • Linus Torvalds's avatar
      mm: make vm_area_alloc() initialize core fields · 490fc053
      Linus Torvalds authored
      
      
      Like vm_area_dup(), it initializes the anon_vma_chain head, and the
      basic mm pointer.
      
      The rest of the fields end up being different for different users,
      although the plan is to also initialize the 'vm_ops' field to a dummy
      entry.
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      490fc053
    • Linus Torvalds's avatar
      mm: make vm_area_dup() actually copy the old vma data · 95faf699
      Linus Torvalds authored
      
      
      .. and re-initialize th eanon_vma_chain head.
      
      This removes some boiler-plate from the users, and also makes it clear
      why it didn't need use the 'zalloc()' version.
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      95faf699
    • Linus Torvalds's avatar
      mm: use helper functions for allocating and freeing vm_area structs · 3928d4f5
      Linus Torvalds authored
      
      
      The vm_area_struct is one of the most fundamental memory management
      objects, but the management of it is entirely open-coded evertwhere,
      ranging from allocation and freeing (using kmem_cache_[z]alloc and
      kmem_cache_free) to initializing all the fields.
      
      We want to unify this in order to end up having some unified
      initialization of the vmas, and the first step to this is to at least
      have basic allocation functions.
      
      Right now those functions are literally just wrappers around the
      kmem_cache_*() calls.  This is a purely mechanical conversion:
      
          # new vma:
          kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()
      
          # copy old vma
          kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)
      
          # free vma
          kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)
      
      to the point where the old vma passed in to the vm_area_dup() function
      isn't even used yet (because I've left all the old manual initialization
      alone).
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3928d4f5
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 191a3afa
      Linus Torvalds authored
      Merge fixes from Andrew Morton:
       "5 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm: memcg: fix use after free in mem_cgroup_iter()
        mm/huge_memory.c: fix data loss when splitting a file pmd
        fat: fix memory allocation failure handling of match_strdup()
        MAINTAINERS: Peter has moved
        mm/memblock: add missing include <linux/bootmem.h>
      191a3afa
    • Jing Xia's avatar
      mm: memcg: fix use after free in mem_cgroup_iter() · 9f15bde6
      Jing Xia authored
      It was reported that a kernel crash happened in mem_cgroup_iter(), which
      can be triggered if the legacy cgroup-v1 non-hierarchical mode is used.
      
      Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b8f
      ......
      Call trace:
        mem_cgroup_iter+0x2e0/0x6d4
        shrink_zone+0x8c/0x324
        balance_pgdat+0x450/0x640
        kswapd+0x130/0x4b8
        kthread+0xe8/0xfc
        ret_from_fork+0x10/0x20
      
        mem_cgroup_iter():
            ......
            if (css_tryget(css))    <-- crash here
      	    break;
            ......
      
      The crashing reason is that mem_cgroup_iter() uses the memcg object whose
      pointer is stored in iter->position, which has been freed before and
      filled with POISON_FREE(0x6b).
      
      And the root cause of the use-after-free issue is that
      invalidate_reclaim_iterators() fails to reset the value of iter->position
      to NULL when the css of the memcg is released in non- hierarchical mode.
      
      Link: http://lkml.kernel.org/r/1531994807-25639-1-git-send-email-jing.xia@unisoc.com
      Fixes: 6df38689
      
       ("mm: memcontrol: fix possible memcg leak due to interrupted reclaim")
      Signed-off-by: default avatarJing Xia <jing.xia.mail@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <chunyan.zhang@unisoc.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f15bde6
    • Hugh Dickins's avatar
      mm/huge_memory.c: fix data loss when splitting a file pmd · e1f1b157
      Hugh Dickins authored
      __split_huge_pmd_locked() must check if the cleared huge pmd was dirty,
      and propagate that to PageDirty: otherwise, data may be lost when a huge
      tmpfs page is modified then split then reclaimed.
      
      How has this taken so long to be noticed?  Because there was no problem
      when the huge page is written by a write system call (shmem_write_end()
      calls set_page_dirty()), nor when the page is allocated for a write fault
      (fault_dirty_shared_page() calls set_page_dirty()); but when allocated for
      a read fault (which MAP_POPULATE simulates), no set_page_dirty().
      
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1807111741430.1106@eggly.anvils
      Fixes: d21b9e57
      
       ("thp: handle file pages in split_huge_pmd()")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarAshwin Chaugule <ashwinch@google.com>
      Reviewed-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e1f1b157