Skip to content
  1. Mar 16, 2017
    • Steven Rostedt (VMware)'s avatar
      sched/deadline: Use deadline instead of period when calculating overflow · 2317d5f1
      Steven Rostedt (VMware) authored
      
      
      I was testing Daniel's changes with his test case, and tweaked it a
      little. Instead of having the runtime equal to the deadline, I
      increased the deadline ten fold.
      
      Daniel's test case had:
      
      	attr.sched_runtime  = 2 * 1000 * 1000;		/* 2 ms */
      	attr.sched_deadline = 2 * 1000 * 1000;		/* 2 ms */
      	attr.sched_period   = 2 * 1000 * 1000 * 1000;	/* 2 s */
      
      To make it more interesting, I changed it to:
      
      	attr.sched_runtime  =  2 * 1000 * 1000;		/* 2 ms */
      	attr.sched_deadline = 20 * 1000 * 1000;		/* 20 ms */
      	attr.sched_period   =  2 * 1000 * 1000 * 1000;	/* 2 s */
      
      The results were rather surprising. The behavior that Daniel's patch
      was fixing came back. The task started using much more than .1% of the
      CPU. More like 20%.
      
      Looking into this I found that it was due to the dl_entity_overflow()
      constantly returning true. That's because it uses the relative period
      against relative runtime vs the absolute deadline against absolute
      runtime.
      
        runtime / (deadline - t) > dl_runtime / dl_period
      
      There's even a comment mentioning this, and saying that when relative
      deadline equals relative period, that the equation is the same as using
      deadline instead of period. That comment is backwards! What we really
      want is:
      
        runtime / (deadline - t) > dl_runtime / dl_deadline
      
      We care about if the runtime can make its deadline, not its period. And
      then we can say "when the deadline equals the period, the equation is
      the same as using dl_period instead of dl_deadline".
      
      After correcting this, now when the task gets enqueued, it can throttle
      correctly, and Daniel's fix to the throttling of sleeping deadline
      tasks works even when the runtime and deadline are not the same.
      
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Link: http://lkml.kernel.org/r/02135a27f1ae3fe5fd032568a5a2f370e190e8d7.1488392936.git.bristot@redhat.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2317d5f1
    • Daniel Bristot de Oliveira's avatar
      sched/deadline: Throttle a constrained deadline task activated after the deadline · df8eac8c
      Daniel Bristot de Oliveira authored
      
      
      During the activation, CBS checks if it can reuse the current task's
      runtime and period. If the deadline of the task is in the past, CBS
      cannot use the runtime, and so it replenishes the task. This rule
      works fine for implicit deadline tasks (deadline == period), and the
      CBS was designed for implicit deadline tasks. However, a task with
      constrained deadline (deadine < period) might be awakened after the
      deadline, but before the next period. In this case, replenishing the
      task would allow it to run for runtime / deadline. As in this case
      deadline < period, CBS enables a task to run for more than the
      runtime / period. In a very loaded system, this can cause a domino
      effect, making other tasks miss their deadlines.
      
      To avoid this problem, in the activation of a constrained deadline
      task after the deadline but before the next period, throttle the
      task and set the replenishing timer to the begin of the next period,
      unless it is boosted.
      
      Reproducer:
      
       --------------- %< ---------------
        int main (int argc, char **argv)
        {
      	int ret;
      	int flags = 0;
      	unsigned long l = 0;
      	struct timespec ts;
      	struct sched_attr attr;
      
      	memset(&attr, 0, sizeof(attr));
      	attr.size = sizeof(attr);
      
      	attr.sched_policy   = SCHED_DEADLINE;
      	attr.sched_runtime  = 2 * 1000 * 1000;		/* 2 ms */
      	attr.sched_deadline = 2 * 1000 * 1000;		/* 2 ms */
      	attr.sched_period   = 2 * 1000 * 1000 * 1000;	/* 2 s */
      
      	ts.tv_sec = 0;
      	ts.tv_nsec = 2000 * 1000;			/* 2 ms */
      
      	ret = sched_setattr(0, &attr, flags);
      
      	if (ret < 0) {
      		perror("sched_setattr");
      		exit(-1);
      	}
      
      	for(;;) {
      		/* XXX: you may need to adjust the loop */
      		for (l = 0; l < 150000; l++);
      		/*
      		 * The ideia is to go to sleep right before the deadline
      		 * and then wake up before the next period to receive
      		 * a new replenishment.
      		 */
      		nanosleep(&ts, NULL);
      	}
      
      	exit(0);
        }
        --------------- >% ---------------
      
      On my box, this reproducer uses almost 50% of the CPU time, which is
      obviously wrong for a task with 2/2000 reservation.
      
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Link: http://lkml.kernel.org/r/edf58354e01db46bf42df8d2dd32418833f68c89.1488392936.git.bristot@redhat.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      df8eac8c
    • Daniel Bristot de Oliveira's avatar
      sched/deadline: Make sure the replenishment timer fires in the next period · 5ac69d37
      Daniel Bristot de Oliveira authored
      
      
      Currently, the replenishment timer is set to fire at the deadline
      of a task. Although that works for implicit deadline tasks because the
      deadline is equals to the begin of the next period, that is not correct
      for constrained deadline tasks (deadline < period).
      
      For instance:
      
      f.c:
       --------------- %< ---------------
      int main (void)
      {
      	for(;;);
      }
       --------------- >% ---------------
      
        # gcc -o f f.c
      
        # trace-cmd record -e sched:sched_switch                              \
      				   -e syscalls:sys_exit_sched_setattr   \
         chrt -d --sched-runtime  490000000					\
                 --sched-deadline 500000000					\
      	   --sched-period  1000000000 0 ./f
      
        # trace-cmd report | grep "{pid of ./f}"
      
      After setting parameters, the task is replenished and continue running
      until being throttled:
      
               f-11295 [003] 13322.113776: sys_exit_sched_setattr: 0x0
      
      The task is throttled after running 492318 ms, as expected:
      
               f-11295 [003] 13322.606094: sched_switch:   f:11295 [-1] R ==> watchdog/3:32 [0]
      
      But then, the task is replenished 500719 ms after the first
      replenishment:
      
          <idle>-0     [003] 13322.614495: sched_switch:   swapper/3:0 [120] R ==> f:11295 [-1]
      
      Running for 490277 ms:
      
               f-11295 [003] 13323.104772: sched_switch:   f:11295 [-1] R ==>  swapper/3:0 [120]
      
      Hence, in the first period, the task runs 2 * runtime, and that is a bug.
      
      During the first replenishment, the next deadline is set one period away.
      So the runtime / period starts to be respected. However, as the second
      replenishment took place in the wrong instant, the next replenishment
      will also be held in a wrong instant of time. Rather than occurring in
      the nth period away from the first activation, it is taking place
      in the (nth period - relative deadline).
      
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarLuca Abeni <luca.abeni@santannapisa.it>
      Reviewed-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Reviewed-by: default avatarJuri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Link: http://lkml.kernel.org/r/ac50d89887c25285b47465638354b63362f8adff.1488392936.git.bristot@redhat.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5ac69d37
    • Matt Fleming's avatar
      sched/loadavg: Use {READ,WRITE}_ONCE() for sample window · caeb5882
      Matt Fleming authored
      
      
      'calc_load_update' is accessed without any kind of locking and there's
      a clear assumption in the code that only a single value is read or
      written.
      
      Make this explicit by using READ_ONCE() and WRITE_ONCE(), and avoid
      unintentionally seeing multiple values, or having the load/stores
      split.
      
      Technically the loads in calc_global_*() don't require this since
      those are the only functions that update 'calc_load_update', but I've
      added the READ_ONCE() for consistency.
      
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/20170217120731.11868-3-matt@codeblueprint.co.uk
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      caeb5882
    • Matt Fleming's avatar
      sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting · 6e5f32f7
      Matt Fleming authored
      If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to
      the pending sample window time on exit, setting the next update not
      one window into the future, but two.
      
      This situation on exiting NO_HZ is described by:
      
        this_rq->calc_load_update < jiffies < calc_load_update
      
      In this scenario, what we should be doing is:
      
        this_rq->calc_load_update = calc_load_update		     [ next window ]
      
      But what we actually do is:
      
        this_rq->calc_load_update = calc_load_update + LOAD_FREQ   [ next+1 window ]
      
      This has the effect of delaying load average updates for potentially
      up to ~9seconds.
      
      This can result in huge spikes in the load average values due to
      per-cpu uninterruptible task counts being out of sync when accumulated
      across all CPUs.
      
      It's safe to update the per-cpu active count if we wake between sample
      windows because any load that we left in 'calc_load_idle' will have
      been zero'd when the idle load was folded in calc_global_load().
      
      This issue is easy to reproduce before,
      
        commit 9d89c257
      
       ("sched/fair: Rewrite runnable load and utilization average tracking")
      
      just by forking short-lived process pipelines built from ps(1) and
      grep(1) in a loop. I'm unable to reproduce the spikes after that
      commit, but the bug still seems to be present from code review.
      
      Signed-off-by: default avatarMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Fixes: commit 5167e8d5 ("sched/nohz: Rewrite and fix load-avg computation -- again")
      Link: http://lkml.kernel.org/r/20170217120731.11868-2-matt@codeblueprint.co.uk
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      6e5f32f7
    • Wanpeng Li's avatar
      sched/deadline: Add missing update_rq_clock() in dl_task_timer() · dcc3b5ff
      Wanpeng Li authored
      
      
      The following warning can be triggered by hot-unplugging the CPU
      on which an active SCHED_DEADLINE task is running on:
      
       ------------[ cut here ]------------
       WARNING: CPU: 7 PID: 0 at kernel/sched/sched.h:833 replenish_dl_entity+0x71e/0xc40
       rq->clock_update_flags < RQCF_ACT_SKIP
       CPU: 7 PID: 0 Comm: swapper/7 Tainted: G    B           4.11.0-rc1+ #24
       Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
       Call Trace:
        <IRQ>
        dump_stack+0x85/0xc4
        __warn+0x172/0x1b0
        warn_slowpath_fmt+0xb4/0xf0
        ? __warn+0x1b0/0x1b0
        ? debug_check_no_locks_freed+0x2c0/0x2c0
        ? cpudl_set+0x3d/0x2b0
        replenish_dl_entity+0x71e/0xc40
        enqueue_task_dl+0x2ea/0x12e0
        ? dl_task_timer+0x777/0x990
        ? __hrtimer_run_queues+0x270/0xa50
        dl_task_timer+0x316/0x990
        ? enqueue_task_dl+0x12e0/0x12e0
        ? enqueue_task_dl+0x12e0/0x12e0
        __hrtimer_run_queues+0x270/0xa50
        ? hrtimer_cancel+0x20/0x20
        ? hrtimer_interrupt+0x119/0x600
        hrtimer_interrupt+0x19c/0x600
        ? trace_hardirqs_off+0xd/0x10
        local_apic_timer_interrupt+0x74/0xe0
        smp_apic_timer_interrupt+0x76/0xa0
        apic_timer_interrupt+0x93/0xa0
      
      The DL task will be migrated to a suitable later deadline rq once the DL
      timer fires and currnet rq is offline. The rq clock of the new rq should
      be updated. This patch fixes it by updating the rq clock after holding
      the new rq's rq lock.
      
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMatt Fleming <matt@codeblueprint.co.uk>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1488865888-15894-1-git-send-email-wanpeng.li@hotmail.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      dcc3b5ff
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.dk/linux-block · 69eea5a4
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "Four small fixes for this cycle:
      
         - followup fix from Neil for a fix that went in before -rc2, ensuring
           that we always see the full per-task bio_list.
      
         - fix for blk-mq-sched from me that ensures that we retain similar
           direct-to-issue behavior on running the queue.
      
         - fix from Sagi fixing a potential NULL pointer dereference in blk-mq
           on spurious CPU unplug.
      
         - a memory leak fix in writeback from Tahsin, fixing a case where
           device removal of a mounted device can leak a struct
           wb_writeback_work"
      
      * 'for-linus' of git://git.kernel.dk/linux-block:
        blk-mq-sched: don't run the queue async from blk_mq_try_issue_directly()
        writeback: fix memory leak in wb_queue_work()
        blk-mq: Fix tagset reinit in the presence of cpu hot-unplug
        blk: Ensure users for current->bio_list can see the full list.
      69eea5a4
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 95422dec
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "This is a rather large set of fixes. The bulk are for lpfc correcting
        a lot of issues in the new NVME driver code which just went in in the
        merge window.
      
        The others are:
      
         - fix a hang in the vmware paravirt driver caused by incorrect
           handling of the new MSI vector allocation
      
         - long standing bug in storvsc, which recent block changes turned
           from being a harmless annoyance into a hang
      
         - yet more fallout (in mpt3sas) from the changes to device blocking
      
        The remainder are small fixes and updates"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (34 commits)
        scsi: lpfc: Add shutdown method for kexec
        scsi: storvsc: Workaround for virtual DVD SCSI version
        scsi: lpfc: revise version number to 11.2.0.10
        scsi: lpfc: code cleanups in NVME initiator discovery
        scsi: lpfc: code cleanups in NVME initiator base
        scsi: lpfc: correct rdp diag portnames
        scsi: lpfc: remove dead sli3 nvme code
        scsi: lpfc: correct double print
        scsi: lpfc: Rename LPFC_MAX_EQ_DELAY to LPFC_MAX_EQ_DELAY_EQID_CNT
        scsi: lpfc: Rework lpfc Kconfig for NVME options
        scsi: lpfc: add transport eh_timed_out reference
        scsi: lpfc: Fix eh_deadline setting for sli3 adapters.
        scsi: lpfc: add NVME exchange aborts
        scsi: lpfc: Fix nvme allocation bug on failed nvme_fc_register_localport
        scsi: lpfc: Fix IO submission if WQ is full
        scsi: lpfc: Fix NVME CMD IU byte swapped word 1 problem
        scsi: lpfc: Fix RCTL value on NVME LS request and response
        scsi: lpfc: Fix crash during Hardware error recovery on SLI3 adapters
        scsi: lpfc: fix missing spin_unlock on sql_list_lock
        scsi: lpfc: don't dereference dma_buf->iocbq before null check
        ...
      95422dec
    • Linus Torvalds's avatar
      Merge tag 'gfs2-4.11-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 · aabcf5fc
      Linus Torvalds authored
      Pull gfs2 fix from Bob Peterson:
       "This is an emergency patch for 4.11-rc3
      
        The GFS2 developers uncovered a really nasty problem that can lead to
        random corruption and kernel panic, much like the last one. Andreas
        Gruenbacher wrote a simple one-line patch to fix the problem."
      
      * tag 'gfs2-4.11-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
        gfs2: Avoid alignment hole in struct lm_lockname
      aabcf5fc
    • Linus Torvalds's avatar
      Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · defc7d75
      Linus Torvalds authored
      Pull crypto fixes from Herbert Xu:
      
       - self-test failure of crc32c on powerpc
      
       - regressions of ecb(aes) when used with xts/lrw in s5p-sss
      
       - a number of bugs in the omap RNG driver
      
      * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
        crypto: s5p-sss - Fix spinlock recursion on LRW(AES)
        hwrng: omap - Do not access INTMASK_REG on EIP76
        hwrng: omap - use devm_clk_get() instead of of_clk_get()
        hwrng: omap - write registers after enabling the clock
        crypto: s5p-sss - Fix completing crypto request in IRQ handler
        crypto: powerpc - Fix initialisation of crc32c context
      defc7d75
  2. Mar 15, 2017
  3. Mar 14, 2017
    • Hannes Frederic Sowa's avatar
      dccp: fix memory leak during tear-down of unsuccessful connection request · 72ef9c41
      Hannes Frederic Sowa authored
      
      
      This patch fixes a memory leak, which happens if the connection request
      is not fulfilled between parsing the DCCP options and handling the SYN
      (because e.g. the backlog is full), because we forgot to free the
      list of ack vectors.
      
      Reported-by: default avatarJianwen Ji <jiji@redhat.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      72ef9c41
    • Hannes Frederic Sowa's avatar
      tun: fix premature POLLOUT notification on tun devices · b20e2d54
      Hannes Frederic Sowa authored
      aszlig observed failing ssh tunnels (-w) during initialization since
      commit cc9da6cc ("ipv6: addrconf: use stable address generator for
      ARPHRD_NONE"). We already had reports that the mentioned commit breaks
      Juniper VPN connections. I can't clearly say that the Juniper VPN client
      has the same problem, but it is worth a try to hint to this patch.
      
      Because of the early generation of link local addresses, the kernel now
      can start asking for routers on the local subnet much earlier than usual.
      Those router solicitation packets arrive inside the ssh channels and
      should be transmitted to the tun fd before the configuration scripts
      might have upped the interface and made it ready for transmission.
      
      ssh polls on the interface and receives back a POLL_OUT. It tries to send
      the earily router solicitation packet to the tun interface.  Unfortunately
      it hasn't been up'ed yet by config scripts, thus failing with -EIO. ssh
      doesn't retry again and considers the tun interface broken forever.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=121131
      Fixes: cc9da6cc
      
       ("ipv6: addrconf: use stable address generator for ARPHRD_NONE")
      Cc: Bjørn Mork <bjorn@mork.no>
      Reported-by: default avatarValdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Reported-by: default avatarJonas Lippuner <jonas@lippuner.ca>
      Cc: Jonas Lippuner <jonas@lippuner.ca>
      Reported-by: default avataraszlig <aszlig@redmoonstudios.org>
      Cc: aszlig <aszlig@redmoonstudios.org>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b20e2d54
    • Jon Maxwell's avatar
      dccp/tcp: fix routing redirect race · 45caeaa5
      Jon Maxwell authored
      As Eric Dumazet pointed out this also needs to be fixed in IPv6.
      v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.
      
      We have seen a few incidents lately where a dst_enty has been freed
      with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
      dst_entry. If the conditions/timings are right a crash then ensues when the
      freed dst_entry is referenced later on. A Common crashing back trace is:
      
       #8 [] page_fault at ffffffff8163e648
          [exception RIP: __tcp_ack_snd_check+74]
      .
      .
       #9 [] tcp_rcv_established at ffffffff81580b64
      #10 [] tcp_v4_do_rcv at ffffffff8158b54a
      #11 [] tcp_v4_rcv at ffffffff8158cd02
      #12 [] ip_local_deliver_finish at ffffffff815668f4
      #13 [] ip_local_deliver at ffffffff81566bd9
      #14 [] ip_rcv_finish at ffffffff8156656d
      #15 [] ip_rcv at ffffffff81566f06
      #16 [] __netif_receive_skb_core at ffffffff8152b3a2
      #17 [] __netif_receive_skb at ffffffff8152b608
      #18 [] netif_receive_skb at ffffffff8152b690
      #19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
      #20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
      #21 [] net_rx_action at ffffffff8152bac2
      #22 [] __do_softirq at ffffffff81084b4f
      #23 [] call_softirq at ffffffff8164845c
      #24 [] do_softirq at ffffffff81016fc5
      #25 [] irq_exit at ffffffff81084ee5
      #26 [] do_IRQ at ffffffff81648ff8
      
      Of course it may happen with other NIC drivers as well.
      
      It's found the freed dst_entry here:
      
       224 static bool tcp_in_quickack_mode(struct sock *sk)
       225 {
       226 ▹       const struct inet_connection_sock *icsk = inet_csk(sk);
       227 ▹       const struct dst_entry *dst = __sk_dst_get(sk);
       228 
       229 ▹       return (dst && dst_metric(dst, RTAX_QUICKACK)) ||
       230 ▹       ▹       (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);
       231 }
      
      But there are other backtraces attributed to the same freed dst_entry in
      netfilter code as well.
      
      All the vmcores showed 2 significant clues:
      
      - Remote hosts behind the default gateway had always been redirected to a
      different gateway. A rtable/dst_entry will be added for that host. Making
      more dst_entrys with lower reference counts. Making this more probable.
      
      - All vmcores showed a postitive LockDroppedIcmps value, e.g:
      
      LockDroppedIcmps                  267
      
      A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
      regardless of whether user space has the socket locked. This can result in a
      race condition where the same dst_entry cached in sk->sk_dst_entry can be
      decremented twice for the same socket via:
      
      do_redirect()->__sk_dst_check()-> dst_release().
      
      Which leads to the dst_entry being prematurely freed with another socket
      pointing to it via sk->sk_dst_cache and a subsequent crash.
      
      To fix this skip do_redirect() if usespace has the socket locked. Instead let
      the redirect take place later when user space does not have the socket
      locked.
      
      The dccp/IPv6 code is very similar in this respect, so fixing it there too.
      
      As Eric Garver pointed out the following commit now invalidates routes. Which
      can set the dst->obsolete flag so that ipv4_dst_check() returns null and
      triggers the dst_release().
      
      Fixes: ceb33206
      
       ("ipv4: Kill routes during PMTU/redirect updates.")
      Cc: Eric Garver <egarver@redhat.com>
      Cc: Hannes Sowa <hsowa@redhat.com>
      Signed-off-by: default avatarJon Maxwell <jmaxwell37@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      45caeaa5
    • Zhao Qiang's avatar
      ucc/hdlc: fix two little issue · 02bb56dd
      Zhao Qiang authored
      
      
      1. modify bd_status from u32 to u16 in function hdlc_rx_done,
      because bd_status register is 16bits
      2. write bd_length register before writing bd_status register
      
      Signed-off-by: default avatarZhao Qiang <qiang.zhao@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      02bb56dd
    • Linus Torvalds's avatar
      Merge tag 'powerpc-4.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · fb5fe0fd
      Linus Torvalds authored
      Pull some more powerpc fixes from Michael Ellerman:
       "The main item is the addition of the Power9 Machine Check handler.
        This was delayed to make sure some details were correct, and is as
        minimal as possible.
      
        The rest is small fixes, two for the Power9 PMU, two dealing with
        obscure toolchain problems, two for the PowerNV IOMMU code (used by
        VFIO), and one to fix a crash on 32-bit machines with macio devices
        due to missing dma_ops.
      
        Thanks to:
          Alexey Kardashevskiy, Cyril Bur, Larry Finger, Madhavan Srinivasan,
          Nicholas Piggin"
      
      * tag 'powerpc-4.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/64s: POWER9 machine check handler
        powerpc/64s: allow machine check handler to set severity and initiator
        powerpc/64s: fix handling of non-synchronous machine checks
        powerpc/pmac: Fix crash in dma-mapping.h with NULL dma_ops
        powerpc/powernv/ioda2: Update iommu table base on ownership change
        powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested
        selftests/powerpc: Replace stxvx and lxvx with stxvd2x/lxvd2x
        powerpc/perf: Handle sdar_mode for marked event in power9
        powerpc/perf: Fix perf_get_data_addr() for power9 DD1
        powerpc/boot: Fix zImage TOC alignment
      fb5fe0fd
    • Nicolas Dichtel's avatar
      vxlan: fix ovs support · c80498e3
      Nicolas Dichtel authored
      The required changes in the function vxlan_dev_create() were missing
      in commit 8bcdc4f3.
      The vxlan device is not registered anymore after this patch and the error
      path causes an stack dump:
       WARNING: CPU: 3 PID: 1498 at net/core/dev.c:6713 rollback_registered_many+0x9d/0x3f0
      
      Fixes: 8bcdc4f3
      
       ("vxlan: add changelink support")
      CC: Roopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c80498e3
    • Andrey Vagin's avatar
      net: use net->count to check whether a netns is alive or not · 91864f58
      Andrey Vagin authored
      
      
      The previous idea was to check whether a net namespace is in
      net_exit_list or not. It doesn't work, because net->exit_list is used in
      __register_pernet_operations and __unregister_pernet_operations where
      all namespaces are added to a temporary list to make cleanup in a error
      case, so list_empty(&net->exit_list) always returns false.
      
      Reported-by: default avatarMantas Mikulėnas <grawity@gmail.com>
      Fixes: 002d8a1a
      
       ("net: skip genenerating uevents for network namespaces that are exiting")
      Signed-off-by: default avatarAndrei Vagin <avagin@openvz.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      91864f58
    • Linus Torvalds's avatar
      Merge tag 'platform-drivers-x86-v4.11-2' of git://git.infradead.org/linux-platform-drivers-x86 · 065f3e49
      Linus Torvalds authored
      Pull x86 platform driver updates from Darren Hart:
       "Asus fixes for the airplane LED and a long awaited fujitsu cleanup.
      
        asus-wmi:
         - Remove quirk_no_rfkill
         - Detect quirk_no_rfkill from the DSDT
      
        fujitsu-laptop:
         - remove redundant MODULE_ALIAS entries
         - autodetect LCD interface on all models
         - simplify acpi_bus_register_driver() error handling
         - remove redundant forward declarations
         - replace numeric values with constants
         - rename FUNC_RFKILL to FUNC_FLAGS
         - make platform-related variables match naming convention
         - replace "hotkey" with "laptop" in symbol names
         - clearly denote backlight-related symbols"
      
      * tag 'platform-drivers-x86-v4.11-2' of git://git.infradead.org/linux-platform-drivers-x86:
        platform/x86: asus-wmi: Remove quirk_no_rfkill
        platform/x86: asus-wmi: Detect quirk_no_rfkill from the DSDT
        platform/x86: fujitsu-laptop: remove redundant MODULE_ALIAS entries
        platform/x86: fujitsu-laptop: autodetect LCD interface on all models
        platform/x86: fujitsu-laptop: simplify acpi_bus_register_driver() error handling
        platform/x86: fujitsu-laptop: remove redundant forward declarations
        platform/x86: fujitsu-laptop: replace numeric values with constants
        platform/x86: fujitsu-laptop: rename FUNC_RFKILL to FUNC_FLAGS
        platform/x86: fujitsu-laptop: make platform-related variables match naming convention
        platform/x86: fujitsu-laptop: replace "hotkey" with "laptop" in symbol names
        platform/x86: fujitsu-laptop: clearly denote backlight-related symbols
      065f3e49
    • Florian Westphal's avatar
      bridge: drop netfilter fake rtable unconditionally · a13b2082
      Florian Westphal authored
      Andreas reports kernel oops during rmmod of the br_netfilter module.
      Hannes debugged the oops down to a NULL rt6info->rt6i_indev.
      
      Problem is that br_netfilter has the nasty concept of adding a fake
      rtable to skb->dst; this happens in a br_netfilter prerouting hook.
      
      A second hook (in bridge LOCAL_IN) is supposed to remove these again
      before the skb is handed up the stack.
      
      However, on module unload hooks get unregistered which means an
      skb could traverse the prerouting hook that attaches the fake_rtable,
      while the 'fake rtable remove' hook gets removed from the hooklist
      immediately after.
      
      Fixes: 34666d46
      
       ("netfilter: bridge: move br_netfilter out of the core")
      Reported-by: default avatarAndreas Karis <akaris@redhat.com>
      Debugged-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Acked-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a13b2082
    • Florian Westphal's avatar
      ipv6: avoid write to a possibly cloned skb · 79e49503
      Florian Westphal authored
      
      
      ip6_fragment, in case skb has a fraglist, checks if the
      skb is cloned.  If it is, it will move to the 'slow path' and allocates
      new skbs for each fragment.
      
      However, right before entering the slowpath loop, it updates the
      nexthdr value of the last ipv6 extension header to NEXTHDR_FRAGMENT,
      to account for the fragment header that will be inserted in the new
      ipv6-fragment skbs.
      
      In case original skb is cloned this munges nexthdr value of another
      skb.  Avoid this by doing the nexthdr update for each of the new fragment
      skbs separately.
      
      This was observed with tcpdump on a bridge device where netfilter ipv6
      reassembly is active:  tcpdump shows malformed fragment headers as
      the l4 header (icmpv6, tcp, etc). is decoded as a fragment header.
      
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Reported-by: default avatarAndreas Karis <akaris@redhat.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      79e49503
    • Johan Hovold's avatar
      net: wimax/i2400m: fix NULL-deref at probe · 6e526fdf
      Johan Hovold authored
      Make sure to check the number of endpoints to avoid dereferencing a
      NULL-pointer or accessing memory beyond the endpoint array should a
      malicious device lack the expected endpoints.
      
      The endpoints are specifically dereferenced in the i2400m_bootrom_init
      path during probe (e.g. in i2400mu_tx_bulk_out).
      
      Fixes: f398e424
      
       ("i2400m/USB: probe/disconnect, dev init/shutdown
      and reset backends")
      Cc: Inaky Perez-Gonzalez <inaky@linux.intel.com>
      
      Signed-off-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6e526fdf