Skip to content
  1. Feb 04, 2022
    • Florian Westphal's avatar
      netfilter: conntrack: don't refresh sctp entries in closed state · 77b33719
      Florian Westphal authored
      
      
      Vivek Thrivikraman reported:
       An SCTP server application which is accessed continuously by client
       application.
       When the session disconnects the client retries to establish a connection.
       After restart of SCTP server application the session is not established
       because of stale conntrack entry with connection state CLOSED as below.
      
       (removing this entry manually established new connection):
      
       sctp 9 CLOSED src=10.141.189.233 [..]  [ASSURED]
      
      Just skip timeout update of closed entries, we don't want them to
      stay around forever.
      
      Reported-and-tested-by: default avatarVivek Thrivikraman <vivek.thrivikraman@est.tech>
      Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1579
      
      
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      77b33719
    • Steen Hegelund's avatar
      net: sparx5: Fix get_stat64 crash in tcpdump · ed14fc7a
      Steen Hegelund authored
      This problem was found with Sparx5 when the tcpdump tool requests the
      do_get_stats64 (sparx5_get_stats64) statistic.
      
      The portstats pointer was incorrectly incremented when fetching priority
      based statistics.
      
      Fixes: af4b1102
      
       (net: sparx5: add ethtool configuration and statistics support)
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Link: https://lore.kernel.org/r/20220203102900.528987-1-steen.hegelund@microchip.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ed14fc7a
    • Kees Cook's avatar
      gcc-plugins/stackleak: Use noinstr in favor of notrace · dcb85f85
      Kees Cook authored
      
      
      While the stackleak plugin was already using notrace, objtool is now a
      bit more picky.  Update the notrace uses to noinstr.  Silences the
      following objtool warnings when building with:
      
      CONFIG_DEBUG_ENTRY=y
      CONFIG_STACK_VALIDATION=y
      CONFIG_VMLINUX_VALIDATION=y
      CONFIG_GCC_PLUGIN_STACKLEAK=y
      
        vmlinux.o: warning: objtool: do_syscall_64()+0x9: call to stackleak_track_stack() leaves .noinstr.text section
        vmlinux.o: warning: objtool: do_int80_syscall_32()+0x9: call to stackleak_track_stack() leaves .noinstr.text section
        vmlinux.o: warning: objtool: exc_general_protection()+0x22: call to stackleak_track_stack() leaves .noinstr.text section
        vmlinux.o: warning: objtool: fixup_bad_iret()+0x20: call to stackleak_track_stack() leaves .noinstr.text section
        vmlinux.o: warning: objtool: do_machine_check()+0x27: call to stackleak_track_stack() leaves .noinstr.text section
        vmlinux.o: warning: objtool: .text+0x5346e: call to stackleak_erase() leaves .noinstr.text section
        vmlinux.o: warning: objtool: .entry.text+0x143: call to stackleak_erase() leaves .noinstr.text section
        vmlinux.o: warning: objtool: .entry.text+0x10eb: call to stackleak_erase() leaves .noinstr.text section
        vmlinux.o: warning: objtool: .entry.text+0x17f9: call to stackleak_erase() leaves .noinstr.text section
      
      Note that the plugin's addition of calls to stackleak_track_stack() from
      noinstr functions is expected to be safe, as it isn't runtime
      instrumentation and is self-contained.
      
      Cc: Alexander Popov <alex.popov@linux.com>
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dcb85f85
    • Linus Torvalds's avatar
      Merge tag 'net-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · eb2eb516
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from bpf, netfilter, and ieee802154.
      
        Current release - regressions:
      
         - Partially revert "net/smc: Add netlink net namespace support", fix
           uABI breakage
      
         - netfilter:
            - nft_ct: fix use after free when attaching zone template
            - nft_byteorder: track register operations
      
        Previous releases - regressions:
      
         - ipheth: fix EOVERFLOW in ipheth_rcvbulk_callback
      
         - phy: qca8081: fix speeds lower than 2.5Gb/s
      
         - sched: fix use-after-free in tc_new_tfilter()
      
        Previous releases - always broken:
      
         - tcp: fix mem under-charging with zerocopy sendmsg()
      
         - tcp: add missing tcp_skb_can_collapse() test in
           tcp_shift_skb_data()
      
         - neigh: do not trigger immediate probes on NUD_FAILED from
           neigh_managed_work, avoid a deadlock
      
         - bpf: use VM_MAP instead of VM_ALLOC for ringbuf, avoid KASAN
           false-positives
      
         - netfilter: nft_reject_bridge: fix for missing reply from prerouting
      
         - smc: forward wakeup to smc socket waitqueue after fallback
      
         - ieee802154:
            - return meaningful error codes from the netlink helpers
            - mcr20a: fix lifs/sifs periods
            - at86rf230, ca8210: stop leaking skbs on error paths
      
         - macsec: add missing un-offload call for NETDEV_UNREGISTER of parent
      
         - ax25: add refcount in ax25_dev to avoid UAF bugs
      
         - eth: mlx5e:
            - fix SFP module EEPROM query
            - fix broken SKB allocation in HW-GRO
            - IPsec offload: fix tunnel mode crypto for non-TCP/UDP flows
      
         - eth: amd-xgbe:
            - fix skb data length underflow
            - ensure reset of the tx_timer_active flag, avoid Tx timeouts
      
         - eth: stmmac: fix runtime pm use in stmmac_dvr_remove()
      
         - eth: e1000e: handshake with CSME starts from Alder Lake platforms"
      
      * tag 'net-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (69 commits)
        ax25: fix reference count leaks of ax25_dev
        net: stmmac: ensure PTP time register reads are consistent
        net: ipa: request IPA register values be retained
        dt-bindings: net: qcom,ipa: add optional qcom,qmp property
        tools/resolve_btfids: Do not print any commands when building silently
        bpf: Use VM_MAP instead of VM_ALLOC for ringbuf
        net, neigh: Do not trigger immediate probes on NUD_FAILED from neigh_managed_work
        tcp: add missing tcp_skb_can_collapse() test in tcp_shift_skb_data()
        net: sparx5: do not refer to skb after passing it on
        Partially revert "net/smc: Add netlink net namespace support"
        net/mlx5e: Avoid field-overflowing memcpy()
        net/mlx5e: Use struct_group() for memcpy() region
        net/mlx5e: Avoid implicit modify hdr for decap drop rule
        net/mlx5e: IPsec: Fix tunnel mode crypto offload for non TCP/UDP traffic
        net/mlx5e: IPsec: Fix crypto offload for non TCP/UDP encapsulated traffic
        net/mlx5e: Don't treat small ceil values as unlimited in HTB offload
        net/mlx5: E-Switch, Fix uninitialized variable modact
        net/mlx5e: Fix handling of wrong devices during bond netevent
        net/mlx5e: Fix broken SKB allocation in HW-GRO
        net/mlx5e: Fix wrong calculation of header index in HW_GRO
        ...
      eb2eb516
    • Linus Torvalds's avatar
      Merge tag 'selinux-pr-20220203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux · 551007a8
      Linus Torvalds authored
      Pull selinux fix from Paul Moore:
       "One small SELinux patch to ensure that a policy structure field is
        properly reset after freeing so that we don't inadvertently do a
        double-free on certain error conditions"
      
      * tag 'selinux-pr-20220203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
        selinux: fix double free of cond_list on error paths
      551007a8
    • Linus Torvalds's avatar
      Merge tag 'linux-kselftest-fixes-5.17-rc3' of... · 25b20ae8
      Linus Torvalds authored
      Merge tag 'linux-kselftest-fixes-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull Kselftest fixes from Shuah Khan:
       "Important fixes to several tests and documentation clarification on
        running mainline kselftest on stable releases. A few notable fixes:
      
         - fix kselftest run hang due to child processes that haven't been
           terminated. Fix signals all child processes
      
         - fix false pass/fail results from vdso_test_abi, openat2, mincore
      
         - build failures when using -j (multiple jobs) option
      
         - exec test build failure due to incorrect build rule for a run-time
           created "pipe"
      
         - zram test fixes related to interaction with zram-generator to make
           sure zram test to coordinate deleted with zram-generator
      
         - zram test compression ratio calculation fix and skipping
           max_comp_streams.
      
         - increasing rtc test timeout
      
         - cpufreq test to write test results to stdout which will necessary
           on automated test systems"
      
      * tag 'linux-kselftest-fixes-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        kselftest: Fix vdso_test_abi return status
        selftests: skip mincore.check_file_mmap when fs lacks needed support
        selftests: openat2: Skip testcases that fail with EOPNOTSUPP
        selftests: openat2: Add missing dependency in Makefile
        selftests: openat2: Print also errno in failure messages
        selftests: futex: Use variable MAKE instead of make
        selftests/exec: Remove pipe from TEST_GEN_FILES
        selftests/zram: Adapt the situation that /dev/zram0 is being used
        selftests/zram01.sh: Fix compression ratio calculation
        selftests/zram: Skip max_comp_streams interface on newer kernel
        docs/kselftest: clarify running mainline tests on stables
        kselftest: signal all child processes
        selftests: cpufreq: Write test output to stdout as well
        selftests: rtc: Increase test timeout so that all tests run
      25b20ae8
    • Duoming Zhou's avatar
      ax25: fix reference count leaks of ax25_dev · 87563a04
      Duoming Zhou authored
      The previous commit d01ffb9e ("ax25: add refcount in ax25_dev
      to avoid UAF bugs") introduces refcount into ax25_dev, but there
      are reference leak paths in ax25_ctl_ioctl(), ax25_fwd_ioctl(),
      ax25_rt_add(), ax25_rt_del() and ax25_rt_opt().
      
      This patch uses ax25_dev_put() and adjusts the position of
      ax25_addr_ax25dev() to fix reference cout leaks of ax25_dev.
      
      Fixes: d01ffb9e
      
       ("ax25: add refcount in ax25_dev to avoid UAF bugs")
      Signed-off-by: default avatarDuoming Zhou <duoming@zju.edu.cn>
      Reviewed-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Link: https://lore.kernel.org/r/20220203150811.42256-1-duoming@zju.edu.cn
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      87563a04
    • Yannick Vignon's avatar
      net: stmmac: ensure PTP time register reads are consistent · 80d46090
      Yannick Vignon authored
      Even if protected from preemption and interrupts, a small time window
      remains when the 2 register reads could return inconsistent values,
      each time the "seconds" register changes. This could lead to an about
      1-second error in the reported time.
      
      Add logic to ensure the "seconds" and "nanoseconds" values are consistent.
      
      Fixes: 92ba6888
      
       ("stmmac: add the support for PTP hw clock driver")
      Signed-off-by: default avatarYannick Vignon <yannick.vignon@nxp.com>
      Reviewed-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Link: https://lore.kernel.org/r/20220203160025.750632-1-yannick.vignon@oss.nxp.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      80d46090
    • Jakub Kicinski's avatar
      Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 77b1b8b4
      Jakub Kicinski authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2022-02-03
      
      We've added 6 non-merge commits during the last 10 day(s) which contain
      a total of 7 files changed, 11 insertions(+), 236 deletions(-).
      
      The main changes are:
      
      1) Fix BPF ringbuf to allocate its area with VM_MAP instead of VM_ALLOC
         flag which otherwise trips over KASAN, from Hou Tao.
      
      2) Fix unresolved symbol warning in resolve_btfids due to LSM callback
         rename, from Alexei Starovoitov.
      
      3) Fix a possible race in inc_misses_counter() when IRQ would trigger
         during counter update, from He Fengqing.
      
      4) Fix tooling infra for cross-building with clang upon probing whether
         gcc provides the standard libraries, from Jean-Philippe Brucker.
      
      5) Fix silent mode build for resolve_btfids, from Nathan Chancellor.
      
      6) Drop unneeded and outdated lirc.h header copy from tooling infra as
         BPF does not require it anymore, from Sean Young.
      
      * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        tools/resolve_btfids: Do not print any commands when building silently
        bpf: Use VM_MAP instead of VM_ALLOC for ringbuf
        tools: Ignore errors from `which' when searching a GCC toolchain
        tools headers UAPI: remove stale lirc.h
        bpf: Fix possible race in inc_misses_counter
        bpf: Fix renaming task_getsecid_subj->current_getsecid_subj.
      ====================
      
      Link: https://lore.kernel.org/r/20220203155815.25689-1-daniel@iogearbox.net
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      77b1b8b4
    • Mickaël Salaün's avatar
      printk: Fix incorrect __user type in proc_dointvec_minmax_sysadmin() · 1f2cfdd3
      Mickaël Salaün authored
      The move of proc_dointvec_minmax_sysadmin() from kernel/sysctl.c to
      kernel/printk/sysctl.c introduced an incorrect __user attribute to the
      buffer argument.  I spotted this change in [1] as well as the kernel
      test robot.  Revert this change to please sparse:
      
        kernel/printk/sysctl.c:20:51: warning: incorrect type in argument 3 (different address spaces)
        kernel/printk/sysctl.c:20:51:    expected void *
        kernel/printk/sysctl.c:20:51:    got void [noderef] __user *buffer
      
      Fixes: faaa357a ("printk: move printk sysctl to printk/sysctl.c")
      Link: https://lore.kernel.org/r/20220104155024.48023-2-mic@digikod.net
      
       [1]
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: John Ogness <john.ogness@linutronix.de>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Xiaoming Ni <nixiaomi...
      1f2cfdd3
    • Igor Pylypiv's avatar
      Revert "module, async: async_synchronize_full() on module init iff async is used" · 67d6212a
      Igor Pylypiv authored
      This reverts commit 774a1221.
      
      We need to finish all async code before the module init sequence is
      done.  In the reverted commit the PF_USED_ASYNC flag was added to mark a
      thread that called async_schedule().  Then the PF_USED_ASYNC flag was
      used to determine whether or not async_synchronize_full() needs to be
      invoked.  This works when modprobe thread is calling async_schedule(),
      but it does not work if module dispatches init code to a worker thread
      which then calls async_schedule().
      
      For example, PCI driver probing is invoked from a worker thread based on
      a node where device is attached:
      
      	if (cpu < nr_cpu_ids)
      		error = work_on_cpu(cpu, local_pci_probe, &ddi);
      	else
      		error = local_pci_probe(&ddi);
      
      We end up in a situation where a worker thread gets the PF_USED_ASYNC
      flag set instead of the modprobe thread.  As a result,
      async_synchronize_full() is not invoked and modprobe completes without
      waiting for the async code to finish.
      
      The issue was discovered while loading the pm80xx driver:
      (scsi_mod.scan=async)
      
      modprobe pm80xx                      worker
      ...
        do_init_module()
        ...
          pci_call_probe()
            work_on_cpu(local_pci_probe)
                                           local_pci_probe()
                                             pm8001_pci_probe()
                                               scsi_scan_host()
                                                 async_schedule()
                                                 worker->flags |= PF_USED_ASYNC;
                                           ...
            < return from worker >
        ...
        if (current->flags & PF_USED_ASYNC) <--- false
        	async_synchronize_full();
      
      Commit 21c3c5d2 ("block: don't request module during elevator init")
      fixed the deadlock issue which the reverted commit 774a1221
      ("module, async: async_synchronize_full() on module init iff async is
      used") tried to fix.
      
      Since commit 0fdff3ec
      
       ("async, kmod: warn on synchronous
      request_module() from async workers") synchronous module loading from
      async is not allowed.
      
      Given that the original deadlock issue is fixed and it is no longer
      allowed to call synchronous request_module() from async we can remove
      PF_USED_ASYNC flag to make module init consistently invoke
      async_synchronize_full() unless async module probe is requested.
      
      Signed-off-by: default avatarIgor Pylypiv <ipylypiv@google.com>
      Reviewed-by: default avatarChangyuan Lyu <changyuanl@google.com>
      Reviewed-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67d6212a
    • Linus Torvalds's avatar
      Merge branch 'for-5.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · 305e6c42
      Linus Torvalds authored
      Pull cgroup fixes from Tejun Heo:
      
       - Eric's fix for a long standing cgroup1 permission issue where it only
         checks for uid 0 instead of CAP which inadvertently allows
         unprivileged userns roots to modify release_agent userhelper
      
       - Fixes for the fallout from Waiman's recent cpuset work
      
      * 'for-5.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning
        cgroup-v1: Require capabilities to set release_agent
        cpuset: Fix the bug that subpart_cpus updated wrongly in update_cpumask()
        cgroup/cpuset: Make child cpusets restrict parents on v1 hierarchy
      305e6c42
    • Jakub Kicinski's avatar
      Merge branch 'net-ipa-enable-register-retention' · 0166556a
      Jakub Kicinski authored
      Alex Elder says:
      
      ====================
      net: ipa: enable register retention
      
      With runtime power management in place, we sometimes need to issue
      a command to enable retention of IPA register values before power
      collapse.  This requires a new Device Tree property, whose presence
      will also be used to signal that the command is required.
      ====================
      
      Link: https://lore.kernel.org/r/20220201150205.468403-1-elder@linaro.org
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0166556a
    • Alex Elder's avatar
      net: ipa: request IPA register values be retained · 34a08176
      Alex Elder authored
      In some cases, the IPA hardware needs to request the always-on
      subsystem (AOSS) to coordinate with the IPA microcontroller to
      retain IPA register values at power collapse.  This is done by
      issuing a QMP request to the AOSS microcontroller.  A similar
      request ondoes that request.
      
      We must get and hold the "QMP" handle early, because we might get
      back EPROBE_DEFER for that.  But the actual request should be sent
      while we know the IPA clock is active, and when we know the
      microcontroller is operational.
      
      Fixes: 1aac309d
      
       ("net: ipa: use autosuspend")
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      34a08176
    • Alex Elder's avatar
      dt-bindings: net: qcom,ipa: add optional qcom,qmp property · ac62a017
      Alex Elder authored
      
      
      For some systems, the IPA driver must make a request to ensure that
      its registers are retained across power collapse of the IPA hardware.
      On such systems, we'll use the existence of the "qcom,qmp" property
      as a signal that this request is required.
      
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ac62a017
  2. Feb 03, 2022
  3. Feb 02, 2022
    • Dmitry V. Levin's avatar
      Partially revert "net/smc: Add netlink net namespace support" · c86d8613
      Dmitry V. Levin authored
      The change of sizeof(struct smc_diag_linkinfo) by commit 79d39fc5
      ("net/smc: Add netlink net namespace support") introduced an ABI
      regression: since struct smc_diag_lgrinfo contains an object of
      type "struct smc_diag_linkinfo", offset of all subsequent members
      of struct smc_diag_lgrinfo was changed by that change.
      
      As result, applications compiled with the old version
      of struct smc_diag_linkinfo will receive garbage in
      struct smc_diag_lgrinfo.role if the kernel implements
      this new version of struct smc_diag_linkinfo.
      
      Fix this regression by reverting the part of commit 79d39fc5 that
      changes struct smc_diag_linkinfo.  After all, there is SMC_GEN_NETLINK
      interface which is good enough, so there is probably no need to touch
      the smc_diag ABI in the first place.
      
      Fixes: 79d39fc5
      
       ("net/smc: Add netlink net namespace support")
      Signed-off-by: default avatarDmitry V. Levin <ldv@altlinux.org>
      Reviewed-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Link: https://lore.kernel.org/r/20220202030904.GA9742@altlinux.org
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c86d8613
    • David S. Miller's avatar
      Merge tag 'mlx5-fixes-2022-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · c8ff576e
      David S. Miller authored
      
      
      Saeed Mahameed says:
      
      ====================
      mlx5 fixes 2022-02-01
      
      This series provides bug fixes to mlx5 driver.
      Please pull and let me know if there is any problem.
      
      Sorry about the long series, but I had to move the top two patches from
      net-next to net to help avoiding a build break when kspp branch is merged
      into linus-next on next merge window.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8ff576e
    • Jakub Kicinski's avatar
      Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · 3aa430d3
      Jakub Kicinski authored
      Tony Nguyen says:
      
      ====================
      Intel Wired LAN Driver Updates 2022-02-01
      
      This series contains updates to e1000e driver only.
      
      Sasha removes CSME handshake with TGL platform as this is not supported
      and is causing hardware unit hangs to be reported.
      
      * '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
        e1000e: Handshake with CSME starts from ADL platforms
        e1000e: Separate ADP board type from TGP
      ====================
      
      Link: https://lore.kernel.org/r/20220201173754.580305-1-anthony.l.nguyen@intel.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3aa430d3
    • Kees Cook's avatar
      net/mlx5e: Avoid field-overflowing memcpy() · ad518573
      Kees Cook authored
      In preparation for FORTIFY_SOURCE performing compile-time and run-time
      field bounds checking for memcpy(), memmove(), and memset(), avoid
      intentionally writing across neighboring fields.
      
      Use flexible arrays instead of zero-element arrays (which look like they
      are always overflowing) and split the cross-field memcpy() into two halves
      that can be appropriately bounds-checked by the compiler.
      
      We were doing:
      
      	#define ETH_HLEN  14
      	#define VLAN_HLEN  4
      	...
      	#define MLX5E_XDP_MIN_INLINE (ETH_HLEN + VLAN_HLEN)
      	...
              struct mlx5e_tx_wqe      *wqe  = mlx5_wq_cyc_get_wqe(wq, pi);
      	...
              struct mlx5_wqe_eth_seg  *eseg = &wqe->eth;
              struct mlx5_wqe_data_seg *dseg = wqe->data;
      	...
      	memcpy(eseg->inline_hdr.start, xdptxd->data, MLX5E_XDP_MIN_INLINE);
      
      target is wqe->eth.inline_hdr.start (which the compiler sees as being
      2 bytes in size), but copying 18, intending to write across start
      (really vlan_tci, 2 bytes). The remaining 16 bytes get written into
      wqe->data[0], covering byte_count (4 bytes), lkey (4 bytes), and addr
      (8 bytes).
      
      struct mlx5e_tx_wqe {
              struct mlx5_wqe_ctrl_seg   ctrl;                 /*     0    16 */
              struct mlx5_wqe_eth_seg    eth;                  /*    16    16 */
              struct mlx5_wqe_data_seg   data[];               /*    32     0 */
      
              /* size: 32, cachelines: 1, members: 3 */
              /* last cacheline: 32 bytes */
      };
      
      struct mlx5_wqe_eth_seg {
              u8                         swp_outer_l4_offset;  /*     0     1 */
              u8                         swp_outer_l3_offset;  /*     1     1 */
              u8                         swp_inner_l4_offset;  /*     2     1 */
              u8                         swp_inner_l3_offset;  /*     3     1 */
              u8                         cs_flags;             /*     4     1 */
              u8                         swp_flags;            /*     5     1 */
              __be16                     mss;                  /*     6     2 */
              __be32                     flow_table_metadata;  /*     8     4 */
              union {
                      struct {
                              __be16     sz;                   /*    12     2 */
                              u8         start[2];             /*    14     2 */
                      } inline_hdr;                            /*    12     4 */
                      struct {
                              __be16     type;                 /*    12     2 */
                              __be16     vlan_tci;             /*    14     2 */
                      } insert;                                /*    12     4 */
                      __be32             trailer;              /*    12     4 */
              };                                               /*    12     4 */
      
              /* size: 16, cachelines: 1, members: 9 */
              /* last cacheline: 16 bytes */
      };
      
      struct mlx5_wqe_data_seg {
              __be32                     byte_count;           /*     0     4 */
              __be32                     lkey;                 /*     4     4 */
              __be64                     addr;                 /*     8     8 */
      
              /* size: 16, cachelines: 1, members: 3 */
              /* last cacheline: 16 bytes */
      };
      
      So, split the memcpy() so the compiler can reason about the buffer
      sizes.
      
      "pahole" shows no size nor member offset changes to struct mlx5e_tx_wqe
      nor struct mlx5e_umr_wqe. "objdump -d" shows no meaningful object
      code changes (i.e. only source line number induced differences and
      optimizations).
      
      Fixes: b5503b99
      
       ("net/mlx5e: XDP TX forwarding support")
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      ad518573
    • Kees Cook's avatar
      net/mlx5e: Use struct_group() for memcpy() region · 6d5c900e
      Kees Cook authored
      In preparation for FORTIFY_SOURCE performing compile-time and run-time
      field bounds checking for memcpy(), memmove(), and memset(), avoid
      intentionally writing across neighboring fields.
      
      Use struct_group() in struct vlan_ethhdr around members h_dest and
      h_source, so they can be referenced together. This will allow memcpy()
      and sizeof() to more easily reason about sizes, improve readability,
      and avoid future warnings about writing beyond the end of h_dest.
      
      "pahole" shows no size nor member offset changes to struct vlan_ethhdr.
      "objdump -d" shows no object code changes.
      
      Fixes: 34802a42
      
       ("net/mlx5e: Do not modify the TX SKB")
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      6d5c900e
    • Roi Dayan's avatar
      net/mlx5e: Avoid implicit modify hdr for decap drop rule · 5b209d1a
      Roi Dayan authored
      Currently the driver adds implicit modify hdr action for
      decap rules on tunnel devices if the port is an ovs port.
      This is also done if the action is drop and makes the modify
      hdr redundant and also the FW doesn't support it and will generate
      a syndrome.
      
      kernel: mlx5_core 0000:08:00.0: mlx5_cmd_check:777:(pid 102063): SET_FLOW_TABLE_ENTRY(0x936) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x8708c3)
      
      Fix it by adding the implicit modify hdr only for fwd actions.
      
      Fixes: b16eb3c8 ("net/mlx5: Support internal port as decap route device")
      Fixes: 077cdda7
      
       ("net/mlx5e: TC, Fix memory leak with rules with internal port")
      Signed-off-by: default avatarRoi Dayan <roid@nvidia.com>
      Reviewed-by: default avatarAriel Levkovich <lariel@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      5b209d1a
    • Raed Salem's avatar
      net/mlx5e: IPsec: Fix tunnel mode crypto offload for non TCP/UDP traffic · de47db0c
      Raed Salem authored
      IPsec Tunnel mode crypto offload software parser (SWP) setting in data
      path currently always set the inner L4 offset regardless of the
      encapsulated L4 header type and whether it exists in the first place,
      this breaks non TCP/UDP traffic as such.
      
      Set the SWP inner L4 offset only when the IPsec tunnel encapsulated L4
      header protocol is TCP/UDP.
      
      While at it fix inner ip protocol read for setting MLX5_ETH_WQE_SWP_INNER_L4_UDP
      flag to address the case where the ip header protocol is IPv6.
      
      Fixes: f1267798
      
       ("net/mlx5: Fix checksum issue of VXLAN and IPsec crypto offload")
      Signed-off-by: default avatarRaed Salem <raeds@nvidia.com>
      Reviewed-by: default avatarMaor Dickman <maord@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      de47db0c
    • Raed Salem's avatar
      net/mlx5e: IPsec: Fix crypto offload for non TCP/UDP encapsulated traffic · 5352859b
      Raed Salem authored
      IPsec crypto offload always set the ethernet segment checksum flags with
      the inner L4 header checksum flag enabled for encapsulated IPsec offloaded
      packet regardless of the encapsulated L4 header type, and even if it
      doesn't exists in the first place, this breaks non TCP/UDP traffic as
      such.
      
      Set the inner L4 checksum flag only when the encapsulated L4 header
      protocol is TCP/UDP using software parser swp_inner_l4_offset field as
      indication.
      
      Fixes: 5cfb540e
      
       ("net/mlx5e: Set IPsec WAs only in IP's non checksum partial case.")
      Signed-off-by: default avatarRaed Salem <raeds@nvidia.com>
      Reviewed-by: default avatarMaor Dickman <maord@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      5352859b
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Don't treat small ceil values as unlimited in HTB offload · 736dfe4e
      Maxim Mikityanskiy authored
      The hardware spec defines max_average_bw == 0 as "unlimited bandwidth".
      max_average_bw is calculated as `ceil / BYTES_IN_MBIT`, which can become
      0 when ceil is small, leading to an undesired effect of having no
      bandwidth limit.
      
      This commit fixes it by rounding up small values of ceil to 1 Mbit/s.
      
      Fixes: 214baf22
      
       ("net/mlx5e: Support HTB offload")
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      736dfe4e
    • Maor Dickman's avatar
      net/mlx5: E-Switch, Fix uninitialized variable modact · d8e5883d
      Maor Dickman authored
      The variable modact is not initialized before used in command
      modify header allocation which can cause command to fail.
      
      Fix by initializing modact with zeros.
      
      Addresses-Coverity: ("Uninitialized scalar variable")
      Fixes: 8f1e0b97
      
       ("net/mlx5: E-Switch, Mark miss packets with new chain id mapping")
      Signed-off-by: default avatarMaor Dickman <maord@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      d8e5883d
    • Maor Dickman's avatar
      net/mlx5e: Fix handling of wrong devices during bond netevent · ec41332e
      Maor Dickman authored
      Current implementation of bond netevent handler only check if
      the handled netdev is VF representor and it missing a check if
      the VF representor is on the same phys device of the bond handling
      the netevent.
      
      Fix by adding the missing check and optimizing the check if
      the netdev is VF representor so it will not access uninitialized
      private data and crashes.
      
      BUG: kernel NULL pointer dereference, address: 000000000000036c
      PGD 0 P4D 0
      Oops: 0000 [#1] SMP NOPTI
      Workqueue: eth3bond0 bond_mii_monitor [bonding]
      RIP: 0010:mlx5e_is_uplink_rep+0xc/0x50 [mlx5_core]
      RSP: 0018:ffff88812d69fd60 EFLAGS: 00010282
      RAX: 0000000000000000 RBX: ffff8881cf800000 RCX: 0000000000000000
      RDX: ffff88812d69fe10 RSI: 000000000000001b RDI: ffff8881cf800880
      RBP: ffff8881cf800000 R08: 00000445cabccf2b R09: 0000000000000008
      R10: 0000000000000004 R11: 0000000000000008 R12: ffff88812d69fe10
      R13: 00000000fffffffe R14: ffff88820c0f9000 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff88846fb00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000000000000036c CR3: 0000000103d80006 CR4: 0000000000370ea0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       mlx5e_eswitch_uplink_rep+0x31/0x40 [mlx5_core]
       mlx5e_rep_is_lag_netdev+0x94/0xc0 [mlx5_core]
       mlx5e_rep_esw_bond_netevent+0xeb/0x3d0 [mlx5_core]
       raw_notifier_call_chain+0x41/0x60
       call_netdevice_notifiers_info+0x34/0x80
       netdev_lower_state_changed+0x4e/0xa0
       bond_mii_monitor+0x56b/0x640 [bonding]
       process_one_work+0x1b9/0x390
       worker_thread+0x4d/0x3d0
       ? rescuer_thread+0x350/0x350
       kthread+0x124/0x150
       ? set_kthread_struct+0x40/0x40
       ret_from_fork+0x1f/0x30
      
      Fixes: 7e51891a
      
       ("net/mlx5e: Use netdev events to set/del egress acl forward-to-vport rule")
      Signed-off-by: default avatarMaor Dickman <maord@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      ec41332e
    • Khalid Manaa's avatar
      net/mlx5e: Fix broken SKB allocation in HW-GRO · 7957837b
      Khalid Manaa authored
      In case the HW doesn't perform header-data split, it will write the whole
      packet into the data buffer in the WQ, in this case the SHAMPO CQE handler
      couldn't use the header entry to build the SKB, instead it should allocate
      a new memory to build the SKB using the function:
      mlx5e_skb_from_cqe_mpwrq_nonlinear.
      
      Fixes: f97d5c2a
      
       ("net/mlx5e: Add handle SHAMPO cqe support")
      Signed-off-by: default avatarKhalid Manaa <khalidm@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      7957837b