Skip to content
  1. Dec 01, 2021
    • Jakub Kicinski's avatar
      tls: fix replacing proto_ops · b3c37092
      Jakub Kicinski authored
      [ Upstream commit f3911f73 ]
      
      We replace proto_ops whenever TLS is configured for RX. But our
      replacement also overrides sendpage_locked, which will crash
      unless TX is also configured. Similarly we plug both of those
      in for TLS_HW (NIC crypto offload) even tho TLS_HW has a completely
      different implementation for TX.
      
      Last but not least we always plug in something based on inet_stream_ops
      even though a few of the callbacks differ for IPv6 (getname, release,
      bind).
      
      Use a callback building method similar to what we do for struct proto.
      
      Fixes: c46234eb ("tls: RX path for ktls")
      Fixes: d4ffb02d
      
       ("net/tls: enable sk_msg redirect to tls socket egress")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      b3c37092
    • Jakub Kicinski's avatar
      tls: splice_read: fix accessing pre-processed records · 6a012337
      Jakub Kicinski authored
      [ Upstream commit e062fe99 ]
      
      recvmsg() will put peek()ed and partially read records onto the rx_list.
      splice_read() needs to consult that list otherwise it may miss data.
      Align with recvmsg() and also put partially-read records onto rx_list.
      tls_sw_advance_skb() is pretty pointless now and will be removed in
      net-next.
      
      Fixes: 692d7b5d
      
       ("tls: Fix recvmsg() to be able to peek across multiple records")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      6a012337
    • Jakub Kicinski's avatar
      tls: splice_read: fix record type check · befe4e29
      Jakub Kicinski authored
      [ Upstream commit 520493f6 ]
      
      We don't support splicing control records. TLS 1.3 changes moved
      the record type check into the decrypt if(). The skb may already
      be decrypted and still be an alert.
      
      Note that decrypt_skb_update() is idempotent and updates ctx->decrypted
      so the if() is pointless.
      
      Reorder the check for decryption errors with the content type check
      while touching them. This part is not really a bug, because if
      decryption failed in TLS 1.3 content type will be DATA, and for
      TLS 1.2 it will be correct. Nevertheless its strange to touch output
      before checking if the function has failed.
      
      Fixes: fedf201e
      
       ("net: tls: Refactor control message handling on recv")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      befe4e29
    • Huang Pei's avatar
      MIPS: use 3-level pgtable for 64KB page size on MIPS_VA_BITS_48 · a6a75b53
      Huang Pei authored
      [ Upstream commit 41ce097f ]
      
      It hangup when booting Loongson 3A1000 with BOTH
      CONFIG_PAGE_SIZE_64KB and CONFIG_MIPS_VA_BITS_48, that it turn
      out to use 2-level pgtable instead of 3-level. 64KB page size
      with 2-level pgtable only cover 42 bits VA, use 3-level pgtable
      to cover all 48 bits VA(55 bits)
      
      Fixes: 1e321fa9
      
       ("MIPS64: Support of at least 48 bits of SEGBITS)
      Signed-off-by: default avatarHuang Pei <huangpei@loongson.cn>
      Signed-off-by: default avatarThomas Bogendoerfer <tsbogend@alpha.franken.de>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      a6a75b53
    • Huang Pei's avatar
      MIPS: loongson64: fix FTLB configuration · ea3c7588
      Huang Pei authored
      [ Upstream commit 7db5e9e9 ]
      
      It turns out that 'decode_configs' -> 'set_ftlb_enable' is called under
      c->cputype unset, which leaves FTLB disabled on BOTH 3A2000 and 3A3000
      
      Fix it by calling "decode_configs" after c->cputype is initialized
      
      Fixes: da1bd297
      
       ("MIPS: Loongson64: Probe CPU features via CPUCFG")
      Signed-off-by: default avatarHuang Pei <huangpei@loongson.cn>
      Signed-off-by: default avatarThomas Bogendoerfer <tsbogend@alpha.franken.de>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ea3c7588
    • Jesse Brandeburg's avatar
      igb: fix netpoll exit with traffic · 1685d666
      Jesse Brandeburg authored
      [ Upstream commit eaeace60 ]
      
      Oleksandr brought a bug report where netpoll causes trace
      messages in the log on igb.
      
      Danielle brought this back up as still occurring, so we'll try
      again.
      
      [22038.710800] ------------[ cut here ]------------
      [22038.710801] igb_poll+0x0/0x1440 [igb] exceeded budget in poll
      [22038.710802] WARNING: CPU: 12 PID: 40362 at net/core/netpoll.c:155 netpoll_poll_dev+0x18a/0x1a0
      
      As Alex suggested, change the driver to return work_done at the
      exit of napi_poll, which should be safe to do in this driver
      because it is not polling multiple queues in this single napi
      context (multiple queues attached to one MSI-X vector). Several
      other drivers contain the same simple sequence, so I hope
      this will not create new problems.
      
      Fixes: 16eb8815
      
       ("igb: Refactor clean_rx_irq to reduce overhead and improve performance")
      Reported-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Reported-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Suggested-by: default avatarAlexander Duyck <alexander.duyck@gmail.com>
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Link: https://lore.kernel.org/r/20211123204000.1597971-1-jesse.brandeburg@intel.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      1685d666
    • Maurizio Lombardi's avatar
      nvmet: use IOCB_NOWAIT only if the filesystem supports it · 55850368
      Maurizio Lombardi authored
      [ Upstream commit c024b226 ]
      
      Submit I/O requests with the IOCB_NOWAIT flag set only if
      the underlying filesystem supports it.
      
      Fixes: 50a909db
      
       ("nvmet: use IOCB_NOWAIT for file-ns buffered I/O")
      Signed-off-by: default avatarMaurizio Lombardi <mlombard@redhat.com>
      Reviewed-by: default avatarChaitanya Kulkarni <kch@nvidia.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      55850368
    • Guo DaXing's avatar
      net/smc: Fix loop in smc_listen · a93af38c
      Guo DaXing authored
      [ Upstream commit 9ebb0c4b ]
      
      The kernel_listen function in smc_listen will fail when all the available
      ports are occupied.  At this point smc->clcsock->sk->sk_data_ready has
      been changed to smc_clcsock_data_ready.  When we call smc_listen again,
      now both smc->clcsock->sk->sk_data_ready and smc->clcsk_data_ready point
      to the smc_clcsock_data_ready function.
      
      The smc_clcsock_data_ready() function calls lsmc->clcsk_data_ready which
      now points to itself resulting in an infinite loop.
      
      This patch restores smc->clcsock->sk->sk_data_ready with the old value.
      
      Fixes: a60a2b1e
      
       ("net/smc: reduce active tcp_listen workers")
      Signed-off-by: default avatarGuo DaXing <guodaxing@huawei.com>
      Acked-by: default avatarTony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      a93af38c
    • Karsten Graul's avatar
      net/smc: Fix NULL pointer dereferencing in smc_vlan_by_tcpsk() · bb851d0f
      Karsten Graul authored
      [ Upstream commit 587acad4 ]
      
      Coverity reports a possible NULL dereferencing problem:
      
      in smc_vlan_by_tcpsk():
      6. returned_null: netdev_lower_get_next returns NULL (checked 29 out of 30 times).
      7. var_assigned: Assigning: ndev = NULL return value from netdev_lower_get_next.
      1623                ndev = (struct net_device *)netdev_lower_get_next(ndev, &lower);
      CID 1468509 (#1 of 1): Dereference null return value (NULL_RETURNS)
      8. dereference: Dereferencing a pointer that might be NULL ndev when calling is_vlan_dev.
      1624                if (is_vlan_dev(ndev)) {
      
      Remove the manual implementation and use netdev_walk_all_lower_dev() to
      iterate over the lower devices. While on it remove an obsolete function
      parameter comment.
      
      Fixes: cb9d43f6
      
       ("net/smc: determine vlan_id of stacked net_device")
      Suggested-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      bb851d0f
    • Russell King (Oracle)'s avatar
      net: phylink: Force retrigger in case of latched link-fail indicator · e85d50c4
      Russell King (Oracle) authored
      [ Upstream commit dbae3388 ]
      
      On mv88e6xxx 1G/2.5G PCS, the SerDes register 4.2001.2 has the following
      description:
        This register bit indicates when link was lost since the last
        read. For the current link status, read this register
        back-to-back.
      
      Thus to get current link state, we need to read the register twice.
      
      But doing that in the link change interrupt handler would lead to
      potentially ignoring link down events, which we really want to avoid.
      
      Thus this needs to be solved in phylink's resolve, by retriggering
      another resolve in the event when PCS reports link down and previous
      link was up, and by re-reading PCS state if the previous link was down.
      
      The wrong value is read when phylink requests change from sgmii to
      2500base-x mode, and link won't come up. This fixes the bug.
      
      Fixes: 9525ae83
      
       ("phylink: add phylink infrastructure")
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e85d50c4
    • Russell King (Oracle)'s avatar
      net: phylink: Force link down and retrigger resolve on interface change · d6525de2
      Russell King (Oracle) authored
      [ Upstream commit 80662f4f ]
      
      On PHY state change the phylink_resolve() function can read stale
      information from the MAC and report incorrect link speed and duplex to
      the kernel message log.
      
      Example with a Marvell 88X3310 PHY connected to a SerDes port on Marvell
      88E6393X switch:
      - PHY driver triggers state change due to PHY interface mode being
        changed from 10gbase-r to 2500base-x due to copper change in speed
        from 10Gbps to 2.5Gbps, but the PHY itself either hasn't yet changed
        its interface to the host, or the interrupt about loss of SerDes link
        hadn't arrived yet (there can be a delay of several milliseconds for
        this), so we still think that the 10gbase-r mode is up
      - phylink_resolve()
        - phylink_mac_pcs_get_state()
          - this fills in speed=10g link=up
        - interface mode is updated to 2500base-x but speed is left at 10Gbps
        - phylink_major_config()
          - interface is changed to 2500base-x
        - phylink_link_up()
          - mv88e6xxx_mac_link_up()
            - .port_set_speed_duplex()
              - speed is set to 10Gbps
          - reports "Link is Up - 10Gbps/Full" to dmesg
      
      Afterwards when the interrupt finally arrives for mv88e6xxx, another
      resolve is forced in which we get the correct speed from
      phylink_mac_pcs_get_state(), but since the interface is not being
      changed anymore, we don't call phylink_major_config() but only
      phylink_mac_config(), which does not set speed/duplex anymore.
      
      To fix this, we need to force the link down and trigger another resolve
      on PHY interface change event.
      
      Fixes: 9525ae83
      
       ("phylink: add phylink infrastructure")
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      d6525de2
    • Heiner Kallweit's avatar
      lan743x: fix deadlock in lan743x_phy_link_status_change() · cc164542
      Heiner Kallweit authored
      [ Upstream commit ddb826c2 ]
      
      Usage of phy_ethtool_get_link_ksettings() in the link status change
      handler isn't needed, and in combination with the referenced change
      it results in a deadlock. Simply remove the call and replace it with
      direct access to phydev->speed. The duplex argument of
      lan743x_phy_update_flowcontrol() isn't used and can be removed.
      
      Fixes: c10a485c
      
       ("phy: phy_ethtool_ksettings_get: Lock the phy for consistency")
      Reported-by: default avatarAlessandro B Maurici <abmaurici@gmail.com>
      Tested-by: default avatarAlessandro B Maurici <abmaurici@gmail.com>
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/40e27f76-0ba3-dcef-ee32-a78b9df38b0f@gmail.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      cc164542
    • Eric Dumazet's avatar
      tcp_cubic: fix spurious Hystart ACK train detections for not-cwnd-limited flows · 8165a96f
      Eric Dumazet authored
      [ Upstream commit 4e1fddc9 ]
      
      While testing BIG TCP patch series, I was expecting that TCP_RR workloads
      with 80KB requests/answers would send one 80KB TSO packet,
      then being received as a single GRO packet.
      
      It turns out this was not happening, and the root cause was that
      cubic Hystart ACK train was triggering after a few (2 or 3) rounds of RPC.
      
      Hystart was wrongly setting CWND/SSTHRESH to 30, while my RPC
      needed a budget of ~20 segments.
      
      Ideally these TCP_RR flows should not exit slow start.
      
      Cubic Hystart should reset itself at each round, instead of assuming
      every TCP flow is a bulk one.
      
      Note that even after this patch, Hystart can still trigger, depending
      on scheduling artifacts, but at a higher CWND/SSTHRESH threshold,
      keeping optimal TSO packet sizes.
      
      Tested:
      
      ip link set dev eth0 gro_ipv6_max_size 131072 gso_ipv6_max_size 131072
      nstat -n; netperf -H ... -t TCP_RR  -l 5  -- -r 80000,80000 -K cubic; nstat|egrep "Ip6InReceives|Hystart|Ip6OutRequests"
      
      Before:
      
         8605
      Ip6InReceives                   87541              0.0
      Ip6OutRequests                  129496             0.0
      TcpExtTCPHystartTrainDetect     1                  0.0
      TcpExtTCPHystartTrainCwnd       30                 0.0
      
      After:
      
        8760
      Ip6InReceives                   88514              0.0
      Ip6OutRequests                  87975              0.0
      
      Fixes: ae27e98a
      
       ("[TCP] CUBIC v2.3")
      Co-developed-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Link: https://lore.kernel.org/r/20211123202535.1843771-1-eric.dumazet@gmail.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      8165a96f
    • Nicholas Kazlauskas's avatar
      drm/amd/display: Set plane update flags for all planes in reset · 7b904ba3
      Nicholas Kazlauskas authored
      [ Upstream commit 21431f70 ]
      
      [Why]
      We're only setting the flags on stream[0]'s planes so this logic fails
      if we have more than one stream in the state.
      
      This can cause a page flip timeout with multiple displays in the
      configuration.
      
      [How]
      Index into the stream_status array using the stream index - it's a 1:1
      mapping.
      
      Fixes: cdaae837
      
       ("drm/amd/display: Handle GPU reset for DC block")
      
      Reviewed-by: default avatarHarry Wentland <Harry.Wentland@amd.com>
      Acked-by: default avatarQingqing Zhuo <qingqing.zhuo@amd.com>
      Signed-off-by: default avatarNicholas Kazlauskas <nicholas.kazlauskas@amd.com>
      Tested-by: default avatarDaniel Wheeler <daniel.wheeler@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      7b904ba3
    • Nicholas Kazlauskas's avatar
      drm/amd/display: Fix DPIA outbox timeout after GPU reset · 4da56400
      Nicholas Kazlauskas authored
      [ Upstream commit 6eff272d ]
      
      [Why]
      The HW interrupt gets disabled after GPU reset so we don't receive
      notifications for HPD or AUX from DMUB - leading to timeout and
      black screen with (or without) DPIA links connected.
      
      [How]
      Re-enable the interrupt after GPU reset like we do for the other
      DC interrupts.
      
      Fixes: 81927e28
      
       ("drm/amd/display: Support for DMUB AUX")
      
      Reviewed-by: default avatarJude Shih <Jude.Shih@amd.com>
      Acked-by: default avatarQingqing Zhuo <qingqing.zhuo@amd.com>
      Signed-off-by: default avatarNicholas Kazlauskas <nicholas.kazlauskas@amd.com>
      Tested-by: default avatarDaniel Wheeler <daniel.wheeler@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      4da56400
    • Thomas Zeitlhofer's avatar
      PM: hibernate: use correct mode for swsusp_close() · c83f2757
      Thomas Zeitlhofer authored
      [ Upstream commit cefcf24b ]
      
      Commit 39fbef4b ("PM: hibernate: Get block device exclusively in
      swsusp_check()") changed the opening mode of the block device to
      (FMODE_READ | FMODE_EXCL).
      
      In the corresponding calls to swsusp_close(), the mode is still just
      FMODE_READ which triggers the warning in blkdev_flush_mapping() on
      resume from hibernate.
      
      So, use the mode (FMODE_READ | FMODE_EXCL) also when closing the
      device.
      
      Fixes: 39fbef4b
      
       ("PM: hibernate: Get block device exclusively in swsusp_check()")
      Signed-off-by: default avatarThomas Zeitlhofer <thomas.zeitlhofer+lkml@ze-it.at>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c83f2757
    • Kumar Thangavel's avatar
      net/ncsi : Add payload to be 32-bit aligned to fix dropped packets · fd49f1f5
      Kumar Thangavel authored
      [ Upstream commit ac132852 ]
      
      Update NC-SI command handler (both standard and OEM) to take into
      account of payload paddings in allocating skb (in case of payload
      size is not 32-bit aligned).
      
      The checksum field follows payload field, without taking payload
      padding into account can cause checksum being truncated, leading to
      dropped packets.
      
      Fixes: fb4ee675
      
       ("net/ncsi: Add NCSI OEM command support")
      Signed-off-by: default avatarKumar Thangavel <thangavel.k@hcl.com>
      Acked-by: default avatarSamuel Mendoza-Jonas <sam@mendozajonas.com>
      Reviewed-by: default avatarPaul Menzel <pmenzel@molgen.mpg.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      fd49f1f5
    • Mark Rutland's avatar
      arm64: uaccess: avoid blocking within critical sections · ff1a3074
      Mark Rutland authored
      [ Upstream commit 94902d84 ]
      
      As Vincent reports in:
      
        https://lore.kernel.org/r/20211118163417.21617-1-vincent.whitchurch@axis.com
      
      The put_user() in schedule_tail() can get stuck in a livelock, similar
      to a problem recently fixed on riscv in commit:
      
        285a76bb
      
       ("riscv: evaluate put_user() arg before enabling user access")
      
      In __raw_put_user() we have a critical section between
      uaccess_ttbr0_enable() and uaccess_ttbr0_disable() where we cannot
      safely call into the scheduler without having taken an exception, as
      schedule() and other scheduling functions will not save/restore the
      TTBR0 state. If either of the `x` or `ptr` arguments to __raw_put_user()
      contain a blocking call, we may call into the scheduler within the
      critical section. This can result in two problems:
      
      1) The access within the critical section will occur without the
         required TTBR0 tables installed. This will fault, and where the
         required tables permit access, the access will be retried without the
         required tables, resulting in a livelock.
      
      2) When TTBR0 SW PAN is in use, check_and_switch_context() does not
         modify TTBR0, leaving a stale value installed. The mappings of the
         blocked task will erroneously be accessible to regular accesses in
         the context of the new task. Additionally, if the tables are
         subsequently freed, local TLB maintenance required to reuse the ASID
         may be lost, potentially resulting in TLB corruption (e.g. in the
         presence of CnP).
      
      The same issue exists for __raw_get_user() in the critical section
      between uaccess_ttbr0_enable() and uaccess_ttbr0_disable().
      
      A similar issue exists for __get_kernel_nofault() and
      __put_kernel_nofault() for the critical section between
      __uaccess_enable_tco_async() and __uaccess_disable_tco_async(), as the
      TCO state is not context-switched by direct calls into the scheduler.
      Here the TCO state may be lost from the context of the current task,
      resulting in unexpected asynchronous tag check faults. It may also be
      leaked to another task, suppressing expected tag check faults.
      
      To fix all of these cases, we must ensure that we do not directly call
      into the scheduler in their respective critical sections. This patch
      reworks __raw_put_user(), __raw_get_user(), __get_kernel_nofault(), and
      __put_kernel_nofault(), ensuring that parameters are evaluated outside
      of the critical sections. To make this requirement clear, comments are
      added describing the problem, and line spaces added to separate the
      critical sections from other portions of the macros.
      
      For __raw_get_user() and __raw_put_user() the `err` parameter is
      conditionally assigned to, and we must currently evaluate this in the
      critical section. This behaviour is relied upon by the signal code,
      which uses chains of put_user_error() and get_user_error(), checking the
      return value at the end. In all cases, the `err` parameter is a plain
      int rather than a more complex expression with a blocking call, so this
      is safe.
      
      In future we should try to clean up the `err` usage to remove the
      potential for this to be a problem.
      
      Aside from the changes to time of evaluation, there should be no
      functional change as a result of this patch.
      
      Reported-by: default avatarVincent Whitchurch <vincent.whitchurch@axis.com>
      Link: https://lore.kernel.org/r/20211118163417.21617-1-vincent.whitchurch@axis.com
      Fixes: f253d827
      
       ("arm64: uaccess: refactor __{get,put}_user")
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Link: https://lore.kernel.org/r/20211122125820.55286-1-mark.rutland@arm.com
      
      
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ff1a3074
    • Mohammed Gamal's avatar
      drm/hyperv: Fix device removal on Gen1 VMs · 85851d9f
      Mohammed Gamal authored
      [ Upstream commit e048834c ]
      
      The Hyper-V DRM driver tries to free MMIO region on removing
      the device regardless of VM type, while Gen1 VMs don't use MMIO
      and hence causing the kernel to crash on a NULL pointer dereference.
      
      Fix this by making deallocating MMIO only on Gen2 machines and implement
      removal for Gen1
      
      Fixes: 76c56a5a
      
       ("drm/hyperv: Add DRM driver for hyperv synthetic video device")
      
      Signed-off-by: default avatarMohammed Gamal <mgamal@redhat.com>
      Reviewed-by: default avatarDeepak Rawat <drawat.floss@gmail.com>
      Signed-off-by: default avatarDeepak Rawat <drawat.floss@gmail.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20211119112900.300537-1-mgamal@redhat.com
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      85851d9f
    • Varun Prakash's avatar
      nvmet-tcp: fix incomplete data digest send · 63a68f37
      Varun Prakash authored
      [ Upstream commit 102110ef ]
      
      Current nvmet_try_send_ddgst() code does not check whether
      all data digest bytes are transmitted, fix this by returning
      -EAGAIN if all data digest bytes are not transmitted.
      
      Fixes: 872d26a3
      
       ("nvmet-tcp: add NVMe over TCP target driver")
      Signed-off-by: default avatarVarun Prakash <varun@chelsio.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      63a68f37
    • Adamos Ttofari's avatar
      cpufreq: intel_pstate: Add Ice Lake server to out-of-band IDs · d10ecfd9
      Adamos Ttofari authored
      [ Upstream commit cd23f02f ]
      
      Commit fbdc21e9 ("cpufreq: intel_pstate: Add Icelake servers
      support in no-HWP mode") enabled the use of Intel P-State driver
      for Ice Lake servers.
      
      But it doesn't cover the case when OS can't control P-States.
      
      Therefore, for Ice Lake server, if MSR_MISC_PWR_MGMT bits 8 or 18
      are enabled, then the Intel P-State driver should exit as OS can't
      control P-States.
      
      Fixes: fbdc21e9
      
       ("cpufreq: intel_pstate: Add Icelake servers support in no-HWP mode")
      Signed-off-by: default avatarAdamos Ttofari <attofari@amazon.de>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      d10ecfd9
    • Marek Behún's avatar
      net: marvell: mvpp2: increase MTU limit when XDP enabled · 57e91396
      Marek Behún authored
      [ Upstream commit 7b1b62bc ]
      
      Currently mvpp2_xdp_setup won't allow attaching XDP program if
        mtu > ETH_DATA_LEN (1500).
      
      The mvpp2_change_mtu on the other hand checks whether
        MVPP2_RX_PKT_SIZE(mtu) > MVPP2_BM_LONG_PKT_SIZE.
      
      These two checks are semantically different.
      
      Moreover this limit can be increased to MVPP2_MAX_RX_BUF_SIZE, since in
      mvpp2_rx we have
        xdp.data = data + MVPP2_MH_SIZE + MVPP2_SKB_HEADROOM;
        xdp.frame_sz = PAGE_SIZE;
      
      Change the checks to check whether
        mtu > MVPP2_MAX_RX_BUF_SIZE
      
      Fixes: 07dd0a7a
      
       ("mvpp2: add basic XDP support")
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      57e91396
    • Alex Elder's avatar
      net: ipa: kill ipa_cmd_pipeline_clear() · d815f7ca
      Alex Elder authored
      [ Upstream commit e4e9bfb7 ]
      
      Calling ipa_cmd_pipeline_clear() after stopping the channel
      underlying the AP<-modem RX endpoint can lead to a deadlock.
      
      This occurs in the ->runtime_suspend device power operation for the
      IPA driver.  While this callback is in progress, any other requests
      for power will block until the callback returns.
      
      Stopping the AP<-modem RX channel does not prevent the modem from
      sending another packet to this endpoint.  If a packet arrives for an
      RX channel when the channel is stopped, an SUSPEND IPA interrupt
      condition will be pending.  Handling an IPA interrupt requires
      power, so ipa_isr_thread() calls pm_runtime_get_sync() first thing.
      
      The problem occurs because a "pipeline clear" command will not
      complete while such a SUSPEND interrupt condition exists.  So the
      SUSPEND IPA interrupt handler won't proceed until it gets power;
      that won't happen until the ->runtime_suspend callback (and its
      "pipeline clear" command) completes; and that can't happen while
      the SUSPEND interrupt condition exists.
      
      It turns out that in this case there is no need to use the "pipeline
      clear" command.  There are scenarios in which clearing the pipeline
      is required while suspending, but those are not (yet) supported
      upstream.  So a simple fix, avoiding the potential deadlock, is to
      stop calling ipa_cmd_pipeline_clear() in ipa_endpoint_suspend().
      This removes the only user of ipa_cmd_pipeline_clear(), so get rid
      of that function.  It can be restored again whenever it's needed.
      
      This is basically a manual revert along with an explanation for
      commit 6cb63ea6 ("net: ipa: introduce ipa_cmd_tag_process()").
      
      Fixes: 6cb63ea6
      
       ("net: ipa: introduce ipa_cmd_tag_process()")
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      d815f7ca
    • Alex Elder's avatar
      net: ipa: separate disabling setup from modem stop · 740c461a
      Alex Elder authored
      [ Upstream commit 8afc7e47 ]
      
      The IPA setup_complete flag is set at the end of ipa_setup(), when
      the setup phase of initialization has completed successfully.  This
      occurs as part of driver probe processing, or (if "modem-init" is
      specified in the DTS file) it is triggered by the "ipa-setup-ready"
      SMP2P interrupt generated by the modem.
      
      In the latter case, it's possible for driver shutdown (or remove) to
      begin while setup processing is underway, and this can't be allowed.
      The problem is that the setup_complete flag is not adequate to signal
      that setup is underway.
      
      If setup_complete is set, it will never be un-set, so that case is
      not a problem.  But if setup_complete is false, there's a chance
      setup is underway.
      
      Because setup is triggered by an interrupt on a "modem-init" system,
      there is a simple way to ensure the value of setup_complete is safe
      to read.  The threaded handler--if it is executing--will complete as
      part of a request to disable the "ipa-modem-ready" interrupt.  This
      means that ipa_setup() (which is called from the handler) will run
      to completion if it was underway, or will never be called otherwise.
      
      The request to disable the "ipa-setup-ready" interrupt is currently
      made within ipa_modem_stop().  Instead, disable the interrupt
      outside that function in the two places it's called.  In the case of
      ipa_remove(), this ensures the setup_complete flag is safe to read
      before we read it.
      
      Rename ipa_smp2p_disable() to be ipa_smp2p_irq_disable_setup(), to be
      more specific about its effect.
      
      Fixes: 530f9216
      
       ("soc: qcom: ipa: AP/modem communications")
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      740c461a
    • Alex Elder's avatar
      net: ipa: directly disable ipa-setup-ready interrupt · f38aa5cf
      Alex Elder authored
      [ Upstream commit 33a15310 ]
      
      We currently maintain a "disabled" Boolean flag to determine whether
      the "ipa-setup-ready" SMP2P IRQ handler does anything.  That flag
      must be accessed under protection of a mutex.
      
      Instead, disable the SMP2P interrupt when requested, which prevents
      the interrupt handler from ever being called.  More importantly, it
      synchronizes a thread disabling the interrupt with the completion of
      the interrupt handler in case they run concurrently.
      
      Use the IPA setup_complete flag rather than the disabled flag in the
      handler to determine whether to ignore any interrupts arriving after
      the first.
      
      Rename the "disabled" flag to be "setup_disabled", to be specific
      about its purpose.
      
      Fixes: 530f9216
      
       ("soc: qcom: ipa: AP/modem communications")
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f38aa5cf
    • Amit Cohen's avatar
      mlxsw: spectrum: Protect driver from buggy firmware · da4d7019
      Amit Cohen authored
      [ Upstream commit 63b08b1f ]
      
      When processing port up/down events generated by the device's firmware,
      the driver protects itself from events reported for non-existent local
      ports, but not the CPU port (local port 0), which exists, but lacks a
      netdev.
      
      This can result in a NULL pointer dereference when calling
      netif_carrier_{on,off}().
      
      Fix this by bailing early when processing an event reported for the CPU
      port. Problem was only observed when running on top of a buggy emulator.
      
      Fixes: 28b1987e
      
       ("mlxsw: spectrum: Register CPU port with devlink")
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      da4d7019
    • Tony Lu's avatar
      net/smc: Ensure the active closing peer first closes clcsock · 12dea26c
      Tony Lu authored
      [ Upstream commit 606a63c9 ]
      
      The side that actively closed socket, it's clcsock doesn't enter
      TIME_WAIT state, but the passive side does it. It should show the same
      behavior as TCP sockets.
      
      Consider this, when client actively closes the socket, the clcsock in
      server enters TIME_WAIT state, which means the address is occupied and
      won't be reused before TIME_WAIT dismissing. If we restarted server, the
      service would be unavailable for a long time.
      
      To solve this issue, shutdown the clcsock in [A], perform the TCP active
      close progress first, before the passive closed side closing it. So that
      the actively closed side enters TIME_WAIT, not the passive one.
      
      Client                                            |  Server
      close() // client actively close                  |
        smc_release()                                   |
            smc_close_active() // PEERCLOSEWAIT1        |
                smc_close_final() // abort or closed = 1|
                    smc_cdc_get_slot_and_msg_send()     |
                [A]                                     |
                                                        |smc_cdc_msg_recv_action() // ACTIVE
                                                        |  queue_work(smc_close_wq, &conn->close_work)
                                                        |    smc_close_passive_work() // PROCESSABORT or APPCLOSEWAIT1
                                                        |      smc_close_passive_abort_received() // only in abort
                                                        |
                                                        |close() // server recv zero, close
                                                        |  smc_release() // PROCESSABORT or APPCLOSEWAIT1
                                                        |    smc_close_active()
                                                        |      smc_close_abort() or smc_close_final() // CLOSED
                                                        |        smc_cdc_get_slot_and_msg_send() // abort or closed = 1
      smc_cdc_msg_recv_action()                         |    smc_clcsock_release()
        queue_work(smc_close_wq, &conn->close_work)     |      sock_release(tcp) // actively close clc, enter TIME_WAIT
          smc_close_passive_work() // PEERCLOSEWAIT1    |    smc_conn_free()
            smc_close_passive_abort_received() // CLOSED|
            smc_conn_free()                             |
            smc_clcsock_release()                       |
              sock_release(tcp) // passive close clc    |
      
      Link: https://www.spinics.net/lists/netdev/msg780407.html
      Fixes: b38d7324
      
       ("smc: socket closing and linkgroup cleanup")
      Signed-off-by: default avatarTony Lu <tonylu@linux.alibaba.com>
      Reviewed-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      12dea26c
    • Vincent Whitchurch's avatar
      i2c: virtio: disable timeout handling · cc432b07
      Vincent Whitchurch authored
      [ Upstream commit 84e1d0bf ]
      
      If a timeout is hit, it can result is incorrect data on the I2C bus
      and/or memory corruptions in the guest since the device can still be
      operating on the buffers it was given while the guest has freed them.
      
      Here is, for example, the start of a slub_debug splat which was
      triggered on the next transfer after one transfer was forced to timeout
      by setting a breakpoint in the backend (rust-vmm/vhost-device):
      
       BUG kmalloc-1k (Not tainted): Poison overwritten
       First byte 0x1 instead of 0x6b
       Allocated in virtio_i2c_xfer+0x65/0x35c age=350 cpu=0 pid=29
       	__kmalloc+0xc2/0x1c9
       	virtio_i2c_xfer+0x65/0x35c
       	__i2c_transfer+0x429/0x57d
       	i2c_transfer+0x115/0x134
       	i2cdev_ioctl_rdwr+0x16a/0x1de
       	i2cdev_ioctl+0x247/0x2ed
       	vfs_ioctl+0x21/0x30
       	sys_ioctl+0xb18/0xb41
       Freed in virtio_i2c_xfer+0x32e/0x35c age=244 cpu=0 pid=29
       	kfree+0x1bd/0x1cc
       	virtio_i2c_xfer+0x32e/0x35c
       	__i2c_transfer+0x429/0x57d
       	i2c_transfer+0x115/0x134
       	i2cdev_ioctl_rdwr+0x16a/0x1de
       	i2cdev_ioctl+0x247/0x2ed
       	vfs_ioctl+0x21/0x30
       	sys_ioctl+0xb18/0xb41
      
      There is no simple fix for this (the driver would have to always create
      bounce buffers and hold on to them until the device eventually returns
      the buffers), so just disable the timeout support for now.
      
      Fixes: 3cfc8838
      
       ("i2c: virtio: add a virtio i2c frontend driver")
      Acked-by: default avatarJie Deng <jie.deng@intel.com>
      Signed-off-by: default avatarVincent Whitchurch <vincent.whitchurch@axis.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: default avatarWolfram Sang <wsa@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      cc432b07
    • Huang Jianan's avatar
      erofs: fix deadlock when shrink erofs slab · 4339cd08
      Huang Jianan authored
      [ Upstream commit 57bbeacd ]
      
      We observed the following deadlock in the stress test under low
      memory scenario:
      
      Thread A                               Thread B
      - erofs_shrink_scan
       - erofs_try_to_release_workgroup
        - erofs_workgroup_try_to_freeze -- A
                                             - z_erofs_do_read_page
                                              - z_erofs_collection_begin
                                               - z_erofs_register_collection
                                                - erofs_insert_workgroup
                                                 - xa_lock(&sbi->managed_pslots) -- B
                                                 - erofs_workgroup_get
                                                  - erofs_wait_on_workgroup_freezed -- A
        - xa_erase
         - xa_lock(&sbi->managed_pslots) -- B
      
      To fix this, it needs to hold xa_lock before freezing the workgroup
      since xarray will be touched then. So let's hold the lock before
      accessing each workgroup, just like what we did with the radix tree
      before.
      
      [ Gao Xiang: Jianhua Hao also reports this issue at
        https://lore.kernel.org/r/b10b85df30694bac8aadfe43537c897a@xiaomi.com ]
      
      Link: https://lore.kernel.org/r/20211118135844.3559-1-huangjianan@oppo.com
      Fixes: 64094a04
      
       ("erofs: convert workstn to XArray")
      Reviewed-by: default avatarChao Yu <chao@kernel.org>
      Reviewed-by: default avatarGao Xiang <hsiangkao@linux.alibaba.com>
      Signed-off-by: default avatarHuang Jianan <huangjianan@oppo.com>
      Reported-by: default avatarJianhua Hao <haojianhua1@xiaomi.com>
      Signed-off-by: default avatarGao Xiang <xiang@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      4339cd08
    • Shin'ichiro Kawasaki's avatar
      scsi: scsi_debug: Zero clear zones at reset write pointer · 8b3b9aaa
      Shin'ichiro Kawasaki authored
      [ Upstream commit 2d62253e ]
      
      When a reset is requested the position of the write pointer is updated but
      the data in the corresponding zone is not cleared. Instead scsi_debug
      returns any data written before the write pointer was reset. This is an
      error and prevents using scsi_debug for stale page cache testing of the
      BLKRESETZONE ioctl.
      
      Zero written data in the zone when resetting the write pointer.
      
      Link: https://lore.kernel.org/r/20211122061223.298890-1-shinichiro.kawasaki@wdc.com
      Fixes: f0d1cf93
      
       ("scsi: scsi_debug: Add ZBC zone commands")
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Acked-by: default avatarDouglas Gilbert <dgilbert@interlog.com>
      Signed-off-by: default avatarShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      8b3b9aaa
    • Mike Christie's avatar
      scsi: core: sysfs: Fix setting device state to SDEV_RUNNING · a67c045b
      Mike Christie authored
      [ Upstream commit eb97545d ]
      
      This fixes an issue added in commit 4edd8cd4 ("scsi: core: sysfs: Fix
      hang when device state is set via sysfs") where if userspace is requesting
      to set the device state to SDEV_RUNNING when the state is already
      SDEV_RUNNING, we return -EINVAL instead of count. The commmit above set ret
      to count for this case, when it should have set it to 0.
      
      Link: https://lore.kernel.org/r/20211120164917.4924-1-michael.christie@oracle.com
      Fixes: 4edd8cd4
      
       ("scsi: core: sysfs: Fix hang when device state is set via sysfs")
      Reviewed-by: default avatarLee Duncan <lduncan@suse.com>
      Signed-off-by: default avatarMike Christie <michael.christie@oracle.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      a67c045b
    • Marta Plantykow's avatar
      ice: avoid bpf_prog refcount underflow · 1f10b09c
      Marta Plantykow authored
      [ Upstream commit f65ee535 ]
      
      Ice driver has the routines for managing XDP resources that are shared
      between ndo_bpf op and VSI rebuild flow. The latter takes place for
      example when user changes queue count on an interface via ethtool's
      set_channels().
      
      There is an issue around the bpf_prog refcounting when VSI is being
      rebuilt - since ice_prepare_xdp_rings() is called with vsi->xdp_prog as
      an argument that is used later on by ice_vsi_assign_bpf_prog(), same
      bpf_prog pointers are swapped with each other. Then it is also
      interpreted as an 'old_prog' which in turn causes us to call
      bpf_prog_put on it that will decrement its refcount.
      
      Below splat can be interpreted in a way that due to zero refcount of a
      bpf_prog it is wiped out from the system while kernel still tries to
      refer to it:
      
      [  481.069429] BUG: unable to handle page fault for address: ffffc9000640f038
      [  481.077390] #PF: supervisor read access in kernel mode
      [  481.083335] #PF: error_code(0x0000) - not-present page
      [  481.089276] PGD 100000067 P4D 100000067 PUD 1001cb067 PMD 106d2b067 PTE 0
      [  481.097141] Oops: 0000 [#1] PREEMPT SMP PTI
      [  481.101980] CPU: 12 PID: 3339 Comm: sudo Tainted: G           OE     5.15.0-rc5+ #1
      [  481.110840] Hardware name: Intel Corp. GRANTLEY/GRANTLEY, BIOS GRRFCRB1.86B.0276.D07.1605190235 05/19/2016
      [  481.122021] RIP: 0010:dev_xdp_prog_id+0x25/0x40
      [  481.127265] Code: 80 00 00 00 00 0f 1f 44 00 00 89 f6 48 c1 e6 04 48 01 fe 48 8b 86 98 08 00 00 48 85 c0 74 13 48 8b 50 18 31 c0 48 85 d2 74 07 <48> 8b 42 38 8b 40 20 c3 48 8b 96 90 08 00 00 eb e8 66 2e 0f 1f 84
      [  481.148991] RSP: 0018:ffffc90007b63868 EFLAGS: 00010286
      [  481.155034] RAX: 0000000000000000 RBX: ffff889080824000 RCX: 0000000000000000
      [  481.163278] RDX: ffffc9000640f000 RSI: ffff889080824010 RDI: ffff889080824000
      [  481.171527] RBP: ffff888107af7d00 R08: 0000000000000000 R09: ffff88810db5f6e0
      [  481.179776] R10: 0000000000000000 R11: ffff8890885b9988 R12: ffff88810db5f4bc
      [  481.188026] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      [  481.196276] FS:  00007f5466d5bec0(0000) GS:ffff88903fb00000(0000) knlGS:0000000000000000
      [  481.205633] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  481.212279] CR2: ffffc9000640f038 CR3: 000000014429c006 CR4: 00000000003706e0
      [  481.220530] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  481.228771] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  481.237029] Call Trace:
      [  481.239856]  rtnl_fill_ifinfo+0x768/0x12e0
      [  481.244602]  rtnl_dump_ifinfo+0x525/0x650
      [  481.249246]  ? __alloc_skb+0xa5/0x280
      [  481.253484]  netlink_dump+0x168/0x3c0
      [  481.257725]  netlink_recvmsg+0x21e/0x3e0
      [  481.262263]  ____sys_recvmsg+0x87/0x170
      [  481.266707]  ? __might_fault+0x20/0x30
      [  481.271046]  ? _copy_from_user+0x66/0xa0
      [  481.275591]  ? iovec_from_user+0xf6/0x1c0
      [  481.280226]  ___sys_recvmsg+0x82/0x100
      [  481.284566]  ? sock_sendmsg+0x5e/0x60
      [  481.288791]  ? __sys_sendto+0xee/0x150
      [  481.293129]  __sys_recvmsg+0x56/0xa0
      [  481.297267]  do_syscall_64+0x3b/0xc0
      [  481.301395]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [  481.307238] RIP: 0033:0x7f5466f39617
      [  481.311373] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb bd 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2f 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
      [  481.342944] RSP: 002b:00007ffedc7f4308 EFLAGS: 00000246 ORIG_RAX: 000000000000002f
      [  481.361783] RAX: ffffffffffffffda RBX: 00007ffedc7f5460 RCX: 00007f5466f39617
      [  481.380278] RDX: 0000000000000000 RSI: 00007ffedc7f5360 RDI: 0000000000000003
      [  481.398500] RBP: 00007ffedc7f53f0 R08: 0000000000000000 R09: 000055d556f04d50
      [  481.416463] R10: 0000000000000077 R11: 0000000000000246 R12: 00007ffedc7f5360
      [  481.434131] R13: 00007ffedc7f5350 R14: 00007ffedc7f5344 R15: 0000000000000e98
      [  481.451520] Modules linked in: ice(OE) af_packet binfmt_misc nls_iso8859_1 ipmi_ssif intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp mxm_wmi mei_me coretemp mei ipmi_si ipmi_msghandler wmi acpi_pad acpi_power_meter ip_tables x_tables autofs4 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel ahci crypto_simd cryptd libahci lpc_ich [last unloaded: ice]
      [  481.528558] CR2: ffffc9000640f038
      [  481.542041] ---[ end trace d1f24c9ecf5b61c1 ]---
      
      Fix this by only calling ice_vsi_assign_bpf_prog() inside
      ice_prepare_xdp_rings() when current vsi->xdp_prog pointer is NULL.
      This way set_channels() flow will not attempt to swap the vsi->xdp_prog
      pointers with itself.
      
      Also, sprinkle around some comments that provide a reasoning about
      correlation between driver and kernel in terms of bpf_prog refcount.
      
      Fixes: efc2214b
      
       ("ice: Add support for XDP")
      Reviewed-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Signed-off-by: default avatarMarta Plantykow <marta.a.plantykow@intel.com>
      Co-developed-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: default avatarKiran Bhandare <kiranx.bhandare@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      1f10b09c
    • Maciej Fijalkowski's avatar
      ice: fix vsi->txq_map sizing · 992ba40a
      Maciej Fijalkowski authored
      [ Upstream commit 792b2086 ]
      
      The approach of having XDP queue per CPU regardless of user's setting
      exposed a hidden bug that could occur in case when Rx queue count differ
      from Tx queue count. Currently vsi->txq_map's size is equal to the
      doubled vsi->alloc_txq, which is not correct due to the fact that XDP
      rings were previously based on the Rx queue count. Below splat can be
      seen when ethtool -L is used and XDP rings are configured:
      
      [  682.875339] BUG: kernel NULL pointer dereference, address: 000000000000000f
      [  682.883403] #PF: supervisor read access in kernel mode
      [  682.889345] #PF: error_code(0x0000) - not-present page
      [  682.895289] PGD 0 P4D 0
      [  682.898218] Oops: 0000 [#1] PREEMPT SMP PTI
      [  682.903055] CPU: 42 PID: 2878 Comm: ethtool Tainted: G           OE     5.15.0-rc5+ #1
      [  682.912214] Hardware name: Intel Corp. GRANTLEY/GRANTLEY, BIOS GRRFCRB1.86B.0276.D07.1605190235 05/19/2016
      [  682.923380] RIP: 0010:devres_remove+0x44/0x130
      [  682.928527] Code: 49 89 f4 55 48 89 fd 4c 89 ff 53 48 83 ec 10 e8 92 b9 49 00 48 8b 9d a8 02 00 00 48 8d 8d a0 02 00 00 49 89 c2 48 39 cb 74 0f <4c> 3b 63 10 74 25 48 8b 5b 08 48 39 cb 75 f1 4c 89 ff 4c 89 d6 e8
      [  682.950237] RSP: 0018:ffffc90006a679f0 EFLAGS: 00010002
      [  682.956285] RAX: 0000000000000286 RBX: ffffffffffffffff RCX: ffff88908343a370
      [  682.964538] RDX: 0000000000000001 RSI: ffffffff81690d60 RDI: 0000000000000000
      [  682.972789] RBP: ffff88908343a0d0 R08: 0000000000000000 R09: 0000000000000000
      [  682.981040] R10: 0000000000000286 R11: 3fffffffffffffff R12: ffffffff81690d60
      [  682.989282] R13: ffffffff81690a00 R14: ffff8890819807a8 R15: ffff88908343a36c
      [  682.997535] FS:  00007f08c7bfa740(0000) GS:ffff88a03fd00000(0000) knlGS:0000000000000000
      [  683.006910] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  683.013557] CR2: 000000000000000f CR3: 0000001080a66003 CR4: 00000000003706e0
      [  683.021819] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  683.030075] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  683.038336] Call Trace:
      [  683.041167]  devm_kfree+0x33/0x50
      [  683.045004]  ice_vsi_free_arrays+0x5e/0xc0 [ice]
      [  683.050380]  ice_vsi_rebuild+0x4c8/0x750 [ice]
      [  683.055543]  ice_vsi_recfg_qs+0x9a/0x110 [ice]
      [  683.060697]  ice_set_channels+0x14f/0x290 [ice]
      [  683.065962]  ethnl_set_channels+0x333/0x3f0
      [  683.070807]  genl_family_rcv_msg_doit+0xea/0x150
      [  683.076152]  genl_rcv_msg+0xde/0x1d0
      [  683.080289]  ? channels_prepare_data+0x60/0x60
      [  683.085432]  ? genl_get_cmd+0xd0/0xd0
      [  683.089667]  netlink_rcv_skb+0x50/0xf0
      [  683.094006]  genl_rcv+0x24/0x40
      [  683.097638]  netlink_unicast+0x239/0x340
      [  683.102177]  netlink_sendmsg+0x22e/0x470
      [  683.106717]  sock_sendmsg+0x5e/0x60
      [  683.110756]  __sys_sendto+0xee/0x150
      [  683.114894]  ? handle_mm_fault+0xd0/0x2a0
      [  683.119535]  ? do_user_addr_fault+0x1f3/0x690
      [  683.134173]  __x64_sys_sendto+0x25/0x30
      [  683.148231]  do_syscall_64+0x3b/0xc0
      [  683.161992]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fix this by taking into account the value that num_possible_cpus()
      yields in addition to vsi->alloc_txq instead of doubling the latter.
      
      Fixes: efc2214b ("ice: Add support for XDP")
      Fixes: 22bf877e
      
       ("ice: introduce XDP_TX fallback path")
      Reviewed-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: default avatarKiran Bhandare <kiranx.bhandare@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      992ba40a
    • Nikolay Aleksandrov's avatar
      net: nexthop: release IPv6 per-cpu dsts when replacing a nexthop group · 66521011
      Nikolay Aleksandrov authored
      [ Upstream commit 1005f19b ]
      
      When replacing a nexthop group, we must release the IPv6 per-cpu dsts of
      the removed nexthop entries after an RCU grace period because they
      contain references to the nexthop's net device and to the fib6 info.
      With specific series of events[1] we can reach net device refcount
      imbalance which is unrecoverable. IPv4 is not affected because dsts
      don't take a refcount on the route.
      
      [1]
       $ ip nexthop list
        id 200 via 2002:db8::2 dev bridge.10 scope link onlink
        id 201 via 2002:db8::3 dev bridge scope link onlink
        id 203 group 201/200
       $ ip -6 route
        2001:db8::10 nhid 203 metric 1024 pref medium
           nexthop via 2002:db8::3 dev bridge weight 1 onlink
           nexthop via 2002:db8::2 dev bridge.10 weight 1 onlink
      
      Create rt6_info through one of the multipath legs, e.g.:
       $ taskset -a -c 1  ./pkt_inj 24 bridge.10 2001:db8::10
       (pkt_inj is just a custom packet generator, nothing special)
      
      Then remove that leg from the group by replace (let's assume it is id
      200 in this case):
       $ ip nexthop replace id 203 group 201
      
      Now remove the IPv6 route:
       $ ip -6 route del 2001:db8::10/128
      
      The route won't be really deleted due to the stale rt6_info holding 1
      refcnt in nexthop id 200.
      At this point we have the following reference count dependency:
       (deleted) IPv6 route holds 1 reference over nhid 203
       nh 203 holds 1 ref over id 201
       nh 200 holds 1 ref over the net device and the route due to the stale
       rt6_info
      
      Now to create circular dependency between nh 200 and the IPv6 route, and
      also to get a reference over nh 200, restore nhid 200 in the group:
       $ ip nexthop replace id 203 group 201/200
      
      And now we have a permanent circular dependncy because nhid 203 holds a
      reference over nh 200 and 201, but the route holds a ref over nh 203 and
      is deleted.
      
      To trigger the bug just delete the group (nhid 203):
       $ ip nexthop del id 203
      
      It won't really be deleted due to the IPv6 route dependency, and now we
      have 2 unlinked and deleted objects that reference each other: the group
      and the IPv6 route. Since the group drops the reference it holds over its
      entries at free time (i.e. its own refcount needs to drop to 0) that will
      never happen and we get a permanent ref on them, since one of the entries
      holds a reference over the IPv6 route it will also never be released.
      
      At this point the dependencies are:
       (deleted, only unlinked) IPv6 route holds reference over group nh 203
       (deleted, only unlinked) group nh 203 holds reference over nh 201 and 200
       nh 200 holds 1 ref over the net device and the route due to the stale
       rt6_info
      
      This is the last point where it can be fixed by running traffic through
      nh 200, and specifically through the same CPU so the rt6_info (dst) will
      get released due to the IPv6 genid, that in turn will free the IPv6
      route, which in turn will free the ref count over the group nh 203.
      
      If nh 200 is deleted at this point, it will never be released due to the
      ref from the unlinked group 203, it will only be unlinked:
       $ ip nexthop del id 200
       $ ip nexthop
       $
      
      Now we can never release that stale rt6_info, we have IPv6 route with ref
      over group nh 203, group nh 203 with ref over nh 200 and 201, nh 200 with
      rt6_info (dst) with ref over the net device and the IPv6 route. All of
      these objects are only unlinked, and cannot be released, thus they can't
      release their ref counts.
      
       Message from syslogd@dev at Nov 19 14:04:10 ...
        kernel:[73501.828730] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
       Message from syslogd@dev at Nov 19 14:04:20 ...
        kernel:[73512.068811] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
      
      Fixes: 7bf4796d
      
       ("nexthops: add support for replace")
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      66521011
    • Nikolay Aleksandrov's avatar
      net: ipv6: add fib6_nh_release_dsts stub · e085ae66
      Nikolay Aleksandrov authored
      [ Upstream commit 8837cbbf ]
      
      We need a way to release a fib6_nh's per-cpu dsts when replacing
      nexthops otherwise we can end up with stale per-cpu dsts which hold net
      device references, so add a new IPv6 stub called fib6_nh_release_dsts.
      It must be used after an RCU grace period, so no new dsts can be created
      through a group's nexthop entry.
      Similar to fib6_nh_release it shouldn't be used if fib6_nh_init has failed
      so it doesn't need a dummy stub when IPv6 is not enabled.
      
      Fixes: 7bf4796d
      
       ("nexthops: add support for replace")
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e085ae66
    • Holger Assmann's avatar
      net: stmmac: retain PTP clock time during SIOCSHWTSTAMP ioctls · 8d196fa5
      Holger Assmann authored
      [ Upstream commit a6da2bbb ]
      
      Currently, when user space emits SIOCSHWTSTAMP ioctl calls such as
      enabling/disabling timestamping or changing filter settings, the driver
      reads the current CLOCK_REALTIME value and programming this into the
      NIC's hardware clock. This might be necessary during system
      initialization, but at runtime, when the PTP clock has already been
      synchronized to a grandmaster, a reset of the timestamp settings might
      result in a clock jump. Furthermore, if the clock is also controlled by
      phc2sys in automatic mode (where the UTC offset is queried from ptp4l),
      that UTC-to-TAI offset (currently 37 seconds in 2021) would be
      temporarily reset to 0, and it would take a long time for phc2sys to
      readjust so that CLOCK_REALTIME and the PHC are apart by 37 seconds
      again.
      
      To address the issue, we introduce a new function called
      stmmac_init_tstamp_counter(), which gets called during ndo_open().
      It contains the code snippet moved from stmmac_hwtstamp_set() that
      manages the time synchronization. Besides, the sub second increment
      configuration is also moved here since the related values are hardware
      dependent and runtime invariant.
      
      Furthermore, the hardware clock must be kept running even when no time
      stamping mode is selected in order to retain the synchronized time base.
      That way, timestamping can be enabled again at any time only with the
      need to compensate the clock's natural drifting.
      
      As a side effect, this patch fixes the issue that ptp_clock_info::enable
      can be called before SIOCSHWTSTAMP and the driver (which looks at
      priv->systime_flags) was not prepared to handle that ordering.
      
      Fixes: 92ba6888
      
       ("stmmac: add the support for PTP hw clock driver")
      Reported-by: default avatarMichael Olbrich <m.olbrich@pengutronix.de>
      Signed-off-by: default avatarAhmad Fatoum <a.fatoum@pengutronix.de>
      Signed-off-by: default avatarHolger Assmann <h.assmann@pengutronix.de>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      8d196fa5
    • Diana Wang's avatar
      nfp: checking parameter process for rx-usecs/tx-usecs is invalid · f6cd5768
      Diana Wang authored
      [ Upstream commit 3bd6b2a8 ]
      
      Use nn->tlv_caps.me_freq_mhz instead of nn->me_freq_mhz to check whether
      rx-usecs/tx-usecs is valid.
      
      This is because nn->tlv_caps.me_freq_mhz represents the clock_freq (MHz) of
      the flow processing cores (FPC) on the NIC. While nn->me_freq_mhz is not
      be set.
      
      Fixes: ce991ab6
      
       ("nfp: read ME frequency from vNIC ctrl memory")
      Signed-off-by: default avatarDiana Wang <na.wang@corigine.com>
      Signed-off-by: default avatarSimon Horman <simon.horman@corigine.com>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f6cd5768
    • Eric Dumazet's avatar
      ipv6: fix typos in __ip6_finish_output() · f1f243c0
      Eric Dumazet authored
      [ Upstream commit 19d36c5f ]
      
      We deal with IPv6 packets, so we need to use IP6CB(skb)->flags and
      IP6SKB_REROUTED, instead of IPCB(skb)->flags and IPSKB_REROUTED
      
      Found by code inspection, please double check that fixing this bug
      does not surface other bugs.
      
      Fixes: 09ee9dba
      
       ("ipv6: Reinject IPv6 packets if IPsec policy matches after SNAT")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Tobias Brunner <tobias@strongswan.org>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: David Ahern <dsahern@kernel.org>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Tested-by: default avatarTobias Brunner <tobias@strongswan.org>
      Acked-by: default avatarTobias Brunner <tobias@strongswan.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f1f243c0
    • Michael Kelley's avatar
      firmware: smccc: Fix check for ARCH_SOC_ID not implemented · 88f6b5f1
      Michael Kelley authored
      [ Upstream commit e95d8eae ]
      
      The ARCH_FEATURES function ID is a 32-bit SMC call, which returns
      a 32-bit result per the SMCCC spec.  Current code is doing a 64-bit
      comparison against -1 (SMCCC_RET_NOT_SUPPORTED) to detect that the
      feature is unimplemented.  That check doesn't work in a Hyper-V VM,
      where the upper 32-bits are zero as allowed by the spec.
      
      Cast the result as an 'int' so the comparison works. The change also
      makes the code consistent with other similar checks in this file.
      
      Fixes: 821b67fa
      
       ("firmware: smccc: Add ARCH_SOC_ID support")
      Signed-off-by: default avatarMichael Kelley <mikelley@microsoft.com>
      Reviewed-by: default avatarSudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      88f6b5f1
    • Vincent Whitchurch's avatar
      af_unix: fix regression in read after shutdown · 80d70987
      Vincent Whitchurch authored
      [ Upstream commit f9390b24 ]
      
      On kernels before v5.15, calling read() on a unix socket after
      shutdown(SHUT_RD) or shutdown(SHUT_RDWR) would return the data
      previously written or EOF.  But now, while read() after
      shutdown(SHUT_RD) still behaves the same way, read() after
      shutdown(SHUT_RDWR) always fails with -EINVAL.
      
      This behaviour change was apparently inadvertently introduced as part of
      a bug fix for a different regression caused by the commit adding sockmap
      support to af_unix, commit 94531cfc ("af_unix: Add
      unix_stream_proto for sockmap").  Those commits, for unclear reasons,
      started setting the socket state to TCP_CLOSE on shutdown(SHUT_RDWR),
      while this state change had previously only been done in
      unix_release_sock().
      
      Restore the original behaviour.  The sockmap tests in
      tests/selftests/bpf continue to pass after this patch.
      
      Fixes: d0c6416b ("unix: Fix an issue in unix_shutdown causing the other end read/write failures")
      Link: https://lore.kernel.org/lkml/20211111140000.GA10779@axis.com/
      
      
      Signed-off-by: default avatarVincent Whitchurch <vincent.whitchurch@axis.com>
      Tested-by: default avatarCasey Schaufler <casey@schaufler-ca.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      80d70987