Skip to content
  1. May 27, 2024
    • Kuniyuki Iwashima's avatar
      af_unix: Read sk->sk_hash under bindlock during bind(). · 51d1b25a
      Kuniyuki Iwashima authored
      syzkaller reported data-race of sk->sk_hash in unix_autobind() [0],
      and the same ones exist in unix_bind_bsd() and unix_bind_abstract().
      
      The three bind() functions prefetch sk->sk_hash locklessly and
      use it later after validating that unix_sk(sk)->addr is NULL under
      unix_sk(sk)->bindlock.
      
      The prefetched sk->sk_hash is the hash value of unbound socket set
      in unix_create1() and does not change until bind() completes.
      
      There could be a chance that sk->sk_hash changes after the lockless
      read.  However, in such a case, non-NULL unix_sk(sk)->addr is visible
      under unix_sk(sk)->bindlock, and bind() returns -EINVAL without using
      the prefetched value.
      
      The KCSAN splat is false-positive, but let's silence it by reading
      sk->sk_hash under unix_sk(sk)->bindlock.
      
      [0]:
      BUG: KCSAN: data-race in unix_autobind / unix_autobind
      
      write to 0xffff888034a9fb88 of 4 bytes by task 4468 on cpu 0:
       __unix_set_addr_hash net/unix/af_unix.c:331 [inline]
       unix_autobind+0x47a/0x7d0 net/unix/af_unix.c:1185
       unix_dgram_connect+0x7e3/0x890 net/unix/af_unix.c:1373
       __sys_connect_file+0xd7/0xe0 net/socket.c:2048
       __sys_connect+0x114/0x140 net/socket.c:2065
       __do_sys_connect net/socket.c:2075 [inline]
       __se_sys_connect net/socket.c:2072 [inline]
       __x64_sys_connect+0x40/0x50 net/socket.c:2072
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0x4f/0x110 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x46/0x4e
      
      read to 0xffff888034a9fb88 of 4 bytes by task 4465 on cpu 1:
       unix_autobind+0x28/0x7d0 net/unix/af_unix.c:1134
       unix_dgram_connect+0x7e3/0x890 net/unix/af_unix.c:1373
       __sys_connect_file+0xd7/0xe0 net/socket.c:2048
       __sys_connect+0x114/0x140 net/socket.c:2065
       __do_sys_connect net/socket.c:2075 [inline]
       __se_sys_connect net/socket.c:2072 [inline]
       __x64_sys_connect+0x40/0x50 net/socket.c:2072
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0x4f/0x110 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x46/0x4e
      
      value changed: 0x000000e4 -> 0x000001e3
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 4465 Comm: syz-executor.0 Not tainted 6.8.0-12822-gcd51db110a7e #12
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      
      Fixes: afd20b92
      
       ("af_unix: Replace the big lock with small locks.")
      Reported-by: default avatarsyzkaller <syzkaller@googlegroups.com>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240522154218.78088-1-kuniyu@amazon.com
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      51d1b25a
    • Kuniyuki Iwashima's avatar
      af_unix: Annotate data-race around unix_sk(sk)->addr. · 97e1db06
      Kuniyuki Iwashima authored
      Once unix_sk(sk)->addr is assigned under net->unx.table.locks and
      unix_sk(sk)->bindlock, *(unix_sk(sk)->addr) and unix_sk(sk)->path are
      fully set up, and unix_sk(sk)->addr is never changed.
      
      unix_getname() and unix_copy_addr() access the two fields locklessly,
      and commit ae3b5641 ("missing barriers in some of unix_sock ->addr
      and ->path accesses") added smp_store_release() and smp_load_acquire()
      pairs.
      
      In other functions, we still read unix_sk(sk)->addr locklessly to check
      if the socket is bound, and KCSAN complains about it.  [0]
      
      Given these functions have no dependency for *(unix_sk(sk)->addr) and
      unix_sk(sk)->path, READ_ONCE() is enough to annotate the data-race.
      
      Note that it is safe to access unix_sk(sk)->addr locklessly if the socket
      is found in the hash table.  For example, the lockless read of otheru->addr
      in unix_stream_connect() is safe.
      
      Note also that newu->addr there is of the child socket that is still not
      accessible from userspace, and smp_store_release() publishes the address
      in case the socket is accept()ed and unix_getname() / unix_copy_addr()
      is called.
      
      [0]:
      BUG: KCSAN: data-race in unix_bind / unix_listen
      
      write (marked) to 0xffff88805f8d1840 of 8 bytes by task 13723 on cpu 0:
       __unix_set_addr_hash net/unix/af_unix.c:329 [inline]
       unix_bind_bsd net/unix/af_unix.c:1241 [inline]
       unix_bind+0x881/0x1000 net/unix/af_unix.c:1319
       __sys_bind+0x194/0x1e0 net/socket.c:1847
       __do_sys_bind net/socket.c:1858 [inline]
       __se_sys_bind net/socket.c:1856 [inline]
       __x64_sys_bind+0x40/0x50 net/socket.c:1856
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0x4f/0x110 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x46/0x4e
      
      read to 0xffff88805f8d1840 of 8 bytes by task 13724 on cpu 1:
       unix_listen+0x72/0x180 net/unix/af_unix.c:734
       __sys_listen+0xdc/0x160 net/socket.c:1881
       __do_sys_listen net/socket.c:1890 [inline]
       __se_sys_listen net/socket.c:1888 [inline]
       __x64_sys_listen+0x2e/0x40 net/socket.c:1888
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0x4f/0x110 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x46/0x4e
      
      value changed: 0x0000000000000000 -> 0xffff88807b5b1b40
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 13724 Comm: syz-executor.4 Not tainted 6.8.0-12822-gcd51db110a7e #12
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      
      Fixes: 1da177e4
      
       ("Linux-2.6.12-rc2")
      Reported-by: default avatarsyzkaller <syzkaller@googlegroups.com>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240522154002.77857-1-kuniyu@amazon.com
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      97e1db06
    • Geliang Tang's avatar
      selftests: hsr: Fix "File exists" errors for hsr_ping · 21a22ed6
      Geliang Tang authored
      The hsr_ping test reports the following errors:
      
       INFO: preparing interfaces for HSRv0.
       INFO: Initial validation ping.
       INFO: Longer ping test.
       INFO: Cutting one link.
       INFO: Delay the link and drop a few packages.
       INFO: All good.
       INFO: preparing interfaces for HSRv1.
       RTNETLINK answers: File exists
       RTNETLINK answers: File exists
       RTNETLINK answers: File exists
       RTNETLINK answers: File exists
       RTNETLINK answers: File exists
       RTNETLINK answers: File exists
       Error: ipv4: Address already assigned.
       Error: ipv6: address already assigned.
       Error: ipv4: Address already assigned.
       Error: ipv6: address already assigned.
       Error: ipv4: Address already assigned.
       Error: ipv6: address already assigned.
       INFO: Initial validation ping.
      
      That is because the cleanup code for the 2nd round test before
      "setup_hsr_interfaces 1" is removed incorrectly in commit 680fda4f
      ("test: hsr: Remove script code already implemented in lib.sh").
      
      This patch fixes it by re-setup the namespaces using
      
      	setup_ns ns1 ns2 ns3
      
      command before "setup_hsr_interfaces 1". It deletes previous namespaces
      and create new ones.
      
      Fixes: 680fda4f
      
       ("test: hsr: Remove script code already implemented in lib.sh")
      Reviewed-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: default avatarGeliang Tang <tanggeliang@kylinos.cn>
      Link: https://lore.kernel.org/r/6485d3005f467758d49f0f313c8c009759ba6b05.1716374462.git.tanggeliang@kylinos.cn
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      21a22ed6
    • Roded Zats's avatar
      enic: Validate length of nl attributes in enic_set_vf_port · e8021b94
      Roded Zats authored
      enic_set_vf_port assumes that the nl attribute IFLA_PORT_PROFILE
      is of length PORT_PROFILE_MAX and that the nl attributes
      IFLA_PORT_INSTANCE_UUID, IFLA_PORT_HOST_UUID are of length PORT_UUID_MAX.
      These attributes are validated (in the function do_setlink in rtnetlink.c)
      using the nla_policy ifla_port_policy. The policy defines IFLA_PORT_PROFILE
      as NLA_STRING, IFLA_PORT_INSTANCE_UUID as NLA_BINARY and
      IFLA_PORT_HOST_UUID as NLA_STRING. That means that the length validation
      using the policy is for the max size of the attributes and not on exact
      size so the length of these attributes might be less than the sizes that
      enic_set_vf_port expects. This might cause an out of bands
      read access in the memcpys of the data of these
      attributes in enic_set_vf_port.
      
      Fixes: f8bd9091
      
       ("net: Add ndo_{set|get}_vf_port support for enic dynamic vnics")
      Signed-off-by: default avatarRoded Zats <rzats@paloaltonetworks.com>
      Link: https://lore.kernel.org/r/20240522073044.33519-1-rzats@paloaltonetworks.com
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      e8021b94
  2. May 24, 2024
    • David S. Miller's avatar
      Merge branch 'mlx5-fixes' · 0b4f5add
      David S. Miller authored
      Tariq Toukan says:
      
      ====================
      mlx5 fixes 24-05-22
      
      This patchset provides bug fixes to mlx5 core and Eth drivers.
      
      Series generated against:
      commit 9c91c7fa
      
       ("net: mana: Fix the extra HZ in mana_hwc_send_request")
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0b4f5add
    • Gal Pressman's avatar
      net/mlx5e: Fix UDP GSO for encapsulated packets · 83fea49f
      Gal Pressman authored
      When the skb is encapsulated, adjust the inner UDP header instead of the
      outer one, and account for UDP header (instead of TCP) in the inline
      header size calculation.
      
      Fixes: 689adf0d
      
       ("net/mlx5e: Add UDP GSO support")
      Reported-by: default avatarJason Baron <jbaron@akamai.com>
      Closes: https://lore.kernel.org/netdev/c42961cb-50b9-4a9a-bd43-87fe48d88d29@akamai.com/
      Signed-off-by: default avatarGal Pressman <gal@nvidia.com>
      Reviewed-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Reviewed-by: default avatarBoris Pismenny <borisp@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      83fea49f
    • Carolina Jubran's avatar
      net/mlx5e: Use rx_missed_errors instead of rx_dropped for reporting buffer exhaustion · 5c74195d
      Carolina Jubran authored
      Previously, the driver incorrectly used rx_dropped to report device
      buffer exhaustion.
      
      According to the documentation, rx_dropped should not be used to count
      packets dropped due to buffer exhaustion, which is the purpose of
      rx_missed_errors.
      
      Use rx_missed_errors as intended for counting packets dropped due to
      buffer exhaustion.
      
      Fixes: 269e6b3a
      
       ("net/mlx5e: Report additional error statistics in get stats ndo")
      Signed-off-by: default avatarCarolina Jubran <cjubran@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5c74195d
    • Rahul Rameshbabu's avatar
      net/mlx5e: Do not use ptp structure for tx ts stats when not initialized · f55cd312
      Rahul Rameshbabu authored
      The ptp channel instance is only initialized when ptp traffic is first
      processed by the driver. This means that there is a window in between when
      port timestamping is enabled and ptp traffic is sent where the ptp channel
      instance is not initialized. Accessing statistics during this window will
      lead to an access violation (NULL + member offset). Check the validity of
      the instance before attempting to query statistics.
      
        BUG: unable to handle page fault for address: 0000000000003524
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 109dfc067 P4D 109dfc067 PUD 1064ef067 PMD 0
        Oops: 0000 [#1] SMP
        CPU: 0 PID: 420 Comm: ethtool Not tainted 6.9.0-rc2-rrameshbabu+ #245
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/204
        RIP: 0010:mlx5e_stats_ts_get+0x4c/0x130
        <snip>
        Call Trace:
         <TASK>
         ? show_regs+0x60/0x70
         ? __die+0x24/0x70
         ? page_fault_oops+0x15f/0x430
         ? do_user_addr_fault+0x2c9/0x5c0
         ? exc_page_fault+0x63/0x110
         ? asm_exc_page_fault+0x27/0x30
         ? mlx5e_stats_ts_get+0x4c/0x130
         ? mlx5e_stats_ts_get+0x20/0x130
         mlx5e_get_ts_stats+0x15/0x20
        <snip>
      
      Fixes: 3579032c
      
       ("net/mlx5e: Implement ethtool hardware timestamping statistics")
      Signed-off-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f55cd312
    • Rahul Rameshbabu's avatar
      net/mlx5e: Fix IPsec tunnel mode offload feature check · 9a52f6d4
      Rahul Rameshbabu authored
      Remove faulty check disabling checksum offload and GSO for offload of
      simple IPsec tunnel L4 traffic. Comment previously describing the deleted
      code incorrectly claimed the check prevented double tunnel (or three layers
      of ip headers).
      
      Fixes: f1267798
      
       ("net/mlx5: Fix checksum issue of VXLAN and IPsec crypto offload")
      Signed-off-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9a52f6d4
    • Rahul Rameshbabu's avatar
      net/mlx5: Use mlx5_ipsec_rx_status_destroy to correctly delete status rules · 16d66a4f
      Rahul Rameshbabu authored
      rx_create no longer allocates a modify_hdr instance that needs to be
      cleaned up. The mlx5_modify_header_dealloc call will lead to a NULL pointer
      dereference. A leak in the rules also previously occurred since there are
      now two rules populated related to status.
      
        BUG: kernel NULL pointer dereference, address: 0000000000000000
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 109907067 P4D 109907067 PUD 116890067 PMD 0
        Oops: 0000 [#1] SMP
        CPU: 1 PID: 484 Comm: ip Not tainted 6.9.0-rc2-rrameshbabu+ #254
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014
        RIP: 0010:mlx5_modify_header_dealloc+0xd/0x70
        <snip>
        Call Trace:
         <TASK>
         ? show_regs+0x60/0x70
         ? __die+0x24/0x70
         ? page_fault_oops+0x15f/0x430
         ? free_to_partial_list.constprop.0+0x79/0x150
         ? do_user_addr_fault+0x2c9/0x5c0
         ? exc_page_fault+0x63/0x110
         ? asm_exc_page_fault+0x27/0x30
         ? mlx5_modify_header_dealloc+0xd/0x70
         rx_create+0x374/0x590
         rx_add_rule+0x3ad/0x500
         ? rx_add_rule+0x3ad/0x500
         ? mlx5_cmd_exec+0x2c/0x40
         ? mlx5_create_ipsec_obj+0xd6/0x200
         mlx5e_accel_ipsec_fs_add_rule+0x31/0xf0
         mlx5e_xfrm_add_state+0x426/0xc00
        <snip>
      
      Fixes: 94af50c0
      
       ("net/mlx5e: Unify esw and normal IPsec status table creation/destruction")
      Signed-off-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      16d66a4f
    • Gal Pressman's avatar
      net/mlx5: Fix MTMP register capability offset in MCAM register · 1b9f86c6
      Gal Pressman authored
      The MTMP register (0x900a) capability offset is off-by-one, move it to
      the right place.
      
      Fixes: 1f507e80
      
       ("net/mlx5: Expose NIC temperature via hardware monitoring kernel API")
      Signed-off-by: default avatarGal Pressman <gal@nvidia.com>
      Reviewed-by: default avatarCosmin Ratiu <cratiu@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1b9f86c6
    • Tariq Toukan's avatar
      net/mlx5: Do not query MPIR on embedded CPU function · fca3b479
      Tariq Toukan authored
      A proper query to MPIR needs to set the correct value in the depth field.
      On embedded CPU this value is not necessarily zero. As there is no real
      use case for multi-PF netdev on the embedded CPU of the smart NIC, block
      this option.
      
      This fixes the following failure:
      ACCESS_REG(0x805) op_mod(0x1) failed, status bad system state(0x4), syndrome (0x685f19), err(-5)
      
      Fixes: 678eb448
      
       ("net/mlx5: SD, Implement basic query and instantiation")
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fca3b479
    • Maher Sanalla's avatar
      net/mlx5: Lag, do bond only if slaves agree on roce state · 51ef9305
      Maher Sanalla authored
      Currently, the driver does not enforce that lag bond slaves must have
      matching roce capabilities. Yet, in mlx5_do_bond(), the driver attempts
      to enable roce on all vports of the bond slaves, causing the following
      syndrome when one slave has no roce fw support:
      
      mlx5_cmd_out_err:809:(pid 25427): MODIFY_NIC_VPORT_CONTEXT(0×755) op_mod(0×0)
      failed, status bad parameter(0×3), syndrome (0xc1f678), err(-22)
      
      Thus, create HW lag only if bond's slaves agree on roce state,
      either all slaves have roce support resulting in a roce lag bond,
      or none do, resulting in a raw eth bond.
      
      Fixes: 7907f23a
      
       ("net/mlx5: Implement RoCE LAG feature")
      Signed-off-by: default avatarMaher Sanalla <msanalla@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      51ef9305
    • Mathieu Othacehe's avatar
      net: phy: micrel: set soft_reset callback to genphy_soft_reset for KSZ8061 · 128d54fb
      Mathieu Othacehe authored
      Following a similar reinstate for the KSZ8081 and KSZ9031.
      
      Older kernels would use the genphy_soft_reset if the PHY did not implement
      a .soft_reset.
      
      The KSZ8061 errata described here:
      https://ww1.microchip.com/downloads/en/DeviceDoc/KSZ8061-Errata-DS80000688B.pdf
      and worked around with 232ba3a5 ("net: phy: Micrel KSZ8061: link failure after cable connect")
      is back again without this soft reset.
      
      Fixes: 6e2d85ec
      
       ("net: phy: Stop with excessive soft reset")
      Tested-by: default avatarKarim Ben Houcine <karim.benhoucine@landisgyr.com>
      Signed-off-by: default avatarMathieu Othacehe <othacehe@gnu.org>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarFlorian Fainelli <florian.fainelli@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      128d54fb
    • Matt Jan's avatar
      connector: Fix invalid conversion in cn_proc.h · 06e785ae
      Matt Jan authored
      
      
      The implicit conversion from unsigned int to enum
      proc_cn_event is invalid, so explicitly cast it
      for compilation in a C++ compiler.
      /usr/include/linux/cn_proc.h: In function 'proc_cn_event valid_event(proc_cn_event)':
      /usr/include/linux/cn_proc.h:72:17: error: invalid conversion from 'unsigned int' to 'proc_cn_event' [-fpermissive]
         72 |         ev_type &= PROC_EVENT_ALL;
            |                 ^
            |                 |
            |                 unsigned int
      
      Signed-off-by: default avatarMatt Jan <zoo868e@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      06e785ae
    • Linus Torvalds's avatar
      Merge tag 'net-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 66ad4829
      Linus Torvalds authored
      Pull networking fixes from Paolo Abeni:
       "Quite smaller than usual. Notably it includes the fix for the unix
        regression from the past weeks. The TCP window fix will require some
        follow-up, already queued.
      
        Current release - regressions:
      
         - af_unix: fix garbage collection of embryos
      
        Previous releases - regressions:
      
         - af_unix: fix race between GC and receive path
      
         - ipv6: sr: fix missing sk_buff release in seg6_input_core
      
         - tcp: remove 64 KByte limit for initial tp->rcv_wnd value
      
         - eth: r8169: fix rx hangup
      
         - eth: lan966x: remove ptp traps in case the ptp is not enabled
      
         - eth: ixgbe: fix link breakage vs cisco switches
      
         - eth: ice: prevent ethtool from corrupting the channels
      
        Previous releases - always broken:
      
         - openvswitch: set the skbuff pkt_type for proper pmtud support
      
         - tcp: Fix shift-out-of-bounds in dctcp_update_alpha()
      
        Misc:
      
         - a bunch of selftests stabilization patches"
      
      * tag 'net-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (25 commits)
        r8169: Fix possible ring buffer corruption on fragmented Tx packets.
        idpf: Interpret .set_channels() input differently
        ice: Interpret .set_channels() input differently
        nfc: nci: Fix handling of zero-length payload packets in nci_rx_work()
        net: relax socket state check at accept time.
        tcp: remove 64 KByte limit for initial tp->rcv_wnd value
        net: ti: icssg_prueth: Fix NULL pointer dereference in prueth_probe()
        tls: fix missing memory barrier in tls_init
        net: fec: avoid lock evasion when reading pps_enable
        Revert "ixgbe: Manual AN-37 for troublesome link partners for X550 SFI"
        testing: net-drv: use stats64 for testing
        net: mana: Fix the extra HZ in mana_hwc_send_request
        net: lan966x: Remove ptp traps in case the ptp is not enabled.
        openvswitch: Set the skbuff pkt_type for proper pmtud support.
        selftest: af_unix: Make SCM_RIGHTS into OOB data.
        af_unix: Fix garbage collection of embryos carrying OOB with SCM_RIGHTS
        tcp: Fix shift-out-of-bounds in dctcp_update_alpha().
        selftests/net: use tc rule to filter the na packet
        ipv6: sr: fix memleak in seg6_hmac_init_algo
        af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock.
        ...
      66ad4829
    • Linus Torvalds's avatar
      Merge tag 'trace-fixes-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace · 404001dd
      Linus Torvalds authored
      Pull tracing fixes from Steven Rostedt:
       "Minor last minute fixes:
      
         - Fix a very tight race between the ring buffer readers and resizing
           the ring buffer
      
         - Correct some stale comments in the ring buffer code
      
         - Fix kernel-doc in the rv code
      
         - Add a MODULE_DESCRIPTION to preemptirq_delay_test"
      
      * tag 'trace-fixes-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        rv: Update rv_en(dis)able_monitor doc to match kernel-doc
        tracing: Add MODULE_DESCRIPTION() to preemptirq_delay_test
        ring-buffer: Fix a race between readers and resize checks
        ring-buffer: Correct stale comments related to non-consuming readers
      404001dd
    • Linus Torvalds's avatar
      Merge tag 'trace-tools-v6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace · e82d2af5
      Linus Torvalds authored
      Pull tracing tool fix from Steven Rostedt:
       "Fix printf format warnings in latency-collector.
      
        Use the printf format string with %s to take a string instead of
        taking in a string directly"
      
      * tag 'trace-tools-v6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        tools/latency-collector: Fix -Wformat-security compile warns
      e82d2af5
    • Linus Torvalds's avatar
      Merge tag 'trace-assign-str-v6.10' of... · d6a326d6
      Linus Torvalds authored
      Merge tag 'trace-assign-str-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
      
      Pull tracing cleanup from Steven Rostedt:
       "Remove second argument of __assign_str()
      
        The __assign_str() macro logic of the TRACE_EVENT() macro was
        optimized so that it no longer needs the second argument. The
        __assign_str() is always matched with __string() field that takes a
        field name and the source for that field:
      
          __string(field, source)
      
        The TRACE_EVENT() macro logic will save off the source value and then
        use that value to copy into the ring buffer via the __assign_str().
      
        Before commit c1fa617c ("tracing: Rework __assign_str() and
        __string() to not duplicate getting the string"), the __assign_str()
        needed the second argument which would perform the same logic as the
        __string() source parameter did. Not only would this add overhead, but
        it was error prone as if the __assign_str() source produced something
        different, it may not have allocated enough for the string in the ring
        buffer (as the __string() source was used to determine how much to
        allocate)
      
        Now that the __assign_str() just uses the same string that was used in
        __string() it no longer needs the source parameter. It can now be
        removed"
      
      * tag 'trace-assign-str-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        tracing/treewide: Remove second parameter of __assign_str()
      d6a326d6
    • Linus Torvalds's avatar
      Merge tag 'sparc-for-6.10-tag1' of... · bca2a25d
      Linus Torvalds authored
      Merge tag 'sparc-for-6.10-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/alarsson/linux-sparc
      
      Pull sparc updates from Andreas Larsson:
      
       - Avoid on-stack cpumask variables in a number of places
      
       - Move struct termio to asm/termios.h, matching other architectures and
         allowing certain user space applications to build also for sparc
      
       - Fix missing prototype warnings for sparc64
      
       - Fix version generation warnings for sparc32
      
       - Fix bug where non-consecutive CPU IDs lead to some CPUs not starting
      
       - Simplification using swap and cleanup using NULL for pointer
      
       - Convert sparc parport and chmc drivers to use remove callbacks
         returning void
      
      * tag 'sparc-for-6.10-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/alarsson/linux-sparc:
        sparc/leon: Remove on-stack cpumask var
        sparc/pci_msi: Remove on-stack cpumask var
        sparc/of: Remove on-stack cpumask var
        sparc/irq: Remove on-stack cpumask var
        sparc/srmmu: Remove on-stack cpumask var
        sparc: chmc: Convert to platform remove callback returning void
        sparc: parport: Convert to platform remove callback returning void
        sparc: Compare pointers to NULL instead of 0
        sparc: Use swap() to fix Coccinelle warning
        sparc32: Fix version generation failed warnings
        sparc64: Fix number of online CPUs
        sparc64: Fix prototype warning for sched_clock
        sparc64: Fix prototype warnings in adi_64.c
        sparc64: Fix prototype warning for dma_4v_iotsb_bind
        sparc64: Fix prototype warning for uprobe_trap
        sparc64: Fix prototype warning for alloc_irqstack_bootmem
        sparc64: Fix prototype warning for vmemmap_free
        sparc64: Fix prototype warnings in traps_64.c
        sparc64: Fix prototype warning for init_vdso_image
        sparc: move struct termio to asm/termios.h
      bca2a25d
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 2b7ced10
      Linus Torvalds authored
      Pull arm64 fixes from Will Deacon:
       "The major fix here is for a filesystem corruption issue reported on
        Apple M1 as a result of buggy management of the floating point
        register state introduced in 6.8. I initially reverted one of the
        offending patches, but in the end Ard cooked a proper fix so there's a
        revert+reapply in the series.
      
        Aside from that, we've got some CPU errata workarounds and misc other
        fixes.
      
         - Fix broken FP register state tracking which resulted in filesystem
           corruption when dm-crypt is used
      
         - Workarounds for Arm CPU errata affecting the SSBS Spectre
           mitigation
      
         - Fix lockdep assertion in DMC620 memory controller PMU driver
      
         - Fix alignment of BUG table when CONFIG_DEBUG_BUGVERBOSE is
           disabled"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64/fpsimd: Avoid erroneous elide of user state reload
        Reapply "arm64: fpsimd: Implement lazy restore for kernel mode FPSIMD"
        arm64: asm-bug: Add .align 2 to the end of __BUG_ENTRY
        perf/arm-dmc620: Fix lockdep assert in ->event_init()
        Revert "arm64: fpsimd: Implement lazy restore for kernel mode FPSIMD"
        arm64: errata: Add workaround for Arm errata 3194386 and 3312417
        arm64: cputype: Add Neoverse-V3 definitions
        arm64: cputype: Add Cortex-X4 definitions
        arm64: barrier: Restore spec_bar() macro
      2b7ced10
    • Linus Torvalds's avatar
      Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost · 2ef32ad2
      Linus Torvalds authored
      Pull virtio updates from Michael Tsirkin:
       "Several new features here:
      
         - virtio-net is finally supported in vduse
      
         - virtio (balloon and mem) interaction with suspend is improved
      
         - vhost-scsi now handles signals better/faster
      
        And fixes, cleanups all over the place"
      
      * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (48 commits)
        virtio-pci: Check if is_avq is NULL
        virtio: delete vq in vp_find_vqs_msix() when request_irq() fails
        MAINTAINERS: add Eugenio Pérez as reviewer
        vhost-vdpa: Remove usage of the deprecated ida_simple_xx() API
        vp_vdpa: don't allocate unused msix vectors
        sound: virtio: drop owner assignment
        fuse: virtio: drop owner assignment
        scsi: virtio: drop owner assignment
        rpmsg: virtio: drop owner assignment
        nvdimm: virtio_pmem: drop owner assignment
        wifi: mac80211_hwsim: drop owner assignment
        vsock/virtio: drop owner assignment
        net: 9p: virtio: drop owner assignment
        net: virtio: drop owner assignment
        net: caif: virtio: drop owner assignment
        misc: nsm: drop owner assignment
        iommu: virtio: drop owner assignment
        drm/virtio: drop owner assignment
        gpio: virtio: drop owner assignment
        firmware: arm_scmi: virtio: drop owner assignment
        ...
      2ef32ad2
  3. May 23, 2024
    • Shuah Khan's avatar
      tools/latency-collector: Fix -Wformat-security compile warns · df73757c
      Shuah Khan authored
      Fix the following -Wformat-security compile warnings adding missing
      format arguments:
      
      latency-collector.c: In function ‘show_available’:
      latency-collector.c:938:17: warning: format not a string literal and
      no format arguments [-Wformat-security]
        938 |                 warnx(no_tracer_msg);
            |                 ^~~~~
      
      latency-collector.c:943:17: warning: format not a string literal and
      no format arguments [-Wformat-security]
        943 |                 warnx(no_latency_tr_msg);
            |                 ^~~~~
      
      latency-collector.c: In function ‘find_default_tracer’:
      latency-collector.c:986:25: warning: format not a string literal and
      no format arguments [-Wformat-security]
        986 |                         errx(EXIT_FAILURE, no_tracer_msg);
            |
                               ^~~~
      latency-collector.c: In function ‘scan_arguments’:
      latency-collector.c:1881:33: warning: format not a string literal and
      no format arguments [-Wformat-security]
       1881 |                                 errx(EXIT_FAILURE, no_tracer_msg);
            |                                 ^~~~
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240404011009.32945-1-skhan@linuxfoundation.org
      
      Cc: stable@vger.kernel.org
      Fixes: e23db805
      
       ("tracing/tools: Add the latency-collector to tools directory")
      Signed-off-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      df73757c
    • Ken Milmore's avatar
      r8169: Fix possible ring buffer corruption on fragmented Tx packets. · c71e3a5c
      Ken Milmore authored
      An issue was found on the RTL8125b when transmitting small fragmented
      packets, whereby invalid entries were inserted into the transmit ring
      buffer, subsequently leading to calls to dma_unmap_single() with a null
      address.
      
      This was caused by rtl8169_start_xmit() not noticing changes to nr_frags
      which may occur when small packets are padded (to work around hardware
      quirks) in rtl8169_tso_csum_v2().
      
      To fix this, postpone inspecting nr_frags until after any padding has been
      applied.
      
      Fixes: 9020845f
      
       ("r8169: improve rtl8169_start_xmit")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarKen Milmore <ken.milmore@gmail.com>
      Reviewed-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Link: https://lore.kernel.org/r/27ead18b-c23d-4f49-a020-1fc482c5ac95@gmail.com
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      c71e3a5c
    • Paolo Abeni's avatar
      Merge branch 'intel-interpret-set_channels-input-differently' · 3d8597d8
      Paolo Abeni authored
      
      
      Jacob Keller says:
      
      ====================
      intel: Interpret .set_channels() input differently
      
      The ice and idpf drivers can trigger a crash with AF_XDP due to incorrect
      interpretation of the asymmetric Tx and Rx parameters in their
      .set_channels() implementations:
      
      1. ethtool -l <IFNAME> -> combined: 40
      2. Attach AF_XDP to queue 30
      3. ethtool -L <IFNAME> rx 15 tx 15
         combined number is not specified, so command becomes {rx_count = 15,
         tx_count = 15, combined_count = 40}.
      4. ethnl_set_channels checks, if there are any AF_XDP of queues from the
         new (combined_count + rx_count) to the old one, so from 55 to 40, check
         does not trigger.
      5. the driver interprets `rx 15 tx 15` as 15 combined channels and deletes
         the queue that AF_XDP is attached to.
      
      This is fundamentally a problem with interpreting a request for asymmetric
      queues as symmetric combined queues.
      
      Fix the ice and idpf drivers to stop interpreting such requests as a
      request for combined queues. Due to current driver design for both ice and
      idpf, it is not possible to support requests of the same count of Tx and Rx
      queues with independent interrupts, (i.e. ethtool -L <IFNAME> rx 15 tx 15)
      so such requests are now rejected.
      
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      ====================
      
      Link: https://lore.kernel.org/r/20240521-iwl-net-2024-05-14-set-channels-fixes-v2-0-7aa39e2e99f1@intel.com
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      3d8597d8
    • Larysa Zaremba's avatar
      idpf: Interpret .set_channels() input differently · 5e7695e0
      Larysa Zaremba authored
      Unlike ice, idpf does not check, if user has requested at least 1 combined
      channel. Instead, it relies on a check in the core code. Unfortunately, the
      check does not trigger for us because of the hacky .set_channels()
      interpretation logic that is not consistent with the core code.
      
      This naturally leads to user being able to trigger a crash with an invalid
      input. This is how:
      
      1. ethtool -l <IFNAME> -> combined: 40
      2. ethtool -L <IFNAME> rx 0 tx 0
         combined number is not specified, so command becomes {rx_count = 0,
         tx_count = 0, combined_count = 40}.
      3. ethnl_set_channels checks, if there is at least 1 RX and 1 TX channel,
         comparing (combined_count + rx_count) and (combined_count + tx_count)
         to zero. Obviously, (40 + 0) is greater than zero, so the core code
         deems the input OK.
      4. idpf interprets `rx 0 tx 0` as 0 channels and tries to proceed with such
         configuration.
      
      The issue has to be solved fundamentally, as current logic is also known to
      cause AF_XDP problems in ice [0].
      
      Interpret the command in a way that is more consistent with ethtool
      manual [1] (--show-channels and --set-channels) and new ice logic.
      
      Considering that in the idpf driver only the difference between RX and TX
      queues forms dedicated channels, change the correct way to set number of
      channels to:
      
      ethtool -L <IFNAME> combined 10 /* For symmetric queues */
      ethtool -L <IFNAME> combined 8 tx 2 rx 0 /* For asymmetric queues */
      
      [0] https://lore.kernel.org/netdev/20240418095857.2827-1-larysa.zaremba@intel.com/
      [1] https://man7.org/linux/man-pages/man8/ethtool.8.html
      
      Fixes: 02cbfba1
      
       ("idpf: add ethtool callbacks")
      Reviewed-by: default avatarPrzemek Kitszel <przemyslaw.kitszel@intel.com>
      Reviewed-by: default avatarIgor Bagnucki <igor.bagnucki@intel.com>
      Signed-off-by: default avatarLarysa Zaremba <larysa.zaremba@intel.com>
      Tested-by: default avatarKrishneil Singh <krishneil.k.singh@intel.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      5e7695e0
    • Larysa Zaremba's avatar
      ice: Interpret .set_channels() input differently · 05d6f442
      Larysa Zaremba authored
      A bug occurs because a safety check guarding AF_XDP-related queues in
      ethnl_set_channels(), does not trigger. This happens, because kernel and
      ice driver interpret the ethtool command differently.
      
      How the bug occurs:
      1. ethtool -l <IFNAME> -> combined: 40
      2. Attach AF_XDP to queue 30
      3. ethtool -L <IFNAME> rx 15 tx 15
         combined number is not specified, so command becomes {rx_count = 15,
         tx_count = 15, combined_count = 40}.
      4. ethnl_set_channels checks, if there are any AF_XDP of queues from the
         new (combined_count + rx_count) to the old one, so from 55 to 40, check
         does not trigger.
      5. ice interprets `rx 15 tx 15` as 15 combined channels and deletes the
         queue that AF_XDP is attached to.
      
      Interpret the command in a way that is more consistent with ethtool
      manual [0] (--show-channels and --set-channels).
      
      Considering that in the ice driver only the difference between RX and TX
      queues forms dedicated channels, change the correct way to set number of
      channels to:
      
      ethtool -L <IFNAME> combined 10 /* For symmetric queues */
      ethtool -L <IFNAME> combined 8 tx 2 rx 0 /* For asymmetric queues */
      
      [0] https://man7.org/linux/man-pages/man8/ethtool.8.html
      
      Fixes: 87324e74
      
       ("ice: Implement ethtool ops for channels")
      Reviewed-by: default avatarMichal Swiatkowski <michal.swiatkowski@linux.intel.com>
      Signed-off-by: default avatarLarysa Zaremba <larysa.zaremba@intel.com>
      Tested-by: default avatarChandan Kumar Rout <chandanx.rout@intel.com>
      Tested-by: default avatarPucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com>
      Acked-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      05d6f442
    • Ryosuke Yasuoka's avatar
      nfc: nci: Fix handling of zero-length payload packets in nci_rx_work() · 6671e352
      Ryosuke Yasuoka authored
      When nci_rx_work() receives a zero-length payload packet, it should not
      discard the packet and exit the loop. Instead, it should continue
      processing subsequent packets.
      
      Fixes: d24b0353
      
       ("nfc: nci: Fix uninit-value in nci_dev_up and nci_ntf_packet")
      Signed-off-by: default avatarRyosuke Yasuoka <ryasuoka@redhat.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Link: https://lore.kernel.org/r/20240521153444.535399-1-ryasuoka@redhat.com
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      6671e352
    • Paolo Abeni's avatar
      net: relax socket state check at accept time. · 26afda78
      Paolo Abeni authored
      Christoph reported the following splat:
      
      WARNING: CPU: 1 PID: 772 at net/ipv4/af_inet.c:761 __inet_accept+0x1f4/0x4a0
      Modules linked in:
      CPU: 1 PID: 772 Comm: syz-executor510 Not tainted 6.9.0-rc7-g7da7119fe22b #56
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
      RIP: 0010:__inet_accept+0x1f4/0x4a0 net/ipv4/af_inet.c:759
      Code: 04 38 84 c0 0f 85 87 00 00 00 41 c7 04 24 03 00 00 00 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc e8 ec b7 da fd <0f> 0b e9 7f fe ff ff e8 e0 b7 da fd 0f 0b e9 fe fe ff ff 89 d9 80
      RSP: 0018:ffffc90000c2fc58 EFLAGS: 00010293
      RAX: ffffffff836bdd14 RBX: 0000000000000000 RCX: ffff888104668000
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: dffffc0000000000 R08: ffffffff836bdb89 R09: fffff52000185f64
      R10: dffffc0000000000 R11: fffff52000185f64 R12: dffffc0000000000
      R13: 1ffff92000185f98 R14: ffff88810754d880 R15: ffff8881007b7800
      FS:  000000001c772880(0000) GS:ffff88811b280000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fb9fcf2e178 CR3: 00000001045d2002 CR4: 0000000000770ef0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       <TASK>
       inet_accept+0x138/0x1d0 net/ipv4/af_inet.c:786
       do_accept+0x435/0x620 net/socket.c:1929
       __sys_accept4_file net/socket.c:1969 [inline]
       __sys_accept4+0x9b/0x110 net/socket.c:1999
       __do_sys_accept net/socket.c:2016 [inline]
       __se_sys_accept net/socket.c:2013 [inline]
       __x64_sys_accept+0x7d/0x90 net/socket.c:2013
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0x58/0x100 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x76/0x7e
      RIP: 0033:0x4315f9
      Code: fd ff 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 ab b4 fd ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007ffdb26d9c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002b
      RAX: ffffffffffffffda RBX: 0000000000400300 RCX: 00000000004315f9
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004
      RBP: 00000000006e1018 R08: 0000000000400300 R09: 0000000000400300
      R10: 0000000000400300 R11: 0000000000000246 R12: 0000000000000000
      R13: 000000000040cdf0 R14: 000000000040ce80 R15: 0000000000000055
       </TASK>
      
      The reproducer invokes shutdown() before entering the listener status.
      After commit 94062790
      
       ("tcp: defer shutdown(SEND_SHUTDOWN) for
      TCP_SYN_RECV sockets"), the above causes the child to reach the accept
      syscall in FIN_WAIT1 status.
      
      Eric noted we can relax the existing assertion in __inet_accept()
      
      Reported-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/490
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Fixes: 94062790
      
       ("tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets")
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/23ab880a44d8cfd967e84de8b93dbf48848e3d8c.1716299669.git.pabeni@redhat.com
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      26afda78
    • Jason Xing's avatar
      tcp: remove 64 KByte limit for initial tp->rcv_wnd value · 378979e9
      Jason Xing authored
      Recently, we had some servers upgraded to the latest kernel and noticed
      the indicator from the user side showed worse results than before. It is
      caused by the limitation of tp->rcv_wnd.
      
      In 2018 commit a337531b ("tcp: up initial rmem to 128KB and SYN rwin
      to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most
      CDN teams would not benefit from this change because they cannot have a
      large window to receive a big packet, which will be slowed down especially
      in long RTT. Small rcv_wnd means slow transfer speed, to some extent. It's
      the side effect for the latency/time-sensitive users.
      
      To avoid future confusion, current change doesn't affect the initial
      receive window on the wire in a SYN or SYN+ACK packet which are set within
      65535 bytes according to RFC 7323 also due to the limit in
      __tcp_transmit_skb():
      
          th->window      = htons(min(tp->rcv_wnd, 65535U));
      
      In one word, __tcp_transmit_skb() already ensures that constraint is
      respected, no matter how large tp->rcv_wnd is. The change doesn't violate
      RFC.
      
      Let me provide one example if with or without the patch:
      Before:
      client   --- SYN: rwindow=65535 ---> server
      client   <--- SYN+ACK: rwindow=65535 ----  server
      client   --- ACK: rwindow=65536 ---> server
      Note: for the last ACK, the calculation is 512 << 7.
      
      After:
      client   --- SYN: rwindow=65535 ---> server
      client   <--- SYN+ACK: rwindow=65535 ----  server
      client   --- ACK: rwindow=175232 ---> server
      Note: I use the following command to make it work:
      ip route change default via [ip] dev eth0 metric 100 initrwnd 120
      For the last ACK, the calculation is 1369 << 7.
      
      When we apply such a patch, having a large rcv_wnd if the user tweak this
      knob can help transfer data more rapidly and save some rtts.
      
      Fixes: a337531b
      
       ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
      Signed-off-by: default avatarJason Xing <kernelxing@tencent.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Link: https://lore.kernel.org/r/20240521134220.12510-1-kerneljasonxing@gmail.com
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      378979e9
    • Romain Gantois's avatar
      net: ti: icssg_prueth: Fix NULL pointer dereference in prueth_probe() · b31c7e78
      Romain Gantois authored
      In the prueth_probe() function, if one of the calls to emac_phy_connect()
      fails due to of_phy_connect() returning NULL, then the subsequent call to
      phy_attached_info() will dereference a NULL pointer.
      
      Check the return code of emac_phy_connect and fail cleanly if there is an
      error.
      
      Fixes: 128d5874
      
       ("net: ti: icssg-prueth: Add ICSSG ethernet driver")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarRomain Gantois <romain.gantois@bootlin.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarMD Danish Anwar <danishanwar@ti.com>
      Link: https://lore.kernel.org/r/20240521-icssg-prueth-fix-v1-1-b4b17b1433e9@bootlin.com
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      b31c7e78
    • Dae R. Jeong's avatar
      tls: fix missing memory barrier in tls_init · 91e61dd7
      Dae R. Jeong authored
      In tls_init(), a write memory barrier is missing, and store-store
      reordering may cause NULL dereference in tls_{setsockopt,getsockopt}.
      
      CPU0                               CPU1
      -----                              -----
      // In tls_init()
      // In tls_ctx_create()
      ctx = kzalloc()
      ctx->sk_proto = READ_ONCE(sk->sk_prot) -(1)
      
      // In update_sk_prot()
      WRITE_ONCE(sk->sk_prot, tls_prots)     -(2)
      
                                         // In sock_common_setsockopt()
                                         READ_ONCE(sk->sk_prot)->setsockopt()
      
                                         // In tls_{setsockopt,getsockopt}()
                                         ctx->sk_proto->setsockopt()    -(3)
      
      In the above scenario, when (1) and (2) are reordered, (3) can observe
      the NULL value of ctx->sk_proto, causing NULL dereference.
      
      To fix it, we rely on rcu_assign_pointer() which implies the release
      barrier semantic. By moving rcu_assign_pointer() after ctx->sk_proto is
      initialized, we can ensure that ctx->sk_proto are visible when
      changing sk->sk_prot.
      
      Fixes: d5bee737
      
       ("net/tls: Annotate access to sk_prot with READ_ONCE/WRITE_ONCE")
      Signed-off-by: default avatarYewon Choi <woni9911@gmail.com>
      Signed-off-by: default avatarDae R. Jeong <threeearcat@gmail.com>
      Link: https://lore.kernel.org/netdev/ZU4OJG56g2V9z_H7@dragonet/T/
      Link: https://lore.kernel.org/r/Zkx4vjSFp0mfpjQ2@libra05
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      91e61dd7
    • Wei Fang's avatar
      net: fec: avoid lock evasion when reading pps_enable · 3b1c92f8
      Wei Fang authored
      The assignment of pps_enable is protected by tmreg_lock, but the read
      operation of pps_enable is not. So the Coverity tool reports a lock
      evasion warning which may cause data race to occur when running in a
      multithread environment. Although this issue is almost impossible to
      occur, we'd better fix it, at least it seems more logically reasonable,
      and it also prevents Coverity from continuing to issue warnings.
      
      Fixes: 278d2404
      
       ("net: fec: ptp: Enable PPS output based on ptp clock")
      Signed-off-by: default avatarWei Fang <wei.fang@nxp.com>
      Link: https://lore.kernel.org/r/20240521023800.17102-1-wei.fang@nxp.com
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      3b1c92f8
    • Jacob Keller's avatar
      Revert "ixgbe: Manual AN-37 for troublesome link partners for X550 SFI" · b35b1c0b
      Jacob Keller authored
      This reverts commit 56573604.
      
      According to the commit, it implements a manual AN-37 for some
      "troublesome" Juniper MX5 switches. This appears to be a workaround for a
      particular switch.
      
      It has been reported that this causes a severe breakage for other switches,
      including a Cisco 3560CX-12PD-S.
      
      The code appears to be a workaround for a specific switch which fails to
      link in SFI mode. It expects to see AN-37 auto negotiation in order to
      link. The Cisco switch is not expecting AN-37 auto negotiation. When the
      device starts the manual AN-37, the Cisco switch decides that the port is
      confused and stops attempting to link with it. This persists until a power
      cycle. A simple driver unload and reload does not resolve the issue, even
      if loading with a version of the driver which lacks this workaround.
      
      The authors of the workaround commit have not responded with
      clarifications, and the result of the workaround is complete failure to
      connect with other switches.
      
      This appears to be a case where the driver can either "correctly" link with
      the Juniper MX5 switch, at the cost of bricking the link with the Cisco
      switch, or it can behave properly for the Cisco switch, but fail to link
      with the Junipir MX5 switch. I do not know enough about the standards
      involved to clearly determine whether either switch is at fault or behaving
      incorrectly. Nor do I know whether there exists some alternative fix which
      corrects behavior with both switches.
      
      Revert the workaround for the Juniper switch.
      
      Fixes: 56573604
      
       ("ixgbe: Manual AN-37 for troublesome link partners for X550 SFI")
      Link: https://lore.kernel.org/netdev/cbe874db-9ac9-42b8-afa0-88ea910e1e99@intel.com/T/
      Link: https://forum.proxmox.com/threads/intel-x553-sfp-ixgbe-no-go-on-pve8.135129/#post-612291
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Cc: Jeff Daly <jeffd@silicom-usa.com>
      Cc: kernel.org-fo5k2w@ycharbi.fr
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240520-net-2024-05-20-revert-silicom-switch-workaround-v1-1-50f80f261c94@intel.com
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      b35b1c0b
    • Joe Damato's avatar
      testing: net-drv: use stats64 for testing · a61a459f
      Joe Damato authored
      Testing a network device that has large numbers of bytes/packets may
      overflow. Using stats64 when comparing fixes this problem.
      
      I tripped on this while iterating on a qstats patch for mlx5. See below
      for confirmation without my added code that this is a bug.
      
      Before this patch (with added debugging output):
      
      $ NETIF=eth0 tools/testing/selftests/drivers/net/stats.py
      KTAP version 1
      1..4
      ok 1 stats.check_pause
      ok 2 stats.check_fec
      rstat: 481708634 qstat: 666201639514 key: tx-bytes
      not ok 3 stats.pkt_byte_sum
      ok 4 stats.qstat_by_ifindex
      
      Note the huge delta above ^^^ in the rtnl vs qstats.
      
      After this patch:
      
      $ NETIF=eth0 tools/testing/selftests/drivers/net/stats.py
      KTAP version 1
      1..4
      ok 1 stats.check_pause
      ok 2 stats.check_fec
      ok 3 stats.pkt_byte_sum
      ok 4 stats.qstat_by_ifindex
      
      It looks like rtnl_fill_stats in net/core/rtnetlink.c will attempt to
      copy the 64bit stats into a 32bit structure which is probably why this
      behavior is occurring.
      
      To show this is happening, you can get the underlying stats that the
      stats.py test uses like this:
      
      $ ./cli.py --spec ../../../Documentation/netlink/specs/rt_link.yaml \
                 --do getlink --json '{"ifi-index": 7}'
      
      And examine the output (heavily snipped to show relevant fields):
      
       'stats': {
                 'multicast': 3739197,
                 'rx-bytes': 1201525399,
                 'rx-packets': 56807158,
                 'tx-bytes': 492404458,
                 'tx-packets': 1200285371,
      
       'stats64': {
                   'multicast': 3739197,
                   'rx-bytes': 35561263767,
                   'rx-packets': 56807158,
                   'tx-bytes': 666212335338,
                   'tx-packets': 1200285371,
      
      The stats.py test prior to this patch was using the 'stats' structure
      above, which matches the failure output on my system.
      
      Comparing side by side, rx-bytes and tx-bytes, and getting ethtool -S
      output:
      
      rx-bytes stats:    1201525399
      rx-bytes stats64: 35561263767
      rx-bytes ethtool: 36203402638
      
      tx-bytes stats:      492404458
      tx-bytes stats64: 666212335338
      tx-bytes ethtool: 666215360113
      
      Note that the above was taken from a system with an mlx5 NIC, which only
      exposes ndo_get_stats64.
      
      Based on the ethtool output and qstat output, it appears that stats.py
      should be updated to use the 'stats64' structure for accurate
      comparisons when packet/byte counters get very large.
      
      To confirm that this was not related to the qstats code I was iterating
      on, I booted a kernel without my driver changes and re-ran the test
      which shows the qstats are skipped (as they don't exist for mlx5):
      
      NETIF=eth0 tools/testing/selftests/drivers/net/stats.py
      KTAP version 1
      1..4
      ok 1 stats.check_pause
      ok 2 stats.check_fec
      ok 3 stats.pkt_byte_sum # SKIP qstats not supported by the device
      ok 4 stats.qstat_by_ifindex # SKIP No ifindex supports qstats
      
      But, fetching the stats using the CLI
      
      $ ./cli.py --spec ../../../Documentation/netlink/specs/rt_link.yaml \
                 --do getlink --json '{"ifi-index": 7}'
      
      Shows the same issue (heavily snipped for relevant fields only):
      
       'stats': {
                 'multicast': 105489,
                 'rx-bytes': 530879526,
                 'rx-packets': 751415,
                 'tx-bytes': 2510191396,
                 'tx-packets': 27700323,
       'stats64': {
                   'multicast': 105489,
                   'rx-bytes': 530879526,
                   'rx-packets': 751415,
                   'tx-bytes': 15395093284,
                   'tx-packets': 27700323,
      
      Comparing side by side with ethtool -S on the unmodified mlx5 driver:
      
      tx-bytes stats:    2510191396
      tx-bytes stats64: 15395093284
      tx-bytes ethtool: 17718435810
      
      Fixes: f0e6c86e
      
       ("testing: net-drv: add a driver test for stats reporting")
      Signed-off-by: default avatarJoe Damato <jdamato@fastly.com>
      Link: https://lore.kernel.org/r/20240520235850.190041-1-jdamato@fastly.com
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      a61a459f
    • Linus Torvalds's avatar
      Merge tag 'mm-nonmm-stable-2024-05-22-17-30' of... · c760b372
      Linus Torvalds authored
      Merge tag 'mm-nonmm-stable-2024-05-22-17-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
      
      Pull more non-mm updates from Andrew Morton:
      
       - A series ("kbuild: enable more warnings by default") from Arnd
         Bergmann which enables a number of additional build-time warnings. We
         fixed all the fallout which we could find, there may still be a few
         stragglers.
      
       - Samuel Holland has developed the series "Unified cross-architecture
         kernel-mode FPU API". This does a lot of consolidation of
         per-architecture kernel-mode FPU usage and enables the use of newer
         AMD GPUs on RISC-V.
      
       - Tao Su has fixed some selftests build warnings in the series
         "Selftests: Fix compilation warnings due to missing _GNU_SOURCE
         definition".
      
       - This pull also includes a nilfs2 fixup from Ryusuke Konishi.
      
      * tag 'mm-nonmm-stable-2024-05-22-17-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (23 commits)
        nilfs2: make block erasure safe in nilfs_finish_roll_forward()
        selftests/harness: use 1024 in place of LINE_MAX
        Revert "selftests/harness: remove use of LINE_MAX"
        selftests/fpu: allow building on other architectures
        selftests/fpu: move FP code to a separate translation unit
        drm/amd/display: use ARCH_HAS_KERNEL_FPU_SUPPORT
        drm/amd/display: only use hard-float, not altivec on powerpc
        riscv: add support for kernel-mode FPU
        x86: implement ARCH_HAS_KERNEL_FPU_SUPPORT
        powerpc: implement ARCH_HAS_KERNEL_FPU_SUPPORT
        LoongArch: implement ARCH_HAS_KERNEL_FPU_SUPPORT
        lib/raid6: use CC_FLAGS_FPU for NEON CFLAGS
        arm64: crypto: use CC_FLAGS_FPU for NEON CFLAGS
        arm64: implement ARCH_HAS_KERNEL_FPU_SUPPORT
        ARM: crypto: use CC_FLAGS_FPU for NEON CFLAGS
        ARM: implement ARCH_HAS_KERNEL_FPU_SUPPORT
        arch: add ARCH_HAS_KERNEL_FPU_SUPPORT
        x86/fpu: fix asm/fpu/types.h include guard
        kbuild: enable -Wcast-function-type-strict unconditionally
        kbuild: enable -Wformat-truncation on clang
        ...
      c760b372
    • Linus Torvalds's avatar
      Merge tag 'mm-stable-2024-05-22-17-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm · 5c6f4d68
      Linus Torvalds authored
      Pull more mm updates from Andrew Morton:
       "A series from Dave Chinner which cleans up and fixes the handling of
        nested allocations within stackdepot and page-owner"
      
      * tag 'mm-stable-2024-05-22-17-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
        mm/page-owner: use gfp_nested_mask() instead of open coded masking
        stackdepot: use gfp_nested_mask() instead of open coded masking
        mm: lift gfp_kmemleak_mask() to gfp.h
      5c6f4d68
    • Steven Rostedt (Google)'s avatar
      tracing/treewide: Remove second parameter of __assign_str() · 2c92ca84
      Steven Rostedt (Google) authored
      
      
      With the rework of how the __string() handles dynamic strings where it
      saves off the source string in field in the helper structure[1], the
      assignment of that value to the trace event field is stored in the helper
      value and does not need to be passed in again.
      
      This means that with:
      
        __string(field, mystring)
      
      Which use to be assigned with __assign_str(field, mystring), no longer
      needs the second parameter and it is unused. With this, __assign_str()
      will now only get a single parameter.
      
      There's over 700 users of __assign_str() and because coccinelle does not
      handle the TRACE_EVENT() macro I ended up using the following sed script:
      
        git grep -l __assign_str | while read a ; do
            sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file;
            mv /tmp/test-file $a;
        done
      
      I then searched for __assign_str() that did not end with ';' as those
      were multi line assignments that the sed script above would fail to catch.
      
      Note, the same updates will need to be done for:
      
        __assign_str_len()
        __assign_rel_str()
        __assign_rel_str_len()
      
      I tested this with both an allmodconfig and an allyesconfig (build only for both).
      
      [1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Julia Lawall <Julia.Lawall@inria.fr>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Acked-by: default avatarJani Nikula <jani.nikula@intel.com>
      Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts.
      Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for
      Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal
      Acked-by: default avatarTakashi Iwai <tiwai@suse.de>
      Acked-by: Darrick J. Wong <djwong@kernel.org>	# xfs
      Tested-by: default avatarGuenter Roeck <linux@roeck-us.net>
      2c92ca84
    • Linus Torvalds's avatar
      mm: simplify and improve print_vma_addr() output · de7e71ef
      Linus Torvalds authored
      
      
      Use '%pD' to print out the filename, and print out the actual offset
      within the file too, rather than just what the virtual address of the
      mapping is (which doesn't tell you anything about any mapping offsets).
      
      Also, use the exact vma_lookup() instead of find_vma() - the latter
      looks up any vma _after_ the address, which is of questionable value
      (yes, maybe you fell off the beginning, but you'd be more likely to fall
      off the end).
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de7e71ef
    • Linus Torvalds's avatar
      Merge local branch 'x86-codegen' · f8a6e48c
      Linus Torvalds authored
      Merge trivial x86 code generation annoyances
      
       - Introduce helper macros for clang asm input problems
      
       - use said macros to improve trivially stupid code generation issues in
         bitops and array_index_mask_nospec
      
       - also improve codegen with 32-bit array index comparisons
      
      None of these really matter, but I look at code generation and profiles
      fairly regularly, and these misfeatures caused the generated code to
      look really odd and distract from the real issues.
      
      * branch 'x86-codegen' of local tree:
        x86: improve bitop code generation with clang
        x86: improve array_index_mask_nospec() code generation
        clang: work around asm input constraint problems
      f8a6e48c