Skip to content
  1. Jul 29, 2022
    • Jiri Pirko's avatar
      net: devlink: remove region snapshots list dependency on devlink->lock · 2dec18ad
      Jiri Pirko authored
      
      
      After mlx4 driver is converted to do locked reload,
      devlink_region_snapshot_create() may be called from both locked and
      unlocked context.
      
      Note that in mlx4 region snapshots could be created on any command
      failure. That can happen in any flow that involves commands to FW,
      which means most of the driver flows.
      
      So resolve this by removing dependency on devlink->lock for region
      snapshots list consistency and introduce new mutex to ensure it.
      
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2dec18ad
    • Jiri Pirko's avatar
      net: devlink: remove region snapshot ID tracking dependency on devlink->lock · 5502e871
      Jiri Pirko authored
      
      
      After mlx4 driver is converted to do locked reload, functions to get/put
      regions snapshot ID may be called from both locked and unlocked context.
      
      So resolve this by removing dependency on devlink->lock for region
      snapshot ID tracking by using internal xa_lock() to maintain
      shapshot_ids xa_array consistency.
      
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5502e871
    • Jakub Kicinski's avatar
      Merge branch 'add-framework-for-selftests-in-devlink' · 1515a1b8
      Jakub Kicinski authored
      
      
      Vikas Gupta says:
      
      ====================
      add framework for selftests in devlink
      
      Add support for selftests in the devlink framework.
      Adds a callback .selftests_check and .selftests_run in devlink_ops.
      User can add test(s) suite which is subsequently passed to the driver
      and driver can opt for running particular tests based on its capabilities.
      
      Patchset adds a flash based test for the bnxt_en driver.
      ====================
      
      Link: https://lore.kernel.org/r/20220727165721.37959-1-vikas.gupta@broadcom.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1515a1b8
    • vikas's avatar
      bnxt_en: implement callbacks for devlink selftests · 5b6ff128
      vikas authored
      
      
      Add callbacks
      =============
      .selftest_check: returns true for flash selftest.
      .selftest_run: runs a flash selftest.
      
      Also, refactor NVM APIs so that they can be
      used with devlink and ethtool both.
      
      Signed-off-by: default avatarVikas Gupta <vikas.gupta@broadcom.com>
      Reviewed-by: default avatarAndy Gospodarek <gospo@broadcom.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5b6ff128
    • Vikas Gupta's avatar
      devlink: introduce framework for selftests · 08f588fa
      Vikas Gupta authored
      
      
      Add a framework for running selftests.
      Framework exposes devlink commands and test suite(s) to the user
      to execute and query the supported tests by the driver.
      
      Below are new entries in devlink_nl_ops
      devlink_nl_cmd_selftests_show_doit/dumpit: To query the supported
      selftests by the drivers.
      devlink_nl_cmd_selftests_run: To execute selftests. Users can
      provide a test mask for executing group tests or standalone tests.
      
      Documentation/networking/devlink/ path is already part of MAINTAINERS &
      the new files come under this path. Hence no update needed to the
      MAINTAINERS
      
      Signed-off-by: default avatarVikas Gupta <vikas.gupta@broadcom.com>
      Reviewed-by: default avatarAndy Gospodarek <gospo@broadcom.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      08f588fa
    • Jakub Kicinski's avatar
      Merge branch 'mlx5e-use-tls-tx-pool-to-improve-connection-rate' · 68be7b82
      Jakub Kicinski authored
      
      
      Tariq Toukan says:
      
      ====================
      mlx5e use TLS TX pool to improve connection rate
      
      To offload encryption operations, the mlx5 device maintains state and
      keeps track of every kTLS device-offloaded connection.  Two HW objects
      are used per TX context of a kTLS offloaded connection: a. Transport
      interface send (TIS) object, to reach the HW context.  b. Data Encryption
      Key (DEK) to perform the crypto operations.
      
      These two objects are created and destroyed per TLS TX context, via FW
      commands.  In total, 4 FW commands are issued per TLS TX context, which
      seriously limits the connection rate.
      
      In this series, we aim to save creation and destroy of TIS objects by
      recycling them.  Upon recycling of a TIS, the HW still needs to be
      notified for the re-mapping between a TIS and a context. This is done by
      posting WQEs via an SQ, significantly faster API than the FW command
      interface.
      
      A pool is used for recycling. The pool dynamically interacts to the load
      and connection rate, growing and shrinking accordingly.
      
      Saving the TIS FW commands per context increases connection rate by ~42%,
      from 11.6K to 16.5K connections per sec.
      
      Connection rate is still limited by FW bottleneck due to the remaining
      per context FW commands (DEK create/destroy). This will soon be addressed
      in a followup series.  By combining the two series, the FW bottleneck
      will be released, and a significantly higher (about 100K connections per
      sec) kTLS TX device-offloaded connection rate is reached.
      ====================
      
      Link: https://lore.kernel.org/r/20220727094346.10540-1-tariqt@nvidia.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      68be7b82
    • Tariq Toukan's avatar
      net/mlx5e: kTLS, Dynamically re-size TX recycling pool · 624bf099
      Tariq Toukan authored
      
      
      Let the TLS TX recycle pool be more flexible in size, by continuously
      and dynamically allocating and releasing HW resources in response to
      changes in the connections rate and load.
      
      Allocate and release pool entries in bulks (16). Use a workqueue to
      release/allocate in the background. Allocate a new bulk when the pool
      size goes lower than the low threshold (1K). Symmetric operation is done
      when the pool size gets greater than the upper threshold (4K).
      
      Every idle pool entry holds: 1 TIS, 1 DEK (HW resources), in addition to
      ~100 bytes in host memory.
      
      Start with an empty pool to minimize memory and HW resources waste for
      non-TLS users that have the device-offload TLS enabled.
      
      Upon a new request, in case the pool is empty, do not wait for a whole bulk
      allocation to complete.  Instead, trigger an instant allocation of a single
      resource to reduce latency.
      
      Performance tests:
      Before: 11,684 CPS
      After:  16,556 CPS
      
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      624bf099
    • Tariq Toukan's avatar
      net/mlx5e: kTLS, Recycle objects of device-offloaded TLS TX connections · c4dfe704
      Tariq Toukan authored
      
      
      The transport interface send (TIS) object is responsible for performing
      all transport related operations of the transmit side.  The ConnectX HW
      uses a TIS object to save and access the TLS crypto information and state
      of an offloaded TX kTLS connection.
      
      Before this patch, we used to create a new TIS per connection and destroy
      it once it’s closed. Every create and destroy of a TIS is a FW command.
      
      Same applies for the private TLS context, where we used to dynamically
      allocate and free it per connection.
      
      Resources recycling reduce the impact of the allocation/free operations
      and helps speeding up the connection rate.
      
      In this feature we maintain a pool of TX objects and use it to recycle
      the resources instead of re-creating them per connection.
      
      A cached TIS popped from the pool is updated to serve the new connection
      via the fast-path HW interface, updating the tls static and progress
      params. This is a very fast operation, significantly faster than FW
      commands.
      
      On recycling, a WQE fence is required after the context params change.
      This guarantees that the data is sent after the context has been
      successfully updated in hardware, and that the context modification
      doesn't interfere with existing traffic.
      
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarGal Pressman <gal@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c4dfe704
    • Tariq Toukan's avatar
      net/mlx5e: kTLS, Take stats out of OOO handler · 23b1cf1e
      Tariq Toukan authored
      
      
      Let the caller of mlx5e_ktls_tx_handle_ooo() take care of updating the
      stats, according to the returned value.  As the switch/case blocks are
      already there, this change saves unnecessary branches in the handler.
      
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarGal Pressman <gal@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      23b1cf1e
    • Tariq Toukan's avatar
      net/mlx5e: kTLS, Introduce TLS-specific create TIS · da6682fa
      Tariq Toukan authored
      
      
      TLS TIS objects have a defined role in mapping and reaching the HW TLS
      contexts.  Some standard TIS attributes (like LAG port affinity) are
      not relevant for them.
      
      Use a dedicated TLS TIS create function instead of the generic
      mlx5e_create_tis.
      
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarGal Pressman <gal@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      da6682fa
    • Tariq Toukan's avatar
      net/tls: Multi-threaded calls to TX tls_dev_del · 7adc91e0
      Tariq Toukan authored
      
      
      Multiple TLS device-offloaded contexts can be added in parallel via
      concurrent calls to .tls_dev_add, while calls to .tls_dev_del are
      sequential in tls_device_gc_task.
      
      This is not a sustainable behavior. This creates a rate gap between add
      and del operations (addition rate outperforms the deletion rate).  When
      running for enough time, the TLS device resources could get exhausted,
      failing to offload new connections.
      
      Replace the single-threaded garbage collector work with a per-context
      alternative, so they can be handled on several cores in parallel. Use
      a new dedicated destruct workqueue for this.
      
      Tested with mlx5 device:
      Before: 22141 add/sec,   103 del/sec
      After:  11684 add/sec, 11684 del/sec
      
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7adc91e0
    • Tariq Toukan's avatar
      net/tls: Perform immediate device ctx cleanup when possible · 113671b2
      Tariq Toukan authored
      
      
      TLS context destructor can be run in atomic context. Cleanup operations
      for device-offloaded contexts could require access and interaction with
      the device callbacks, which might sleep. Hence, the cleanup of such
      contexts must be deferred and completed inside an async work.
      
      For all others, this is not necessary, as cleanup is atomic. Invoke
      cleanup immediately for them, avoiding queueing redundant gc work.
      
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      113671b2
    • Yang Li's avatar
      tls: rx: Fix unsigned comparison with less than zero · 8fd1e151
      Yang Li authored
      
      
      The return from the call to tls_rx_msg_size() is int, it can be
      a negative error code, however this is being assigned to an
      unsigned long variable 'sz', so making 'sz' an int.
      
      Eliminate the following coccicheck warning:
      ./net/tls/tls_strp.c:211:6-8: WARNING: Unsigned expression compared with zero: sz < 0
      
      Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Signed-off-by: default avatarYang Li <yang.lee@linux.alibaba.com>
      Link: https://lore.kernel.org/r/20220728031019.32838-1-yang.lee@linux.alibaba.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8fd1e151
    • Jakub Kicinski's avatar
      Merge branch 'tls-rx-follow-ups-to-rx-work' · 37e26188
      Jakub Kicinski authored
      
      
      Jakub Kicinski says:
      
      ====================
      tls: rx: follow ups to rx work
      
      A selection of unrelated changes. First some selftest polishing.
      Next a change to rcvtimeo handling for locking based on an exchange
      with Eric. Follow up to Paolo's comments from yesterday. Last but
      not least a fix to a false positive warning, turns out I've been
      testing with DEBUG_NET=n this whole time.
      ====================
      
      Link: https://lore.kernel.org/r/20220727031524.358216-1-kuba@kernel.org
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      37e26188
    • Jakub Kicinski's avatar
      tls: rx: fix the false positive warning · e20691fa
      Jakub Kicinski authored
      I went too far in the accessor conversion, we can't use tls_strp_msg()
      after decryption because the message may not be ready. What we care
      about on this path is that the output skb is detached, i.e. we didn't
      somehow just turn around and used the input skb with its TCP data
      still attached. So look at the anchor directly.
      
      Fixes: 84c61fe1
      
       ("tls: rx: do not use the standard strparser")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e20691fa
    • Jakub Kicinski's avatar
      tls: strp: rename and multithread the workqueue · d11ef9cc
      Jakub Kicinski authored
      
      
      Paolo points out that there seems to be no strong reason strparser
      users a single threaded workqueue. Perhaps there were some performance
      or pinning considerations? Since we don't know (and it's the slow path)
      let's default to the most natural, multi-threaded choice.
      
      Also rename the workqueue to "tls-".
      
      Suggested-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d11ef9cc
    • Jakub Kicinski's avatar
      tls: rx: don't consider sock_rcvtimeo() cumulative · 70f03fc2
      Jakub Kicinski authored
      
      
      Eric indicates that restarting rcvtimeo on every wait may be fine.
      I thought that we should consider it cumulative, and made
      tls_rx_reader_lock() return the remaining timeo after acquiring
      the reader lock.
      
      tls_rx_rec_wait() gets its timeout passed in by value so it
      does not keep track of time previously spent.
      
      Make the lock waiting consistent with tls_rx_rec_wait() - don't
      keep track of time spent.
      
      Read the timeo fresh in tls_rx_rec_wait().
      It's unclear to me why callers are supposed to cache the value.
      
      Link: https://lore.kernel.org/all/CANn89iKcmSfWgvZjzNGbsrndmCch2HC_EPZ7qmGboDNaWoviNQ@mail.gmail.com/
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      70f03fc2
    • Jakub Kicinski's avatar
      selftests: tls: handful of memrnd() and length checks · 86c591fb
      Jakub Kicinski authored
      
      
      Add a handful of memory randomizations and precise length checks.
      Nothing is really broken here, I did this to increase confidence
      when debugging. It does fix a GCC warning, tho. Apparently GCC
      recognizes that memory needs to be initialized for send() but
      does not recognize that for write().
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      86c591fb
    • Xie Shaowen's avatar
      net: usb: delete extra space and tab in blank line · efe3e6b5
      Xie Shaowen authored
      
      
      delete extra space and tab in blank line, there is no functional change.
      
      Signed-off-by: default avatarXie Shaowen <studentxswpy@163.com>
      Link: https://lore.kernel.org/r/20220727081253.3043941-1-studentxswpy@163.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      efe3e6b5
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 272ac32f
      Jakub Kicinski authored
      
      
      No conflicts.
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      272ac32f
    • Linus Torvalds's avatar
      Merge tag 'net-5.19-final' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 33ea1340
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from bluetooth and netfilter, no known blockers for
        the release.
      
        Current release - regressions:
      
         - wifi: mac80211: do not abuse fq.lock in ieee80211_do_stop(), fix
           taking the lock before its initialized
      
         - Bluetooth: mgmt: fix double free on error path
      
        Current release - new code bugs:
      
         - eth: ice: fix tunnel checksum offload with fragmented traffic
      
        Previous releases - regressions:
      
         - tcp: md5: fix IPv4-mapped support after refactoring, don't take the
           pure v6 path
      
         - Revert "tcp: change pingpong threshold to 3", improving detection
           of interactive sessions
      
         - mld: fix netdev refcount leak in mld_{query | report}_work() due to
           a race
      
         - Bluetooth:
            - always set event mask on suspend, avoid early wake ups
            - L2CAP: fix use-after-free caused by l2cap_chan_put
      
         - bridge: do not send empty IFLA_AF_SPEC attribute
      
        Previous releases - always broken:
      
         - ping6: fix memleak in ipv6_renew_options()
      
         - sctp: prevent null-deref caused by over-eager error paths
      
         - virtio-net: fix the race between refill work and close, resulting
           in NAPI scheduled after close and a BUG()
      
         - macsec:
            - fix three netlink parsing bugs
            - avoid breaking the device state on invalid change requests
            - fix a memleak in another error path
      
        Misc:
      
         - dt-bindings: net: ethernet-controller: rework 'fixed-link' schema
      
         - two more batches of sysctl data race adornment"
      
      * tag 'net-5.19-final' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (67 commits)
        stmmac: dwmac-mediatek: fix resource leak in probe
        ipv6/addrconf: fix a null-ptr-deref bug for ip6_ptr
        net: ping6: Fix memleak in ipv6_renew_options().
        net/funeth: Fix fun_xdp_tx() and XDP packet reclaim
        sctp: leave the err path free in sctp_stream_init to sctp_stream_free
        sfc: disable softirqs for ptp TX
        ptp: ocp: Select CRC16 in the Kconfig.
        tcp: md5: fix IPv4-mapped support
        virtio-net: fix the race between refill work and close
        mptcp: Do not return EINPROGRESS when subflow creation succeeds
        Bluetooth: L2CAP: Fix use-after-free caused by l2cap_chan_put
        Bluetooth: Always set event mask on suspend
        Bluetooth: mgmt: Fix double free on error path
        wifi: mac80211: do not abuse fq.lock in ieee80211_do_stop()
        ice: do not setup vlan for loopback VSI
        ice: check (DD | EOF) bits on Rx descriptor rather than (EOP | RS)
        ice: Fix VSIs unable to share unicast MAC
        ice: Fix tunnel checksum offload with fragmented traffic
        ice: Fix max VLANs available for VF
        netfilter: nft_queue: only allow supported familes and hooks
        ...
      33ea1340
    • Dan Carpenter's avatar
      stmmac: dwmac-mediatek: fix resource leak in probe · 4d3d3a1b
      Dan Carpenter authored
      If mediatek_dwmac_clks_config() fails, then call stmmac_remove_config_dt()
      before returning.  Otherwise it is a resource leak.
      
      Fixes: fa4b3ca6
      
       ("stmmac: dwmac-mediatek: fix clock issue")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Link: https://lore.kernel.org/r/YuJ4aZyMUlG6yGGa@kili
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4d3d3a1b
    • Ziyang Xuan's avatar
      ipv6/addrconf: fix a null-ptr-deref bug for ip6_ptr · 85f0173d
      Ziyang Xuan authored
      Change net device's MTU to smaller than IPV6_MIN_MTU or unregister
      device while matching route. That may trigger null-ptr-deref bug
      for ip6_ptr probability as following.
      
      =========================================================
      BUG: KASAN: null-ptr-deref in find_match.part.0+0x70/0x134
      Read of size 4 at addr 0000000000000308 by task ping6/263
      
      CPU: 2 PID: 263 Comm: ping6 Not tainted 5.19.0-rc7+ #14
      Call trace:
       dump_backtrace+0x1a8/0x230
       show_stack+0x20/0x70
       dump_stack_lvl+0x68/0x84
       print_report+0xc4/0x120
       kasan_report+0x84/0x120
       __asan_load4+0x94/0xd0
       find_match.part.0+0x70/0x134
       __find_rr_leaf+0x408/0x470
       fib6_table_lookup+0x264/0x540
       ip6_pol_route+0xf4/0x260
       ip6_pol_route_output+0x58/0x70
       fib6_rule_lookup+0x1a8/0x330
       ip6_route_output_flags_noref+0xd8/0x1a0
       ip6_route_output_flags+0x58/0x160
       ip6_dst_lookup_tail+0x5b4/0x85c
       ip6_dst_lookup_flow+0x98/0x120
       rawv6_sendmsg+0x49c/0xc70
       inet_sendmsg+0x68/0x94
      
      Reproducer as following:
      Firstly, prepare conditions:
      $ip netns add ns1
      $ip netns add ns2
      $ip link add veth1 type veth peer name veth2
      $ip link set veth1 netns ns1
      $ip link set veth2 netns ns2
      $ip netns exec ns1 ip -6 addr add 2001:0db8:0:f101::1/64 dev veth1
      $ip netns exec ns2 ip -6 addr add 2001:0db8:0:f101::2/64 dev veth2
      $ip netns exec ns1 ifconfig veth1 up
      $ip netns exec ns2 ifconfig veth2 up
      $ip netns exec ns1 ip -6 route add 2000::/64 dev veth1 metric 1
      $ip netns exec ns2 ip -6 route add 2001::/64 dev veth2 metric 1
      
      Secondly, execute the following two commands in two ssh windows
      respectively:
      $ip netns exec ns1 sh
      $while true; do ip -6 addr add 2001:0db8:0:f101::1/64 dev veth1; ip -6 route add 2000::/64 dev veth1 metric 1; ping6 2000::2; done
      
      $ip netns exec ns1 sh
      $while true; do ip link set veth1 mtu 1000; ip link set veth1 mtu 1500; sleep 5; done
      
      It is because ip6_ptr has been assigned to NULL in addrconf_ifdown() firstly,
      then ip6_ignore_linkdown() accesses ip6_ptr directly without NULL check.
      
      	cpu0			cpu1
      fib6_table_lookup
      __find_rr_leaf
      			addrconf_notify [ NETDEV_CHANGEMTU ]
      			addrconf_ifdown
      			RCU_INIT_POINTER(dev->ip6_ptr, NULL)
      find_match
      ip6_ignore_linkdown
      
      So we can add NULL check for ip6_ptr before using in ip6_ignore_linkdown() to
      fix the null-ptr-deref bug.
      
      Fixes: dcd1f572
      
       ("net/ipv6: Remove fib6_idev")
      Signed-off-by: default avatarZiyang Xuan <william.xuanziyang@huawei.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20220728013307.656257-1-william.xuanziyang@huawei.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      85f0173d
    • Kuniyuki Iwashima's avatar
      net: ping6: Fix memleak in ipv6_renew_options(). · e2732600
      Kuniyuki Iwashima authored
      When we close ping6 sockets, some resources are left unfreed because
      pingv6_prot is missing sk->sk_prot->destroy().  As reported by
      syzbot [0], just three syscalls leak 96 bytes and easily cause OOM.
      
          struct ipv6_sr_hdr *hdr;
          char data[24] = {0};
          int fd;
      
          hdr = (struct ipv6_sr_hdr *)data;
          hdr->hdrlen = 2;
          hdr->type = IPV6_SRCRT_TYPE_4;
      
          fd = socket(AF_INET6, SOCK_DGRAM, NEXTHDR_ICMP);
          setsockopt(fd, IPPROTO_IPV6, IPV6_RTHDR, data, 24);
          close(fd);
      
      To fix memory leaks, let's add a destroy function.
      
      Note the socket() syscall checks if the GID is within the range of
      net.ipv4.ping_group_range.  The default value is [1, 0] so that no
      GID meets the condition (1 <= GID <= 0).  Thus, the local DoS does
      not succeed until we change the default value.  However, at least
      Ubuntu/Fedora/RHEL loosen it.
      
          $ cat /usr/lib/sysctl.d/50-default.conf
          ...
          -net.ipv4.ping_group_range = 0 2147483647
      
      Also, there could be another path reported with these options, and
      some of them require CAP_NET_RAW.
      
        setsockopt
            IPV6_ADDRFORM (inet6_sk(sk)->pktoptions)
            IPV6_RECVPATHMTU (inet6_sk(sk)->rxpmtu)
            IPV6_HOPOPTS (inet6_sk(sk)->opt)
            IPV6_RTHDRDSTOPTS (inet6_sk(sk)->opt)
            IPV6_RTHDR (inet6_sk(sk)->opt)
            IPV6_DSTOPTS (inet6_sk(sk)->opt)
            IPV6_2292PKTOPTIONS (inet6_sk(sk)->opt)
      
        getsockopt
            IPV6_FLOWLABEL_MGR (inet6_sk(sk)->ipv6_fl_list)
      
      For the record, I left a different splat with syzbot's one.
      
        unreferenced object 0xffff888006270c60 (size 96):
          comm "repro2", pid 231, jiffies 4294696626 (age 13.118s)
          hex dump (first 32 bytes):
            01 00 00 00 44 00 00 00 00 00 00 00 00 00 00 00  ....D...........
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          backtrace:
            [<00000000f6bc7ea9>] sock_kmalloc (net/core/sock.c:2564 net/core/sock.c:2554)
            [<000000006d699550>] do_ipv6_setsockopt.constprop.0 (net/ipv6/ipv6_sockglue.c:715)
            [<00000000c3c3b1f5>] ipv6_setsockopt (net/ipv6/ipv6_sockglue.c:1024)
            [<000000007096a025>] __sys_setsockopt (net/socket.c:2254)
            [<000000003a8ff47b>] __x64_sys_setsockopt (net/socket.c:2265 net/socket.c:2262 net/socket.c:2262)
            [<000000007c409dcb>] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
            [<00000000e939c4a9>] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
      
      [0]: https://syzkaller.appspot.com/bug?extid=a8430774139ec3ab7176
      
      Fixes: 6d0bfe22
      
       ("net: ipv6: Add IPv6 support to the ping socket.")
      Reported-by: default avatar <syzbot+a8430774139ec3ab7176@syzkaller.appspotmail.com>
      Reported-by: default avatarAyushman Dutta <ayudutta@amazon.com>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20220728012220.46918-1-kuniyu@amazon.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e2732600
    • Linus Torvalds's avatar
      watch_queue: Fix missing locking in add_watch_to_object() · e64ab2db
      Linus Torvalds authored
      If a watch is being added to a queue, it needs to guard against
      interference from addition of a new watch, manual removal of a watch and
      removal of a watch due to some other queue being destroyed.
      
      KEYCTL_WATCH_KEY guards against this for the same {key,queue} pair by
      holding the key->sem writelocked and by holding refs on both the key and
      the queue - but that doesn't prevent interaction from other {key,queue}
      pairs.
      
      While add_watch_to_object() does take the spinlock on the event queue,
      it doesn't take the lock on the source's watch list.  The assumption was
      that the caller would prevent that (say by taking key->sem) - but that
      doesn't prevent interference from the destruction of another queue.
      
      Fix this by locking the watcher list in add_watch_to_object().
      
      Fixes: c73be61c
      
       ("pipe: Add general notification queue support")
      Reported-by: default avatar <syzbot+03d7b43290037d1f87ca@syzkaller.appspotmail.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: keyrings@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e64ab2db
    • David Howells's avatar
      watch_queue: Fix missing rcu annotation · e0339f03
      David Howells authored
      Since __post_watch_notification() walks wlist->watchers with only the
      RCU read lock held, we need to use RCU methods to add to the list (we
      already use RCU methods to remove from the list).
      
      Fix add_watch_to_object() to use hlist_add_head_rcu() instead of
      hlist_add_head() for that list.
      
      Fixes: c73be61c
      
       ("pipe: Add general notification queue support")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0339f03
  2. Jul 28, 2022