Skip to content
  1. Nov 11, 2021
    • Rahul Lakkireddy's avatar
      cxgb4: fix eeprom len when diagnostics not implemented · 4ca110bf
      Rahul Lakkireddy authored
      Ensure diagnostics monitoring support is implemented for the SFF 8472
      compliant port module and set the correct length for ethtool port
      module eeprom read.
      
      Fixes: f56ec676
      
       ("cxgb4: Add support for ethtool i2c dump")
      Signed-off-by: default avatarManoj Malviya <manojmalviya@chelsio.com>
      Signed-off-by: default avatarRahul Lakkireddy <rahul.lakkireddy@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ca110bf
    • Alexander Lobakin's avatar
      net: fix premature exit from NAPI state polling in napi_disable() · 0315a075
      Alexander Lobakin authored
      Commit 719c5719 ("net: make napi_disable() symmetric with
      enable") accidentally introduced a bug sometimes leading to a kernel
      BUG when bringing an iface up/down under heavy traffic load.
      
      Prior to this commit, napi_disable() was polling n->state until
      none of (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC) is set and then
      always flip them. Now there's a possibility to get away with the
      NAPIF_STATE_SCHE unset as 'continue' drops us to the cmpxchg()
      call with an uninitialized variable, rather than straight to
      another round of the state check.
      
      Error path looks like:
      
      napi_disable():
      unsigned long val, new; /* new is uninitialized */
      
      do {
      	val = READ_ONCE(n->state); /* NAPIF_STATE_NPSVC and/or
      				      NAPIF_STATE_SCHED is set */
      	if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { /* true */
      		usleep_range(20, 200);
      		continue; /* go straight to the condition check */
      	}
      	new = val | <...>
      } while (cmpxchg(&n->state, val, new) != val); /* state == val, cmpxchg()
      						  writes garbage */
      
      napi_enable():
      do {
      	val = READ_ONCE(n->state);
      	BUG_ON(!test_bit(NAPI_STATE_SCHED, &val)); /* 50/50 boom */
      <...>
      
      while the typical BUG splat is like:
      
      [  172.652461] ------------[ cut here ]------------
      [  172.652462] kernel BUG at net/core/dev.c:6937!
      [  172.656914] invalid opcode: 0000 [#1] PREEMPT SMP PTI
      [  172.661966] CPU: 36 PID: 2829 Comm: xdp_redirect_cp Tainted: G          I       5.15.0 #42
      [  172.670222] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0014.082620210524 08/26/2021
      [  172.680646] RIP: 0010:napi_enable+0x5a/0xd0
      [  172.684832] Code: 07 49 81 cc 00 01 00 00 4c 89 e2 48 89 d8 80 e6 fb f0 48 0f b1 55 10 48 39 c3 74 10 48 8b 5d 10 f6 c7 04 75 3d f6 c3 01 75 b4 <0f> 0b 5b 5d 41 5c c3 65 ff 05 b8 e5 61 53 48 c7 c6 c0 f3 34 ad 48
      [  172.703578] RSP: 0018:ffffa3c9497477a8 EFLAGS: 00010246
      [  172.708803] RAX: ffffa3c96615a014 RBX: 0000000000000000 RCX: ffff8a4b575301a0
      < snip >
      [  172.782403] Call Trace:
      [  172.784857]  <TASK>
      [  172.786963]  ice_up_complete+0x6f/0x210 [ice]
      [  172.791349]  ice_xdp+0x136/0x320 [ice]
      [  172.795108]  ? ice_change_mtu+0x180/0x180 [ice]
      [  172.799648]  dev_xdp_install+0x61/0xe0
      [  172.803401]  dev_xdp_attach+0x1e0/0x550
      [  172.807240]  dev_change_xdp_fd+0x1e6/0x220
      [  172.811338]  do_setlink+0xee8/0x1010
      [  172.814917]  rtnl_setlink+0xe5/0x170
      [  172.818499]  ? bpf_lsm_binder_set_context_mgr+0x10/0x10
      [  172.823732]  ? security_capable+0x36/0x50
      < snip >
      
      Fix this by replacing 'do { } while (cmpxchg())' with an "infinite"
      for-loop with an explicit break.
      
      From v1 [0]:
       - just use a for-loop to simplify both the fix and the existing
         code (Eric).
      
      [0] https://lore.kernel.org/netdev/20211110191126.1214-1-alexandr.lobakin@intel.com
      
      Fixes: 719c5719
      
       ("net: make napi_disable() symmetric with enable")
      Suggested-by: Eric Dumazet <edumazet@google.com> # for-loop
      Signed-off-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Reviewed-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20211110195605.1304-1-alexandr.lobakin@intel.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0315a075
  2. Nov 10, 2021
    • Dust Li's avatar
      net/smc: fix sk_refcnt underflow on linkdown and fallback · e5d5aadc
      Dust Li authored
      We got the following WARNING when running ab/nginx
      test with RDMA link flapping (up-down-up).
      The reason is when smc_sock fallback and at linkdown
      happens simultaneously, we may got the following situation:
      
      __smc_lgr_terminate()
       --> smc_conn_kill()
          --> smc_close_active_abort()
                 smc_sock->sk_state = SMC_CLOSED
                 sock_put(smc_sock)
      
      smc_sock was set to SMC_CLOSED and sock_put() been called
      when terminate the link group. But later application call
      close() on the socket, then we got:
      
      __smc_release():
          if (smc_sock->fallback)
              smc_sock->sk_state = SMC_CLOSED
              sock_put(smc_sock)
      
      Again we set the smc_sock to CLOSED through it's already
      in CLOSED state, and double put the refcnt, so the following
      warning happens:
      
      refcount_t: underflow; use-after-free.
      WARNING: CPU: 5 PID: 860 at lib/refcount.c:28 refcount_warn_saturate+0x8d/0xf0
      Modules linked in:
      CPU: 5 PID: 860 Comm: nginx Not tainted 5.10.46+ #403
      Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8c24b4c 04/01/2014
      RIP: 0010:refcount_warn_saturate+0x8d/0xf0
      Code: 05 5c 1e b5 01 01 e8 52 25 bc ff 0f 0b c3 80 3d 4f 1e b5 01 00 75 ad 48
      
      RSP: 0018:ffffc90000527e50 EFLAGS: 00010286
      RAX: 0000000000000026 RBX: ffff8881300df2c0 RCX: 0000000000000027
      RDX: 0000000000000000 RSI: ffff88813bd58040 RDI: ffff88813bd58048
      RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000000000001
      R10: ffff8881300df2c0 R11: ffffc90000527c78 R12: ffff8881300df340
      R13: ffff8881300df930 R14: ffff88810b3dad80 R15: ffff8881300df4f8
      FS:  00007f739de8fb80(0000) GS:ffff88813bd40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000000000a01b008 CR3: 0000000111b64003 CR4: 00000000003706e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       smc_release+0x353/0x3f0
       __sock_release+0x3d/0xb0
       sock_close+0x11/0x20
       __fput+0x93/0x230
       task_work_run+0x65/0xa0
       exit_to_user_mode_prepare+0xf9/0x100
       syscall_exit_to_user_mode+0x27/0x190
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This patch adds check in __smc_release() to make
      sure we won't do an extra sock_put() and set the
      socket to CLOSED when its already in CLOSED state.
      
      Fixes: 51f1de79
      
       (net/smc: replace sock_put worker by socket refcounting)
      Signed-off-by: default avatarDust Li <dust.li@linux.alibaba.com>
      Reviewed-by: default avatarTony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: default avatarDust Li <dust.li@linux.alibaba.com>
      Acked-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e5d5aadc
    • Dan Carpenter's avatar
      net/mlx5: Lag, fix a potential Oops with mlx5_lag_create_definer() · c7ebe23c
      Dan Carpenter authored
      There is a minus character missing from ERR_PTR(ENOMEM) so if this
      allocation fails it will lead to an Oops in the caller.
      
      Fixes: dc48516e
      
       ("net/mlx5: Lag, add support to create definers for LAG")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c7ebe23c
    • Dan Carpenter's avatar
      gve: fix unmatched u64_stats_update_end() · 721111b1
      Dan Carpenter authored
      The u64_stats_update_end() call is supposed to be inside the curly
      braces so it pairs with the u64_stats_update_begin().
      
      Fixes: 37149e93
      
       ("gve: Implement packet continuation for RX.")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      721111b1
    • Aleksander Jan Bajkowski's avatar
      net: ethernet: lantiq_etop: Fix compilation error · 68eabc34
      Aleksander Jan Bajkowski authored
      This fixes the error detected when compiling the driver.
      
      Fixes: 14d4e308
      
       ("net: lantiq: configure the burst length in ethernet drivers")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarAleksander Jan Bajkowski <olek2@wp.pl>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      68eabc34
    • Petr Machata's avatar
      selftests: forwarding: Fix packet matching in mirroring selftests · af0a5111
      Petr Machata authored
      In commit 6de6e46d ("cls_flower: Fix inability to match GRE/IPIP
      packets"), cls_flower was fixed to match an outer packet of a tunneled
      packet as would be expected, rather than dissecting to the inner packet and
      matching on that.
      
      This fix uncovered several issues in packet matching in mirroring
      selftests:
      
      - in mirror_gre_bridge_1d_vlan.sh and mirror_gre_vlan_bridge_1q.sh, the
        vlan_ethtype match is copied around as "ip", even as some of the tests
        are running over ip6gretap. This is fixed by using an "ipv6" for
        vlan_ethtype in the ip6gretap tests.
      
      - in mirror_gre_changes.sh, a filter to count GRE packets is set up to
        match TTL of 50. This used to trigger in the offloaded datapath, where
        the envelope TTL was matched, but not in the software datapath, which
        considered TTL of the inner packet. Now that both match consistently, all
        the packets were double-counted. This is fixed by marking the filter as
        skip_hw, leaving only the SW datapath component active.
      
      Fixes: 6de6e46d
      
       ("cls_flower: Fix inability to match GRE/IPIP packets")
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      af0a5111
    • Eiichi Tsukata's avatar
      vsock: prevent unnecessary refcnt inc for nonblocking connect · c7cd82b9
      Eiichi Tsukata authored
      Currently vosck_connect() increments sock refcount for nonblocking
      socket each time it's called, which can lead to memory leak if
      it's called multiple times because connect timeout function decrements
      sock refcount only once.
      
      Fixes it by making vsock_connect() return -EALREADY immediately when
      sock state is already SS_CONNECTING.
      
      Fixes: d021c344
      
       ("VSOCK: Introduce VM Sockets")
      Reviewed-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: default avatarEiichi Tsukata <eiichi.tsukata@nutanix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c7cd82b9
    • Marek Behún's avatar
      net: marvell: mvpp2: Fix wrong SerDes reconfiguration order · bb7bbb6e
      Marek Behún authored
      Commit bfe301eb ("net: mvpp2: convert to use
      mac_prepare()/mac_finish()") introduced a bug wherein it leaves the MAC
      RESET register asserted after mac_finish(), due to wrong order of
      function calls.
      
      Before it was:
        .mac_config()
          mvpp22_mode_reconfigure()
            assert reset
          mvpp2_xlg_config()
            deassert reset
      
      Now it is:
        .mac_prepare()
        .mac_config()
          mvpp2_xlg_config()
            deassert reset
        .mac_finish()
          mvpp2_xlg_config()
            assert reset
      
      Obviously this is wrong.
      
      This bug is triggered when phylink tries to change the PHY interface
      mode from a GMAC mode (sgmii, 1000base-x, 2500base-x) to XLG mode
      (10gbase-r, xaui). The XLG mode does not work since reset is left
      asserted. Only after
        ifconfig down && ifconfig up
      is called will the XLG mode work.
      
      Move the call to mvpp22_mode_reconfigure() to .mac_prepare()
      implementation. Since some of the subsequent functions need to know
      whether the interface is being changed, we unfortunately also need to
      pass around the new interface mode before setting port->phy_interface.
      
      Fixes: bfe301eb
      
       ("net: mvpp2: convert to use mac_prepare()/mac_finish()")
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bb7bbb6e
    • Christophe JAILLET's avatar
      net: ethernet: ti: cpsw_ale: Fix access to un-initialized memory · 7a166854
      Christophe JAILLET authored
      It is spurious to allocate a bitmap without initializing it.
      So, better safe than sorry, initialize it to 0 at least to have some known
      values.
      
      While at it, switch to the devm_bitmap_ API which is less verbose.
      
      Fixes: 4b41d343
      
       ("net: ethernet: ti: cpsw: allow untagged traffic on host port")
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7a166854
    • Vladimir Oltean's avatar
      net: stmmac: allow a tc-taprio base-time of zero · f64ab8e4
      Vladimir Oltean authored
      Commit fe28c53e ("net: stmmac: fix taprio configuration when
      base_time is in the past") allowed some base time values in the past,
      but apparently not all, the base-time value of 0 (Jan 1st 1970) is still
      explicitly denied by the driver.
      
      Remove the bogus check.
      
      Fixes: b60189e0
      
       ("net: stmmac: Integrate EST with TAPRIO scheduler API")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f64ab8e4
    • Andrea Righi's avatar
      selftests: net: test_vxlan_under_vrf: fix HV connectivity test · e7e4785f
      Andrea Righi authored
      
      
      It looks like test_vxlan_under_vrf.sh is always failing to verify the
      connectivity test during the ping between the two simulated VMs.
      
      This is due to the fact that veth-hv in each VM should have a distinct
      MAC address.
      
      Fix by setting a unique MAC address on each simulated VM interface.
      
      Without this fix:
      
       $ sudo ./tools/testing/selftests/net/test_vxlan_under_vrf.sh
       Checking HV connectivity                                           [ OK ]
       Check VM connectivity through VXLAN (underlay in the default VRF)  [FAIL]
      
      With this fix applied:
      
       $ sudo ./tools/testing/selftests/net/test_vxlan_under_vrf.sh
       Checking HV connectivity                                           [ OK ]
       Check VM connectivity through VXLAN (underlay in the default VRF)  [ OK ]
       Check VM connectivity through VXLAN (underlay in a VRF)            [FAIL]
      
      NOTE: the connectivity test with the underlay VRF is still failing; it
      seems that ARP requests are blocked at the simulated hypervisor level,
      probably due to some missing ARP forwarding rules. This requires more
      investigation (in the meantime we may consider to set that test as
      expected failure - XFAIL).
      
      Signed-off-by: default avatarAndrea Righi <andrea.righi@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7e4785f
    • David S. Miller's avatar
      Merge branch 'hns3-fixes' · 1413ff13
      David S. Miller authored
      
      
      Guangbin Huang says:
      
      ====================
      net: hns3: add some fixes for -net
      
      This series adds some fixes for the HNS3 ethernet driver.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1413ff13
    • Guangbin Huang's avatar
      net: hns3: allow configure ETS bandwidth of all TCs · 688db0c7
      Guangbin Huang authored
      Currently, driver only allow configuring ETS bandwidth of TCs according
      to the max TC number queried from firmware. However, the hardware actually
      supports 8 TCs and users may need to configure ETS bandwidth of all TCs,
      so remove the restriction.
      
      Fixes: 330baff5
      
       ("net: hns3: add ETS TC weight setting in SSU module")
      Signed-off-by: default avatarGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      688db0c7
    • Guangbin Huang's avatar
      net: hns3: remove check VF uc mac exist when set by PF · 91fcc79b
      Guangbin Huang authored
      If users set unicast mac address for VFs by PF, they need to guarantee all
      VFs' address is different. This patch removes the check mac address exist
      of VFs, for usrs can refresh mac addresses of all VFs directly without
      need to modify the exist mac address to other value firstly.
      
      Fixes: 8e6de441
      
       ("net: hns3: add support for configuring VF MAC from the host")
      Signed-off-by: default avatarGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      91fcc79b
    • Guangbin Huang's avatar
      net: hns3: fix some mac statistics is always 0 in device version V2 · 1122eac1
      Guangbin Huang authored
      When driver queries the register number of mac statistics from firmware,
      the old firmware runs in device version V2 only returns number of valid
      registers, not include number of three reserved registers among of them.
      It cause driver doesn't record the last three data when query mac
      statistics.
      
      To fix this problem, driver never query register number in device version
      V2 and set it to a fixed value which include three reserved registers.
      
      Fixes: c8af2887
      
       ("net: hns3: add support pause/pfc durations for mac statistics")
      Signed-off-by: default avatarGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1122eac1
    • Yufeng Mo's avatar
      net: hns3: fix kernel crash when unload VF while it is being reset · e140c798
      Yufeng Mo authored
      When fully configure VLANs for a VF, then unload the VF while
      triggering a reset to PF, will cause a kernel crash because the
      irq is already uninit.
      
      [ 293.177579] ------------[ cut here ]------------
      [ 293.183502] kernel BUG at drivers/pci/msi.c:352!
      [ 293.189547] Internal error: Oops - BUG: 0 [#1] SMP
      ......
      [ 293.390124] Workqueue: hclgevf hclgevf_service_task [hclgevf]
      [ 293.402627] pstate: 80c00009 (Nzcv daif +PAN +UAO)
      [ 293.414324] pc : free_msi_irqs+0x19c/0x1b8
      [ 293.425429] lr : free_msi_irqs+0x18c/0x1b8
      [ 293.436545] sp : ffff00002716fbb0
      [ 293.446950] x29: ffff00002716fbb0 x28: 0000000000000000
      [ 293.459519] x27: 0000000000000000 x26: ffff45b91ea16b00
      [ 293.472183] x25: 0000000000000000 x24: ffffa587b08f4700
      [ 293.484717] x23: ffffc591ac30e000 x22: ffffa587b08f8428
      [ 293.497190] x21: ffffc591ac30e300 x20: 0000000000000000
      [ 293.509594] x19: ffffa58a062a8300 x18: 0000000000000000
      [ 293.521949] x17: 0000000000000000 x16: ffff45b91dcc3f48
      [ 293.534013] x15: 0000000000000000 x14: 0000000000000000
      [ 293.545883] x13: 0000000000000040 x12: 0000000000000228
      [ 293.557508] x11: 0000000000000020 x10: 0000000000000040
      [ 293.568889] x9 : ffff45b91ea1e190 x8 : ffffc591802d0000
      [ 293.580123] x7 : ffffc591802d0148 x6 : 0000000000000120
      [ 293.591190] x5 : ffffc591802d0000 x4 : 0000000000000000
      [ 293.602015] x3 : 0000000000000000 x2 : 0000000000000000
      [ 293.612624] x1 : 00000000000004a4 x0 : ffffa58a1e0c6b80
      [ 293.623028] Call trace:
      [ 293.630340] free_msi_irqs+0x19c/0x1b8
      [ 293.638849] pci_disable_msix+0x118/0x140
      [ 293.647452] pci_free_irq_vectors+0x20/0x38
      [ 293.656081] hclgevf_uninit_msi+0x44/0x58 [hclgevf]
      [ 293.665309] hclgevf_reset_rebuild+0x1ac/0x2e0 [hclgevf]
      [ 293.674866] hclgevf_reset+0x358/0x400 [hclgevf]
      [ 293.683545] hclgevf_reset_service_task+0xd0/0x1b0 [hclgevf]
      [ 293.693325] hclgevf_service_task+0x4c/0x2e8 [hclgevf]
      [ 293.702307] process_one_work+0x1b0/0x448
      [ 293.710034] worker_thread+0x54/0x468
      [ 293.717331] kthread+0x134/0x138
      [ 293.724114] ret_from_fork+0x10/0x18
      [ 293.731324] Code: f940b000 b4ffff00 a903e7b8 f90017b6 (d4210000)
      
      This patch fixes the problem by waiting for the VF reset done
      while unloading the VF.
      
      Fixes: e2cb1dec
      
       ("net: hns3: Add HNS3 VF HCL(Hardware Compatibility Layer) Support")
      Signed-off-by: default avatarYufeng Mo <moyufeng@huawei.com>
      Signed-off-by: default avatarGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e140c798
    • Yufeng Mo's avatar
      net: hns3: sync rx ring head in echo common pull · 3b6db4a0
      Yufeng Mo authored
      When the driver processes rx packets, the head pointer is updated only
      after the number of received packets reaches 16. However, hardware
      relies on the head pointer to calculate the number of FBDs. As a result,
      the hardware calculates the FBD incorrectly. Therefore, the driver
      proactively updates the head pointer in each common poll to ensure that
      the number of FBDs calculated by the hardware is correct.
      
      Fixes: 68752b24
      
       ("net: hns3: schedule the polling again when allocation fails")
      Signed-off-by: default avatarYufeng Mo <moyufeng@huawei.com>
      Signed-off-by: default avatarGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3b6db4a0
    • Jie Wang's avatar
      net: hns3: fix pfc packet number incorrect after querying pfc parameters · 0b653a81
      Jie Wang authored
      Currently, driver will send command to firmware to query pfc packet number
      when user uses dcb tool to get pfc parameters. However, the periodic
      service task will also periodically query and record MAC statistics,
      including pfc packet number.
      
      As the hardware registers of statistics is cleared after reading, it will
      cause pfc packet number of MAC statistics are not correct after using dcb
      tool to get pfc parameters.
      
      To fix this problem, when user uses dcb tool to get pfc parameters, driver
      updates MAC statistics firstly and then get pfc packet number from MAC
      statistics.
      
      Fixes: 64fd2300
      
       ("net: hns3: add support for querying pfc puase packets statistic")
      Signed-off-by: default avatarJie Wang <wangjie125@huawei.com>
      Signed-off-by: default avatarGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0b653a81
    • Jie Wang's avatar
      net: hns3: fix ROCE base interrupt vector initialization bug · beb27ca4
      Jie Wang authored
      Currently, NIC init ROCE interrupt vector with MSIX interrupt. But ROCE use
      pci_irq_vector() to get interrupt vector, which adds the relative interrupt
      vector again and gets wrong interrupt vector.
      
      So fixes it by assign relative interrupt vector to ROCE instead of MSIX
      interrupt vector and delete the unused struct member base_msi_vector
      declaration of hclgevf_dev.
      
      Fixes: 46a3df9f
      
       ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support")
      Signed-off-by: default avatarJie Wang <wangjie125@huawei.com>
      Signed-off-by: default avatarGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      beb27ca4
    • Guangbin Huang's avatar
      net: hns3: fix failed to add reuse multicast mac addr to hardware when mc mac table is full · 3b4c6566
      Guangbin Huang authored
      Currently, when driver is failed to add a new multicast mac address to
      hardware due to the multicast mac table is full, it will directly return.
      In this case, if the multicast mac list has some reuse addresses after the
      new address, those reuse addresses will never be added to hardware.
      
      To fix this problem, if function hclge_add_mc_addr_common() returns
      -ENOSPC, hclge_sync_vport_mac_list() should judge whether continue or
      stop to add next address.
      
      As function hclge_sync_vport_mac_list() needs parameter mac_type to know
      whether is uc or mc, refine this function to add parameter mac_type and
      remove parameter sync. So does function hclge_unsync_vport_mac_list().
      
      Fixes: ee4bcd3b
      
       ("net: hns3: refactor the MAC address configure")
      Signed-off-by: default avatarGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3b4c6566
    • Colin Ian King's avatar
      net: mana: Fix spelling mistake "calledd" -> "called" · 8f1bc38b
      Colin Ian King authored
      
      
      There is a spelling mistake in a dev_info message. Fix it.
      
      Signed-off-by: default avatarColin Ian King <colin.i.king@gmail.com>
      Reviewed-by: default avatarDexuan Cui <decui@microsoft.com>
      Link: https://lore.kernel.org/r/20211108201817.43121-1-colin.i.king@gmail.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8f1bc38b
    • Eric Dumazet's avatar
      net/sched: sch_taprio: fix undefined behavior in ktime_mono_to_any · 6dc25401
      Eric Dumazet authored
      1) if q->tk_offset == TK_OFFS_MAX, then get_tcp_tstamp() calls
         ktime_mono_to_any() with out-of-bound value.
      
      2) if q->tk_offset is changed in taprio_parse_clockid(),
         taprio_get_time() might also call ktime_mono_to_any()
         with out-of-bound value as sysbot found:
      
      UBSAN: array-index-out-of-bounds in kernel/time/timekeeping.c:908:27
      index 3 is out of range for type 'ktime_t *[3]'
      CPU: 1 PID: 25668 Comm: kworker/u4:0 Not tainted 5.15.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: bat_events batadv_iv_send_outstanding_bat_ogm_packet
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       ubsan_epilogue+0xb/0x5a lib/ubsan.c:151
       __ubsan_handle_out_of_bounds.cold+0x62/0x6c lib/ubsan.c:291
       ktime_mono_to_any+0x1d4/0x1e0 kernel/time/timekeeping.c:908
       get_tcp_tstamp net/sched/sch_taprio.c:322 [inline]
       get_packet_txtime net/sched/sch_taprio.c:353 [inline]
       taprio_enqueue_one+0x5b0/0x1460 net/sched/sch_taprio.c:420
       taprio_enqueue+0x3b1/0x730 net/sched/sch_taprio.c:485
       dev_qdisc_enqueue+0x40/0x300 net/core/dev.c:3785
       __dev_xmit_skb net/core/dev.c:3869 [inline]
       __dev_queue_xmit+0x1f6e/0x3630 net/core/dev.c:4194
       batadv_send_skb_packet+0x4a9/0x5f0 net/batman-adv/send.c:108
       batadv_iv_ogm_send_to_if net/batman-adv/bat_iv_ogm.c:393 [inline]
       batadv_iv_ogm_emit net/batman-adv/bat_iv_ogm.c:421 [inline]
       batadv_iv_send_outstanding_bat_ogm_packet+0x6d7/0x8e0 net/batman-adv/bat_iv_ogm.c:1701
       process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298
       worker_thread+0x658/0x11f0 kernel/workqueue.c:2445
       kthread+0x405/0x4f0 kernel/kthread.c:327
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
      
      Fixes: 7ede7b03 ("taprio: make clock reference conversions easier")
      Fixes: 54002066
      
       ("taprio: Adjust timestamps for TCP packets")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Vedang Patel <vedang.patel@intel.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Reviewed-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Link: https://lore.kernel.org/r/20211108180815.1822479-1-eric.dumazet@gmail.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6dc25401
    • Taehee Yoo's avatar
      amt: use cancel_delayed_work() instead of flush_delayed_work() in amt_fini() · 43aa4937
      Taehee Yoo authored
      When the amt module is being removed, it calls flush_delayed_work() to exit
      source_gc_wq. But it wouldn't be exited properly because the
      amt_source_gc_work(), which is the callback function of source_gc_wq
      internally calls mod_delayed_work() again.
      So, amt_source_gc_work() would be called after the amt module is removed.
      Therefore kernel panic would occur.
      In order to avoid it, cancel_delayed_work() should be used instead of
      flush_delayed_work().
      
      Test commands:
         modprobe amt
         modprobe -rv amt
      
      Splat looks like:
       BUG: unable to handle page fault for address: fffffbfff80f50db
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 1237ee067 P4D 1237ee067 PUD 1237b2067 PMD 100c11067 PTE 0
       Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN PTI
       CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.0+ #27
       5a0ebebc29fe5c40c68bea90197606c3a832b09f
       RIP: 0010:run_timer_softirq+0x221/0xfc0
       Code: 00 00 4c 89 e1 4c 8b 30 48 c1 e9 03 80 3c 29 00 0f 85 ed 0b 00 00
       4d 89 34 24 4d 85 f6 74 19 49 8d 7e 08 48 89 f9 48 c1 e9 03 <80> 3c 29 00
       0f 85 fa 0b 00 00 4d 89 66 08 83 04 24 01 49 89 d4 48
       RSP: 0018:ffff888119009e50 EFLAGS: 00010806
       RAX: ffff8881191f8a80 RBX: 00000000007ffe2a RCX: 1ffffffff80f50db
       RDX: ffff888119009ed0 RSI: 0000000000000008 RDI: ffffffffc07a86d8
       RBP: dffffc0000000000 R08: ffff8881191f8280 R09: ffffed102323f061
       R10: ffff8881191f8307 R11: ffffed102323f060 R12: ffff888119009ec8
       R13: 00000000000000c0 R14: ffffffffc07a86d0 R15: ffff8881191f82e8
       FS:  0000000000000000(0000) GS:ffff888119000000(0000)
       knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: fffffbfff80f50db CR3: 00000001062dc002 CR4: 00000000003706e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        <IRQ>
        ? add_timer+0x650/0x650
        ? kvm_clock_read+0x14/0x30
        ? ktime_get+0xb9/0x180
        ? rcu_read_lock_held_common+0xe/0xa0
        ? rcu_read_lock_sched_held+0x56/0xc0
        ? rcu_read_lock_bh_held+0xa0/0xa0
        ? hrtimer_interrupt+0x271/0x790
        __do_softirq+0x1d0/0x88f
        irq_exit_rcu+0xe7/0x120
        sysvec_apic_timer_interrupt+0x8a/0xb0
        </IRQ>
        <TASK>
      [ ... ]
      
      Fixes: bc54e49c
      
       ("amt: add multicast(IGMP) report message handler")
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Link: https://lore.kernel.org/r/20211108145340.17208-1-ap420073@gmail.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      43aa4937
    • Marek Behún's avatar
      net: dsa: mv88e6xxx: Don't support >1G speeds on 6191X on ports other than 10 · dc2fc9f0
      Marek Behún authored
      Model 88E6191X only supports >1G speeds on port 10. Port 0 and 9 are
      only 1G.
      
      Fixes: de776d0d
      
       ("net: dsa: mv88e6xxx: add support for mv88e6393x family")
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      Cc: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/20211104171747.10509-1-kabel@kernel.org
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      dc2fc9f0
    • Jakub Kicinski's avatar
      Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · fceb0795
      Jakub Kicinski authored
      
      
      Alexei Starovoitov says:
      
      ====================
      pull-request: bpf 2021-11-09
      
      We've added 7 non-merge commits during the last 3 day(s) which contain
      a total of 10 files changed, 174 insertions(+), 48 deletions(-).
      
      The main changes are:
      
      1) Various sockmap fixes, from John and Jussi.
      
      2) Fix out-of-bound issue with bpf_pseudo_func, from Martin.
      
      * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        bpf, sockmap: sk_skb data_end access incorrect when src_reg = dst_reg
        bpf: sockmap, strparser, and tls are reusing qdisc_skb_cb and colliding
        bpf, sockmap: Fix race in ingress receive verdict with redirect to self
        bpf, sockmap: Remove unhash handler for BPF sockmap usage
        bpf, sockmap: Use stricter sk state checks in sk_lookup_assign
        bpf: selftest: Trigger a DCE on the whole subprog
        bpf: Stop caching subprog index in the bpf_pseudo_func insn
      ====================
      
      Link: https://lore.kernel.org/r/20211109215702.38350-1-alexei.starovoitov@gmail.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fceb0795
  3. Nov 09, 2021
    • Arnd Bergmann's avatar
      amt: add IPV6 Kconfig dependency · 9758aba8
      Arnd Bergmann authored
      This driver cannot be built-in if IPV6 is a loadable module:
      
      x86_64-linux-ld: drivers/net/amt.o: in function `amt_build_mld_gq':
      amt.c:(.text+0x2e7d): undefined reference to `ipv6_dev_get_saddr'
      
      Add the idiomatic Kconfig dependency that all such modules
      have.
      
      Fixes: b9022b53
      
       ("amt: add control plane of amt interface")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9758aba8
    • Dan Carpenter's avatar
      gve: Fix off by one in gve_tx_timeout() · 1c360cc1
      Dan Carpenter authored
      The priv->ntfy_blocks[] has "priv->num_ntfy_blks" elements so this >
      needs to be >= to prevent an off by one bug.  The priv->ntfy_blocks[]
      array is allocated in gve_alloc_notify_blocks().
      
      Fixes: 87a7f321
      
       ("gve: Recover from queue stall due to missed IRQ")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1c360cc1
    • Lin Ma's avatar
      hamradio: defer 6pack kfree after unregister_netdev · 0b911192
      Lin Ma authored
      
      
      There is a possible race condition (use-after-free) like below
      
       (USE)                       |  (FREE)
        dev_queue_xmit             |
         __dev_queue_xmit          |
          __dev_xmit_skb           |
           sch_direct_xmit         | ...
            xmit_one               |
             netdev_start_xmit     | tty_ldisc_kill
              __netdev_start_xmit  |  6pack_close
               sp_xmit             |   kfree
                sp_encaps          |
                                   |
      
      According to the patch "defer ax25 kfree after unregister_netdev", this
      patch reorder the kfree after the unregister_netdev to avoid the possible
      UAF as the unregister_netdev() is well synchronized and won't return if
      there is a running routine.
      
      Signed-off-by: default avatarLin Ma <linma@zju.edu.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0b911192
    • Lin Ma's avatar
      hamradio: defer ax25 kfree after unregister_netdev · 3e0588c2
      Lin Ma authored
      
      
      There is a possible race condition (use-after-free) like below
      
       (USE)                       |  (FREE)
      ax25_sendmsg                 |
       ax25_queue_xmit             |
        dev_queue_xmit             |
         __dev_queue_xmit          |
          __dev_xmit_skb           |
           sch_direct_xmit         | ...
            xmit_one               |
             netdev_start_xmit     | tty_ldisc_kill
              __netdev_start_xmit  |  mkiss_close
               ax_xmit             |   kfree
                ax_encaps          |
                                   |
      
      Even though there are two synchronization primitives before the kfree:
      1. wait_for_completion(&ax->dead). This can prevent the race with
      routines from mkiss_ioctl. However, it cannot stop the routine coming
      from upper layer, i.e., the ax25_sendmsg.
      
      2. netif_stop_queue(ax->dev). It seems that this line of code aims to
      halt the transmit queue but it fails to stop the routine that already
      being xmit.
      
      This patch reorder the kfree after the unregister_netdev to avoid the
      possible UAF as the unregister_netdev() is well synchronized and won't
      return if there is a running routine.
      
      Signed-off-by: default avatarLin Ma <linma@zju.edu.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3e0588c2
    • Jean Sacren's avatar
      net: sungem_phy: fix code indentation · 54f0bad6
      Jean Sacren authored
      Remove extra space in front of the return statement.
      
      Fixes: eb5b5b2f
      
       ("sungem_phy: support bcm5461 phy, autoneg.")
      Signed-off-by: default avatarJean Sacren <sakiwit@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      54f0bad6
    • Jussi Maki's avatar
      bpf, sockmap: sk_skb data_end access incorrect when src_reg = dst_reg · b2c46181
      Jussi Maki authored
      The current conversion of skb->data_end reads like this:
      
        ; data_end = (void*)(long)skb->data_end;
         559: (79) r1 = *(u64 *)(r2 +200)   ; r1  = skb->data
         560: (61) r11 = *(u32 *)(r2 +112)  ; r11 = skb->len
         561: (0f) r1 += r11
         562: (61) r11 = *(u32 *)(r2 +116)
         563: (1f) r1 -= r11
      
      But similar to the case in 84f44df6 ("bpf: sock_ops sk access may stomp
      registers when dst_reg = src_reg"), the code will read an incorrect skb->len
      when src == dst. In this case we end up generating this xlated code:
      
        ; data_end = (void*)(long)skb->data_end;
         559: (79) r1 = *(u64 *)(r1 +200)   ; r1  = skb->data
         560: (61) r11 = *(u32 *)(r1 +112)  ; r11 = (skb->data)->len
         561: (0f) r1 += r11
         562: (61) r11 = *(u32 *)(r1 +116)
         563: (1f) r1 -= r11
      
      ... where line 560 is the reading 4B of (skb->data + 112) instead of the
      intended skb->len Here the skb pointer in r1 gets set to skb->data and the
      later deref for skb->len ends up following skb->data instead of skb.
      
      This fixes the issue similarly to the patch mentioned above by creating an
      additional temporary variable and using to store the register when dst_reg =
      src_reg. We name the variable bpf_temp_reg and place it in the cb context for
      sk_skb. Then we restore from the temp to ensure nothing is lost.
      
      Fixes: 16137b09
      
       ("bpf: Compute data_end dynamically with JIT code")
      Signed-off-by: default avatarJussi Maki <joamaki@gmail.com>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20211103204736.248403-6-john.fastabend@gmail.com
      b2c46181
    • John Fastabend's avatar
      bpf: sockmap, strparser, and tls are reusing qdisc_skb_cb and colliding · e0dc3b93
      John Fastabend authored
      Strparser is reusing the qdisc_skb_cb struct to stash the skb message handling
      progress, e.g. offset and length of the skb. First this is poorly named and
      inherits a struct from qdisc that doesn't reflect the actual usage of cb[] at
      this layer.
      
      But, more importantly strparser is using the following to access its metadata.
      
        (struct _strp_msg *)((void *)skb->cb + offsetof(struct qdisc_skb_cb, data))
      
      Where _strp_msg is defined as:
      
        struct _strp_msg {
              struct strp_msg            strp;                 /*     0     8 */
              int                        accum_len;            /*     8     4 */
      
              /* size: 12, cachelines: 1, members: 2 */
              /* last cacheline: 12 bytes */
        };
      
      So we use 12 bytes of ->data[] in struct. However in BPF code running parser
      and verdict the user has read capabilities into the data[] array as well. Its
      not too problematic, but we should not be exposing internal state to BPF
      program. If its really needed then we can use the probe_read() APIs which allow
      reading kernel memory. And I don't believe cb[] layer poses any API breakage by
      moving this around because programs can't depend on cb[] across layers.
      
      In order to fix another issue with a ctx rewrite we need to stash a temp
      variable somewhere. To make this work cleanly this patch builds a cb struct
      for sk_skb types called sk_skb_cb struct. Then we can use this consistently
      in the strparser, sockmap space. Additionally we can start allowing ->cb[]
      write access after this.
      
      Fixes: 604326b4
      
       ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarJussi Maki <joamaki@gmail.com>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20211103204736.248403-5-john.fastabend@gmail.com
      e0dc3b93
    • John Fastabend's avatar
      bpf, sockmap: Fix race in ingress receive verdict with redirect to self · c5d2177a
      John Fastabend authored
      A socket in a sockmap may have different combinations of programs attached
      depending on configuration. There can be no programs in which case the socket
      acts as a sink only. There can be a TX program in this case a BPF program is
      attached to sending side, but no RX program is attached. There can be an RX
      program only where sends have no BPF program attached, but receives are hooked
      with BPF. And finally, both TX and RX programs may be attached. Giving us the
      permutations:
      
       None, Tx, Rx, and TxRx
      
      To date most of our use cases have been TX case being used as a fast datapath
      to directly copy between local application and a userspace proxy. Or Rx cases
      and TxRX applications that are operating an in kernel based proxy. The traffic
      in the first case where we hook applications into a userspace application looks
      like this:
      
        AppA  redirect   AppB
         Tx <-----------> Rx
         |                |
         +                +
         TCP <--> lo <--> TCP
      
      In this case all traffic from AppA (after 3whs) is copied into the AppB
      ingress queue and no traffic is ever on the TCP recieive_queue.
      
      In the second case the application never receives, except in some rare error
      cases, traffic on the actual user space socket. Instead the send happens in
      the kernel.
      
                 AppProxy       socket pool
             sk0 ------------->{sk1,sk2, skn}
              ^                      |
              |                      |
              |                      v
             ingress              lb egress
             TCP                  TCP
      
      Here because traffic is never read off the socket with userspace recv() APIs
      there is only ever one reader on the sk receive_queue. Namely the BPF programs.
      
      However, we've started to introduce a third configuration where the BPF program
      on receive should process the data, but then the normal case is to push the
      data into the receive queue of AppB.
      
             AppB
             recv()                (userspace)
           -----------------------
             tcp_bpf_recvmsg()     (kernel)
               |             |
               |             |
               |             |
             ingress_msgQ    |
               |             |
             RX_BPF          |
               |             |
               v             v
             sk->receive_queue
      
      This is different from the App{A,B} redirect because traffic is first received
      on the sk->receive_queue.
      
      Now for the issue. The tcp_bpf_recvmsg() handler first checks the ingress_msg
      queue for any data handled by the BPF rx program and returned with PASS code
      so that it was enqueued on the ingress msg queue. Then if no data exists on
      that queue it checks the socket receive queue. Unfortunately, this is the same
      receive_queue the BPF program is reading data off of. So we get a race. Its
      possible for the recvmsg() hook to pull data off the receive_queue before the
      BPF hook has a chance to read it. It typically happens when an application is
      banging on recv() and getting EAGAINs. Until they manage to race with the RX
      BPF program.
      
      To fix this we note that before this patch at attach time when the socket is
      loaded into the map we check if it needs a TX program or just the base set of
      proto bpf hooks. Then it uses the above general RX hook regardless of if we
      have a BPF program attached at rx or not. This patch now extends this check to
      handle all cases enumerated above, TX, RX, TXRX, and none. And to fix above
      race when an RX program is attached we use a new hook that is nearly identical
      to the old one except now we do not let the recv() call skip the RX BPF program.
      Now only the BPF program pulls data from sk->receive_queue and recv() only
      pulls data from the ingress msgQ post BPF program handling.
      
      With this resolved our AppB from above has been up and running for many hours
      without detecting any errors. We do this by correlating counters in RX BPF
      events and the AppB to ensure data is never skipping the BPF program. Selftests,
      was not able to detect this because we only run them for a short period of time
      on well ordered send/recvs so we don't get any of the noise we see in real
      application environments.
      
      Fixes: 51199405
      
       ("bpf: skb_verdict, support SK_PASS on RX BPF path")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarJussi Maki <joamaki@gmail.com>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20211103204736.248403-4-john.fastabend@gmail.com
      c5d2177a
    • John Fastabend's avatar
      bpf, sockmap: Remove unhash handler for BPF sockmap usage · b8b8315e
      John Fastabend authored
      We do not need to handle unhash from BPF side we can simply wait for the
      close to happen. The original concern was a socket could transition from
      ESTABLISHED state to a new state while the BPF hook was still attached.
      But, we convinced ourself this is no longer possible and we also improved
      BPF sockmap to handle listen sockets so this is no longer a problem.
      
      More importantly though there are cases where unhash is called when data is
      in the receive queue. The BPF unhash logic will flush this data which is
      wrong. To be correct it should keep the data in the receive queue and allow
      a receiving application to continue reading the data. This may happen when
      tcp_abort() is received for example. Instead of complicating the logic in
      unhash simply moving all this to tcp_close() hook solves this.
      
      Fixes: 51199405
      
       ("bpf: skb_verdict, support SK_PASS on RX BPF path")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarJussi Maki <joamaki@gmail.com>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20211103204736.248403-3-john.fastabend@gmail.com
      b8b8315e
    • John Fastabend's avatar
      bpf, sockmap: Use stricter sk state checks in sk_lookup_assign · 40a34121
      John Fastabend authored
      
      
      In order to fix an issue with sockets in TCP sockmap redirect cases we plan
      to allow CLOSE state sockets to exist in the sockmap. However, the check in
      bpf_sk_lookup_assign() currently only invalidates sockets in the
      TCP_ESTABLISHED case relying on the checks on sockmap insert to ensure we
      never SOCK_CLOSE state sockets in the map.
      
      To prepare for this change we flip the logic in bpf_sk_lookup_assign() to
      explicitly test for the accepted cases. Namely, a tcp socket in TCP_LISTEN
      or a udp socket in TCP_CLOSE state. This also makes the code more resilent
      to future changes.
      
      Suggested-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20211103204736.248403-2-john.fastabend@gmail.com
      40a34121
  4. Nov 08, 2021