Skip to content
  1. May 06, 2022
    • Vladimir Oltean's avatar
      net: mscc: ocelot: fix VCAP IS2 filters matching on both lookups · 6741e118
      Vladimir Oltean authored
      The VCAP IS2 TCAM is looked up twice per packet, and each filter can be
      configured to only match during the first, second lookup, or both, or
      none.
      
      The blamed commit wrote the code for making VCAP IS2 filters match only
      on the given lookup. But right below that code, there was another line
      that explicitly made the lookup a "don't care", and this is overwriting
      the lookup we've selected. So the code had no effect.
      
      Some of the more noticeable effects of having filters match on both
      lookups:
      
      - in "tc -s filter show dev swp0 ingress", we see each packet matching a
        VCAP IS2 filter counted twice. This throws off scripts such as
        tools/testing/selftests/net/forwarding/tc_actions.sh and makes them
        fail.
      
      - a "tc-drop" action offloaded to VCAP IS2 needs a policer as well,
        because once the CPU port becomes a member of the destination port
        mask of a packet, nothing removes it, not even a PERMIT/DENY mask mode
        with a port mask of 0. But VCAP IS2 rules with the POLICE_ENA bit in
        the action vector can only appear in the first lookup. What happens
        when a filter matches both lookups is that the action vector is
        combined, and this makes the POLICE_ENA bit ineffective, since the
        last lookup in which it has appeared is the second one. In other
        words, "tc-drop" actions do not drop packets for the CPU port, dropped
        packets are still seen by software unless there was an FDB entry that
        directed those packets to some other place different from the CPU.
      
      The last bit used to work, because in the initial commit b5962294
      ("net: mscc: ocelot: Add support for tcam"), we were writing the FIRST
      field of the VCAP IS2 half key with a 1, not with a "don't care".
      The change to "don't care" was made inadvertently by me in commit
      c1c3993e ("net: mscc: ocelot: generalize existing code for VCAP"),
      which I just realized, and which needs a separate fix from this one,
      for "stable" kernels that lack the commit blamed below.
      
      Fixes: 226e9cd8
      
       ("net: mscc: ocelot: only install TCAM entries into a specific lookup and PAG")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6741e118
    • Vladimir Oltean's avatar
      net: mscc: ocelot: fix last VCAP IS1/IS2 filter persisting in hardware when deleted · 16bbebd3
      Vladimir Oltean authored
      ocelot_vcap_filter_del() works by moving the next filters over the
      current one, and then deleting the last filter by calling vcap_entry_set()
      with a del_filter which was specially created by memsetting its memory
      to zeroes. vcap_entry_set() then programs this to the TCAM and action
      RAM via the cache registers.
      
      The problem is that vcap_entry_set() is a dispatch function which looks
      at del_filter->block_id. But since del_filter is zeroized memory, the
      block_id is 0, or otherwise said, VCAP_ES0. So practically, what we do
      is delete the entry at the same TCAM index from VCAP ES0 instead of IS1
      or IS2.
      
      The code was not always like this. vcap_entry_set() used to simply be
      is2_entry_set(), and then, the logic used to work.
      
      Restore the functionality by populating the block_id of the del_filter
      based on the VCAP block of the filter that we're deleting. This makes
      vcap_entry_set() know what to do.
      
      Fixes: 1397a2eb
      
       ("net: mscc: ocelot: create TCAM skeleton from tc filter chains")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      16bbebd3
    • Vladimir Oltean's avatar
      net: mscc: ocelot: mark traps with a bool instead of keeping them in a list · e1846cff
      Vladimir Oltean authored
      Since the blamed commit, VCAP filters can appear on more than one list.
      If their action is "trap", they are chained on ocelot->traps via
      filter->trap_list. This is in addition to their normal placement on the
      VCAP block->rules list head.
      
      Therefore, when we free a VCAP filter, we must remove it from all lists
      it is a member of, including ocelot->traps.
      
      There are at least 2 bugs which are direct consequences of this design
      decision.
      
      First is the incorrect usage of list_empty(), meant to denote whether
      "filter" is chained into ocelot->traps via filter->trap_list.
      This does not do the correct thing, because list_empty() checks whether
      "head->next == head", but in our case, head->next == head->prev == NULL.
      So we dereference NULL pointers and die when we call list_del().
      
      Second is the fact that not all places that should remove the filter
      from ocelot->traps do so. One example is ocelot_vcap_block_remove_filter(),
      which is where we have the main kfree(filter). By keeping freed filters
      in ocelot->traps we end up in a use-after-free in
      felix_update_trapping_destinations().
      
      Attempting to fix all the buggy patterns is a whack-a-mole game which
      makes the driver unmaintainable. Actually this is what the previous
      patch version attempted to do:
      https://patchwork.kernel.org/project/netdevbpf/patch/20220503115728.834457-3-vladimir.oltean@nxp.com/
      
      but it introduced another set of bugs, because there are other places in
      which create VCAP filters, not just ocelot_vcap_filter_create():
      
      - ocelot_trap_add()
      - felix_tag_8021q_vlan_add_rx()
      - felix_tag_8021q_vlan_add_tx()
      
      Relying on the convention that all those code paths must call
      INIT_LIST_HEAD(&filter->trap_list) is not going to scale.
      
      So let's do what should have been done in the first place and keep a
      bool in struct ocelot_vcap_filter which denotes whether we are looking
      at a trapping rule or not. Iterating now happens over the main VCAP IS2
      block->rules. The advantage is that we no longer risk having stale
      references to a freed filter, since it is only present in that list.
      
      Fixes: e42bd4ed
      
       ("net: mscc: ocelot: keep traps in a list")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e1846cff
    • Jonathan Toppins's avatar
      MAINTAINERS: add missing files for bonding definition · 4e707344
      Jonathan Toppins authored
      
      
      The bonding entry did not include additional include files that have
      been added nor did it reference the documentation. Add these references
      for completeness.
      
      Signed-off-by: default avatarJonathan Toppins <jtoppins@redhat.com>
      Link: https://lore.kernel.org/r/903ed2906b93628b38a2015664a20d2802042863.1651690748.git.jtoppins@redhat.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4e707344
    • Tariq Toukan's avatar
      net: Fix features skip in for_each_netdev_feature() · 85db6352
      Tariq Toukan authored
      The find_next_netdev_feature() macro gets the "remaining length",
      not bit index.
      Passing "bit - 1" for the following iteration is wrong as it skips
      the adjacent bit. Pass "bit" instead.
      
      Fixes: 3b89ea9c
      
       ("net: Fix for_each_netdev_feature on Big endian")
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarGal Pressman <gal@nvidia.com>
      Link: https://lore.kernel.org/r/20220504080914.1918-1-tariqt@nvidia.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      85db6352
    • Jakub Kicinski's avatar
      Merge branch 'vrf-fix-address-binding-with-icmp-socket' · 690447a2
      Jakub Kicinski authored
      Nicolas Dichtel says:
      
      ====================
      vrf: fix address binding with icmp socket
      
      The first patch fixes the issue.
      The second patch adds related tests in selftests.
      ====================
      
      Link: https://lore.kernel.org/r/20220504090739.21821-1-nicolas.dichtel@6wind.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      690447a2
    • Nicolas Dichtel's avatar
      selftests: add ping test with ping_group_range tuned · e71b7f1f
      Nicolas Dichtel authored
      
      
      The 'ping' utility is able to manage two kind of sockets (raw or icmp),
      depending on the sysctl ping_group_range. By default, ping_group_range is
      set to '1 0', which forces ping to use an ip raw socket.
      
      Let's replay the ping tests by allowing 'ping' to use the ip icmp socket.
      After the previous patch, ipv4 tests results are the same with both kinds
      of socket. For ipv6, there are a lot a new failures (the previous patch
      fixes only two cases).
      
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e71b7f1f
    • Nicolas Dichtel's avatar
      ping: fix address binding wrt vrf · e1a7ac6f
      Nicolas Dichtel authored
      When ping_group_range is updated, 'ping' uses the DGRAM ICMP socket,
      instead of an IP raw socket. In this case, 'ping' is unable to bind its
      socket to a local address owned by a vrflite.
      
      Before the patch:
      $ sysctl -w net.ipv4.ping_group_range='0  2147483647'
      $ ip link add blue type vrf table 10
      $ ip link add foo type dummy
      $ ip link set foo master blue
      $ ip link set foo up
      $ ip addr add 192.168.1.1/24 dev foo
      $ ip addr add 2001::1/64 dev foo
      $ ip vrf exec blue ping -c1 -I 192.168.1.1 192.168.1.2
      ping: bind: Cannot assign requested address
      $ ip vrf exec blue ping6 -c1 -I 2001::1 2001::2
      ping6: bind icmp socket: Cannot assign requested address
      
      CC: stable@vger.kernel.org
      Fixes: 1b69c6d0
      
       ("net: Introduce L3 Master device abstraction")
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e1a7ac6f
    • Fabio Estevam's avatar
      net: phy: micrel: Pass .probe for KS8737 · 15f03ffe
      Fabio Estevam authored
      Since commit f1131b9c ("net: phy: micrel: use
      kszphy_suspend()/kszphy_resume for irq aware devices") the kszphy_suspend/
      resume hooks are used.
      
      These functions require the probe function to be called so that
      priv can be allocated.
      
      Otherwise, a NULL pointer dereference happens inside
      kszphy_config_reset().
      
      Cc: stable@vger.kernel.org
      Fixes: f1131b9c
      
       ("net: phy: micrel: use kszphy_suspend()/kszphy_resume for irq aware devices")
      Reported-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarFabio Estevam <festevam@denx.de>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/20220504143104.1286960-2-festevam@gmail.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      15f03ffe
    • Fabio Estevam's avatar
      net: phy: micrel: Do not use kszphy_suspend/resume for KSZ8061 · e333eed6
      Fabio Estevam authored
      Since commit f1131b9c ("net: phy: micrel: use
      kszphy_suspend()/kszphy_resume for irq aware devices") the following
      NULL pointer dereference is observed on a board with KSZ8061:
      
       # udhcpc -i eth0
      udhcpc: started, v1.35.0
      8<--- cut here ---
      Unable to handle kernel NULL pointer dereference at virtual address 00000008
      pgd = f73cef4e
      [00000008] *pgd=00000000
      Internal error: Oops: 5 [#1] SMP ARM
      Modules linked in:
      CPU: 0 PID: 196 Comm: ifconfig Not tainted 5.15.37-dirty #94
      Hardware name: Freescale i.MX6 SoloX (Device Tree)
      PC is at kszphy_config_reset+0x10/0x114
      LR is at kszphy_resume+0x24/0x64
      ...
      
      The KSZ8061 phy_driver structure does not have the .probe/..driver_data
      fields, which means that priv is not allocated.
      
      This causes the NULL pointer dereference inside kszphy_config_reset().
      
      Fix the problem by using the generic suspend/resume functions as before.
      
      Another alternative would be to provide the .probe and .driver_data
      information into the structure, but to be on the safe side, let's
      just restore Ethernet functionality by using the generic suspend/resume.
      
      Cc: stable@vger.kernel.org
      Fixes: f1131b9c
      
       ("net: phy: micrel: use kszphy_suspend()/kszphy_resume for irq aware devices")
      Signed-off-by: default avatarFabio Estevam <festevam@denx.de>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/20220504143104.1286960-1-festevam@gmail.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e333eed6
    • Tetsuo Handa's avatar
      net: rds: use maybe_get_net() when acquiring refcount on TCP sockets · 6997fbd7
      Tetsuo Handa authored
      
      
      Eric Dumazet is reporting addition on 0 problem at rds_tcp_tune(), for
      delayed works queued in rds_wq might be invoked after a net namespace's
      refcount already reached 0.
      
      Since rds_tcp_exit_net() from cleanup_net() calls flush_workqueue(rds_wq),
      it is guaranteed that we can instead use maybe_get_net() from delayed work
      functions until rds_tcp_exit_net() returns.
      
      Note that I'm not convinced that all works which might access a net
      namespace are already queued in rds_wq by the moment rds_tcp_exit_net()
      calls flush_workqueue(rds_wq). If some race is there, rds_tcp_exit_net()
      will fail to wait for work functions, and kmem_cache_free() could be
      called from net_free() before maybe_get_net() is called from
      rds_tcp_tune().
      
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Fixes: 3a58f13a
      
       ("net: rds: acquire refcount on TCP sockets")
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/41d09faf-bc78-1a87-dfd1-c6d1b5984b61@I-love.SAKURA.ne.jp
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6997fbd7
    • Linus Torvalds's avatar
      Merge tag 'net-5.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 68533eb1
      Linus Torvalds authored
      Pull networking fixes from Paolo Abeni:
       "Including fixes from can, rxrpc and wireguard.
      
        Previous releases - regressions:
      
         - igmp: respect RCU rules in ip_mc_source() and ip_mc_msfilter()
      
         - mld: respect RCU rules in ip6_mc_source() and ip6_mc_msfilter()
      
         - rds: acquire netns refcount on TCP sockets
      
         - rxrpc: enable IPv6 checksums on transport socket
      
         - nic: hinic: fix bug of wq out of bound access
      
         - nic: thunder: don't use pci_irq_vector() in atomic context
      
         - nic: bnxt_en: fix possible bnxt_open() failure caused by wrong RFS
           flag
      
         - nic: mlx5e:
            - lag, fix use-after-free in fib event handler
            - fix deadlock in sync reset flow
      
        Previous releases - always broken:
      
         - tcp: fix insufficient TCP source port randomness
      
         - can: grcan: grcan_close(): fix deadlock
      
         - nfc: reorder destructive operations in to avoid bugs
      
        Misc:
      
         - wireguard: improve selftests reliability"
      
      * tag 'net-5.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (63 commits)
        NFC: netlink: fix sleep in atomic bug when firmware download timeout
        selftests: ocelot: tc_flower_chains: specify conform-exceed action for policer
        tcp: drop the hash_32() part from the index calculation
        tcp: increase source port perturb table to 2^16
        tcp: dynamically allocate the perturb table used by source ports
        tcp: add small random increments to the source port
        tcp: resalt the secret every 10 seconds
        tcp: use different parts of the port_offset for index and offset
        secure_seq: use the 64 bits of the siphash for port offset calculation
        wireguard: selftests: set panic_on_warn=1 from cmdline
        wireguard: selftests: bump package deps
        wireguard: selftests: restore support for ccache
        wireguard: selftests: use newer toolchains to fill out architectures
        wireguard: selftests: limit parallelism to $(nproc) tests at once
        wireguard: selftests: make routing loop test non-fatal
        net/mlx5: Fix matching on inner TTC
        net/mlx5: Avoid double clear or set of sync reset requested
        net/mlx5: Fix deadlock in sync reset flow
        net/mlx5e: Fix trust state reset in reload
        net/mlx5e: Avoid checking offload capability in post_parse action
        ...
      68533eb1
  2. May 05, 2022
  3. May 04, 2022