Skip to content
  1. Sep 23, 2021
    • Vladimir Oltean's avatar
      net: mscc: ocelot: fix forwarding from BLOCKING ports remaining enabled · acc64f52
      Vladimir Oltean authored
      The blamed commit made the fatally incorrect assumption that ports which
      aren't in the FORWARDING STP state should not have packets forwarded
      towards them, and that is all that needs to be done.
      
      However, that logic alone permits BLOCKING ports to forward to
      FORWARDING ports, which of course allows packet storms to occur when
      there is an L2 loop.
      
      The ocelot_get_bridge_fwd_mask should not only ask "what can the bridge
      do for you", but "what can you do for the bridge". This way, only
      FORWARDING ports forward to the other FORWARDING ports from the same
      bridging domain, and we are still compatible with the idea of multiple
      bridges.
      
      Fixes: df291e54
      
       ("net: ocelot: support multiple bridges")
      Suggested-by: default avatarColin Foster <colin.foster@in-advantage.com>
      Reported-by: default avatarColin Foster <colin.foster@in-advantage.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarColin Foster <colin.foster@in-advantage.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      acc64f52
    • Felix Fietkau's avatar
      net: ethernet: mtk_eth_soc: avoid creating duplicate offload entries · e68daf61
      Felix Fietkau authored
      Sometimes multiple CLS_REPLACE calls are issued for the same connection.
      rhashtable_insert_fast does not check for these duplicates, so multiple
      hardware flow entries can be created.
      Fix this by checking for an existing entry early
      
      Fixes: 502e84e2
      
       ("net: ethernet: mtk_eth_soc: add flow offloading support")
      Signed-off-by: default avatarFelix Fietkau <nbd@nbd.name>
      Signed-off-by: default avatarIlya Lipnitskiy <ilya.lipnitskiy@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e68daf61
    • Mark Brown's avatar
      nfc: st-nci: Add SPI ID matching DT compatible · 31339440
      Mark Brown authored
      Currently autoloading for SPI devices does not use the DT ID table, it uses
      SPI modalises. Supporting OF modalises is going to be difficult if not
      impractical, an attempt was made but has been reverted, so ensure that
      module autoloading works for this driver by adding the part name used in
      the compatible to the list of SPI IDs.
      
      Fixes: 96c8395e
      
       ("spi: Revert modalias changes")
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      31339440
    • Guvenc Gulce's avatar
      MAINTAINERS: remove Guvenc Gulce as net/smc maintainer · 5b099870
      Guvenc Gulce authored
      
      
      Remove myself as net/smc maintainer, as I am
      leaving IBM soon and can not maintain net/smc anymore.
      
      Cc: Julian Wiedmann <jwi@linux.ibm.com>
      Acked-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarGuvenc Gulce <guvenc@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5b099870
    • Ido Schimmel's avatar
      nexthop: Fix memory leaks in nexthop notification chain listeners · 3106a084
      Ido Schimmel authored
      syzkaller discovered memory leaks [1] that can be reduced to the
      following commands:
      
       # ip nexthop add id 1 blackhole
       # devlink dev reload pci/0000:06:00.0
      
      As part of the reload flow, mlxsw will unregister its netdevs and then
      unregister from the nexthop notification chain. Before unregistering
      from the notification chain, mlxsw will receive delete notifications for
      nexthop objects using netdevs registered by mlxsw or their uppers. mlxsw
      will not receive notifications for nexthops using netdevs that are not
      dismantled as part of the reload flow. For example, the blackhole
      nexthop above that internally uses the loopback netdev as its nexthop
      device.
      
      One way to fix this problem is to have listeners flush their nexthop
      tables after unregistering from the notification chain. This is
      error-prone as evident by this patch and also not symmetric with the
      registration path where a listener receives a dump of all the existing
      nexthops.
      
      Therefore, fix this problem by replaying delete notifications for the
      listener being unregistered. This is symmetric to the registration path
      and also consistent with the netdev notification chain.
      
      The above means that unregister_nexthop_notifier(), like
      register_nexthop_notifier(), will have to take RTNL in order to iterate
      over the existing nexthops and that any callers of the function cannot
      hold RTNL. This is true for mlxsw and netdevsim, but not for the VXLAN
      driver. To avoid a deadlock, change the latter to unregister its nexthop
      listener without holding RTNL, making it symmetric to the registration
      path.
      
      [1]
      unreferenced object 0xffff88806173d600 (size 512):
        comm "syz-executor.0", pid 1290, jiffies 4295583142 (age 143.507s)
        hex dump (first 32 bytes):
          41 9d 1e 60 80 88 ff ff 08 d6 73 61 80 88 ff ff  A..`......sa....
          08 d6 73 61 80 88 ff ff 01 00 00 00 00 00 00 00  ..sa............
        backtrace:
          [<ffffffff81a6b576>] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline]
          [<ffffffff81a6b576>] slab_post_alloc_hook+0x96/0x490 mm/slab.h:522
          [<ffffffff81a716d3>] slab_alloc_node mm/slub.c:3206 [inline]
          [<ffffffff81a716d3>] slab_alloc mm/slub.c:3214 [inline]
          [<ffffffff81a716d3>] kmem_cache_alloc_trace+0x163/0x370 mm/slub.c:3231
          [<ffffffff82e8681a>] kmalloc include/linux/slab.h:591 [inline]
          [<ffffffff82e8681a>] kzalloc include/linux/slab.h:721 [inline]
          [<ffffffff82e8681a>] mlxsw_sp_nexthop_obj_group_create drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:4918 [inline]
          [<ffffffff82e8681a>] mlxsw_sp_nexthop_obj_new drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:5054 [inline]
          [<ffffffff82e8681a>] mlxsw_sp_nexthop_obj_event+0x59a/0x2910 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:5239
          [<ffffffff813ef67d>] notifier_call_chain+0xbd/0x210 kernel/notifier.c:83
          [<ffffffff813f0662>] blocking_notifier_call_chain kernel/notifier.c:318 [inline]
          [<ffffffff813f0662>] blocking_notifier_call_chain+0x72/0xa0 kernel/notifier.c:306
          [<ffffffff8384b9c6>] call_nexthop_notifiers+0x156/0x310 net/ipv4/nexthop.c:244
          [<ffffffff83852bd8>] insert_nexthop net/ipv4/nexthop.c:2336 [inline]
          [<ffffffff83852bd8>] nexthop_add net/ipv4/nexthop.c:2644 [inline]
          [<ffffffff83852bd8>] rtm_new_nexthop+0x14e8/0x4d10 net/ipv4/nexthop.c:2913
          [<ffffffff833e9a78>] rtnetlink_rcv_msg+0x448/0xbf0 net/core/rtnetlink.c:5572
          [<ffffffff83608703>] netlink_rcv_skb+0x173/0x480 net/netlink/af_netlink.c:2504
          [<ffffffff833de032>] rtnetlink_rcv+0x22/0x30 net/core/rtnetlink.c:5590
          [<ffffffff836069de>] netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
          [<ffffffff836069de>] netlink_unicast+0x5ae/0x7f0 net/netlink/af_netlink.c:1340
          [<ffffffff83607501>] netlink_sendmsg+0x8e1/0xe30 net/netlink/af_netlink.c:1929
          [<ffffffff832fde84>] sock_sendmsg_nosec net/socket.c:704 [inline]
          [<ffffffff832fde84>] sock_sendmsg net/socket.c:724 [inline]
          [<ffffffff832fde84>] ____sys_sendmsg+0x874/0x9f0 net/socket.c:2409
          [<ffffffff83304a44>] ___sys_sendmsg+0x104/0x170 net/socket.c:2463
          [<ffffffff83304c01>] __sys_sendmsg+0x111/0x1f0 net/socket.c:2492
          [<ffffffff83304d5d>] __do_sys_sendmsg net/socket.c:2501 [inline]
          [<ffffffff83304d5d>] __se_sys_sendmsg net/socket.c:2499 [inline]
          [<ffffffff83304d5d>] __x64_sys_sendmsg+0x7d/0xc0 net/socket.c:2499
      
      Fixes: 2a014b20
      
       ("mlxsw: spectrum_router: Add support for nexthop objects")
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3106a084
  2. Sep 22, 2021
    • Paolo Abeni's avatar
      mptcp: ensure tx skbs always have the MPTCP ext · 977d293e
      Paolo Abeni authored
      
      
      Due to signed/unsigned comparison, the expression:
      
      	info->size_goal - skb->len > 0
      
      evaluates to true when the size goal is smaller than the
      skb size. That results in lack of tx cache refill, so that
      the skb allocated by the core TCP code lacks the required
      MPTCP skb extensions.
      
      Due to the above, syzbot is able to trigger the following WARN_ON():
      
      WARNING: CPU: 1 PID: 810 at net/mptcp/protocol.c:1366 mptcp_sendmsg_frag+0x1362/0x1bc0 net/mptcp/protocol.c:1366
      Modules linked in:
      CPU: 1 PID: 810 Comm: syz-executor.4 Not tainted 5.14.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:mptcp_sendmsg_frag+0x1362/0x1bc0 net/mptcp/protocol.c:1366
      Code: ff 4c 8b 74 24 50 48 8b 5c 24 58 e9 0f fb ff ff e8 13 44 8b f8 4c 89 e7 45 31 ed e8 98 57 2e fe e9 81 f4 ff ff e8 fe 43 8b f8 <0f> 0b 41 bd ea ff ff ff e9 6f f4 ff ff 4c 89 e7 e8 b9 8e d2 f8 e9
      RSP: 0018:ffffc9000531f6a0 EFLAGS: 00010216
      RAX: 000000000000697f RBX: 0000000000000000 RCX: ffffc90012107000
      RDX: 0000000000040000 RSI: ffffffff88eac9e2 RDI: 0000000000000003
      RBP: ffff888078b15780 R08: 0000000000000000 R09: 0000000000000000
      R10: ffffffff88eac017 R11: 0000000000000000 R12: ffff88801de0a280
      R13: 0000000000006b58 R14: ffff888066278280 R15: ffff88803c2fe9c0
      FS:  00007fd9f866e700(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007faebcb2f718 CR3: 00000000267cb000 CR4: 00000000001506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       __mptcp_push_pending+0x1fb/0x6b0 net/mptcp/protocol.c:1547
       mptcp_release_cb+0xfe/0x210 net/mptcp/protocol.c:3003
       release_sock+0xb4/0x1b0 net/core/sock.c:3206
       sk_stream_wait_memory+0x604/0xed0 net/core/stream.c:145
       mptcp_sendmsg+0xc39/0x1bc0 net/mptcp/protocol.c:1749
       inet6_sendmsg+0x99/0xe0 net/ipv6/af_inet6.c:643
       sock_sendmsg_nosec net/socket.c:704 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:724
       sock_write_iter+0x2a0/0x3e0 net/socket.c:1057
       call_write_iter include/linux/fs.h:2163 [inline]
       new_sync_write+0x40b/0x640 fs/read_write.c:507
       vfs_write+0x7cf/0xae0 fs/read_write.c:594
       ksys_write+0x1ee/0x250 fs/read_write.c:647
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x4665f9
      Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007fd9f866e188 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 000000000056c038 RCX: 00000000004665f9
      RDX: 00000000000e7b78 RSI: 0000000020000000 RDI: 0000000000000003
      RBP: 00000000004bfcc4 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 000000000056c038
      R13: 0000000000a9fb1f R14: 00007fd9f866e300 R15: 0000000000022000
      
      Fix the issue rewriting the relevant expression to avoid
      sign-related problems - note: size_goal is always >= 0.
      
      Additionally, ensure that the skb in the tx cache always carries
      the relevant extension.
      
      Reported-and-tested-by: default avatar <syzbot+263a248eec3e875baa7b@syzkaller.appspotmail.com>
      Fixes: 1094c6fe
      
       ("mptcp: fix possible divide by zero")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      977d293e
    • Shai Malin's avatar
      qed: rdma - don't wait for resources under hw error recovery flow · 1ea78123
      Shai Malin authored
      If the HW device is during recovery, the HW resources will never return,
      hence we shouldn't wait for the CID (HW context ID) bitmaps to clear.
      This fix speeds up the error recovery flow.
      
      Fixes: 64515dc8
      
       ("qed: Add infrastructure for error detection and recovery")
      Signed-off-by: default avatarMichal Kalderon <mkalderon@marvell.com>
      Signed-off-by: default avatarAriel Elior <aelior@marvell.com>
      Signed-off-by: default avatarShai Malin <smalin@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1ea78123
    • Jakub Kicinski's avatar
      Merge branch 's390-qeth-fixes-2021-09-21' · b52d3161
      Jakub Kicinski authored
      Julian Wiedmann says:
      
      ====================
      s390/qeth: fixes 2021-09-21
      
      This brings two fixes for deadlocks when a device is removed while it
      has certain types of async work pending. And one additional fix for a
      missing NULL check in an error case.
      ====================
      
      Link: https://lore.kernel.org/r/20210921145217.1584654-1-jwi@linux.ibm.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b52d3161
    • Alexandra Winter's avatar
      s390/qeth: fix deadlock during failing recovery · d2b59bd4
      Alexandra Winter authored
      Commit 0b9902c1 ("s390/qeth: fix deadlock during recovery") removed
      taking discipline_mutex inside qeth_do_reset(), fixing potential
      deadlocks. An error path was missed though, that still takes
      discipline_mutex and thus has the original deadlock potential.
      
      Intermittent deadlocks were seen when a qeth channel path is configured
      offline, causing a race between qeth_do_reset and ccwgroup_remove.
      Call qeth_set_offline() directly in the qeth_do_reset() error case and
      then a new variant of ccwgroup_set_offline(), without taking
      discipline_mutex.
      
      Fixes: b41b554c
      
       ("s390/qeth: fix locking for discipline setup / removal")
      Signed-off-by: default avatarAlexandra Winter <wintera@linux.ibm.com>
      Reviewed-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d2b59bd4
    • Alexandra Winter's avatar
      s390/qeth: Fix deadlock in remove_discipline · ee909d0b
      Alexandra Winter authored
      Problem: qeth_close_dev_handler is a worker that tries to acquire
      card->discipline_mutex via drv->set_offline() in ccwgroup_set_offline().
      Since commit b41b554c
      ("s390/qeth: fix locking for discipline setup / removal")
      qeth_remove_discipline() is called under card->discipline_mutex and
      cancels the work and waits for it to finish.
      
      STOPLAN reception with reason code IPA_RC_VEPA_TO_VEB_TRANSITION is the
      only situation that schedules close_dev_work. In that situation scheduling
      qeth recovery will also result in an offline interface, when resetting the
      isolation mode fails, if the external switch is still set to VEB.
      And since commit 0b9902c1 ("s390/qeth: fix deadlock during recovery")
      qeth recovery does not aquire card->discipline_mutex anymore.
      
      So we accept the longer pathlength of qeth_schedule_recovery in this
      error situation and re-use the existing function.
      
      As a side-benefit this changes the hwtrap to behave like during recovery
      instead of like during a user-triggered set_offline.
      
      Fixes: b41b554c
      
       ("s390/qeth: fix locking for discipline setup / removal")
      Signed-off-by: default avatarAlexandra Winter <wintera@linux.ibm.com>
      Acked-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ee909d0b
    • Julian Wiedmann's avatar
      s390/qeth: fix NULL deref in qeth_clear_working_pool_list() · 248f064a
      Julian Wiedmann authored
      When qeth_set_online() calls qeth_clear_working_pool_list() to roll
      back after an error exit from qeth_hardsetup_card(), we are at risk of
      accessing card->qdio.in_q before it was allocated by
      qeth_alloc_qdio_queues() via qeth_mpc_initialize().
      
      qeth_clear_working_pool_list() then dereferences NULL, and by writing to
      queue->bufs[i].pool_entry scribbles all over the CPU's lowcore.
      Resulting in a crash when those lowcore areas are used next (eg. on
      the next machine-check interrupt).
      
      Such a scenario would typically happen when the device is first set
      online and its queues aren't allocated yet. An early IO error or certain
      misconfigs (eg. mismatched transport mode, bad portno) then cause us to
      error out from qeth_hardsetup_card() with card->qdio.in_q still being
      NULL.
      
      Fix it by checking the pointer for NULL before accessing it.
      
      Note that we also have (rare) paths inside qeth_mpc_initialize() where
      a configuration change can cause us to free the existing queues,
      expecting that subsequent code will allocate them again. If we then
      error out before that re-allocation happens, the same bug occurs.
      
      Fixes: eff73e16
      
       ("s390/qeth: tolerate pre-filled RX buffer")
      Reported-by: default avatarStefan Raspl <raspl@linux.ibm.com>
      Root-caused-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Reviewed-by: default avatarAlexandra Winter <wintera@linux.ibm.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      248f064a
  3. Sep 21, 2021
    • David S. Miller's avatar
      Merge branch 'dsa-devres' · b3f98404
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      Fix mdiobus users with devres
      
      Commit ac3a68d5
      
       ("net: phy: don't abuse devres in
      devm_mdiobus_register()") by Bartosz Golaszewski has introduced two
      classes of potential bugs by making the devres callback of
      devm_mdiobus_alloc stop calling mdiobus_unregister.
      
      The exact buggy circumstances are presented in the individual commit
      messages. I have searched the tree for other occurrences, but at the
      moment:
      
      - for issue (a) I have no concrete proof that other buses except SPI and
        I2C suffer from it, and the only SPI or I2C device drivers that call
        of_mdiobus_alloc are the DSA drivers that leave a NULL
        ds->slave_mii_bus and a non-NULL ds->ops->phy_read, aka ksz9477,
        ksz8795, lan9303_i2c, vsc73xx-spi.
      
      - for issue (b), all drivers which call of_mdiobus_alloc either use
        of_mdiobus_register too, or call mdiobus_unregister sometime within
        the ->remove path.
      
      Although at this point I've seen enough strangeness caused by this
      "device_del during ->shutdown" that I'm just going to copy the SPI and
      I2C subsystem maintainers to this patch series, to get their feedback
      whether they've had reports about things like this before. I don't think
      other buses behave in this way, it forces SPI and I2C devices to have to
      protect themselves from a really strange set of issues.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b3f98404
    • Vladimir Oltean's avatar
      net: dsa: realtek: register the MDIO bus under devres · 74b6d7d1
      Vladimir Oltean authored
      The Linux device model permits both the ->shutdown and ->remove driver
      methods to get called during a shutdown procedure. Example: a DSA switch
      which sits on an SPI bus, and the SPI bus driver calls this on its
      ->shutdown method:
      
      spi_unregister_controller
      -> device_for_each_child(&ctlr->dev, NULL, __unregister);
         -> spi_unregister_device(to_spi_device(dev));
            -> device_del(&spi->dev);
      
      So this is a simple pattern which can theoretically appear on any bus,
      although the only other buses on which I've been able to find it are
      I2C:
      
      i2c_del_adapter
      -> device_for_each_child(&adap->dev, NULL, __unregister_client);
         -> i2c_unregister_device(client);
            -> device_unregister(&client->dev);
      
      The implication of this pattern is that devices on these buses can be
      unregistered after having been shut down. The drivers for these devices
      might choose to return early either from ->remove or ->shutdown if the
      other callback has already run once, and they might choose that the
      ->shutdown method should only perform a subset of the teardown done by
      ->remove (to avoid unnecessary delays when rebooting).
      
      So in other words, the device driver may choose on ->remove to not
      do anything (therefore to not unregister an MDIO bus it has registered
      on ->probe), because this ->remove is actually triggered by the
      device_shutdown path, and its ->shutdown method has already run and done
      the minimally required cleanup.
      
      This used to be fine until the blamed commit, but now, the following
      BUG_ON triggers:
      
      void mdiobus_free(struct mii_bus *bus)
      {
      	/* For compatibility with error handling in drivers. */
      	if (bus->state == MDIOBUS_ALLOCATED) {
      		kfree(bus);
      		return;
      	}
      
      	BUG_ON(bus->state != MDIOBUS_UNREGISTERED);
      	bus->state = MDIOBUS_RELEASED;
      
      	put_device(&bus->dev);
      }
      
      In other words, there is an attempt to free an MDIO bus which was not
      unregistered. The attempt to free it comes from the devres release
      callbacks of the SPI device, which are executed after the device is
      unregistered.
      
      I'm not saying that the fact that MDIO buses allocated using devres
      would automatically get unregistered wasn't strange. I'm just saying
      that the commit didn't care about auditing existing call paths in the
      kernel, and now, the following code sequences are potentially buggy:
      
      (a) devm_mdiobus_alloc followed by plain mdiobus_register, for a device
          located on a bus that unregisters its children on shutdown. After
          the blamed patch, either both the alloc and the register should use
          devres, or none should.
      
      (b) devm_mdiobus_alloc followed by plain mdiobus_register, and then no
          mdiobus_unregister at all in the remove path. After the blamed
          patch, nobody unregisters the MDIO bus anymore, so this is even more
          buggy than the previous case which needs a specific bus
          configuration to be seen, this one is an unconditional bug.
      
      In this case, the Realtek drivers fall under category (b). To solve it,
      we can register the MDIO bus under devres too, which restores the
      previous behavior.
      
      Fixes: ac3a68d5
      
       ("net: phy: don't abuse devres in devm_mdiobus_register()")
      Reported-by: default avatarLino Sanfilippo <LinoSanfilippo@gmx.de>
      Reported-by: default avatarAlvin Šipraga <alsi@bang-olufsen.dk>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      74b6d7d1
    • Vladimir Oltean's avatar
      net: dsa: don't allocate the slave_mii_bus using devres · 5135e96a
      Vladimir Oltean authored
      The Linux device model permits both the ->shutdown and ->remove driver
      methods to get called during a shutdown procedure. Example: a DSA switch
      which sits on an SPI bus, and the SPI bus driver calls this on its
      ->shutdown method:
      
      spi_unregister_controller
      -> device_for_each_child(&ctlr->dev, NULL, __unregister);
         -> spi_unregister_device(to_spi_device(dev));
            -> device_del(&spi->dev);
      
      So this is a simple pattern which can theoretically appear on any bus,
      although the only other buses on which I've been able to find it are
      I2C:
      
      i2c_del_adapter
      -> device_for_each_child(&adap->dev, NULL, __unregister_client);
         -> i2c_unregister_device(client);
            -> device_unregister(&client->dev);
      
      The implication of this pattern is that devices on these buses can be
      unregistered after having been shut down. The drivers for these devices
      might choose to return early either from ->remove or ->shutdown if the
      other callback has already run once, and they might choose that the
      ->shutdown method should only perform a subset of the teardown done by
      ->remove (to avoid unnecessary delays when rebooting).
      
      So in other words, the device driver may choose on ->remove to not
      do anything (therefore to not unregister an MDIO bus it has registered
      on ->probe), because this ->remove is actually triggered by the
      device_shutdown path, and its ->shutdown method has already run and done
      the minimally required cleanup.
      
      This used to be fine until the blamed commit, but now, the following
      BUG_ON triggers:
      
      void mdiobus_free(struct mii_bus *bus)
      {
      	/* For compatibility with error handling in drivers. */
      	if (bus->state == MDIOBUS_ALLOCATED) {
      		kfree(bus);
      		return;
      	}
      
      	BUG_ON(bus->state != MDIOBUS_UNREGISTERED);
      	bus->state = MDIOBUS_RELEASED;
      
      	put_device(&bus->dev);
      }
      
      In other words, there is an attempt to free an MDIO bus which was not
      unregistered. The attempt to free it comes from the devres release
      callbacks of the SPI device, which are executed after the device is
      unregistered.
      
      I'm not saying that the fact that MDIO buses allocated using devres
      would automatically get unregistered wasn't strange. I'm just saying
      that the commit didn't care about auditing existing call paths in the
      kernel, and now, the following code sequences are potentially buggy:
      
      (a) devm_mdiobus_alloc followed by plain mdiobus_register, for a device
          located on a bus that unregisters its children on shutdown. After
          the blamed patch, either both the alloc and the register should use
          devres, or none should.
      
      (b) devm_mdiobus_alloc followed by plain mdiobus_register, and then no
          mdiobus_unregister at all in the remove path. After the blamed
          patch, nobody unregisters the MDIO bus anymore, so this is even more
          buggy than the previous case which needs a specific bus
          configuration to be seen, this one is an unconditional bug.
      
      In this case, DSA falls into category (a), it tries to be helpful and
      registers an MDIO bus on behalf of the switch, which might be on such a
      bus. I've no idea why it does it under devres.
      
      It does this on probe:
      
      	if (!ds->slave_mii_bus && ds->ops->phy_read)
      		alloc and register mdio bus
      
      and this on remove:
      
      	if (ds->slave_mii_bus && ds->ops->phy_read)
      		unregister mdio bus
      
      I _could_ imagine using devres because the condition used on remove is
      different than the condition used on probe. So strictly speaking, DSA
      cannot determine whether the ds->slave_mii_bus it sees on remove is the
      ds->slave_mii_bus that _it_ has allocated on probe. Using devres would
      have solved that problem. But nonetheless, the existing code already
      proceeds to unregister the MDIO bus, even though it might be
      unregistering an MDIO bus it has never registered. So I can only guess
      that no driver that implements ds->ops->phy_read also allocates and
      registers ds->slave_mii_bus itself.
      
      So in that case, if unregistering is fine, freeing must be fine too.
      
      Stop using devres and free the MDIO bus manually. This will make devres
      stop attempting to free a still registered MDIO bus on ->shutdown.
      
      Fixes: ac3a68d5
      
       ("net: phy: don't abuse devres in devm_mdiobus_register()")
      Reported-by: default avatarLino Sanfilippo <LinoSanfilippo@gmx.de>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Tested-by: default avatarLino Sanfilippo <LinoSanfilippo@gmx.de>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5135e96a
    • Masanari Iida's avatar
      Doc: networking: Fox a typo in ice.rst · 3e95cfa2
      Masanari Iida authored
      
      
      This patch fixes a spelling typo in ice.rst
      
      Signed-off-by: default avatarMasanari Iida <standby24x7@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3e95cfa2
    • Vladimir Oltean's avatar
      net: dsa: fix dsa_tree_setup error path · e5845aa0
      Vladimir Oltean authored
      Since the blamed commit, dsa_tree_teardown_switches() was split into two
      smaller functions, dsa_tree_teardown_switches and dsa_tree_teardown_ports.
      
      However, the error path of dsa_tree_setup stopped calling dsa_tree_teardown_ports.
      
      Fixes: a57d8c21
      
       ("net: dsa: flush switchdev workqueue before tearing down CPU/DSA ports")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e5845aa0
    • David S. Miller's avatar
      Merge branch 'smc-fixes' · 431db53c
      David S. Miller authored
      
      
      Karsten Graul says:
      
      ====================
      net/smc: fixes 2021-09-20
      
      Please apply the following patches for smc to netdev's net tree.
      
      The first patch adds a missing error check, and the second patch
      fixes a possible leak of a lock in a worker.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      431db53c
    • Karsten Graul's avatar
      net/smc: fix 'workqueue leaked lock' in smc_conn_abort_work · a18cee47
      Karsten Graul authored
      The abort_work is scheduled when a connection was detected to be
      out-of-sync after a link failure. The work calls smc_conn_kill(),
      which calls smc_close_active_abort() and that might end up calling
      smc_close_cancel_work().
      smc_close_cancel_work() cancels any pending close_work and tx_work but
      needs to release the sock_lock before and acquires the sock_lock again
      afterwards. So when the sock_lock was NOT acquired before then it may
      be held after the abort_work completes. Thats why the sock_lock is
      acquired before the call to smc_conn_kill() in __smc_lgr_terminate(),
      but this is missing in smc_conn_abort_work().
      
      Fix that by acquiring the sock_lock first and release it after the
      call to smc_conn_kill().
      
      Fixes: b286a065
      
       ("net/smc: handle incoming CDC validation message")
      Signed-off-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a18cee47
    • Karsten Graul's avatar
      net/smc: add missing error check in smc_clc_prfx_set() · 6c907319
      Karsten Graul authored
      Coverity stumbled over a missing error check in smc_clc_prfx_set():
      
      *** CID 1475954:  Error handling issues  (CHECKED_RETURN)
      /net/smc/smc_clc.c: 233 in smc_clc_prfx_set()
      >>>     CID 1475954:  Error handling issues  (CHECKED_RETURN)
      >>>     Calling "kernel_getsockname" without checking return value (as is done elsewhere 8 out of 10 times).
      233     	kernel_getsockname(clcsock, (struct sockaddr *)&addrs);
      
      Add the return code check in smc_clc_prfx_set().
      
      Fixes: c246d942
      
       ("net/smc: restructure netinfo for CLC proposal msgs")
      Reported-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c907319
  4. Sep 20, 2021
    • David S. Miller's avatar
      Merge branch 'hns3-fixes' · 36747c96
      David S. Miller authored
      
      
      Guangbin Huang says:
      
      ====================
      net: hns3: add some fixes for -net
      
      This series adds some fixes for the HNS3 ethernet driver.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      36747c96
    • Yufeng Mo's avatar
      net: hns3: fix a return value error in hclge_get_reset_status() · 5126b9d3
      Yufeng Mo authored
      hclge_get_reset_status() should return the tqp reset status.
      However, if the CMDQ fails, the caller will take it as tqp reset
      success status by mistake. Therefore, uses a parameters to get
      the tqp reset status instead.
      
      Fixes: 46a3df9f
      
       ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support")
      Signed-off-by: default avatarYufeng Mo <moyufeng@huawei.com>
      Signed-off-by: default avatarGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5126b9d3
    • liaoguojia's avatar
      net: hns3: check vlan id before using it · ef39d632
      liaoguojia authored
      The input parameters may not be reliable, so check the vlan id before
      using it, otherwise may set wrong vlan id into hardware.
      
      Fixes: dc8131d8
      
       ("net: hns3: Fix for packet loss due wrong filter config in VLAN tbls")
      Signed-off-by: default avatarliaoguojia <liaoguojia@huawei.com>
      Signed-off-by: default avatarGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef39d632
    • Yufeng Mo's avatar
      net: hns3: check queue id range before using · 63b1279d
      Yufeng Mo authored
      The input parameters may not be reliable. Before using the
      queue id, we should check this parameter. Otherwise, memory
      overwriting may occur.
      
      Fixes: d3410018
      
       ("net: hns3: refactor the mailbox message between PF and VF")
      Signed-off-by: default avatarYufeng Mo <moyufeng@huawei.com>
      Signed-off-by: default avatarGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      63b1279d
    • Jiaran Zhang's avatar
      net: hns3: fix misuse vf id and vport id in some logs · 311c0aaa
      Jiaran Zhang authored
      vport_id include PF and VFs, vport_id = 0 means PF, other values mean VFs.
      So the actual vf id is equal to vport_id minus 1.
      
      Some VF print logs are actually vport, and logs of vf id actually use
      vport id, so this patch fixes them.
      
      Fixes: ac887be5 ("net: hns3: change print level of RAS error log from warning to error")
      Fixes: adcf738b
      
       ("net: hns3: cleanup some print format warning")
      Signed-off-by: default avatarJiaran Zhang <zhangjiaran@huawei.com>
      Signed-off-by: default avatarGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      311c0aaa
    • Jian Shen's avatar
      net: hns3: fix inconsistent vf id print · 91bc0d52
      Jian Shen authored
      The vf id from ethtool is added 1 before configured to driver.
      So it's necessary to minus 1 when printing it, in order to
      keep consistent with user's configuration.
      
      Fixes: dd74f815
      
       ("net: hns3: Add support for rule add/delete for flow director")
      Signed-off-by: default avatarJian Shen <shenjian15@huawei.com>
      Signed-off-by: default avatarGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      91bc0d52
    • Jian Shen's avatar
      net: hns3: fix change RSS 'hfunc' ineffective issue · e184cec5
      Jian Shen authored
      When user change rss 'hfunc' without set rss 'hkey' by ethtool
      -X command, the driver will ignore the 'hfunc' for the hkey is
      NULL. It's unreasonable. So fix it.
      
      Fixes: 46a3df9f ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support")
      Fixes: 374ad291
      
       ("net: hns3: Add RSS general configuration support for VF")
      Signed-off-by: default avatarJian Shen <shenjian15@huawei.com>
      Signed-off-by: default avatarGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e184cec5
    • Arnd Bergmann's avatar
      ptp: ocp: add COMMON_CLK dependency · 42a99a0b
      Arnd Bergmann authored
      Without CONFIG_COMMON_CLK, this fails to link:
      
      arm-linux-gnueabi-ld: drivers/ptp/ptp_ocp.o: in function `ptp_ocp_register_i2c':
      ptp_ocp.c:(.text+0xcc0): undefined reference to `__clk_hw_register_fixed_rate'
      arm-linux-gnueabi-ld: ptp_ocp.c:(.text+0xcf4): undefined reference to `devm_clk_hw_register_clkdev'
      arm-linux-gnueabi-ld: drivers/ptp/ptp_ocp.o: in function `ptp_ocp_detach':
      ptp_ocp.c:(.text+0x1c24): undefined reference to `clk_hw_unregister_fixed_rate'
      
      Fixes: a7e1abad
      
       ("ptp: Add clock driver for the OpenCompute TimeCard.")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      42a99a0b
    • Michael Chan's avatar
      bnxt_en: Fix TX timeout when TX ring size is set to the smallest · 5bed8b07
      Michael Chan authored
      The smallest TX ring size we support must fit a TX SKB with MAX_SKB_FRAGS
      + 1.  Because the first TX BD for a packet is always a long TX BD, we
      need an extra TX BD to fit this packet.  Define BNXT_MIN_TX_DESC_CNT with
      this value to make this more clear.  The current code uses a minimum
      that is off by 1.  Fix it using this constant.
      
      The tx_wake_thresh to determine when to wake up the TX queue is half the
      ring size but we must have at least BNXT_MIN_TX_DESC_CNT for the next
      packet which may have maximum fragments.  So the comparison of the
      available TX BDs with tx_wake_thresh should be >= instead of > in the
      current code.  Otherwise, at the smallest ring size, we will never wake
      up the TX queue and will cause TX timeout.
      
      Fixes: c0c050c5
      
       ("bnxt_en: New Broadcom ethernet driver.")
      Reviewed-by: default avatarPavan Chebbi <pavan.chebbi@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadocm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5bed8b07
    • Ido Schimmel's avatar
      nexthop: Fix division by zero while replacing a resilient group · 563f23b0
      Ido Schimmel authored
      The resilient nexthop group torture tests in fib_nexthop.sh exposed a
      possible division by zero while replacing a resilient group [1]. The
      division by zero occurs when the data path sees a resilient nexthop
      group with zero buckets.
      
      The tests replace a resilient nexthop group in a loop while traffic is
      forwarded through it. The tests do not specify the number of buckets
      while performing the replacement, resulting in the kernel allocating a
      stub resilient table (i.e, 'struct nh_res_table') with zero buckets.
      
      This table should never be visible to the data path, but the old nexthop
      group (i.e., 'oldg') might still be used by the data path when the stub
      table is assigned to it.
      
      Fix this by only assigning the stub table to the old nexthop group after
      making sure the group is no longer used by the data path.
      
      Tested with fib_nexthops.sh:
      
      Tests passed: 222
      Tests failed:   0
      
      [1]
       divide error: 0000 [#1] PREEMPT SMP KASAN
       CPU: 0 PID: 1850 Comm: ping Not tainted 5.14.0-custom-10271-ga86eb53057fe #1107
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-4.fc34 04/01/2014
       RIP: 0010:nexthop_select_path+0x2d2/0x1a80
      [...]
       Call Trace:
        fib_select_multipath+0x79b/0x1530
        fib_select_path+0x8fb/0x1c10
        ip_route_output_key_hash_rcu+0x1198/0x2da0
        ip_route_output_key_hash+0x190/0x340
        ip_route_output_flow+0x21/0x120
        raw_sendmsg+0x91d/0x2e10
        inet_sendmsg+0x9e/0xe0
        __sys_sendto+0x23d/0x360
        __x64_sys_sendto+0xe1/0x1b0
        do_syscall_64+0x35/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Cc: stable@vger.kernel.org
      Fixes: 283a72a5
      
       ("nexthop: Add implementation of resilient next-hop groups")
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      563f23b0
    • Xuan Zhuo's avatar
      napi: fix race inside napi_enable · 3765996e
      Xuan Zhuo authored
      The process will cause napi.state to contain NAPI_STATE_SCHED and
      not in the poll_list, which will cause napi_disable() to get stuck.
      
      The prefix "NAPI_STATE_" is removed in the figure below, and
      NAPI_STATE_HASHED is ignored in napi.state.
      
                            CPU0       |                   CPU1       | napi.state
      ===============================================================================
      napi_disable()                   |                              | SCHED | NPSVC
      napi_enable()                    |                              |
      {                                |                              |
          smp_mb__before_atomic();     |                              |
          clear_bit(SCHED, &n->state); |                              | NPSVC
                                       | napi_schedule_prep()         | SCHED | NPSVC
                                       | napi_poll()                  |
                                       |   napi_complete_done()       |
                                       |   {                          |
                                       |      if (n->state & (NPSVC | | (1)
                                       |               _BUSY_POLL)))  |
                                       |           return false;      |
                                       |     ................         |
                                       |   }                          | SCHED | NPSVC
                                       |                              |
          clear_bit(NPSVC, &n->state); |                              | SCHED
      }                                |                              |
                                       |                              |
      napi_schedule_prep()             |                              | SCHED | MISSED (2)
      
      (1) Here return direct. Because of NAPI_STATE_NPSVC exists.
      (2) NAPI_STATE_SCHED exists. So not add napi.poll_list to sd->poll_list
      
      Since NAPI_STATE_SCHED already exists and napi is not in the
      sd->poll_list queue, NAPI_STATE_SCHED cannot be cleared and will always
      exist.
      
      1. This will cause this queue to no longer receive packets.
      2. If you encounter napi_disable under the protection of rtnl_lock, it
         will cause the entire rtnl_lock to be locked, affecting the overall
         system.
      
      This patch uses cmpxchg to implement napi_enable(), which ensures that
      there will be no race due to the separation of clear two bits.
      
      Fixes: 2d8bff12
      
       ("netpoll: Close race condition between poll_one_napi and napi_disable")
      Signed-off-by: default avatarXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Reviewed-by: default avatarDust Li <dust.li@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3765996e
  5. Sep 19, 2021
    • Shuah Khan's avatar
      selftests: net: af_unix: Fix makefile to use TEST_GEN_PROGS · e30cd812
      Shuah Khan authored
      
      
      Makefile uses TEST_PROGS instead of TEST_GEN_PROGS to define
      executables. TEST_PROGS is for shell scripts that need to be
      installed and run by the common lib.mk framework. The common
      framework doesn't touch TEST_PROGS when it does build and clean.
      
      As a result "make kselftest-clean" and "make clean" fail to remove
      executables. Run and install work because the common framework runs
      and installs TEST_PROGS. Build works because the Makefile defines
      "all" rule which is unnecessary if TEST_GEN_PROGS is used.
      
      Use TEST_GEN_PROGS so the common framework can handle build/run/
      install/clean properly.
      
      Signed-off-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e30cd812
    • Lama Kayal's avatar
      net/mlx4_en: Resolve bad operstate value · 72a3c58d
      Lama Kayal authored
      Any link state change that's done prior to net device registration
      isn't reflected on the state, thus the operational state is left
      obsolete, with 'UNKNOWN' status.
      
      To resolve the issue, query link state from FW upon open operations
      to ensure operational state is updated.
      
      Fixes: c27a02cd
      
       ("mlx4_en: Add driver for Mellanox ConnectX 10GbE NIC")
      Signed-off-by: default avatarLama Kayal <lkayal@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      72a3c58d
    • Shuah Khan's avatar
      selftests: net: af_unix: Fix incorrect args in test result msg · 48514a22
      Shuah Khan authored
      
      
      Fix the args to fprintf(). Splitting the message ends up passing
      incorrect arg for "sigurg %d" and an extra arg overall. The test
      result message ends up incorrect.
      
      test_unix_oob.c: In function ‘main’:
      test_unix_oob.c:274:43: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘char *’ [-Wformat=]
        274 |   fprintf(stderr, "Test 3 failed, sigurg %d len %d OOB %c ",
            |                                          ~^
            |                                           |
            |                                           int
            |                                          %s
        275 |   "atmark %d\n", signal_recvd, len, oob, atmark);
            |   ~~~~~~~~~~~~~
            |   |
            |   char *
      test_unix_oob.c:274:19: warning: too many arguments for format [-Wformat-extra-args]
        274 |   fprintf(stderr, "Test 3 failed, sigurg %d len %d OOB %c ",
      
      Signed-off-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      48514a22
    • Christian Lamparter's avatar
      net: bgmac-bcma: handle deferred probe error due to mac-address · 029497e6
      Christian Lamparter authored
      Due to the inclusion of nvmem handling into the mac-address getter
      function of_get_mac_address() by
      commit d01f449c ("of_net: add NVMEM support to of_get_mac_address")
      it is now possible to get a -EPROBE_DEFER return code. Which did cause
      bgmac to assign a random ethernet address.
      
      This exact issue happened on my Meraki MR32. The nvmem provider is
      an EEPROM (at24c64) which gets instantiated once the module
      driver is loaded... This happens once the filesystem becomes available.
      
      With this patch, bgmac_probe() will propagate the -EPROBE_DEFER error.
      Then the driver subsystem will reschedule the probe at a later time.
      
      Cc: Petr Štetiar <ynezz@true.cz>
      Cc: Michael Walle <michael@walle.cc>
      Fixes: d01f449c
      
       ("of_net: add NVMEM support to of_get_mac_address")
      Signed-off-by: default avatarChristian Lamparter <chunkeey@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      029497e6
    • Vladimir Oltean's avatar
      net: dsa: tear down devlink port regions when tearing down the devlink port on error · fd292c18
      Vladimir Oltean authored
      Commit 86f8b1c0 ("net: dsa: Do not make user port errors fatal")
      decided it was fine to ignore errors on certain ports that fail to
      probe, and go on with the ports that do probe fine.
      
      Commit fb6ec87f ("net: dsa: Fix type was not set for devlink port")
      noticed that devlink_port_type_eth_set(dlp, dp->slave); does not get
      called, and devlink notices after a timeout of 3600 seconds and prints a
      WARN_ON. So it went ahead to unregister the devlink port. And because
      there exists an UNUSED port flavour, we actually re-register the devlink
      port as UNUSED.
      
      Commit 08156ba4 ("net: dsa: Add devlink port regions support to
      DSA") added devlink port regions, which are set up by the driver and not
      by DSA.
      
      When we trigger the devlink port deregistration and reregistration as
      unused, devlink now prints another WARN_ON, from here:
      
      devlink_port_unregister:
      	WARN_ON(!list_empty(&devlink_port->region_list));
      
      So the port still has regions, which makes sense, because they were set
      up by the driver, and the driver doesn't know we're unregistering the
      devlink port.
      
      Somebody needs to tear them down, and optionally (actually it would be
      nice, to be consistent) set them up again for the new devlink port.
      
      But DSA's layering stays in our way quite badly here.
      
      The options I've considered are:
      
      1. Introduce a function in devlink to just change a port's type and
         flavour. No dice, devlink keeps a lot of state, it really wants the
         port to not be registered when you set its parameters, so changing
         anything can only be done by destroying what we currently have and
         recreating it.
      
      2. Make DSA cache the parameters passed to dsa_devlink_port_region_create,
         and the region returned, keep those in a list, then when the devlink
         port unregister needs to take place, the existing devlink regions are
         destroyed by DSA, and we replay the creation of new regions using the
         cached parameters. Problem: mv88e6xxx keeps the region pointers in
         chip->ports[port].region, and these will remain stale after DSA frees
         them. There are many things DSA can do, but updating mv88e6xxx's
         private pointers is not one of them.
      
      3. Just let the driver do it (i.e. introduce a very specific method
         called ds->ops->port_reinit_as_unused, which unregisters its devlink
         port devlink regions, then the old devlink port, then registers the
         new one, then the devlink port regions for it). While it does work,
         as opposed to the others, it's pretty horrible from an API
         perspective and we can do better.
      
      4. Introduce a new pair of methods, ->port_setup and ->port_teardown,
         which in the case of mv88e6xxx must register and unregister the
         devlink port regions. Call these 2 methods when the port must be
         reinitialized as unused.
      
      Naturally, I went for the 4th approach.
      
      Fixes: 08156ba4
      
       ("net: dsa: Add devlink port regions support to DSA")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fd292c18
    • Krzysztof Kozlowski's avatar
      net: freescale: drop unneeded MODULE_ALIAS · fdb47583
      Krzysztof Kozlowski authored
      
      
      The MODULE_DEVICE_TABLE already creates proper alias for platform
      driver.  Having another MODULE_ALIAS causes the alias to be duplicated.
      
      Signed-off-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fdb47583
    • David S. Miller's avatar
      Merge branch 'ocelot-phylink-fixes' · d614489f
      David S. Miller authored
      Colin Foster says:
      
      ====================
      ocelot phylink fixes
      
      When the ocelot driver was migrated to phylink, e6e12df6
      
       ("net:
      mscc: ocelot: convert to phylink") there were two additional writes to
      registers that became stale. One write was to DEV_CLOCK_CFG and one was
      to ANA_PFC_PCF_CFG.
      
      Both of these writes referenced the variable "speed" which originally
      was set to OCELOT_SPEED_{10,100,1000,2500}. These macros expand to
      values of 3, 2, 1, or 0, respectively. After the update, the variable
      speed is set to SPEED_{10,100,1000,2500} which expand to 10, 100, 1000,
      and 2500. So invalid values were getting written to the two registers,
      which would lead to either a lack of functionality or undefined
      funcationality.
      
      Fixing these values was the intent of v1 of this patch set - submitted
      as "[PATCH v1 net] net: ethernet: mscc: ocelot: bug fix when writing MAC
      speed"
      
      During that review it was determined that both writes were actually
      unnecessary. DEV_CLOCK_CFG is a duplicate write, so can be removed
      entirely. This was accidentally submitted as as a new, lone patch titled
      "[PATCH v1 net] net: mscc: ocelot: remove buggy duplicate write to
      DEV_CLOCK_CFG". This is part of what is considered v2 of this patch set.
      
      Additionally, the write to ANA_PFC_PFC_CFG is also unnecessary. Priority
      flow contol is disabled, so configuring it is useless and should be
      removed. This was also submitted as a new, lone patch titled "[PATCH v1
      net] net: mscc: ocelot: remove buggy and useless write to ANA_PFC_PFC_CFG".
      This is the rest of what is considered v2 of this patch set.
      
      v3
      Identical to v2, but fixes the patch numbering to v3 and submitting the
      two changes as a patch set.
      
      v2
      Note: I misunderstood and submitted two new "v1" patches instead of a
      single "v2" patch set.
      - Remove the buggy writes altogher
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d614489f
    • Colin Foster's avatar
      net: mscc: ocelot: remove buggy duplicate write to DEV_CLOCK_CFG · ba68e994
      Colin Foster authored
      When updating ocelot to use phylink, a second write to DEV_CLOCK_CFG was
      mistakenly left in. It used the variable "speed" which, previously, would
      would have been assigned a value of OCELOT_SPEED_1000. In phylink the
      variable is be SPEED_1000, which is invalid for the
      DEV_CLOCK_LINK_SPEED macro. Removing it as unnecessary and buggy.
      
      Fixes: e6e12df6
      
       ("net: mscc: ocelot: convert to phylink")
      Signed-off-by: default avatarColin Foster <colin.foster@in-advantage.com>
      Reviewed-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ba68e994
    • Colin Foster's avatar
      net: mscc: ocelot: remove buggy and useless write to ANA_PFC_PFC_CFG · 163957c4
      Colin Foster authored
      A useless write to ANA_PFC_PFC_CFG was left in while refactoring ocelot to
      phylink. Since priority flow control is disabled, writing the speed has no
      effect.
      
      Further, it was using ethtool.h SPEED_ instead of OCELOT_SPEED_ macros,
      which are incorrectly offset for GENMASK.
      
      Lastly, for priority flow control to properly function, some scenarios
      would rely on the rate adaptation from the PCS while the MAC speed would
      be fixed. So it isn't used, and even if it was, neither "speed" nor
      "mac_speed" are necessarily the correct values to be used.
      
      Fixes: e6e12df6
      
       ("net: mscc: ocelot: convert to phylink")
      Signed-off-by: default avatarColin Foster <colin.foster@in-advantage.com>
      Reviewed-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      163957c4
    • Thomas Gleixner's avatar
      net: core: Correct the sock::sk_lock.owned lockdep annotations · 2dcb96ba
      Thomas Gleixner authored
      lock_sock_fast() and lock_sock_nested() contain lockdep annotations for the
      sock::sk_lock.owned 'mutex'. sock::sk_lock.owned is not a regular mutex. It
      is just lockdep wise equivalent. In fact it's an open coded trivial mutex
      implementation with some interesting features.
      
      sock::sk_lock.slock is a regular spinlock protecting the 'mutex'
      representation sock::sk_lock.owned which is a plain boolean. If 'owned' is
      true, then some other task holds the 'mutex', otherwise it is uncontended.
      As this locking construct is obviously endangered by lock ordering issues as
      any other locking primitive it got lockdep annotated via a dedicated
      dependency map sock::sk_lock.dep_map which has to be updated at the lock
      and unlock sites.
      
      lock_sock_nested() is a straight forward 'mutex' lock operation:
      
        might_sleep();
        spin_lock_bh(sock::sk_lock.slock)
        while (!try_lock(sock::sk_lock.owned)) {
            spin_unlock_bh(sock::sk_lock.slock);
            wait_for_release();
            spin_lock_bh(sock::sk_lock.slock);
        }
      
      The lockdep annotation for sock::sk_lock.owned is for unknown reasons
      _after_ the lock has been acquired, i.e. after the code block above and
      after releasing sock::sk_lock.slock, but inside the bottom halves disabled
      region:
      
        spin_unlock(sock::sk_lock.slock);
        mutex_acquire(&sk->sk_lock.dep_map, subclass, 0, _RET_IP_);
        local_bh_enable();
      
      The placement after the unlock is obvious because otherwise the
      mutex_acquire() would nest into the spin lock held region.
      
      But that's from the lockdep perspective still the wrong place:
      
       1) The mutex_acquire() is issued _after_ the successful acquisition which
          is pointless because in a dead lock scenario this point is never
          reached which means that if the deadlock is the first instance of
          exposing the wrong lock order lockdep does not have a chance to detect
          it.
      
       2) It only works because lockdep is rather lax on the context from which
          the mutex_acquire() is issued. Acquiring a mutex inside a bottom halves
          and therefore non-preemptible region is obviously invalid, except for a
          trylock which is clearly not the case here.
      
          This 'works' stops working on RT enabled kernels where the bottom halves
          serialization is done via a local lock, which exposes this misplacement
          because the 'mutex' and the local lock nest the wrong way around and
          lockdep complains rightfully about a lock inversion.
      
      The placement is wrong since the initial commit a5b5bb9a
      
       ("[PATCH]
      lockdep: annotate sk_locks") which introduced this.
      
      Fix it by moving the mutex_acquire() in front of the actual lock
      acquisition, which is what the regular mutex_lock() operation does as well.
      
      lock_sock_fast() is not that straight forward. It looks at the first glance
      like a convoluted trylock operation:
      
        spin_lock_bh(sock::sk_lock.slock)
        if (!sock::sk_lock.owned)
            return false;
        while (!try_lock(sock::sk_lock.owned)) {
            spin_unlock_bh(sock::sk_lock.slock);
            wait_for_release();
            spin_lock_bh(sock::sk_lock.slock);
        }
        spin_unlock(sock::sk_lock.slock);
        mutex_acquire(&sk->sk_lock.dep_map, subclass, 0, _RET_IP_);
        local_bh_enable();
        return true;
      
      But that's not the case: lock_sock_fast() is an interesting optimization
      for short critical sections which can run with bottom halves disabled and
      sock::sk_lock.slock held. This allows to shortcut the 'mutex' operation in
      the non contended case by preventing other lockers to acquire
      sock::sk_lock.owned because they are blocked on sock::sk_lock.slock, which
      in turn avoids the overhead of doing the heavy processing in release_sock()
      including waking up wait queue waiters.
      
      In the contended case, i.e. when sock::sk_lock.owned == true the behavior
      is the same as lock_sock_nested().
      
      Semantically this shortcut means, that the task acquired the 'mutex' even
      if it does not touch the sock::sk_lock.owned field in the non-contended
      case. Not telling lockdep about this shortcut acquisition is hiding
      potential lock ordering violations in the fast path.
      
      As a consequence the same reasoning as for the above lock_sock_nested()
      case vs. the placement of the lockdep annotation applies.
      
      The current placement of the lockdep annotation was just copied from
      the original lock_sock(), now renamed to lock_sock_nested(),
      implementation.
      
      Fix this by moving the mutex_acquire() in front of the actual lock
      acquisition and adding the corresponding mutex_release() into
      unlock_sock_fast(). Also document the fast path return case with a comment.
      
      Reported-by: default avatarSebastian Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: netdev@vger.kernel.org
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2dcb96ba