Skip to content
  1. Nov 29, 2021
  2. Nov 28, 2021
  3. Nov 27, 2021
    • Ye Bin's avatar
      io_uring: Fix undefined-behaviour in io_issue_sqe · f6223ff7
      Ye Bin authored
      
      
      We got issue as follows:
      ================================================================================
      UBSAN: Undefined behaviour in ./include/linux/ktime.h:42:14
      signed integer overflow:
      -4966321760114568020 * 1000000000 cannot be represented in type 'long long int'
      CPU: 1 PID: 2186 Comm: syz-executor.2 Not tainted 4.19.90+ #12
      Hardware name: linux,dummy-virt (DT)
      Call trace:
       dump_backtrace+0x0/0x3f0 arch/arm64/kernel/time.c:78
       show_stack+0x28/0x38 arch/arm64/kernel/traps.c:158
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x170/0x1dc lib/dump_stack.c:118
       ubsan_epilogue+0x18/0xb4 lib/ubsan.c:161
       handle_overflow+0x188/0x1dc lib/ubsan.c:192
       __ubsan_handle_mul_overflow+0x34/0x44 lib/ubsan.c:213
       ktime_set include/linux/ktime.h:42 [inline]
       timespec64_to_ktime include/linux/ktime.h:78 [inline]
       io_timeout fs/io_uring.c:5153 [inline]
       io_issue_sqe+0x42c8/0x4550 fs/io_uring.c:5599
       __io_queue_sqe+0x1b0/0xbc0 fs/io_uring.c:5988
       io_queue_sqe+0x1ac/0x248 fs/io_uring.c:6067
       io_submit_sqe fs/io_uring.c:6137 [inline]
       io_submit_sqes+0xed8/0x1c88 fs/io_uring.c:6331
       __do_sys_io_uring_enter fs/io_uring.c:8170 [inline]
       __se_sys_io_uring_enter fs/io_uring.c:8129 [inline]
       __arm64_sys_io_uring_enter+0x490/0x980 fs/io_uring.c:8129
       invoke_syscall arch/arm64/kernel/syscall.c:53 [inline]
       el0_svc_common+0x374/0x570 arch/arm64/kernel/syscall.c:121
       el0_svc_handler+0x190/0x260 arch/arm64/kernel/syscall.c:190
       el0_svc+0x10/0x218 arch/arm64/kernel/entry.S:1017
      ================================================================================
      
      As ktime_set only judge 'secs' if big than KTIME_SEC_MAX, but if we pass
      negative value maybe lead to overflow.
      To address this issue, we must check if 'sec' is negative.
      
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Link: https://lore.kernel.org/r/20211118015907.844807-1-yebin10@huawei.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f6223ff7
    • Ye Bin's avatar
      io_uring: fix soft lockup when call __io_remove_buffers · 1d0254e6
      Ye Bin authored
      I got issue as follows:
      [ 567.094140] __io_remove_buffers: [1]start ctx=0xffff8881067bf000 bgid=65533 buf=0xffff8881fefe1680
      [  594.360799] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/u32:5:108]
      [  594.364987] Modules linked in:
      [  594.365405] irq event stamp: 604180238
      [  594.365906] hardirqs last  enabled at (604180237): [<ffffffff93fec9bd>] _raw_spin_unlock_irqrestore+0x2d/0x50
      [  594.367181] hardirqs last disabled at (604180238): [<ffffffff93fbbadb>] sysvec_apic_timer_interrupt+0xb/0xc0
      [  594.368420] softirqs last  enabled at (569080666): [<ffffffff94200654>] __do_softirq+0x654/0xa9e
      [  594.369551] softirqs last disabled at (569080575): [<ffffffff913e1d6a>] irq_exit_rcu+0x1ca/0x250
      [  594.370692] CPU: 2 PID: 108 Comm: kworker/u32:5 Tainted: G            L    5.15.0-next-20211112+ #88
      [  594.371891] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
      [  594.373604] Workqueue: events_unbound io_ring_exit_work
      [  594.374303] RIP: 0010:_raw_spin_unlock_irqrestore+0x33/0x50
      [  594.375037] Code: 48 83 c7 18 53 48 89 f3 48 8b 74 24 10 e8 55 f5 55 fd 48 89 ef e8 ed a7 56 fd 80 e7 02 74 06 e8 43 13 7b fd fb bf 01 00 00 00 <e8> f8 78 474
      [  594.377433] RSP: 0018:ffff888101587a70 EFLAGS: 00000202
      [  594.378120] RAX: 0000000024030f0d RBX: 0000000000000246 RCX: 1ffffffff2f09106
      [  594.379053] RDX: 0000000000000000 RSI: ffffffff9449f0e0 RDI: 0000000000000001
      [  594.379991] RBP: ffffffff9586cdc0 R08: 0000000000000001 R09: fffffbfff2effcab
      [  594.380923] R10: ffffffff977fe557 R11: fffffbfff2effcaa R12: ffff8881b8f3def0
      [  594.381858] R13: 0000000000000246 R14: ffff888153a8b070 R15: 0000000000000000
      [  594.382787] FS:  0000000000000000(0000) GS:ffff888399c00000(0000) knlGS:0000000000000000
      [  594.383851] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  594.384602] CR2: 00007fcbe71d2000 CR3: 00000000b4216000 CR4: 00000000000006e0
      [  594.385540] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  594.386474] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  594.387403] Call Trace:
      [  594.387738]  <TASK>
      [  594.388042]  find_and_remove_object+0x118/0x160
      [  594.389321]  delete_object_full+0xc/0x20
      [  594.389852]  kfree+0x193/0x470
      [  594.390275]  __io_remove_buffers.part.0+0xed/0x147
      [  594.390931]  io_ring_ctx_free+0x342/0x6a2
      [  594.392159]  io_ring_exit_work+0x41e/0x486
      [  594.396419]  process_one_work+0x906/0x15a0
      [  594.399185]  worker_thread+0x8b/0xd80
      [  594.400259]  kthread+0x3bf/0x4a0
      [  594.401847]  ret_from_fork+0x22/0x30
      [  594.402343]  </TASK>
      
      Message from syslogd@localhost at Nov 13 09:09:54 ...
      kernel:watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/u32:5:108]
      [  596.793660] __io_remove_buffers: [2099199]start ctx=0xffff8881067bf000 bgid=65533 buf=0xffff8881fefe1680
      
      We can reproduce this issue by follow syzkaller log:
      r0 = syz_io_uring_setup(0x401, &(0x7f0000000300), &(0x7f0000003000/0x2000)=nil, &(0x7f0000ff8000/0x4000)=nil, &(0x7f0000000280)=<r1=>0x0, &(0x7f0000000380)=<r2=>0x0)
      sendmsg$ETHTOOL_MSG_FEATURES_SET(0xffffffffffffffff, &(0x7f0000003080)={0x0, 0x0, &(0x7f0000003040)={&(0x7f0000000040)=ANY=[], 0x18}}, 0x0)
      syz_io_uring_submit(r1, r2, &(0x7f0000000240)=@IORING_OP_PROVIDE_BUFFERS={0x1f, 0x5, 0x0, 0x401, 0x1, 0x0, 0x100, 0x0, 0x1, {0xfffd}}, 0x0)
      io_uring_enter(r0, 0x3a2d, 0x0, 0x0, 0x0, 0x0)
      
      The reason above issue  is 'buf->list' has 2,100,000 nodes, occupied cpu lead
      to soft lockup.
      To solve this issue, we need add schedule point when do while loop in
      '__io_remove_buffers'.
      After add  schedule point we do regression, get follow data.
      [  240.141864] __io_remove_buffers: [1]start ctx=0xffff888170603000 bgid=65533 buf=0xffff8881116fcb00
      [  268.408260] __io_remove_buffers: [1]start ctx=0xffff8881b92d2000 bgid=65533 buf=0xffff888130c83180
      [  275.899234] __io_remove_buffers: [2099199]start ctx=0xffff888170603000 bgid=65533 buf=0xffff8881116fcb00
      [  296.741404] __io_remove_buffers: [1]start ctx=0xffff8881b659c000 bgid=65533 buf=0xffff8881010fe380
      [  305.090059] __io_remove_buffers: [2099199]start ctx=0xffff8881b92d2000 bgid=65533 buf=0xffff888130c83180
      [  325.415746] __io_remove_buffers: [1]start ctx=0xffff8881b92d1000 bgid=65533 buf=0xffff8881a17d8f00
      [  333.160318] __io_remove_buffers: [2099199]start ctx=0xffff8881b659c000 bgid=65533 buf=0xffff8881010fe380
      ...
      
      Fixes:8bab4c09
      
      ("io_uring: allow conditional reschedule for intensive iterators")
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Link: https://lore.kernel.org/r/20211122024737.2198530-1-yebin10@huawei.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1d0254e6
    • Steven Rostedt (VMware)'s avatar
      tracing: Fix pid filtering when triggers are attached · a55f224f
      Steven Rostedt (VMware) authored
      If a event is filtered by pid and a trigger that requires processing of
      the event to happen is a attached to the event, the discard portion does
      not take the pid filtering into account, and the event will then be
      recorded when it should not have been.
      
      Cc: stable@vger.kernel.org
      Fixes: 3fdaf80f
      
       ("tracing: Implement event pid filtering")
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      a55f224f
    • Alex Williamson's avatar
      iommu/vt-d: Fix unmap_pages support · 86dc40c7
      Alex Williamson authored
      When supporting only the .map and .unmap callbacks of iommu_ops,
      the IOMMU driver can make assumptions about the size and alignment
      used for mappings based on the driver provided pgsize_bitmap.  VT-d
      previously used essentially PAGE_MASK for this bitmap as any power
      of two mapping was acceptably filled by native page sizes.
      
      However, with the .map_pages and .unmap_pages interface we're now
      getting page-size and count arguments.  If we simply combine these
      as (page-size * count) and make use of the previous map/unmap
      functions internally, any size and alignment assumptions are very
      different.
      
      As an example, a given vfio device assignment VM will often create
      a 4MB mapping at IOVA pfn [0x3fe00 - 0x401ff].  On a system that
      does not support IOMMU super pages, the unmap_pages interface will
      ask to unmap 1024 4KB pages at the base IOVA.  dma_pte_clear_level()
      will recurse down to level 2 of the page table where the first half
      of the pfn range exactly matches the entire pte ...
      86dc40c7
    • Christophe JAILLET's avatar
      iommu/vt-d: Fix an unbalanced rcu_read_lock/rcu_read_unlock() · 4e5973dd
      Christophe JAILLET authored
      If we return -EOPNOTSUPP, the rcu lock remains lock. This is spurious.
      Go through the end of the function instead. This way, the missing
      'rcu_read_unlock()' is called.
      
      Fixes: 7afd7f6a
      
       ("iommu/vt-d: Check FL and SL capability sanity in scalable mode")
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Link: https://lore.kernel.org/r/40cc077ca5f543614eab2a10e84d29dd190273f6.1636217517.git.christophe.jaillet@wanadoo.fr
      
      
      Signed-off-by: default avatarLu Baolu <baolu.lu@linux.intel.com>
      Link: https://lore.kernel.org/r/20211126135556.397932-2-baolu.lu@linux.intel.com
      
      
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      4e5973dd
    • Alex Bee's avatar
      iommu/rockchip: Fix PAGE_DESC_HI_MASKs for RK3568 · f7ff3cff
      Alex Bee authored
      With the submission of iommu driver for RK3568 a subtle bug was
      introduced: PAGE_DESC_HI_MASK1 and PAGE_DESC_HI_MASK2 have to be
      the other way arround - that leads to random errors, especially when
      addresses beyond 32 bit are used.
      
      Fix it.
      
      Fixes: c55356c5
      
       ("iommu: rockchip: Add support for iommu v2")
      Signed-off-by: default avatarAlex Bee <knaerzche@gmail.com>
      Tested-by: default avatarPeter Geis <pgwipeout@gmail.com>
      Reviewed-by: default avatarHeiko Stuebner <heiko@sntech.de>
      Tested-by: default avatarDan Johansen <strit@manjaro.org>
      Reviewed-by: default avatarBenjamin Gaignard <benjamin.gaignard@collabora.com>
      Link: https://lore.kernel.org/r/20211124021325.858139-1-knaerzche@gmail.com
      
      
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      f7ff3cff
    • Joerg Roedel's avatar
      iommu/amd: Clarify AMD IOMMUv2 initialization messages · 717e88aa
      Joerg Roedel authored
      
      
      The messages printed on the initialization of the AMD IOMMUv2 driver
      have caused some confusion in the past. Clarify the messages to lower
      the confusion in the future.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      Link: https://lore.kernel.org/r/20211123105507.7654-3-joro@8bytes.org
      717e88aa
    • Joerg Roedel's avatar
      iommu/vt-d: Remove unused PASID_DISABLED · 21e96a20
      Joerg Roedel authored
      The macro is unused after commit 00ecd540
      
       so it can be removed.
      
      Reported-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Fixes: 00ecd540
      
       ("iommu/vt-d: Clean up unused PASID updating functions")
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      Reviewed-by: default avatarLu Baolu <baolu.lu@linux.intel.com>
      Link: https://lore.kernel.org/r/20211123105507.7654-2-joro@8bytes.org
      21e96a20
    • Linus Torvalds's avatar
      Merge tag 'net-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · c5c17547
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Networking fixes, including fixes from netfilter.
      
        Current release - regressions:
      
         - r8169: fix incorrect mac address assignment
      
         - vlan: fix underflow for the real_dev refcnt when vlan creation
           fails
      
         - smc: avoid warning of possible recursive locking
      
        Current release - new code bugs:
      
         - vsock/virtio: suppress used length validation
      
         - neigh: fix crash in v6 module initialization error path
      
        Previous releases - regressions:
      
         - af_unix: fix change in behavior in read after shutdown
      
         - igb: fix netpoll exit with traffic, avoid warning
      
         - tls: fix splice_read() when starting mid-record
      
         - lan743x: fix deadlock in lan743x_phy_link_status_change()
      
         - marvell: prestera: fix bridge port operation
      
        Previous releases - always broken:
      
         - tcp_cubic: fix spurious Hystart ACK train detections for
           not-cwnd-limited flows
      
         - nexthop: fix refcount issues when replacing IPv6 groups
      
         - nexthop: fix null pointer dereference when IPv6 is not enabled
      
         - phylink: force link down and retrigger resolve on interface change
      
         - mptcp: fix delack timer length calculation and incorrect early
           clearing
      
         - ieee802154: handle iftypes as u32, prevent shift-out-of-bounds
      
         - nfc: virtual_ncidev: change default device permissions
      
         - netfilter: ctnetlink: fix error codes and flags used for kernel
           side filtering of dumps
      
         - netfilter: flowtable: fix IPv6 tunnel addr match
      
         - ncsi: align payload to 32-bit to fix dropped packets
      
         - iavf: fix deadlock and loss of config during VF interface reset
      
         - ice: avoid bpf_prog refcount underflow
      
         - ocelot: fix broken PTP over IP and PTP API violations
      
        Misc:
      
         - marvell: mvpp2: increase MTU limit when XDP enabled"
      
      * tag 'net-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (94 commits)
        net: dsa: microchip: implement multi-bridge support
        net: mscc: ocelot: correctly report the timestamping RX filters in ethtool
        net: mscc: ocelot: set up traps for PTP packets
        net: ptp: add a definition for the UDP port for IEEE 1588 general messages
        net: mscc: ocelot: create a function that replaces an existing VCAP filter
        net: mscc: ocelot: don't downgrade timestamping RX filters in SIOCSHWTSTAMP
        net: hns3: fix incorrect components info of ethtool --reset command
        net: hns3: fix one incorrect value of page pool info when queried by debugfs
        net: hns3: add check NULL address for page pool
        net: hns3: fix VF RSS failed problem after PF enable multi-TCs
        net: qed: fix the array may be out of bound
        net/smc: Don't call clcsock shutdown twice when smc shutdown
        net: vlan: fix underflow for the real_dev refcnt
        ptp: fix filter names in the documentation
        ethtool: ioctl: fix potential NULL deref in ethtool_set_coalesce()
        nfc: virtual_ncidev: change default device permissions
        net/sched: sch_ets: don't peek at classes beyond 'nbands'
        net: stmmac: Disable Tx queues when reconfiguring the interface
        selftests: tls: test for correct proto_ops
        tls: fix replacing proto_ops
        ...
      c5c17547
    • Oleksij Rempel's avatar
      net: dsa: microchip: implement multi-bridge support · b3612ccd
      Oleksij Rempel authored
      Current driver version is able to handle only one bridge at time.
      Configuring two bridges on two different ports would end up shorting this
      bridges by HW. To reproduce it:
      
      	ip l a name br0 type bridge
      	ip l a name br1 type bridge
      	ip l s dev br0 up
      	ip l s dev br1 up
      	ip l s lan1 master br0
      	ip l s dev lan1 up
      	ip l s lan2 master br1
      	ip l s dev lan2 up
      
      	Ping on lan1 and get response on lan2, which should not happen.
      
      This happened, because current driver version is storing one global "Port VLAN
      Membership" and applying it to all ports which are members of any
      bridge.
      To solve this issue, we need to handle each port separately.
      
      This patch is dropping the global port member storage and calculating
      membership dynamically depending on STP state and bridge participation.
      
      Note: STP support was broken before this patch and should be fixed
      separately.
      
      Fixes: c2e86691
      
       ("net: dsa: microchip: break KSZ9477 DSA driver into two files")
      Signed-off-by: default avatarOleksij Rempel <o.rempel@pengutronix.de>
      Reviewed-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Link: https://lore.kernel.org/r/20211126123926.2981028-1-o.rempel@pengutronix.de
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b3612ccd
    • Linus Torvalds's avatar
      Merge tag 'acpi-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 5367cf1c
      Linus Torvalds authored
      Pull ACPI fixes from Rafael Wysocki:
       "These fix a NULL pointer dereference in the CPPC library code and a
        locking issue related to printing the names of ACPI device nodes in
        the device properties framework.
      
        Specifics:
      
         - Fix NULL pointer dereference in the CPPC library code occuring on
           hybrid systems without CPPC support (Rafael Wysocki).
      
         - Avoid attempts to acquire a semaphore with interrupts off when
           printing the names of ACPI device nodes and clean up code on top of
           that fix (Sakari Ailus)"
      
      * tag 'acpi-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        ACPI: CPPC: Add NULL pointer check to cppc_get_perf()
        ACPI: Make acpi_node_get_parent() local
        ACPI: Get acpi_device's parent from the parent field
      5367cf1c
    • Linus Torvalds's avatar
      Merge tag 'pm-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 0ce629b1
      Linus Torvalds authored
      Pull power management fixes from Rafael Wysocki:
       "These address three issues in the intel_pstate driver and fix two
        problems related to hibernation.
      
        Specifics:
      
         - Make intel_pstate work correctly on Ice Lake server systems with
           out-of-band performance control enabled (Adamos Ttofari).
      
         - Fix EPP handling in intel_pstate during CPU offline and online in
           the active mode (Rafael Wysocki).
      
         - Make intel_pstate support ITMT on asymmetric systems with
           overclocking enabled (Srinivas Pandruvada).
      
         - Fix hibernation image saving when using the user space interface
           based on the snapshot special device file (Evan Green).
      
         - Make the hibernation code release the snapshot block device using
           the same mode that was used when acquiring it (Thomas Zeitlhofer)"
      
      * tag 'pm-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        PM: hibernate: Fix snapshot partial write lengths
        PM: hibernate: use correct mode for swsusp_close()
        cpufreq: intel_pstate: ITMT support for overclocked system
        cpufreq: intel_pstate: Fix active mode offline/online EPP handling
        cpufreq: intel_pstate: Add Ice Lake server to out-of-band IDs
      0ce629b1
    • Linus Torvalds's avatar
      Merge tag 'fuse-fixes-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse · 925c9437
      Linus Torvalds authored
      Pull fuse fix from Miklos Szeredi:
       "Fix a regression caused by a bugfix in the previous release. The
        symptom is a VM_BUG_ON triggered from splice to the fuse device.
      
        Unfortunately the original bugfix was already backported to a number
        of stable releases, so this fix-fix will need to be backported as
        well"
      
      * tag 'fuse-fixes-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
        fuse: release pipe buf after last use
      925c9437
    • Jakub Kicinski's avatar
      Merge branch 'fix-broken-ptp-over-ip-on-ocelot-switches' · 32c54497
      Jakub Kicinski authored
      Vladimir Oltean says:
      
      ====================
      Fix broken PTP over IP on Ocelot switches
      
      Changes in v2: added patch 5, added Richard's ack for the whole series
      sans patch 5 which is new.
      
      Po Liu reported recently that timestamping PTP over IPv4 is broken using
      the felix driver on NXP LS1028A. This has been known for a while, of
      course, since it has always been broken. The reason is because IP PTP
      packets are currently treated as unknown IP multicast, which is not
      flooded to the CPU port in the ocelot driver design, so packets don't
      reach the ptp4l program.
      
      The series solves the problem by installing packet traps per port when
      the timestamping ioctl is called, depending on the RX filter selected
      (L2, L4 or both).
      ====================
      
      Link: https://lore.kernel.org/r/20211126172845.3149260-1-vladimir.oltean@nxp.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      32c54497
    • Vladimir Oltean's avatar
      net: mscc: ocelot: correctly report the timestamping RX filters in ethtool · c49a35ee
      Vladimir Oltean authored
      The driver doesn't support RX timestamping for non-PTP packets, but it
      declares that it does. Restrict the reported RX filters to PTP v2 over
      L2 and over L4.
      
      Fixes: 4e3b0468
      
       ("net: mscc: PTP Hardware Clock (PHC) support")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c49a35ee
    • Vladimir Oltean's avatar
      net: mscc: ocelot: set up traps for PTP packets · 96ca08c0
      Vladimir Oltean authored
      IEEE 1588 support was declared too soon for the Ocelot switch. Out of
      reset, this switch does not apply any special treatment for PTP packets,
      i.e. when an event message is received, the natural tendency is to
      forward it by MAC DA/VLAN ID. This poses a problem when the ingress port
      is under a bridge, since user space application stacks (written
      primarily for endpoint ports, not switches) like ptp4l expect that PTP
      messages are always received on AF_PACKET / AF_INET sockets (depending
      on the PTP transport being used), and never being autonomously
      forwarded. Any forwarding, if necessary (for example in Transparent
      Clock mode) is handled in software by ptp4l. Having the hardware forward
      these packets too will cause duplicates which will confuse endpoints
      connected to these switches.
      
      So PTP over L2 barely works, in the sense that PTP packets reach the CPU
      port, but they reach it via flooding, and therefore reach lots of other
      unwanted destinations too. But PTP over IPv4/IPv6 does not work at all.
      This is because the Ocelot switch have a separate destination port mask
      for unknown IP multicast (which PTP over IP is) flooding compared to
      unknown non-IP multicast (which PTP over L2 is) flooding. Specifically,
      the driver allows the CPU port to be in the PGID_MC port group, but not
      in PGID_MCIPV4 and PGID_MCIPV6. There are several presentations from
      Allan Nielsen which explain that the embedded MIPS CPU on Ocelot
      switches is not very powerful at all, so every penny they could save by
      not allowing flooding to the CPU port module matters. Unknown IP
      multicast did not make it.
      
      The de facto consensus is that when a switch is PTP-aware and an
      application stack for PTP is running, switches should have some sort of
      trapping mechanism for PTP packets, to extract them from the hardware
      data path. This avoids both problems:
      (a) PTP packets are no longer flooded to unwanted destinations
      (b) PTP over IP packets are no longer denied from reaching the CPU since
          they arrive there via a trap and not via flooding
      
      It is not the first time when this change is attempted. Last time, the
      feedback from Allan Nielsen and Andrew Lunn was that the traps should
      not be installed by default, and that PTP-unaware switching may be
      desired for some use cases:
      https://patchwork.ozlabs.org/project/netdev/patch/20190813025214.18601-5-yangbo.lu@nxp.com/
      
      To address that feedback, the present patch adds the necessary packet
      traps according to the RX filter configuration transmitted by user space
      through the SIOCSHWTSTAMP ioctl. Trapping is done via VCAP IS2, where we
      keep 5 filters, which are amended each time RX timestamping is enabled
      or disabled on a port:
      - 1 for PTP over L2
      - 2 for PTP over IPv4 (UDP ports 319 and 320)
      - 2 for PTP over IPv6 (UDP ports 319 and 320)
      
      The cookie by which these filters (invisible to tc) are identified is
      strategically chosen such that it does not collide with the filters used
      for the ocelot-8021q tagging protocol by the Felix driver, or with the
      MRP traps set up by the Ocelot library.
      
      Other alternatives were considered, like patching user space to do
      something, but there are so many ways in which PTP packets could be made
      to reach the CPU, generically speaking, that "do what?" is a very valid
      question. The ptp4l program from the linuxptp stack already attempts to
      do something: it calls setsockopt(IP_ADD_MEMBERSHIP) (and
      PACKET_ADD_MEMBERSHIP, respectively) which translates in both cases into
      a dev_mc_add() on the interface, in the kernel:
      https://github.com/richardcochran/linuxptp/blob/v3.1.1/udp.c#L73
      https://github.com/richardcochran/linuxptp/blob/v3.1.1/raw.c
      
      Reality shows that this is not sufficient in case the interface belongs
      to a switchdev driver, as dev_mc_add() does not show the intention to
      trap a packet to the CPU, but rather the intention to not drop it (it is
      strictly for RX filtering, same as promiscuous does not mean to send all
      traffic to the CPU, but to not drop traffic with unknown MAC DA). This
      topic is a can of worms in itself, and it would be great if user space
      could just stay out of it.
      
      On the other hand, setting up PTP traps privately within the driver is
      not new by any stretch of the imagination:
      https://elixir.bootlin.com/linux/v5.16-rc2/source/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c#L833
      https://elixir.bootlin.com/linux/v5.16-rc2/source/drivers/net/dsa/hirschmann/hellcreek.c#L1050
      https://elixir.bootlin.com/linux/v5.16-rc2/source/include/linux/dsa/sja1105.h#L21
      
      So this is the approach taken here as well. The difference here being
      that we prepare and destroy the traps per port, dynamically at runtime,
      as opposed to driver init time, because apparently, PTP-unaware
      forwarding is a use case.
      
      Fixes: 4e3b0468
      
       ("net: mscc: PTP Hardware Clock (PHC) support")
      Reported-by: default avatarPo Liu <po.liu@nxp.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      96ca08c0
    • Vladimir Oltean's avatar
      net: ptp: add a definition for the UDP port for IEEE 1588 general messages · ec15baec
      Vladimir Oltean authored
      
      
      As opposed to event messages (Sync, PdelayReq etc) which require
      timestamping, general messages (Announce, FollowUp etc) do not.
      In PTP they are part of different streams of data.
      
      IEEE 1588-2008 Annex D.2 "UDP port numbers" states that the UDP
      destination port assigned by IANA is 319 for event messages, and 320 for
      general messages. Yet the kernel seems to be missing the definition for
      general messages. This patch adds it.
      
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ec15baec