Skip to content
  1. Apr 17, 2024
    • Namhyung Kim's avatar
      perf/x86: Fix out of range data · e8f4a290
      Namhyung Kim authored
      commit dec8ced8 upstream.
      
      On x86 each struct cpu_hw_events maintains a table for counter assignment but
      it missed to update one for the deleted event in x86_pmu_del().  This
      can make perf_clear_dirty_counters() reset used counter if it's called
      before event scheduling or enabling.  Then it would return out of range
      data which doesn't make sense.
      
      The following code can reproduce the problem.
      
        $ cat repro.c
        #include <pthread.h>
        #include <stdio.h>
        #include <stdlib.h>
        #include <unistd.h>
        #include <linux/perf_event.h>
        #include <sys/ioctl.h>
        #include <sys/mman.h>
        #include <sys/syscall.h>
      
        struct perf_event_attr attr = {
        	.type = PERF_TYPE_HARDWARE,
        	.config = PERF_COUNT_HW_CPU_CYCLES,
        	.disabled = 1,
        };
      
        void *worker(void *arg)
        {
        	int cpu = (long)arg;
        	int fd1 = syscall(SYS_perf_event_open, &attr, -1, cpu, -1, 0);
        	int fd2 = syscall(SYS_perf_event_open, &attr, -1, cpu, -1, 0);
        	void *p;
      
        	do {
        		ioctl(fd1, PERF_EVENT_IOC_ENABLE, 0);
        		p = mmap(NULL, 4096, PROT_READ, MAP_SHARED, fd1, 0);
        		ioctl(fd2, PERF_EVENT_IOC_ENABLE, 0);
      
        		ioctl(fd2, PERF_EVENT_IOC_DISABLE, 0);
        		munmap(p, 4096);
        		ioctl(fd1, PERF_EVENT_IOC_DISABLE, 0);
        	} while (1);
      
        	return NULL;
        }
      
        int main(void)
        {
        	int i;
        	int n = sysconf(_SC_NPROCESSORS_ONLN);
        	pthread_t *th = calloc(n, sizeof(*th));
      
        	for (i = 0; i < n; i++)
        		pthread_create(&th[i], NULL, worker, (void *)(long)i);
        	for (i = 0; i < n; i++)
        		pthread_join(th[i], NULL);
      
        	free(th);
        	return 0;
        }
      
      And you can see the out of range data using perf stat like this.
      Probably it'd be easier to see on a large machine.
      
        $ gcc -o repro repro.c -pthread
        $ ./repro &
        $ sudo perf stat -A -I 1000 2>&1 | awk '{ if (length($3) > 15) print }'
             1.001028462 CPU6   196,719,295,683,763      cycles                           # 194290.996 GHz                       (71.54%)
             1.001028462 CPU3   396,077,485,787,730      branch-misses                    # 15804359784.80% of all branches      (71.07%)
             1.001028462 CPU17  197,608,350,727,877      branch-misses                    # 14594186554.56% of all branches      (71.22%)
             2.020064073 CPU4   198,372,472,612,140      cycles                           # 194681.113 GHz                       (70.95%)
             2.020064073 CPU6   199,419,277,896,696      cycles                           # 195720.007 GHz                       (70.57%)
             2.020064073 CPU20  198,147,174,025,639      cycles                           # 194474.654 GHz                       (71.03%)
             2.020064073 CPU20  198,421,240,580,145      stalled-cycles-frontend          #  100.14% frontend cycles idle        (70.93%)
             3.037443155 CPU4   197,382,689,923,416      cycles                           # 194043.065 GHz                       (71.30%)
             3.037443155 CPU20  196,324,797,879,414      cycles                           # 193003.773 GHz                       (71.69%)
             3.037443155 CPU5   197,679,956,608,205      stalled-cycles-backend           # 1315606428.66% backend cycles idle   (71.19%)
             3.037443155 CPU5   198,571,860,474,851      instructions                     # 13215422.58  insn per cycle
      
      It should move the contents in the cpuc->assign as well.
      
      Fixes: 5471eea5
      
       ("perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task")
      Signed-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20240306061003.1894224-1-namhyung@kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e8f4a290
    • Gavin Shan's avatar
      vhost: Add smp_rmb() in vhost_vq_avail_empty() · acf9b01d
      Gavin Shan authored
      commit 22e1992c upstream.
      
      A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by
      Will. Otherwise, it's not ensured the available ring entries pushed
      by guest can be observed by vhost in time, leading to stale available
      ring entries fetched by vhost in vhost_get_vq_desc(), as reported by
      Yihuang Yu on NVidia's grace-hopper (ARM64) platform.
      
        /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64      \
        -accel kvm -machine virt,gic-version=host -cpu host          \
        -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \
        -m 4096M,slots=16,maxmem=64G                                 \
        -object memory-backend-ram,id=mem0,size=4096M                \
         :                                                           \
        -netdev tap,id=vnet0,vhost=true                              \
        -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
         :
        guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM
        virtio_net virtio0: output.0:id 100 is not a head!
      
      Add the missed smp_rmb() in vhost_vq_avail_empty(). When tx_can_batch()
      returns true, it means there's still pending tx buffers. Since it might
      read indices, so it still can bypass the smp_rmb() in vhost_get_vq_desc().
      Note that it should be safe until vq->avail_idx is changed by commit
      275bf960 ("vhost: better detection of available buffers").
      
      Fixes: 275bf960
      
       ("vhost: better detection of available buffers")
      Cc: <stable@kernel.org> # v4.11+
      Reported-by: default avatarYihuang Yu <yihyu@redhat.com>
      Suggested-by: default avatarWill Deacon <will@kernel.org>
      Signed-off-by: default avatarGavin Shan <gshan@redhat.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Message-Id: <20240328002149.1141302-2-gshan@redhat.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      acf9b01d
    • Ville Syrjälä's avatar
      drm/client: Fully protect modes[] with dev->mode_config.mutex · d2dc6600
      Ville Syrjälä authored
      commit 3eadd887
      
       upstream.
      
      The modes[] array contains pointers to modes on the connectors'
      mode lists, which are protected by dev->mode_config.mutex.
      Thus we need to extend modes[] the same protection or by the
      time we use it the elements may already be pointing to
      freed/reused memory.
      
      Cc: stable@vger.kernel.org
      Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/10583
      Signed-off-by: default avatarVille Syrjälä <ville.syrjala@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240404203336.10454-2-ville.syrjala@linux.intel.com
      Reviewed-by: default avatarDmitry Baryshkov <dmitry.baryshkov@linaro.org>
      Reviewed-by: default avatarJani Nikula <jani.nikula@intel.com>
      Reviewed-by: default avatarThomas Zimmermann <tzimmermann@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d2dc6600
    • Boris Burkov's avatar
      btrfs: qgroup: correctly model root qgroup rsv in convert · 773d38f4
      Boris Burkov authored
      commit 141fb8cd upstream.
      
      We use add_root_meta_rsv and sub_root_meta_rsv to track prealloc and
      pertrans reservations for subvolumes when quotas are enabled. The
      convert function does not properly increment pertrans after decrementing
      prealloc, so the count is not accurate.
      
      Note: we check that the fs is not read-only to mirror the logic in
      qgroup_convert_meta, which checks that before adding to the pertrans rsv.
      
      Fixes: 8287475a
      
       ("btrfs: qgroup: Use root::qgroup_meta_rsv_* to record qgroup meta reserved space")
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      773d38f4
    • Jacob Pan's avatar
      iommu/vt-d: Allocate local memory for page request queue · 23b57c55
      Jacob Pan authored
      [ Upstream commit a34f3e20 ]
      
      The page request queue is per IOMMU, its allocation should be made
      NUMA-aware for performance reasons.
      
      Fixes: a222a7f0
      
       ("iommu/vt-d: Implement page request handling")
      Signed-off-by: default avatarJacob Pan <jacob.jun.pan@linux.intel.com>
      Reviewed-by: default avatarKevin Tian <kevin.tian@intel.com>
      Link: https://lore.kernel.org/r/20240403214007.985600-1-jacob.jun.pan@linux.intel.com
      Signed-off-by: default avatarLu Baolu <baolu.lu@linux.intel.com>
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      23b57c55
    • Arnd Bergmann's avatar
      tracing: hide unused ftrace_event_id_fops · 81f3ad64
      Arnd Bergmann authored
      [ Upstream commit 5281ec83 ]
      
      When CONFIG_PERF_EVENTS, a 'make W=1' build produces a warning about the
      unused ftrace_event_id_fops variable:
      
      kernel/trace/trace_events.c:2155:37: error: 'ftrace_event_id_fops' defined but not used [-Werror=unused-const-variable=]
       2155 | static const struct file_operations ftrace_event_id_fops = {
      
      Hide this in the same #ifdef as the reference to it.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240403080702.3509288-7-arnd@kernel.org
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Zheng Yejian <zhengyejian1@huawei.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Ajay Kaher <akaher@vmware.com>
      Cc: Jinjie Ruan <ruanjinjie@huawei.com>
      Cc: Clément Léger <cleger@rivosinc.com>
      Cc: Dan Carpenter <dan.carpenter@linaro.org>
      Cc: "Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com>
      Fixes: 620a30e9
      
       ("tracing: Don't pass file_operations array to event_create_dir()")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      81f3ad64
    • David Arinzon's avatar
      net: ena: Fix incorrect descriptor free behavior · fdfbf54d
      David Arinzon authored
      [ Upstream commit bf02d9fe ]
      
      ENA has two types of TX queues:
      - queues which only process TX packets arriving from the network stack
      - queues which only process TX packets forwarded to it by XDP_REDIRECT
        or XDP_TX instructions
      
      The ena_free_tx_bufs() cycles through all descriptors in a TX queue
      and unmaps + frees every descriptor that hasn't been acknowledged yet
      by the device (uncompleted TX transactions).
      The function assumes that the processed TX queue is necessarily from
      the first category listed above and ends up using napi_consume_skb()
      for descriptors belonging to an XDP specific queue.
      
      This patch solves a bug in which, in case of a VF reset, the
      descriptors aren't freed correctly, leading to crashes.
      
      Fixes: 548c4940
      
       ("net: ena: Implement XDP_TX action")
      Signed-off-by: default avatarShay Agroskin <shayagr@amazon.com>
      Signed-off-by: default avatarDavid Arinzon <darinzon@amazon.com>
      Reviewed-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      fdfbf54d
    • David Arinzon's avatar
      net: ena: Wrong missing IO completions check order · ec25a9ce
      David Arinzon authored
      [ Upstream commit f7e41718 ]
      
      Missing IO completions check is called every second (HZ jiffies).
      This commit fixes several issues with this check:
      
      1. Duplicate queues check:
         Max of 4 queues are scanned on each check due to monitor budget.
         Once reaching the budget, this check exits under the assumption that
         the next check will continue to scan the remainder of the queues,
         but in practice, next check will first scan the last already scanned
         queue which is not necessary and may cause the full queue scan to
         last a couple of seconds longer.
         The fix is to start every check with the next queue to scan.
         For example, on 8 IO queues:
         Bug: [0,1,2,3], [3,4,5,6], [6,7]
         Fix: [0,1,2,3], [4,5,6,7]
      
      2. Unbalanced queues check:
         In case the number of active IO queues is not a multiple of budget,
         there will be checks which don't utilize the full budget
         because the full scan exits when reaching the last queue id.
         The fix is to run every TX completion check with exact queue budget
         regardless of the queue id.
         For example, on 7 IO queues:
         Bug: [0,1,2,3], [4,5,6], [0,1,2,3]
         Fix: [0,1,2,3], [4,5,6,0], [1,2,3,4]
         The budget may be lowered in case the number of IO queues is less
         than the budget (4) to make sure there are no duplicate queues on
         the same check.
         For example, on 3 IO queues:
         Bug: [0,1,2,0], [1,2,0,1]
         Fix: [0,1,2], [0,1,2]
      
      Fixes: 1738cd3e
      
       ("net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)")
      Signed-off-by: default avatarAmit Bernstein <amitbern@amazon.com>
      Signed-off-by: default avatarDavid Arinzon <darinzon@amazon.com>
      Reviewed-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ec25a9ce
    • David Arinzon's avatar
      net: ena: Fix potential sign extension issue · e667a05c
      David Arinzon authored
      [ Upstream commit 713a8519 ]
      
      Small unsigned types are promoted to larger signed types in
      the case of multiplication, the result of which may overflow.
      In case the result of such a multiplication has its MSB
      turned on, it will be sign extended with '1's.
      This changes the multiplication result.
      
      Code example of the phenomenon:
      -------------------------------
      u16 x, y;
      size_t z1, z2;
      
      x = y = 0xffff;
      printk("x=%x y=%x\n",x,y);
      
      z1 = x*y;
      z2 = (size_t)x*y;
      
      printk("z1=%lx z2=%lx\n", z1, z2);
      
      Output:
      -------
      x=ffff y=ffff
      z1=fffffffffffe0001 z2=fffe0001
      
      The expected result of ffff*ffff is fffe0001, and without the
      explicit casting to avoid the unwanted sign extension we got
      fffffffffffe0001.
      
      This commit adds an explicit casting to avoid the sign extension
      issue.
      
      Fixes: 689b2bda
      
       ("net: ena: add functions for handling Low Latency Queues in ena_com")
      Signed-off-by: default avatarArthur Kiyanovski <akiyano@amazon.com>
      Signed-off-by: default avatarDavid Arinzon <darinzon@amazon.com>
      Reviewed-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e667a05c
    • Michal Luczaj's avatar
      af_unix: Fix garbage collector racing against connect() · e76c2678
      Michal Luczaj authored
      [ Upstream commit 47d8ac01 ]
      
      Garbage collector does not take into account the risk of embryo getting
      enqueued during the garbage collection. If such embryo has a peer that
      carries SCM_RIGHTS, two consecutive passes of scan_children() may see a
      different set of children. Leading to an incorrectly elevated inflight
      count, and then a dangling pointer within the gc_inflight_list.
      
      sockets are AF_UNIX/SOCK_STREAM
      S is an unconnected socket
      L is a listening in-flight socket bound to addr, not in fdtable
      V's fd will be passed via sendmsg(), gets inflight count bumped
      
      connect(S, addr)	sendmsg(S, [V]); close(V)	__unix_gc()
      ----------------	-------------------------	-----------
      
      NS = unix_create1()
      skb1 = sock_wmalloc(NS)
      L = unix_find_other(addr)
      unix_state_lock(L)
      unix_peer(S) = NS
      			// V count=1 inflight=0
      
       			NS = unix_peer(S)
       			skb2 = sock_alloc()
      			skb_queue_tail(NS, skb2[V])
      
      			// V became in-flight
      			// V count=2 inflight=1
      
      			close(V)
      
      			// V count=1 inflight=1
      			// GC candidate condition met
      
      						for u in gc_inflight_list:
      						  if (total_refs == inflight_refs)
      						    add u to gc_candidates
      
      						// gc_candidates={L, V}
      
      						for u in gc_candidates:
      						  scan_children(u, dec_inflight)
      
      						// embryo (skb1) was not
      						// reachable from L yet, so V's
      						// inflight remains unchanged
      __skb_queue_tail(L, skb1)
      unix_state_unlock(L)
      						for u in gc_candidates:
      						  if (u.inflight)
      						    scan_children(u, inc_inflight_move_tail)
      
      						// V count=1 inflight=2 (!)
      
      If there is a GC-candidate listening socket, lock/unlock its state. This
      makes GC wait until the end of any ongoing connect() to that socket. After
      flipping the lock, a possibly SCM-laden embryo is already enqueued. And if
      there is another embryo coming, it can not possibly carry SCM_RIGHTS. At
      this point, unix_inflight() can not happen because unix_gc_lock is already
      taken. Inflight graph remains unaffected.
      
      Fixes: 1fd05ba5
      
       ("[AF_UNIX]: Rewrite garbage collector, fixes race.")
      Signed-off-by: default avatarMichal Luczaj <mhal@rbox.co>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240409201047.1032217-1-mhal@rbox.co
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e76c2678
    • Kuniyuki Iwashima's avatar
      af_unix: Do not use atomic ops for unix_sk(sk)->inflight. · 37120fa8
      Kuniyuki Iwashima authored
      [ Upstream commit 97af84a6
      
       ]
      
      When touching unix_sk(sk)->inflight, we are always under
      spin_lock(&unix_gc_lock).
      
      Let's convert unix_sk(sk)->inflight to the normal unsigned long.
      
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240123170856.41348-3-kuniyu@amazon.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Stable-dep-of: 47d8ac01
      
       ("af_unix: Fix garbage collector racing against connect()")
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      37120fa8
    • Arınç ÜNAL's avatar
      net: dsa: mt7530: trap link-local frames regardless of ST Port State · 22641478
      Arınç ÜNAL authored
      [ Upstream commit 17c56011 ]
      
      In Clause 5 of IEEE Std 802-2014, two sublayers of the data link layer
      (DLL) of the Open Systems Interconnection basic reference model (OSI/RM)
      are described; the medium access control (MAC) and logical link control
      (LLC) sublayers. The MAC sublayer is the one facing the physical layer.
      
      In 8.2 of IEEE Std 802.1Q-2022, the Bridge architecture is described. A
      Bridge component comprises a MAC Relay Entity for interconnecting the Ports
      of the Bridge, at least two Ports, and higher layer entities with at least
      a Spanning Tree Protocol Entity included.
      
      Each Bridge Port also functions as an end station and shall provide the MAC
      Service to an LLC Entity. Each instance of the MAC Service is provided to a
      distinct LLC Entity that supports protocol identification, multiplexing,
      and demultiplexing, for protocol data unit (PDU) transmission and reception
      by one or more higher layer entities.
      
      It is described in 8.13.9 of IEEE Std 802.1Q-2022 that in a Bridge, the LLC
      Entity associated with each Bridge Port is modeled as being directly
      connected to the attached Local Area Network (LAN).
      
      On the switch with CPU port architecture, CPU port functions as Management
      Port, and the Management Port functionality is provided by software which
      functions as an end station. Software is connected to an IEEE 802 LAN that
      is wholly contained within the system that incorporates the Bridge.
      Software provides access to the LLC Entity associated with each Bridge Port
      by the value of the source port field on the special tag on the frame
      received by software.
      
      We call frames that carry control information to determine the active
      topology and current extent of each Virtual Local Area Network (VLAN),
      i.e., spanning tree or Shortest Path Bridging (SPB) and Multiple VLAN
      Registration Protocol Data Units (MVRPDUs), and frames from other link
      constrained protocols, such as Extensible Authentication Protocol over LAN
      (EAPOL) and Link Layer Discovery Protocol (LLDP), link-local frames. They
      are not forwarded by a Bridge. Permanently configured entries in the
      filtering database (FDB) ensure that such frames are discarded by the
      Forwarding Process. In 8.6.3 of IEEE Std 802.1Q-2022, this is described in
      detail:
      
      Each of the reserved MAC addresses specified in Table 8-1
      (01-80-C2-00-00-[00,01,02,03,04,05,06,07,08,09,0A,0B,0C,0D,0E,0F]) shall be
      permanently configured in the FDB in C-VLAN components and ERs.
      
      Each of the reserved MAC addresses specified in Table 8-2
      (01-80-C2-00-00-[01,02,03,04,05,06,07,08,09,0A,0E]) shall be permanently
      configured in the FDB in S-VLAN components.
      
      Each of the reserved MAC addresses specified in Table 8-3
      (01-80-C2-00-00-[01,02,04,0E]) shall be permanently configured in the FDB
      in TPMR components.
      
      The FDB entries for reserved MAC addresses shall specify filtering for all
      Bridge Ports and all VIDs. Management shall not provide the capability to
      modify or remove entries for reserved MAC addresses.
      
      The addresses in Table 8-1, Table 8-2, and Table 8-3 determine the scope of
      propagation of PDUs within a Bridged Network, as follows:
      
        The Nearest Bridge group address (01-80-C2-00-00-0E) is an address that
        no conformant Two-Port MAC Relay (TPMR) component, Service VLAN (S-VLAN)
        component, Customer VLAN (C-VLAN) component, or MAC Bridge can forward.
        PDUs transmitted using this destination address, or any other addresses
        that appear in Table 8-1, Table 8-2, and Table 8-3
        (01-80-C2-00-00-[00,01,02,03,04,05,06,07,08,09,0A,0B,0C,0D,0E,0F]), can
        therefore travel no further than those stations that can be reached via a
        single individual LAN from the originating station.
      
        The Nearest non-TPMR Bridge group address (01-80-C2-00-00-03), is an
        address that no conformant S-VLAN component, C-VLAN component, or MAC
        Bridge can forward; however, this address is relayed by a TPMR component.
        PDUs using this destination address, or any of the other addresses that
        appear in both Table 8-1 and Table 8-2 but not in Table 8-3
        (01-80-C2-00-00-[00,03,05,06,07,08,09,0A,0B,0C,0D,0F]), will be relayed
        by any TPMRs but will propagate no further than the nearest S-VLAN
        component, C-VLAN component, or MAC Bridge.
      
        The Nearest Customer Bridge group address (01-80-C2-00-00-00) is an
        address that no conformant C-VLAN component, MAC Bridge can forward;
        however, it is relayed by TPMR components and S-VLAN components. PDUs
        using this destination address, or any of the other addresses that appear
        in Table 8-1 but not in either Table 8-2 or Table 8-3
        (01-80-C2-00-00-[00,0B,0C,0D,0F]), will be relayed by TPMR components and
        S-VLAN components but will propagate no further than the nearest C-VLAN
        component or MAC Bridge.
      
      Because the LLC Entity associated with each Bridge Port is provided via CPU
      port, we must not filter these frames but forward them to CPU port.
      
      In a Bridge, the transmission Port is majorly decided by ingress and egress
      rules, FDB, and spanning tree Port State functions of the Forwarding
      Process. For link-local frames, only CPU port should be designated as
      destination port in the FDB, and the other functions of the Forwarding
      Process must not interfere with the decision of the transmission Port. We
      call this process trapping frames to CPU port.
      
      Therefore, on the switch with CPU port architecture, link-local frames must
      be trapped to CPU port, and certain link-local frames received by a Port of
      a Bridge comprising a TPMR component or an S-VLAN component must be
      excluded from it.
      
      A Bridge of the switch with CPU port architecture cannot comprise a
      Two-Port MAC Relay (TPMR) component as a TPMR component supports only a
      subset of the functionality of a MAC Bridge. A Bridge comprising two Ports
      (Management Port doesn't count) of this architecture will either function
      as a standard MAC Bridge or a standard VLAN Bridge.
      
      Therefore, a Bridge of this architecture can only comprise S-VLAN
      components, C-VLAN components, or MAC Bridge components. Since there's no
      TPMR component, we don't need to relay PDUs using the destination addresses
      specified on the Nearest non-TPMR section, and the proportion of the
      Nearest Customer Bridge section where they must be relayed by TPMR
      components.
      
      One option to trap link-local frames to CPU port is to add static FDB
      entries with CPU port designated as destination port. However, because that
      Independent VLAN Learning (IVL) is being used on every VID, each entry only
      applies to a single VLAN Identifier (VID). For a Bridge comprising a MAC
      Bridge component or a C-VLAN component, there would have to be 16 times
      4096 entries. This switch intellectual property can only hold a maximum of
      2048 entries. Using this option, there also isn't a mechanism to prevent
      link-local frames from being discarded when the spanning tree Port State of
      the reception Port is discarding.
      
      The remaining option is to utilise the BPC, RGAC1, RGAC2, RGAC3, and RGAC4
      registers. Whilst this applies to every VID, it doesn't contain all of the
      reserved MAC addresses without affecting the remaining Standard Group MAC
      Addresses. The REV_UN frame tag utilised using the RGAC4 register covers
      the remaining 01-80-C2-00-00-[04,05,06,07,08,09,0A,0B,0C,0D,0F] destination
      addresses. It also includes the 01-80-C2-00-00-22 to 01-80-C2-00-00-FF
      destination addresses which may be relayed by MAC Bridges or VLAN Bridges.
      The latter option provides better but not complete conformance.
      
      This switch intellectual property also does not provide a mechanism to trap
      link-local frames with specific destination addresses to CPU port by
      Bridge, to conform to the filtering rules for the distinct Bridge
      components.
      
      Therefore, regardless of the type of the Bridge component, link-local
      frames with these destination addresses will be trapped to CPU port:
      
      01-80-C2-00-00-[00,01,02,03,0E]
      
      In a Bridge comprising a MAC Bridge component or a C-VLAN component:
      
        Link-local frames with these destination addresses won't be trapped to
        CPU port which won't conform to IEEE Std 802.1Q-2022:
      
        01-80-C2-00-00-[04,05,06,07,08,09,0A,0B,0C,0D,0F]
      
      In a Bridge comprising an S-VLAN component:
      
        Link-local frames with these destination addresses will be trapped to CPU
        port which won't conform to IEEE Std 802.1Q-2022:
      
        01-80-C2-00-00-00
      
        Link-local frames with these destination addresses won't be trapped to
        CPU port which won't conform to IEEE Std 802.1Q-2022:
      
        01-80-C2-00-00-[04,05,06,07,08,09,0A]
      
      Currently on this switch intellectual property, if the spanning tree Port
      State of the reception Port is discarding, link-local frames will be
      discarded.
      
      To trap link-local frames regardless of the spanning tree Port State, make
      the switch regard them as Bridge Protocol Data Units (BPDUs). This switch
      intellectual property only lets the frames regarded as BPDUs bypass the
      spanning tree Port State function of the Forwarding Process.
      
      With this change, the only remaining interference is the ingress rules.
      When the reception Port has no PVID assigned on software, VLAN-untagged
      frames won't be allowed in. There doesn't seem to be a mechanism on the
      switch intellectual property to have link-local frames bypass this function
      of the Forwarding Process.
      
      Fixes: b8f126a8
      
       ("net-next: dsa: add dsa support for Mediatek MT7530 switch")
      Reviewed-by: default avatarDaniel Golle <daniel@makrotopia.org>
      Signed-off-by: default avatarArınç ÜNAL <arinc.unal@arinc9.com>
      Link: https://lore.kernel.org/r/20240409-b4-for-net-mt7530-fix-link-local-when-stp-discarding-v2-1-07b1150164ac@arinc9.com
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      22641478
    • Daniel Machon's avatar
      net: sparx5: fix wrong config being used when reconfiguring PCS · 26515606
      Daniel Machon authored
      [ Upstream commit 33623113 ]
      
      The wrong port config is being used if the PCS is reconfigured. Fix this
      by correctly using the new config instead of the old one.
      
      Fixes: 946e7fd5
      
       ("net: sparx5: add port module support")
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Link: https://lore.kernel.org/r/20240409-link-mode-reconfiguration-fix-v2-1-db6a507f3627@microchip.com
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      26515606
    • Cosmin Ratiu's avatar
      net/mlx5: Properly link new fs rules into the tree · 7aaee12b
      Cosmin Ratiu authored
      [ Upstream commit 7c6782ad ]
      
      Previously, add_rule_fg would only add newly created rules from the
      handle into the tree when they had a refcount of 1. On the other hand,
      create_flow_handle tries hard to find and reference already existing
      identical rules instead of creating new ones.
      
      These two behaviors can result in a situation where create_flow_handle
      1) creates a new rule and references it, then
      2) in a subsequent step during the same handle creation references it
         again,
      resulting in a rule with a refcount of 2 that is not linked into the
      tree, will have a NULL parent and root and will result in a crash when
      the flow group is deleted because del_sw_hw_rule, invoked on rule
      deletion, assumes node->parent is != NULL.
      
      This happened in the wild, due to another bug related to incorrect
      handling of duplicate pkt_reformat ids, which lead to the code in
      create_flow_handle incorrectly referencing a just-added rule in the same
      flow handle, resulting in the problem described above. Full details are
      at [1].
      
      This patch changes add_rule_fg to add new rules without parents into
      the tree, properly initializing them and avoiding the crash. This makes
      it more consistent with how rules are added to an FTE in
      create_flow_handle.
      
      Fixes: 74491de9
      
       ("net/mlx5: Add multi dest support")
      Link: https://lore.kernel.org/netdev/ea5264d6-6b55-4449-a602-214c6f509c1e@163.com/T/#u [1]
      Signed-off-by: default avatarCosmin Ratiu <cratiu@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarMark Bloch <mbloch@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240409190820.227554-5-tariqt@nvidia.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      7aaee12b
    • Eric Dumazet's avatar
      netfilter: complete validation of user input · 97dab36e
      Eric Dumazet authored
      [ Upstream commit 65acf6e0 ]
      
      In my recent commit, I missed that do_replace() handlers
      use copy_from_sockptr() (which I fixed), followed
      by unsafe copy_from_sockptr_offset() calls.
      
      In all functions, we can perform the @optlen validation
      before even calling xt_alloc_table_info() with the following
      check:
      
      if ((u64)optlen < (u64)tmp.size + sizeof(tmp))
              return -EINVAL;
      
      Fixes: 0c83842d
      
       ("netfilter: validate user input for expected length")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Link: https://lore.kernel.org/r/20240409120741.3538135-1-edumazet@google.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      97dab36e
    • Jiri Benc's avatar
      ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr · 4b19e950
      Jiri Benc authored
      [ Upstream commit 7633c4da ]
      
      Although ipv6_get_ifaddr walks inet6_addr_lst under the RCU lock, it
      still means hlist_for_each_entry_rcu can return an item that got removed
      from the list. The memory itself of such item is not freed thanks to RCU
      but nothing guarantees the actual content of the memory is sane.
      
      In particular, the reference count can be zero. This can happen if
      ipv6_del_addr is called in parallel. ipv6_del_addr removes the entry
      from inet6_addr_lst (hlist_del_init_rcu(&ifp->addr_lst)) and drops all
      references (__in6_ifa_put(ifp) + in6_ifa_put(ifp)). With bad enough
      timing, this can happen:
      
      1. In ipv6_get_ifaddr, hlist_for_each_entry_rcu returns an entry.
      
      2. Then, the whole ipv6_del_addr is executed for the given entry. The
         reference count drops to zero and kfree_rcu is scheduled.
      
      3. ipv6_get_ifaddr continues and tries to increments the reference count
         (in6_ifa_hold).
      
      4. The rcu is unlocked and the entry is freed.
      
      5. The freed entry is returned.
      
      Prevent increasing of the reference count in such case. The name
      in6_ifa_hold_safe is chosen to mimic the existing fib6_info_hold_safe.
      
      [   41.506330] refcount_t: addition on 0; use-after-free.
      [   41.506760] WARNING: CPU: 0 PID: 595 at lib/refcount.c:25 refcount_warn_saturate+0xa5/0x130
      [   41.507413] Modules linked in: veth bridge stp llc
      [   41.507821] CPU: 0 PID: 595 Comm: python3 Not tainted 6.9.0-rc2.main-00208-g49563be82afa #14
      [   41.508479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
      [   41.509163] RIP: 0010:refcount_warn_saturate+0xa5/0x130
      [   41.509586] Code: ad ff 90 0f 0b 90 90 c3 cc cc cc cc 80 3d c0 30 ad 01 00 75 a0 c6 05 b7 30 ad 01 01 90 48 c7 c7 38 cc 7a 8c e8 cc 18 ad ff 90 <0f> 0b 90 90 c3 cc cc cc cc 80 3d 98 30 ad 01 00 0f 85 75 ff ff ff
      [   41.510956] RSP: 0018:ffffbda3c026baf0 EFLAGS: 00010282
      [   41.511368] RAX: 0000000000000000 RBX: ffff9e9c46914800 RCX: 0000000000000000
      [   41.511910] RDX: ffff9e9c7ec29c00 RSI: ffff9e9c7ec1c900 RDI: ffff9e9c7ec1c900
      [   41.512445] RBP: ffff9e9c43660c9c R08: 0000000000009ffb R09: 00000000ffffdfff
      [   41.512998] R10: 00000000ffffdfff R11: ffffffff8ca58a40 R12: ffff9e9c4339a000
      [   41.513534] R13: 0000000000000001 R14: ffff9e9c438a0000 R15: ffffbda3c026bb48
      [   41.514086] FS:  00007fbc4cda1740(0000) GS:ffff9e9c7ec00000(0000) knlGS:0000000000000000
      [   41.514726] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   41.515176] CR2: 000056233b337d88 CR3: 000000000376e006 CR4: 0000000000370ef0
      [   41.515713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   41.516252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   41.516799] Call Trace:
      [   41.517037]  <TASK>
      [   41.517249]  ? __warn+0x7b/0x120
      [   41.517535]  ? refcount_warn_saturate+0xa5/0x130
      [   41.517923]  ? report_bug+0x164/0x190
      [   41.518240]  ? handle_bug+0x3d/0x70
      [   41.518541]  ? exc_invalid_op+0x17/0x70
      [   41.520972]  ? asm_exc_invalid_op+0x1a/0x20
      [   41.521325]  ? refcount_warn_saturate+0xa5/0x130
      [   41.521708]  ipv6_get_ifaddr+0xda/0xe0
      [   41.522035]  inet6_rtm_getaddr+0x342/0x3f0
      [   41.522376]  ? __pfx_inet6_rtm_getaddr+0x10/0x10
      [   41.522758]  rtnetlink_rcv_msg+0x334/0x3d0
      [   41.523102]  ? netlink_unicast+0x30f/0x390
      [   41.523445]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
      [   41.523832]  netlink_rcv_skb+0x53/0x100
      [   41.524157]  netlink_unicast+0x23b/0x390
      [   41.524484]  netlink_sendmsg+0x1f2/0x440
      [   41.524826]  __sys_sendto+0x1d8/0x1f0
      [   41.525145]  __x64_sys_sendto+0x1f/0x30
      [   41.525467]  do_syscall_64+0xa5/0x1b0
      [   41.525794]  entry_SYSCALL_64_after_hwframe+0x72/0x7a
      [   41.526213] RIP: 0033:0x7fbc4cfcea9a
      [   41.526528] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
      [   41.527942] RSP: 002b:00007ffcf54012a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      [   41.528593] RAX: ffffffffffffffda RBX: 00007ffcf5401368 RCX: 00007fbc4cfcea9a
      [   41.529173] RDX: 000000000000002c RSI: 00007fbc4b9d9bd0 RDI: 0000000000000005
      [   41.529786] RBP: 00007fbc4bafb040 R08: 00007ffcf54013e0 R09: 000000000000000c
      [   41.530375] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      [   41.530977] R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007fbc4ca85d1b
      [   41.531573]  </TASK>
      
      Fixes: 5c578aed
      
       ("IPv6: convert addrconf hash list to RCU")
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJiri Benc <jbenc@redhat.com>
      Link: https://lore.kernel.org/r/8ab821e36073a4a406c50ec83c9e8dc586c539e4.1712585809.git.jbenc@redhat.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      4b19e950
    • Arnd Bergmann's avatar
      ipv4/route: avoid unused-but-set-variable warning · 6179cdbf
      Arnd Bergmann authored
      [ Upstream commit cf1b7201 ]
      
      The log_martians variable is only used in an #ifdef, causing a 'make W=1'
      warning with gcc:
      
      net/ipv4/route.c: In function 'ip_rt_send_redirect':
      net/ipv4/route.c:880:13: error: variable 'log_martians' set but not used [-Werror=unused-but-set-variable]
      
      Change the #ifdef to an equivalent IS_ENABLED() to let the compiler
      see where the variable is used.
      
      Fixes: 30038fc6
      
       ("net: ip_rt_send_redirect() optimization")
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240408074219.3030256-2-arnd@kernel.org
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      6179cdbf
    • Arnd Bergmann's avatar
      ipv6: fib: hide unused 'pn' variable · ed94af8d
      Arnd Bergmann authored
      [ Upstream commit 74043489 ]
      
      When CONFIG_IPV6_SUBTREES is disabled, the only user is hidden, causing
      a 'make W=1' warning:
      
      net/ipv6/ip6_fib.c: In function 'fib6_add':
      net/ipv6/ip6_fib.c:1388:32: error: variable 'pn' set but not used [-Werror=unused-but-set-variable]
      
      Add another #ifdef around the variable declaration, matching the other
      uses in this file.
      
      Fixes: 66729e18
      
       ("[IPV6] ROUTE: Make sure we have fn->leaf when adding a node on subtree.")
      Link: https://lore.kernel.org/netdev/20240322131746.904943-1-arnd@kernel.org/
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240408074219.3030256-1-arnd@kernel.org
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ed94af8d
    • Geetha sowjanya's avatar
      octeontx2-af: Fix NIX SQ mode and BP config · 98b3e282
      Geetha sowjanya authored
      [ Upstream commit faf23006 ]
      
      NIX SQ mode and link backpressure configuration is required for
      all platforms. But in current driver this code is wrongly placed
      under specific platform check. This patch fixes the issue by
      moving the code out of platform check.
      
      Fixes: 5d9b976d
      
       ("octeontx2-af: Support fixed transmit scheduler topology")
      Signed-off-by: default avatarGeetha sowjanya <gakula@marvell.com>
      Link: https://lore.kernel.org/r/20240408063643.26288-1-gakula@marvell.com
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      98b3e282
    • Kuniyuki Iwashima's avatar
      af_unix: Clear stale u->oob_skb. · b4bc99d0
      Kuniyuki Iwashima authored
      [ Upstream commit b46f4eaa ]
      
      syzkaller started to report deadlock of unix_gc_lock after commit
      4090fa37 ("af_unix: Replace garbage collection algorithm."), but
      it just uncovers the bug that has been there since commit 314001f0
      ("af_unix: Add OOB support").
      
      The repro basically does the following.
      
        from socket import *
        from array import array
      
        c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
        c1.sendmsg([b'a'], [(SOL_SOCKET, SCM_RIGHTS, array("i", [c2.fileno()]))], MSG_OOB)
        c2.recv(1)  # blocked as no normal data in recv queue
      
        c2.close()  # done async and unblock recv()
        c1.close()  # done async and trigger GC
      
      A socket sends its file descriptor to itself as OOB data and tries to
      receive normal data, but finally recv() fails due to async close().
      
      The problem here is wrong handling of OOB skb in manage_oob().  When
      recvmsg() is called without MSG_OOB, manage_oob() is called to check
      if the peeked skb is OOB skb.  In such a case, manage_oob() pops it
      out of the receive queue but does not clear unix_sock(sk)->oob_skb.
      This is wrong in terms of uAPI.
      
      Let's say we send "hello" with MSG_OOB, and "world" without MSG_OOB.
      The 'o' is handled as OOB data.  When recv() is called twice without
      MSG_OOB, the OOB data should be lost.
      
        >>> from socket import *
        >>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM, 0)
        >>> c1.send(b'hello', MSG_OOB)  # 'o' is OOB data
        5
        >>> c1.send(b'world')
        5
        >>> c2.recv(5)  # OOB data is not received
        b'hell'
        >>> c2.recv(5)  # OOB date is skipped
        b'world'
        >>> c2.recv(5, MSG_OOB)  # This should return an error
        b'o'
      
      In the same situation, TCP actually returns -EINVAL for the last
      recv().
      
      Also, if we do not clear unix_sk(sk)->oob_skb, unix_poll() always set
      EPOLLPRI even though the data has passed through by previous recv().
      
      To avoid these issues, we must clear unix_sk(sk)->oob_skb when dequeuing
      it from recv queue.
      
      The reason why the old GC did not trigger the deadlock is because the
      old GC relied on the receive queue to detect the loop.
      
      When it is triggered, the socket with OOB data is marked as GC candidate
      because file refcount == inflight count (1).  However, after traversing
      all inflight sockets, the socket still has a positive inflight count (1),
      thus the socket is excluded from candidates.  Then, the old GC lose the
      chance to garbage-collect the socket.
      
      With the old GC, the repro continues to create true garbage that will
      never be freed nor detected by kmemleak as it's linked to the global
      inflight list.  That's why we couldn't even notice the issue.
      
      Fixes: 314001f0
      
       ("af_unix: Add OOB support")
      Reported-by: default avatar <syzbot+7f7f201cc2668a8fd169@syzkaller.appspotmail.com>
      Closes: https://syzkaller.appspot.com/bug?extid=7f7f201cc2668a8fd169
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240405221057.2406-1-kuniyu@amazon.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      b4bc99d0
    • Eric Dumazet's avatar
      geneve: fix header validation in geneve[6]_xmit_skb · 3c1ae6de
      Eric Dumazet authored
      [ Upstream commit d8a6213d ]
      
      syzbot is able to trigger an uninit-value in geneve_xmit() [1]
      
      Problem : While most ip tunnel helpers (like ip_tunnel_get_dsfield())
      uses skb_protocol(skb, true), pskb_inet_may_pull() is only using
      skb->protocol.
      
      If anything else than ETH_P_IPV6 or ETH_P_IP is found in skb->protocol,
      pskb_inet_may_pull() does nothing at all.
      
      If a vlan tag was provided by the caller (af_packet in the syzbot case),
      the network header might not point to the correct location, and skb
      linear part could be smaller than expected.
      
      Add skb_vlan_inet_prepare() to perform a complete mac validation.
      
      Use this in geneve for the moment, I suspect we need to adopt this
      more broadly.
      
      v4 - Jakub reported v3 broke l2_tos_ttl_inherit.sh selftest
         - Only call __vlan_get_protocol() for vlan types.
      Link: https://lore.kernel.org/netdev/20240404100035.3270a7d5@kernel.org/
      
      v2,v3 - Addressed Sabrina comments on v1 and v2
      Link: https://lore.kernel.org/netdev/Zg1l9L2BNoZWZDZG@hog/
      
      [1]
      
      BUG: KMSAN: uninit-value in geneve_xmit_skb drivers/net/geneve.c:910 [inline]
       BUG: KMSAN: uninit-value in geneve_xmit+0x302d/0x5420 drivers/net/geneve.c:1030
        geneve_xmit_skb drivers/net/geneve.c:910 [inline]
        geneve_xmit+0x302d/0x5420 drivers/net/geneve.c:1030
        __netdev_start_xmit include/linux/netdevice.h:4903 [inline]
        netdev_start_xmit include/linux/netdevice.h:4917 [inline]
        xmit_one net/core/dev.c:3531 [inline]
        dev_hard_start_xmit+0x247/0xa20 net/core/dev.c:3547
        __dev_queue_xmit+0x348d/0x52c0 net/core/dev.c:4335
        dev_queue_xmit include/linux/netdevice.h:3091 [inline]
        packet_xmit+0x9c/0x6c0 net/packet/af_packet.c:276
        packet_snd net/packet/af_packet.c:3081 [inline]
        packet_sendmsg+0x8bb0/0x9ef0 net/packet/af_packet.c:3113
        sock_sendmsg_nosec net/socket.c:730 [inline]
        __sock_sendmsg+0x30f/0x380 net/socket.c:745
        __sys_sendto+0x685/0x830 net/socket.c:2191
        __do_sys_sendto net/socket.c:2203 [inline]
        __se_sys_sendto net/socket.c:2199 [inline]
        __x64_sys_sendto+0x125/0x1d0 net/socket.c:2199
       do_syscall_64+0xd5/0x1f0
       entry_SYSCALL_64_after_hwframe+0x6d/0x75
      
      Uninit was created at:
        slab_post_alloc_hook mm/slub.c:3804 [inline]
        slab_alloc_node mm/slub.c:3845 [inline]
        kmem_cache_alloc_node+0x613/0xc50 mm/slub.c:3888
        kmalloc_reserve+0x13d/0x4a0 net/core/skbuff.c:577
        __alloc_skb+0x35b/0x7a0 net/core/skbuff.c:668
        alloc_skb include/linux/skbuff.h:1318 [inline]
        alloc_skb_with_frags+0xc8/0xbf0 net/core/skbuff.c:6504
        sock_alloc_send_pskb+0xa81/0xbf0 net/core/sock.c:2795
        packet_alloc_skb net/packet/af_packet.c:2930 [inline]
        packet_snd net/packet/af_packet.c:3024 [inline]
        packet_sendmsg+0x722d/0x9ef0 net/packet/af_packet.c:3113
        sock_sendmsg_nosec net/socket.c:730 [inline]
        __sock_sendmsg+0x30f/0x380 net/socket.c:745
        __sys_sendto+0x685/0x830 net/socket.c:2191
        __do_sys_sendto net/socket.c:2203 [inline]
        __se_sys_sendto net/socket.c:2199 [inline]
        __x64_sys_sendto+0x125/0x1d0 net/socket.c:2199
       do_syscall_64+0xd5/0x1f0
       entry_SYSCALL_64_after_hwframe+0x6d/0x75
      
      CPU: 0 PID: 5033 Comm: syz-executor346 Not tainted 6.9.0-rc1-syzkaller-00005-g928a87efa423 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/29/2024
      
      Fixes: d13f048d
      
       ("net: geneve: modify IP header check in geneve6_xmit_skb and geneve_xmit_skb")
      Reported-by: default avatar <syzbot+9ee20ec1de7b3168db09@syzkaller.appspotmail.com>
      Closes: https://lore.kernel.org/netdev/000000000000d19c3a06152f9ee4@google.com/
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Phillip Potter <phil@philpotter.co.uk>
      Cc: Sabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: default avatarPhillip Potter <phil@philpotter.co.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      3c1ae6de
    • Eric Dumazet's avatar
      xsk: validate user input for XDP_{UMEM|COMPLETION}_FILL_RING · f0a068de
      Eric Dumazet authored
      [ Upstream commit 237f3cf1 ]
      
      syzbot reported an illegal copy in xsk_setsockopt() [1]
      
      Make sure to validate setsockopt() @optlen parameter.
      
      [1]
      
       BUG: KASAN: slab-out-of-bounds in copy_from_sockptr_offset include/linux/sockptr.h:49 [inline]
       BUG: KASAN: slab-out-of-bounds in copy_from_sockptr include/linux/sockptr.h:55 [inline]
       BUG: KASAN: slab-out-of-bounds in xsk_setsockopt+0x909/0xa40 net/xdp/xsk.c:1420
      Read of size 4 at addr ffff888028c6cde3 by task syz-executor.0/7549
      
      CPU: 0 PID: 7549 Comm: syz-executor.0 Not tainted 6.8.0-syzkaller-08951-gfe46a7dd189e #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
      Call Trace:
       <TASK>
        __dump_stack lib/dump_stack.c:88 [inline]
        dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
        print_address_description mm/kasan/report.c:377 [inline]
        print_report+0x169/0x550 mm/kasan/report.c:488
        kasan_report+0x143/0x180 mm/kasan/report.c:601
        copy_from_sockptr_offset include/linux/sockptr.h:49 [inline]
        copy_from_sockptr include/linux/sockptr.h:55 [inline]
        xsk_setsockopt+0x909/0xa40 net/xdp/xsk.c:1420
        do_sock_setsockopt+0x3af/0x720 net/socket.c:2311
        __sys_setsockopt+0x1ae/0x250 net/socket.c:2334
        __do_sys_setsockopt net/socket.c:2343 [inline]
        __se_sys_setsockopt net/socket.c:2340 [inline]
        __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
       do_syscall_64+0xfb/0x240
       entry_SYSCALL_64_after_hwframe+0x6d/0x75
      RIP: 0033:0x7fb40587de69
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007fb40665a0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
      RAX: ffffffffffffffda RBX: 00007fb4059abf80 RCX: 00007fb40587de69
      RDX: 0000000000000005 RSI: 000000000000011b RDI: 0000000000000006
      RBP: 00007fb4058ca47a R08: 0000000000000002 R09: 0000000000000000
      R10: 0000000020001980 R11: 0000000000000246 R12: 0000000000000000
      R13: 000000000000000b R14: 00007fb4059abf80 R15: 00007fff57ee4d08
       </TASK>
      
      Allocated by task 7549:
        kasan_save_stack mm/kasan/common.c:47 [inline]
        kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
        poison_kmalloc_redzone mm/kasan/common.c:370 [inline]
        __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:387
        kasan_kmalloc include/linux/kasan.h:211 [inline]
        __do_kmalloc_node mm/slub.c:3966 [inline]
        __kmalloc+0x233/0x4a0 mm/slub.c:3979
        kmalloc include/linux/slab.h:632 [inline]
        __cgroup_bpf_run_filter_setsockopt+0xd2f/0x1040 kernel/bpf/cgroup.c:1869
        do_sock_setsockopt+0x6b4/0x720 net/socket.c:2293
        __sys_setsockopt+0x1ae/0x250 net/socket.c:2334
        __do_sys_setsockopt net/socket.c:2343 [inline]
        __se_sys_setsockopt net/socket.c:2340 [inline]
        __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
       do_syscall_64+0xfb/0x240
       entry_SYSCALL_64_after_hwframe+0x6d/0x75
      
      The buggy address belongs to the object at ffff888028c6cde0
       which belongs to the cache kmalloc-8 of size 8
      The buggy address is located 1 bytes to the right of
       allocated 2-byte region [ffff888028c6cde0, ffff888028c6cde2)
      
      The buggy address belongs to the physical page:
      page:ffffea0000a31b00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888028c6c9c0 pfn:0x28c6c
      anon flags: 0xfff00000000800(slab|node=0|zone=1|lastcpupid=0x7ff)
      page_type: 0xffffffff()
      raw: 00fff00000000800 ffff888014c41280 0000000000000000 dead000000000001
      raw: ffff888028c6c9c0 0000000080800057 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      page_owner tracks the page as allocated
      page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112cc0(GFP_USER|__GFP_NOWARN|__GFP_NORETRY), pid 6648, tgid 6644 (syz-executor.0), ts 133906047828, free_ts 133859922223
        set_page_owner include/linux/page_owner.h:31 [inline]
        post_alloc_hook+0x1ea/0x210 mm/page_alloc.c:1533
        prep_new_page mm/page_alloc.c:1540 [inline]
        get_page_from_freelist+0x33ea/0x3580 mm/page_alloc.c:3311
        __alloc_pages+0x256/0x680 mm/page_alloc.c:4569
        __alloc_pages_node include/linux/gfp.h:238 [inline]
        alloc_pages_node include/linux/gfp.h:261 [inline]
        alloc_slab_page+0x5f/0x160 mm/slub.c:2175
        allocate_slab mm/slub.c:2338 [inline]
        new_slab+0x84/0x2f0 mm/slub.c:2391
        ___slab_alloc+0xc73/0x1260 mm/slub.c:3525
        __slab_alloc mm/slub.c:3610 [inline]
        __slab_alloc_node mm/slub.c:3663 [inline]
        slab_alloc_node mm/slub.c:3835 [inline]
        __do_kmalloc_node mm/slub.c:3965 [inline]
        __kmalloc_node+0x2db/0x4e0 mm/slub.c:3973
        kmalloc_node include/linux/slab.h:648 [inline]
        __vmalloc_area_node mm/vmalloc.c:3197 [inline]
        __vmalloc_node_range+0x5f9/0x14a0 mm/vmalloc.c:3392
        __vmalloc_node mm/vmalloc.c:3457 [inline]
        vzalloc+0x79/0x90 mm/vmalloc.c:3530
        bpf_check+0x260/0x19010 kernel/bpf/verifier.c:21162
        bpf_prog_load+0x1667/0x20f0 kernel/bpf/syscall.c:2895
        __sys_bpf+0x4ee/0x810 kernel/bpf/syscall.c:5631
        __do_sys_bpf kernel/bpf/syscall.c:5738 [inline]
        __se_sys_bpf kernel/bpf/syscall.c:5736 [inline]
        __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5736
       do_syscall_64+0xfb/0x240
       entry_SYSCALL_64_after_hwframe+0x6d/0x75
      page last free pid 6650 tgid 6647 stack trace:
        reset_page_owner include/linux/page_owner.h:24 [inline]
        free_pages_prepare mm/page_alloc.c:1140 [inline]
        free_unref_page_prepare+0x95d/0xa80 mm/page_alloc.c:2346
        free_unref_page_list+0x5a3/0x850 mm/page_alloc.c:2532
        release_pages+0x2117/0x2400 mm/swap.c:1042
        tlb_batch_pages_flush mm/mmu_gather.c:98 [inline]
        tlb_flush_mmu_free mm/mmu_gather.c:293 [inline]
        tlb_flush_mmu+0x34d/0x4e0 mm/mmu_gather.c:300
        tlb_finish_mmu+0xd4/0x200 mm/mmu_gather.c:392
        exit_mmap+0x4b6/0xd40 mm/mmap.c:3300
        __mmput+0x115/0x3c0 kernel/fork.c:1345
        exit_mm+0x220/0x310 kernel/exit.c:569
        do_exit+0x99e/0x27e0 kernel/exit.c:865
        do_group_exit+0x207/0x2c0 kernel/exit.c:1027
        get_signal+0x176e/0x1850 kernel/signal.c:2907
        arch_do_signal_or_restart+0x96/0x860 arch/x86/kernel/signal.c:310
        exit_to_user_mode_loop kernel/entry/common.c:105 [inline]
        exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline]
        __syscall_exit_to_user_mode_work kernel/entry/common.c:201 [inline]
        syscall_exit_to_user_mode+0xc9/0x360 kernel/entry/common.c:212
        do_syscall_64+0x10a/0x240 arch/x86/entry/common.c:89
       entry_SYSCALL_64_after_hwframe+0x6d/0x75
      
      Memory state around the buggy address:
       ffff888028c6cc80: fa fc fc fc fa fc fc fc fa fc fc fc fa fc fc fc
       ffff888028c6cd00: fa fc fc fc fa fc fc fc 00 fc fc fc 06 fc fc fc
      >ffff888028c6cd80: fa fc fc fc fa fc fc fc fa fc fc fc 02 fc fc fc
                                                             ^
       ffff888028c6ce00: fa fc fc fc fa fc fc fc fa fc fc fc fa fc fc fc
       ffff888028c6ce80: fa fc fc fc fa fc fc fc fa fc fc fc fa fc fc fc
      
      Fixes: 423f3832
      
       ("xsk: add umem fill queue support and mmap")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: "Björn Töpel" <bjorn@kernel.org>
      Cc: Magnus Karlsson <magnus.karlsson@intel.com>
      Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/r/20240404202738.3634547-1-edumazet@google.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f0a068de
    • Sebastian Andrzej Siewior's avatar
      u64_stats: Disable preemption on 32bit UP+SMP PREEMPT_RT during updates. · a9dca26b
      Sebastian Andrzej Siewior authored
      [ Upstream commit 3c118547
      
       ]
      
      On PREEMPT_RT the seqcount_t for synchronisation is required on 32bit
      architectures even on UP because the softirq (and the threaded IRQ handler) can
      be preempted.
      
      With the seqcount_t for synchronisation, a reader with higher priority can
      preempt the writer and then spin endlessly in read_seqcount_begin() while the
      writer can't make progress.
      
      To avoid such a lock up on PREEMPT_RT the writer must disable preemption during
      the update. There is no need to disable interrupts because no writer is using
      this API in hard-IRQ context on PREEMPT_RT.
      
      Disable preemption on 32bit-RT within the u64_stats write section.
      
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Stable-dep-of: 38a15d0a
      
       ("u64_stats: fix u64_stats_init() for lockdep when used repeatedly in one file")
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      a9dca26b
    • Ilya Maximets's avatar
      net: openvswitch: fix unwanted error log on timeout policy probing · 11e04135
      Ilya Maximets authored
      [ Upstream commit 4539f91f ]
      
      On startup, ovs-vswitchd probes different datapath features including
      support for timeout policies.  While probing, it tries to execute
      certain operations with OVS_PACKET_ATTR_PROBE or OVS_FLOW_ATTR_PROBE
      attributes set.  These attributes tell the openvswitch module to not
      log any errors when they occur as it is expected that some of the
      probes will fail.
      
      For some reason, setting the timeout policy ignores the PROBE attribute
      and logs a failure anyway.  This is causing the following kernel log
      on each re-start of ovs-vswitchd:
      
        kernel: Failed to associated timeout policy `ovs_test_tp'
      
      Fix that by using the same logging macro that all other messages are
      using.  The message will still be printed at info level when needed
      and will be rate limited, but with a net rate limiter instead of
      generic printk one.
      
      The nf_ct_set_timeout() itself will still print some info messages,
      but at least this change makes logging in openvswitch module more
      consistent.
      
      Fixes: 06bd2bdf
      
       ("openvswitch: Add timeout support to ct action")
      Signed-off-by: default avatarIlya Maximets <i.maximets@ovn.org>
      Acked-by: default avatarEelco Chaudron <echaudro@redhat.com>
      Link: https://lore.kernel.org/r/20240403203803.2137962-1-i.maximets@ovn.org
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      11e04135
    • Dan Carpenter's avatar
      scsi: qla2xxx: Fix off by one in qla_edif_app_getstats() · 8c820f7c
      Dan Carpenter authored
      [ Upstream commit 4406e417 ]
      
      The app_reply->elem[] array is allocated earlier in this function and it
      has app_req.num_ports elements.  Thus this > comparison needs to be >= to
      prevent memory corruption.
      
      Fixes: 7878f22a
      
       ("scsi: qla2xxx: edif: Add getfcinfo and statistic bsgs")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Link: https://lore.kernel.org/r/5c125b2f-92dd-412b-9b6f-fc3a3207bd60@moroto.mountain
      Reviewed-by: default avatarHimanshu Madhani <himanshu.madhani@oracle.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      8c820f7c
    • Arnd Bergmann's avatar
      nouveau: fix function cast warning · 5562dbfc
      Arnd Bergmann authored
      [ Upstream commit 185fdb46 ]
      
      Calling a function through an incompatible pointer type causes breaks
      kcfi, so clang warns about the assignment:
      
      drivers/gpu/drm/nouveau/nvkm/subdev/bios/shadowof.c:73:10: error: cast from 'void (*)(const void *)' to 'void (*)(void *)' converts to incompatible function type [-Werror,-Wcast-function-type-strict]
         73 |         .fini = (void(*)(void *))kfree,
      
      Avoid this with a trivial wrapper.
      
      Fixes: c39f472e
      
       ("drm/nouveau: remove symlinks, move core/ to nvkm/ (no code changes)")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDanilo Krummrich <dakr@redhat.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240404160234.2923554-1-arnd@kernel.org
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      5562dbfc
    • Alex Constantino's avatar
      Revert "drm/qxl: simplify qxl_fence_wait" · 8d278fc3
      Alex Constantino authored
      [ Upstream commit 07ed11af ]
      
      This reverts commit 5a838e5d.
      
      Changes from commit 5a838e5d ("drm/qxl: simplify qxl_fence_wait") would
      result in a '[TTM] Buffer eviction failed' exception whenever it reached a
      timeout.
      Due to a dependency to DMA_FENCE_WARN this also restores some code deleted
      by commit d72277b6 ("dma-buf: nuke DMA_FENCE_TRACE macros v2").
      
      Fixes: 5a838e5d
      
       ("drm/qxl: simplify qxl_fence_wait")
      Link: https://lore.kernel.org/regressions/ZTgydqRlK6WX_b29@eldamar.lan/
      Reported-by: default avatarTimo Lindfors <timo.lindfors@iki.fi>
      Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1054514
      Signed-off-by: default avatarAlex Constantino <dreaming.about.electric.sheep@gmail.com>
      Signed-off-by: default avatarMaxime Ripard <mripard@kernel.org>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240404181448.1643-2-dreaming.about.electric.sheep@gmail.com
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      8d278fc3
    • Frank Li's avatar
      arm64: dts: imx8-ss-conn: fix usdhc wrong lpcg clock order · 42beda7d
      Frank Li authored
      [ Upstream commit c6ddd6e7 ]
      
      The actual clock show wrong frequency:
      
         echo on >/sys/devices/platform/bus\@5b000000/5b010000.mmc/power/control
         cat /sys/kernel/debug/mmc0/ios
      
         clock:          200000000 Hz
         actual clock:   166000000 Hz
                         ^^^^^^^^^
         .....
      
      According to
      
      sdhc0_lpcg: clock-controller@5b200000 {
                      compatible = "fsl,imx8qxp-lpcg";
                      reg = <0x5b200000 0x10000>;
                      #clock-cells = <1>;
                      clocks = <&clk IMX_SC_R_SDHC_0 IMX_SC_PM_CLK_PER>,
                               <&conn_ipg_clk>, <&conn_axi_clk>;
                      clock-indices = <IMX_LPCG_CLK_0>, <IMX_LPCG_CLK_4>,
                                      <IMX_LPCG_CLK_5>;
                      clock-output-names = "sdhc0_lpcg_per_clk",
                                           "sdhc0_lpcg_ipg_clk",
                                           "sdhc0_lpcg_ahb_clk";
                      power-domains = <&pd IMX_SC_R_SDHC_0>;
              }
      
      "per_clk" should be IMX_LPCG_CLK_0 instead of IMX_LPCG_CLK_5.
      
      After correct clocks order:
      
         echo on >/sys/devices/platform/bus\@5b000000/5b010000.mmc/power/control
         cat /sys/kernel/debug/mmc0/ios
      
         clock:          200000000 Hz
         actual clock:   198000000 Hz
                         ^^^^^^^^
         ...
      
      Fixes: 16c4ea75
      
       ("arm64: dts: imx8: switch to new lpcg clock binding")
      Signed-off-by: default avatarFrank Li <Frank.Li@nxp.com>
      Signed-off-by: default avatarShawn Guo <shawnguo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      42beda7d
    • Nini Song's avatar
      media: cec: core: remove length check of Timer Status · cc7b83f0
      Nini Song authored
      commit ce5d241c
      
       upstream.
      
      The valid_la is used to check the length requirements,
      including special cases of Timer Status. If the length is
      shorter than 5, that means no Duration Available is returned,
      the message will be forced to be invalid.
      
      However, the description of Duration Available in the spec
      is that this parameter may be returned when these cases, or
      that it can be optionally return when these cases. The key
      words in the spec description are flexible choices.
      
      Remove the special length check of Timer Status to fit the
      spec which is not compulsory about that.
      
      Signed-off-by: default avatarNini Song <nini.song@mediatek.com>
      Signed-off-by: default avatarHans Verkuil <hverkuil-cisco@xs4all.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cc7b83f0
    • Dmitry Antipov's avatar
      Bluetooth: Fix memory leak in hci_req_sync_complete() · 75193678
      Dmitry Antipov authored
      commit 45d355a9
      
       upstream.
      
      In 'hci_req_sync_complete()', always free the previous sync
      request state before assigning reference to a new one.
      
      Reported-by: default avatar <syzbot+39ec16ff6cc18b1d066d@syzkaller.appspotmail.com>
      Closes: https://syzkaller.appspot.com/bug?extid=39ec16ff6cc18b1d066d
      Cc: stable@vger.kernel.org
      Fixes: f60cb305
      
       ("Bluetooth: Convert hci_req_sync family of function to new request API")
      Signed-off-by: default avatarDmitry Antipov <dmantipov@yandex.ru>
      Signed-off-by: default avatarLuiz Augusto von Dentz <luiz.von.dentz@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      75193678
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Only update pages_touched when a new page is touched · 53e494b7
      Steven Rostedt (Google) authored
      commit ffe3986f upstream.
      
      The "buffer_percent" logic that is used by the ring buffer splice code to
      only wake up the tasks when there's no data after the buffer is filled to
      the percentage of the "buffer_percent" file is dependent on three
      variables that determine the amount of data that is in the ring buffer:
      
       1) pages_read - incremented whenever a new sub-buffer is consumed
       2) pages_lost - incremented every time a writer overwrites a sub-buffer
       3) pages_touched - incremented when a write goes to a new sub-buffer
      
      The percentage is the calculation of:
      
        (pages_touched - (pages_lost + pages_read)) / nr_pages
      
      Basically, the amount of data is the total number of sub-bufs that have been
      touched, minus the number of sub-bufs lost and sub-bufs consumed. This is
      divided by the total count to give the buffer percentage. When the
      percentage is greater than the value in the "buffer_percent" file, it
      wakes up splice readers waiting for that amount.
      
      It was observed that over time, the amount read from the splice was
      constantly decreasing the longer the trace was running. That is, if one
      asked for 60%, it would read over 60% when it first starts tracing, but
      then it would be woken up at under 60% and would slowly decrease the
      amount of data read after being woken up, where the amount becomes much
      less than the buffer percent.
      
      This was due to an accounting of the pages_touched incrementation. This
      value is incremented whenever a writer transfers to a new sub-buffer. But
      the place where it was incremented was incorrect. If a writer overflowed
      the current sub-buffer it would go to the next one. If it gets preempted
      by an interrupt at that time, and the interrupt performs a trace, it too
      will end up going to the next sub-buffer. But only one should increment
      the counter. Unfortunately, that was not the case.
      
      Change the cmpxchg() that does the real switch of the tail-page into a
      try_cmpxchg(), and on success, perform the increment of pages_touched. This
      will only increment the counter once for when the writer moves to a new
      sub-buffer, and not when there's a race and is incremented for when a
      writer and its preempting writer both move to the same new sub-buffer.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240409151309.0d0e5056@gandalf.local.home
      
      Cc: stable@vger.kernel.org
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Fixes: 2c2b0a78
      
       ("ring-buffer: Add percentage of ring buffer full to wake up reader")
      Acked-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      53e494b7
    • Sven Eckelmann's avatar
      batman-adv: Avoid infinite loop trying to resize local TT · 87b6af1a
      Sven Eckelmann authored
      commit b1f532a3 upstream.
      
      If the MTU of one of an attached interface becomes too small to transmit
      the local translation table then it must be resized to fit inside all
      fragments (when enabled) or a single packet.
      
      But if the MTU becomes too low to transmit even the header + the VLAN
      specific part then the resizing of the local TT will never succeed. This
      can for example happen when the usable space is 110 bytes and 11 VLANs are
      on top of batman-adv. In this case, at least 116 byte would be needed.
      There will just be an endless spam of
      
         batman_adv: batadv0: Forced to purge local tt entries to fit new maximum fragment MTU (110)
      
      in the log but the function will never finish. Problem here is that the
      timeout will be halved all the time and will then stagnate at 0 and
      therefore never be able to reduce the table even more.
      
      There are other scenarios possible with a similar result. The number of
      BATADV_TT_CLIENT_NOPURGE entries in the local TT can for example be too
      high to fit inside a packet. Such a scenario can therefore happen also with
      only a single VLAN + 7 non-purgable addresses - requiring at least 120
      bytes.
      
      While this should be handled proactively when:
      
      * interface with too low MTU is added
      * VLAN is added
      * non-purgeable local mac is added
      * MTU of an attached interface is reduced
      * fragmentation setting gets disabled (which most likely requires dropping
        attached interfaces)
      
      not all of these scenarios can be prevented because batman-adv is only
      consuming events without the the possibility to prevent these actions
      (non-purgable MAC address added, MTU of an attached interface is reduced).
      It is therefore necessary to also make sure that the code is able to handle
      also the situations when there were already incompatible system
      configuration are present.
      
      Cc: stable@vger.kernel.org
      Fixes: a19d3d85
      
       ("batman-adv: limit local translation table max size")
      Reported-by: default avatar <syzbot+a6a4b5bb3da165594cff@syzkaller.appspotmail.com>
      Signed-off-by: default avatarSven Eckelmann <sven@narfation.org>
      Signed-off-by: default avatarSimon Wunderlich <sw@simonwunderlich.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      87b6af1a
  2. Apr 13, 2024
    • Greg Kroah-Hartman's avatar
      Linux 5.15.155 · fa3df276
      Greg Kroah-Hartman authored
      
      
      Link: https://lore.kernel.org/r/20240411095407.982258070@linuxfoundation.org
      Tested-by: default avatarSeongJae Park <sj@kernel.org>
      Tested-by: default avatarFlorian Fainelli <florian.fainelli@broadcom.com>
      Tested-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Tested-by: default avatarkernelci.org bot <bot@kernelci.org>
      Tested-by: default avatarRon Economos <re@w6rz.net>
      Tested-by: default avatarJon Hunter <jonathanh@nvidia.com>
      Tested-by: default avatarLinux Kernel Functional Testing <lkft@linaro.org>
      Tested-by: default avatarKelsey Steele <kelseysteele@linux.microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      v5.15.155
      fa3df276
    • Greg Kroah-Hartman's avatar
      Revert "ACPI: CPPC: Use access_width over bit_width for system memory accesses" · b54c4632
      Greg Kroah-Hartman authored
      This reverts commit 4949affd which is
      commit 2f4a4d63
      
       upstream.
      
      It breaks AmpereOne systems and should not have been added to the stable
      tree just yet.
      
      Link: https://lore.kernel.org/r/97d25ef7-dee9-4cc5-842a-273f565869b3@linux.microsoft.com
      Reported-by: default avatarEaswar Hariharan <eahariha@linux.microsoft.com>
      Cc: Jarred White <jarredwhite@linux.microsoft.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b54c4632
    • Vasiliy Kovalev's avatar
      VMCI: Fix possible memcpy() run-time warning in vmci_datagram_invoke_guest_handler() · 1793e6b2
      Vasiliy Kovalev authored
      commit e606e4b7 upstream.
      
      The changes are similar to those given in the commit 19b070fe
      
      
      ("VMCI: Fix memcpy() run-time warning in dg_dispatch_as_host()").
      
      Fix filling of the msg and msg_payload in dg_info struct, which prevents a
      possible "detected field-spanning write" of memcpy warning that is issued
      by the tracking mechanism __fortify_memcpy_chk.
      
      Signed-off-by: default avatarVasiliy Kovalev <kovalev@altlinux.org>
      Link: https://lore.kernel.org/r/20240219105315.76955-1-kovalev@altlinux.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1793e6b2
    • Luiz Augusto von Dentz's avatar
      Bluetooth: btintel: Fixe build regression · dd883e01
      Luiz Augusto von Dentz authored
      commit 6e62ebfb upstream.
      
      This fixes the following build regression:
      
      drivers-bluetooth-btintel.c-btintel_read_version()-warn:
      passing-zero-to-PTR_ERR
      
      Fixes: b79e0409
      
       ("Bluetooth: btintel: Fix null ptr deref in btintel_read_version")
      Signed-off-by: default avatarLuiz Augusto von Dentz <luiz.von.dentz@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dd883e01
    • Gwendal Grignou's avatar
      platform/x86: intel-vbtn: Update tablet mode switch at end of probe · bb6b8827
      Gwendal Grignou authored
      [ Upstream commit 434e5781 ]
      
      ACER Vivobook Flip (TP401NAS) virtual intel switch is implemented as
      follow:
      
         Device (VGBI)
         {
             Name (_HID, EisaId ("INT33D6") ...
             Name (VBDS, Zero)
             Method (_STA, 0, Serialized)  // _STA: Status ...
             Method (VBDL, 0, Serialized)
             {
                 PB1E |= 0x20
                 VBDS |= 0x40
             }
             Method (VGBS, 0, Serialized)
             {
                 Return (VBDS) /* \_SB_.PCI0.SBRG.EC0_.VGBI.VBDS */
             }
             ...
          }
      
      By default VBDS is set to 0. At boot it is set to clamshell (bit 6 set)
      only after method VBDL is executed.
      
      Since VBDL is now evaluated in the probe routine later, after the device
      is registered, the retrieved value of VBDS was still 0 ("tablet mode")
      when setting up the virtual switch.
      
      Make sure to evaluate VGBS after VBDL, to ensure the
      convertible boots in clamshell mode, the expected default.
      
      Fixes: 26173179
      
       ("platform/x86: intel-vbtn: Eval VBDL after registering our notifier")
      Signed-off-by: default avatarGwendal Grignou <gwendal@chromium.org>
      Reviewed-by: default avatarKuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
      Reviewed-by: default avatarHans de Goede <hdegoede@redhat.com>
      Link: https://lore.kernel.org/r/20240329143206.2977734-3-gwendal@chromium.org
      Reviewed-by: default avatarIlpo Järvinen <ilpo.jarvinen@linux.intel.com>
      Signed-off-by: default avatarIlpo Järvinen <ilpo.jarvinen@linux.intel.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      bb6b8827
    • Kees Cook's avatar
      randomize_kstack: Improve entropy diffusion · dfb2ce95
      Kees Cook authored
      [ Upstream commit 9c573cd3
      
       ]
      
      The kstack_offset variable was really only ever using the low bits for
      kernel stack offset entropy. Add a ror32() to increase bit diffusion.
      
      Suggested-by: default avatarArnd Bergmann <arnd@arndb.de>
      Fixes: 39218ff4
      
       ("stack: Optionally randomize kernel stack offset each syscall")
      Link: https://lore.kernel.org/r/20240309202445.work.165-kees@kernel.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      dfb2ce95
    • David Hildenbrand's avatar
      x86/mm/pat: fix VM_PAT handling in COW mappings · 7cfee26d
      David Hildenbrand authored
      commit 04c35ab3
      
       upstream.
      
      PAT handling won't do the right thing in COW mappings: the first PTE (or,
      in fact, all PTEs) can be replaced during write faults to point at anon
      folios.  Reliably recovering the correct PFN and cachemode using
      follow_phys() from PTEs will not work in COW mappings.
      
      Using follow_phys(), we might just get the address+protection of the anon
      folio (which is very wrong), or fail on swap/nonswap entries, failing
      follow_phys() and triggering a WARN_ON_ONCE() in untrack_pfn() and
      track_pfn_copy(), not properly calling free_pfn_range().
      
      In free_pfn_range(), we either wouldn't call memtype_free() or would call
      it with the wrong range, possibly leaking memory.
      
      To fix that, let's update follow_phys() to refuse returning anon folios,
      and fallback to using the stored PFN inside vma->vm_pgoff for COW mappings
      if we run into that.
      
      We will now properly handle untrack_pfn() with COW mappings, where we
      don't need the cachemode.  We'll have to fail fork()->track_pfn_copy() if
      the first page was replaced by an anon folio, though: we'd have to store
      the cachemode in the VMA to make this work, likely growing the VMA size.
      
      For now, lets keep it simple and let track_pfn_copy() just fail in that
      case: it would have failed in the past with swap/nonswap entries already,
      and it would have done the wrong thing with anon folios.
      
      Simple reproducer to trigger the WARN_ON_ONCE() in untrack_pfn():
      
      <--- C reproducer --->
       #include <stdio.h>
       #include <sys/mman.h>
       #include <unistd.h>
       #include <liburing.h>
      
       int main(void)
       {
               struct io_uring_params p = {};
               int ring_fd;
               size_t size;
               char *map;
      
               ring_fd = io_uring_setup(1, &p);
               if (ring_fd < 0) {
                       perror("io_uring_setup");
                       return 1;
               }
               size = p.sq_off.array + p.sq_entries * sizeof(unsigned);
      
               /* Map the submission queue ring MAP_PRIVATE */
               map = mmap(0, size, PROT_READ | PROT_WRITE, MAP_PRIVATE,
                          ring_fd, IORING_OFF_SQ_RING);
               if (map == MAP_FAILED) {
                       perror("mmap");
                       return 1;
               }
      
               /* We have at least one page. Let's COW it. */
               *map = 0;
               pause();
               return 0;
       }
      <--- C reproducer --->
      
      On a system with 16 GiB RAM and swap configured:
       # ./iouring &
       # memhog 16G
       # killall iouring
      [  301.552930] ------------[ cut here ]------------
      [  301.553285] WARNING: CPU: 7 PID: 1402 at arch/x86/mm/pat/memtype.c:1060 untrack_pfn+0xf4/0x100
      [  301.553989] Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_g
      [  301.558232] CPU: 7 PID: 1402 Comm: iouring Not tainted 6.7.5-100.fc38.x86_64 #1
      [  301.558772] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebu4
      [  301.559569] RIP: 0010:untrack_pfn+0xf4/0x100
      [  301.559893] Code: 75 c4 eb cf 48 8b 43 10 8b a8 e8 00 00 00 3b 6b 28 74 b8 48 8b 7b 30 e8 ea 1a f7 000
      [  301.561189] RSP: 0018:ffffba2c0377fab8 EFLAGS: 00010282
      [  301.561590] RAX: 00000000ffffffea RBX: ffff9208c8ce9cc0 RCX: 000000010455e047
      [  301.562105] RDX: 07fffffff0eb1e0a RSI: 0000000000000000 RDI: ffff9208c391d200
      [  301.562628] RBP: 0000000000000000 R08: ffffba2c0377fab8 R09: 0000000000000000
      [  301.563145] R10: ffff9208d2292d50 R11: 0000000000000002 R12: 00007fea890e0000
      [  301.563669] R13: 0000000000000000 R14: ffffba2c0377fc08 R15: 0000000000000000
      [  301.564186] FS:  0000000000000000(0000) GS:ffff920c2fbc0000(0000) knlGS:0000000000000000
      [  301.564773] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  301.565197] CR2: 00007fea88ee8a20 CR3: 00000001033a8000 CR4: 0000000000750ef0
      [  301.565725] PKRU: 55555554
      [  301.565944] Call Trace:
      [  301.566148]  <TASK>
      [  301.566325]  ? untrack_pfn+0xf4/0x100
      [  301.566618]  ? __warn+0x81/0x130
      [  301.566876]  ? untrack_pfn+0xf4/0x100
      [  301.567163]  ? report_bug+0x171/0x1a0
      [  301.567466]  ? handle_bug+0x3c/0x80
      [  301.567743]  ? exc_invalid_op+0x17/0x70
      [  301.568038]  ? asm_exc_invalid_op+0x1a/0x20
      [  301.568363]  ? untrack_pfn+0xf4/0x100
      [  301.568660]  ? untrack_pfn+0x65/0x100
      [  301.568947]  unmap_single_vma+0xa6/0xe0
      [  301.569247]  unmap_vmas+0xb5/0x190
      [  301.569532]  exit_mmap+0xec/0x340
      [  301.569801]  __mmput+0x3e/0x130
      [  301.570051]  do_exit+0x305/0xaf0
      ...
      
      Link: https://lkml.kernel.org/r/20240403212131.929421-3-david@redhat.com
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarWupeng Ma <mawupeng1@huawei.com>
      Closes: https://lkml.kernel.org/r/20240227122814.3781907-1-mawupeng1@huawei.com
      Fixes: b1a86e15 ("x86, pat: remove the dependency on 'vm_pgoff' in track/untrack pfn vma routines")
      Fixes: 5899329b
      
       ("x86: PAT: implement track/untrack of pfnmap regions for x86 - v3")
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7cfee26d
    • David Hildenbrand's avatar
      virtio: reenable config if freezing device failed · abfae420
      David Hildenbrand authored
      commit 310227f4 upstream.
      
      Currently, we don't reenable the config if freezing the device failed.
      
      For example, virtio-mem currently doesn't support suspend+resume, and
      trying to freeze the device will always fail. Afterwards, the device
      will no longer respond to resize requests, because it won't get notified
      about config changes.
      
      Let's fix this by re-enabling the config if freezing fails.
      
      Fixes: 22b7050a
      
       ("virtio: defer config changed notifications")
      Cc: <stable@kernel.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Message-Id: <20240213135425.795001-1-david@redhat.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      abfae420