- Sep 03, 2021
-
-
Riccardo Mancini authored
commit 67069a1f upstream. ASan reported a memory leak caused by info_linear not being deallocated. The info_linear was allocated during in perf_event__synthesize_one_bpf_prog(). This patch adds the corresponding free() when bpf_prog_info_node is freed in perf_env__purge_bpf(). $ sudo ./perf record -- sleep 5 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.025 MB perf.data (8 samples) ] ================================================================= ==297735==ERROR: LeakSanitizer: detected memory leaks Direct leak of 7688 byte(s) in 19 object(s) allocated from: #0 0x4f420f in malloc (/home/user/linux/tools/perf/perf+0x4f420f) #1 0xc06a74 in bpf_program__get_prog_info_linear /home/user/linux/tools/lib/bpf/libbpf.c:11113:16 #2 0xb426fe in perf_event__synthesize_one_bpf_prog /home/user/linux/tools/perf/util/bpf-event.c:191:16 #3 0xb42008 in perf_event__synthesize_bpf_events /home/user/linux/tools/perf/util/bpf-event.c:410:9 #4 0x594596 in record__synthesize /home/user/linux/tools/perf/builtin-record.c:1490:8 #5 0x58c9ac in __cmd_record /home/user/linux/tools/perf/builtin-record.c:1798:8 #6 0x58990b in cmd_record /home/user/linux/tools/perf/builtin-record.c:2901:8 #7 0x7b2a20 in run_builtin /home/user/linux/tools/perf/perf.c:313:11 #8 0x7b12ff in handle_internal_command /home/user/linux/tools/perf/perf.c:365:8 #9 0x7b2583 in run_argv /home/user/linux/tools/perf/perf.c:409:2 #10 0x7b0d79 in main /home/user/linux/tools/perf/perf.c:539:3 #11 0x7fa357ef6b74 in __libc_start_main /usr/src/debug/glibc-2.33-8.fc34.x86_64/csu/../csu/libc-start.c:332:16 Signed-off-by:
Riccardo Mancini <rickyman7@gmail.com> Acked-by:
Ian Rogers <irogers@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <songliubraving@fb.com> Cc: Yonghong Song <yhs@fb.com> Link: http://lore.kernel.org/lkml/20210602224024.300485-1-rickyman7@gmail.com Signed-off-by:
Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by:
Hanjun Guo <guohanjun@huawei.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Guo Ren authored
commit 5ad84adf upstream. Just like arm64, we can't trace the function in the patch_text path. Here is the bug log: [ 45.234334] Unable to handle kernel paging request at virtual address ffffffd38ae80900 [ 45.242313] Oops [#1] [ 45.244600] Modules linked in: [ 45.247678] CPU: 0 PID: 11 Comm: migration/0 Not tainted 5.9.0-00025-g9b7db83-dirty #215 [ 45.255797] epc: ffffffe00021689a ra : ffffffe00021718e sp : ffffffe01afabb58 [ 45.262955] gp : ffffffe00136afa0 tp : ffffffe01af94d00 t0 : 0000000000000002 [ 45.270200] t1 : 0000000000000000 t2 : 0000000000000001 s0 : ffffffe01afabc08 [ 45.277443] s1 : ffffffe0013718a8 a0 : 0000000000000000 a1 : ffffffe01afabba8 [ 45.284686] a2 : 0000000000000000 a3 : 0000000000000000 a4 : c4c16ad38ae80900 [ 45.291929] a5 : 0000000000000000 a6 : 0000000000000000 a7 : 0000000052464e43 [ 45.299173] s2 : 0000000000000001 s3 : ffffffe000206a60 s4 : ffffffe000206a60 [ 45.306415] s5 : 00000000000009ec s6 : ffffffe0013718a8 s7 : c4c16ad38ae80900 [ 45.313658] s8 : 0000000000000004 s9 : 0000000000000001 s10: 0000000000000001 [ 45.320902] s11: 0000000000000003 t3 : 0000000000000001 t4 : ffffffffd192fe79 [ 45.328144] t5 : ffffffffb8f80000 t6 : 0000000000040000 [ 45.333472] status: 0000000200000100 badaddr: ffffffd38ae80900 cause: 000000000000000f [ 45.341514] ---[ end trace d95102172248fdcf ]--- [ 45.346176] note: migration/0[11] exited with preempt_count 1 (gdb) x /2i $pc => 0xffffffe00021689a <__do_proc_dointvec+196>: sd zero,0(s7) 0xffffffe00021689e <__do_proc_dointvec+200>: li s11,0 (gdb) bt 0 __do_proc_dointvec (tbl_data=0x0, table=0xffffffe01afabba8, write=0, buffer=0x0, lenp=0x7bf897061f9a0800, ppos=0x4, conv=0x0, data=0x52464e43) at kernel/sysctl.c:581 1 0xffffffe00021718e in do_proc_dointvec (data=<optimized out>, conv=<optimized out>, ppos=<optimized out>, lenp=<optimized out>, buffer=<optimized out>, write=<optimized out>, table=<optimized out>) at kernel/sysctl.c:964 2 proc_dointvec_minmax (ppos=<optimized out>, lenp=<optimized out>, buffer=<optimized out>, write=<optimized out>, table=<optimized out>) at kernel/sysctl.c:964 3 proc_do_static_key (table=<optimized out>, write=1, buffer=0x0, lenp=0x0, ppos=0x7bf897061f9a0800) at kernel/sysctl.c:1643 4 0xffffffe000206792 in ftrace_make_call (rec=<optimized out>, addr=<optimized out>) at arch/riscv/kernel/ftrace.c:109 5 0xffffffe0002c9c04 in __ftrace_replace_code (rec=0xffffffe01ae40c30, enable=3) at kernel/trace/ftrace.c:2503 6 0xffffffe0002ca0b2 in ftrace_replace_code (mod_flags=<optimized out>) at kernel/trace/ftrace.c:2530 7 0xffffffe0002ca26a in ftrace_modify_all_code (command=5) at kernel/trace/ftrace.c:2677 8 0xffffffe0002ca30e in __ftrace_modify_code (data=<optimized out>) at kernel/trace/ftrace.c:2703 9 0xffffffe0002c13b0 in multi_cpu_stop (data=0x0) at kernel/stop_machine.c:224 10 0xffffffe0002c0fde in cpu_stopper_thread (cpu=<optimized out>) at kernel/stop_machine.c:491 11 0xffffffe0002343de in smpboot_thread_fn (data=0x0) at kernel/smpboot.c:165 12 0xffffffe00022f8b4 in kthread (_create=0xffffffe01af0c040) at kernel/kthread.c:292 13 0xffffffe000201fac in handle_exception () at arch/riscv/kernel/entry.S:236 0xffffffe00020678a <+114>: auipc ra,0xffffe 0xffffffe00020678e <+118>: jalr -118(ra) # 0xffffffe000204714 <patch_text_nosync> 0xffffffe000206792 <+122>: snez a0,a0 (gdb) disassemble patch_text_nosync Dump of assembler code for function patch_text_nosync: 0xffffffe000204714 <+0>: addi sp,sp,-32 0xffffffe000204716 <+2>: sd s0,16(sp) 0xffffffe000204718 <+4>: sd ra,24(sp) 0xffffffe00020471a <+6>: addi s0,sp,32 0xffffffe00020471c <+8>: auipc ra,0x0 0xffffffe000204720 <+12>: jalr -384(ra) # 0xffffffe00020459c <patch_insn_write> 0xffffffe000204724 <+16>: beqz a0,0xffffffe00020472e <patch_text_nosync+26> 0xffffffe000204726 <+18>: ld ra,24(sp) 0xffffffe000204728 <+20>: ld s0,16(sp) 0xffffffe00020472a <+22>: addi sp,sp,32 0xffffffe00020472c <+24>: ret 0xffffffe00020472e <+26>: sd a0,-24(s0) 0xffffffe000204732 <+30>: auipc ra,0x4 0xffffffe000204736 <+34>: jalr -1464(ra) # 0xffffffe00020817a <flush_icache_all> 0xffffffe00020473a <+38>: ld a0,-24(s0) 0xffffffe00020473e <+42>: ld ra,24(sp) 0xffffffe000204740 <+44>: ld s0,16(sp) 0xffffffe000204742 <+46>: addi sp,sp,32 0xffffffe000204744 <+48>: ret (gdb) disassemble flush_icache_all-4 Dump of assembler code for function flush_icache_all: 0xffffffe00020817a <+0>: addi sp,sp,-8 0xffffffe00020817c <+2>: sd ra,0(sp) 0xffffffe00020817e <+4>: auipc ra,0xfffff 0xffffffe000208182 <+8>: jalr -1822(ra) # 0xffffffe000206a60 <ftrace_caller> 0xffffffe000208186 <+12>: ld ra,0(sp) 0xffffffe000208188 <+14>: addi sp,sp,8 0xffffffe00020818a <+0>: addi sp,sp,-16 0xffffffe00020818c <+2>: sd s0,0(sp) 0xffffffe00020818e <+4>: sd ra,8(sp) 0xffffffe000208190 <+6>: addi s0,sp,16 0xffffffe000208192 <+8>: li a0,0 0xffffffe000208194 <+10>: auipc ra,0xfffff 0xffffffe000208198 <+14>: jalr -410(ra) # 0xffffffe000206ffa <sbi_remote_fence_i> 0xffffffe00020819c <+18>: ld s0,0(sp) 0xffffffe00020819e <+20>: ld ra,8(sp) 0xffffffe0002081a0 <+22>: addi sp,sp,16 0xffffffe0002081a2 <+24>: ret (gdb) frame 5 (rec=0xffffffe01ae40c30, enable=3) at kernel/trace/ftrace.c:2503 2503 return ftrace_make_call(rec, ftrace_addr); (gdb) p /x rec->ip $2 = 0xffffffe00020817a -> flush_icache_all ! When we modified flush_icache_all's patchable-entry with ftrace_caller: - Insert ftrace_caller at flush_icache_all prologue. - Call flush_icache_all to sync I/Dcache, but flush_icache_all is just we modified by half. Link: https://lore.kernel.org/linux-riscv/CAJF2gTT=oDWesWe0JVWvTpGi60-gpbNhYLdFWN_5EbyeqoEDdw@mail.gmail.com/T/#t Signed-off-by:
Guo Ren <guoren@linux.alibaba.com> Reviewed-by:
Atish Patra <atish.patra@wdc.com> Signed-off-by:
Palmer Dabbelt <palmerdabbelt@google.com> Signed-off-by:
Dimitri John Ledkov <dimitri.ledkov@canonical.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Guo Ren authored
commit 67d94577 upstream. We must use $(CC_FLAGS_FTRACE) instead of directly using -pg. It will cause -fpatchable-function-entry error. Signed-off-by:
Guo Ren <guoren@linux.alibaba.com> Signed-off-by:
Palmer Dabbelt <palmerdabbelt@google.com> Signed-off-by:
Dimitri John Ledkov <dimitri.ledkov@canonical.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Pauli Virtanen authored
commit 55981d35 upstream. Some USB BT adapters don't satisfy the MTU requirement mentioned in commit e848dbd3 ("Bluetooth: btusb: Add support USB ALT 3 for WBS") and have ALT 3 setting that produces no/garbled audio. Some adapters with larger MTU were also reported to have problems with ALT 3. Add a flag and check it and MTU before selecting ALT 3, falling back to ALT 1. Enable the flag for Realtek, restoring the previous behavior for non-Realtek devices. Tested with USB adapters (mtu<72, no/garbled sound with ALT3, ALT1 works) BCM20702A1 0b05:17cb, CSR8510A10 0a12:0001, and (mtu>=72, ALT3 works) RTL8761BU 0bda:8771, Intel AX200 8087:0029 (after disabling ALT6). Also got reports for (mtu>=72, ALT 3 reported to produce bad audio) Intel 8087:0a2b. Signed-off-by:
Pauli Virtanen <pav@iki.fi> Fixes: e848dbd3 ("Bluetooth: btusb: Add support USB ALT 3 for WBS") Tested-by:
Michał Kępień <kernel@kempniu.pl> Tested-by:
Jonathan Lampérth <jon@h4n.dev> Signed-off-by:
Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Linus Torvalds authored
commit 2287a51b upstream. As per the long-suffering comment. Reported-by:
Minh Yuan <yuanmingbuaa@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jiri Slaby <jirislaby@kernel.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Xin Long authored
commit 7387a72c upstream. __tipc_sendmsg() is called to send SYN packet by either tipc_sendmsg() or tipc_connect(). The difference is in tipc_connect(), it will call tipc_wait_for_connect() after __tipc_sendmsg() to wait until connecting is done. So there's no need to wait in __tipc_sendmsg() for this case. This patch is to fix it by calling tipc_wait_for_connect() only when dlen is not 0 in __tipc_sendmsg(), which means it's called by tipc_connect(). Note this also fixes the failure in tipcutils/test/ptts/: # ./tipcTS & # ./tipcTC 9 (hang) Fixes: 36239dab6da7 ("tipc: fix implicit-connect for SYN+") Reported-by:
Shuang Li <shuali@redhat.com> Signed-off-by:
Xin Long <lucien.xin@gmail.com> Acked-by:
Jon Maloy <jmaloy@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Frieder Schrempf authored
The new generic NAND ECC framework stores the configuration and requirements in separate places since commit 93ef92f6 ("mtd: nand: Use the new generic ECC object"). In 5.10.x The SPI NAND layer still uses only the requirements to track the ECC properties. This mismatch leads to values of zero being used for ECC strength and step_size in the SPI NAND layer wherever nanddev_get_ecc_conf() is used and therefore breaks the SPI NAND on-die ECC support in 5.10.x. By using nanddev_get_ecc_requirements() instead of nanddev_get_ecc_conf() for SPI NAND, we make sure that the correct parameters for the detected chip are used. In later versions (5.11.x) this is fixed anyway with the implementation of the SPI NAND on-die ECC engine. Cc: stable@vger.kernel.org # 5.10.x Reported-by:
voice INTER connect GmbH <developer@voiceinterconnect.de> Signed-off-by:
Frieder Schrempf <frieder.schrempf@kontron.de> Acked-by:
Miquel Raynal <miquel.raynal@bootlin.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Linus Torvalds authored
commit fe67f4dd upstream. It turns out that the SIGIO/FASYNC situation is almost exactly the same as the EPOLLET case was: user space really wants to be notified after every operation. Now, in a perfect world it should be sufficient to only notify user space on "state transitions" when the IO state changes (ie when a pipe goes from unreadable to readable, or from unwritable to writable). User space should then do as much as possible - fully emptying the buffer or what not - and we'll notify it again the next time the state changes. But as with EPOLLET, we have at least one case (stress-ng) where the kernel sent SIGIO due to the pipe being marked for asynchronous notification, but the user space signal handler then didn't actually necessarily read it all before returning (it read more than what was written, but since there could be multiple writes, it could leave data pending). The user space code then expected to get another SIGIO for subsequent writes - even though the pipe had been readable the whole time - and would only then read more. This is arguably a user space bug - and Colin King already fixed the stress-ng code in question - but the kernel regression rules are clear: it doesn't matter if kernel people think that user space did something silly and wrong. What matters is that it used to work. So if user space depends on specific historical kernel behavior, it's a regression when that behavior changes. It's on us: we were silly to have that non-optimal historical behavior, and our old kernel behavior was what user space was tested against. Because of how the FASYNC notification was tied to wakeup behavior, this was first broken by commits f467a6a6 and 1b6b26ae ("pipe: fix and clarify pipe read/write wakeup logic"), but at the time it seems nobody noticed. Probably because the stress-ng problem case ends up being timing-dependent too. It was then unwittingly fixed by commit 3a34b13a ("pipe: make pipe writes always wake up readers") only to be broken again when by commit 3b844826 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads"). And at that point the kernel test robot noticed the performance refression in the stress-ng.sigio.ops_per_sec case. So the "Fixes" tag below is somewhat ad hoc, but it matches when the issue was noticed. Fix it for good (knock wood) by simply making the kill_fasync() case separate from the wakeup case. FASYNC is quite rare, and we clearly shouldn't even try to use the "avoid unnecessary wakeups" logic for it. Link: https://lore.kernel.org/lkml/20210824151337.GC27667@xsang-OptiPlex-9020/ Fixes: 3b844826 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads") Reported-by:
kernel test robot <oliver.sang@intel.com> Tested-by:
Oliver Sang <oliver.sang@intel.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Colin Ian King <colin.king@canonical.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Linus Torvalds authored
commit 3b844826 upstream. I had forgotten just how sensitive hackbench is to extra pipe wakeups, and commit 3a34b13a ("pipe: make pipe writes always wake up readers") ended up causing a quite noticeable regression on larger machines. Now, hackbench isn't necessarily a hugely meaningful benchmark, and it's not clear that this matters in real life all that much, but as Mel points out, it's used often enough when comparing kernels and so the performance regression shows up like a sore thumb. It's easy enough to fix at least for the common cases where pipes are used purely for data transfer, and you never have any exciting poll usage at all. So set a special 'poll_usage' flag when there is polling activity, and make the ugly "EPOLLET has crazy legacy expectations" semantics explicit to only that case. I would love to limit it to just the broken EPOLLET case, but the pipe code can't see the difference between epoll and regular select/poll, so any non-read/write waiting will trigger the extra wakeup behavior. That is sufficient for at least the hackbench case. Apart from making the odd extra wakeup cases more explicitly about EPOLLET, this also makes the extra wakeup be at the _end_ of the pipe write, not at the first write chunk. That is actually much saner semantics (as much as you can call any of the legacy edge-triggered expectations for EPOLLET "sane") since it means that you know the wakeup will happen once the write is done, rather than possibly in the middle of one. [ For stable people: I'm putting a "Fixes" tag on this, but I leave it up to you to decide whether you actually want to backport it or not. It likely has no impact outside of synthetic benchmarks - Linus ] Link: https://lore.kernel.org/lkml/20210802024945.GA8372@xsang-OptiPlex-9020/ Fixes: 3a34b13a ("pipe: make pipe writes always wake up readers") Reported-by:
kernel test robot <oliver.sang@intel.com> Tested-by:
Sandeep Patil <sspatil@android.com> Tested-by:
Mel Gorman <mgorman@techsingularity.net> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Filipe Manana authored
commit bc0939fc upstream. We have a race between marking that an inode needs to be logged, either at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between btrfs_sync_log(). The following steps describe how the race happens. 1) We are at transaction N; 2) Inode I was previously fsynced in the current transaction so it has: inode->logged_trans set to N; 3) The inode's root currently has: root->log_transid set to 1 root->last_log_commit set to 0 Which means only one log transaction was committed to far, log transaction 0. When a log tree is created we set ->log_transid and ->last_log_commit of its parent root to 0 (at btrfs_add_log_tree()); 4) One more range of pages is dirtied in inode I; 5) Some task A starts an fsync against some other inode J (same root), and so it joins log transaction 1. Before task A calls btrfs_sync_log()... 6) Task B starts an fsync against inode I, which currently has the full sync flag set, so it starts delalloc and waits for the ordered extent to complete before calling btrfs_inode_in_log() at btrfs_sync_file(); 7) During ordered extent completion we have btrfs_update_inode() called against inode I, which in turn calls btrfs_set_inode_last_trans(), which does the following: spin_lock(&inode->lock); inode->last_trans = trans->transaction->transid; inode->last_sub_trans = inode->root->log_transid; inode->last_log_commit = inode->root->last_log_commit; spin_unlock(&inode->lock); So ->last_trans is set to N and ->last_sub_trans set to 1. But before setting ->last_log_commit... 8) Task A is at btrfs_sync_log(): - it increments root->log_transid to 2 - starts writeback for all log tree extent buffers - waits for the writeback to complete - writes the super blocks - updates root->last_log_commit to 1 It's a lot of slow steps between updating root->log_transid and root->last_log_commit; 9) The task doing the ordered extent completion, currently at btrfs_set_inode_last_trans(), then finally runs: inode->last_log_commit = inode->root->last_log_commit; spin_unlock(&inode->lock); Which results in inode->last_log_commit being set to 1. The ordered extent completes; 10) Task B is resumed, and it calls btrfs_inode_in_log() which returns true because we have all the following conditions met: inode->logged_trans == N which matches fs_info->generation && inode->last_subtrans (1) <= inode->last_log_commit (1) && inode->last_subtrans (1) <= root->last_log_commit (1) && list inode->extent_tree.modified_extents is empty And as a consequence we return without logging the inode, so the existing logged version of the inode does not point to the extent that was written after the previous fsync. It should be impossible in practice for one task be able to do so much progress in btrfs_sync_log() while another task is at btrfs_set_inode_last_trans() right after it reads root->log_transid and before it reads root->last_log_commit. Even if kernel preemption is enabled we know the task at btrfs_set_inode_last_trans() can not be preempted because it is holding the inode's spinlock. However there is another place where we do the same without holding the spinlock, which is in the memory mapped write path at: vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) { (...) BTRFS_I(inode)->last_trans = fs_info->generation; BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid; BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit; (...) So with preemption happening after setting ->last_sub_trans and before setting ->last_log_commit, it is less of a stretch to have another task do enough progress at btrfs_sync_log() such that the task doing the memory mapped write ends up with ->last_sub_trans and ->last_log_commit set to the same value. It is still a big stretch to get there, as the task doing btrfs_sync_log() has to start writeback, wait for its completion and write the super blocks. So fix this in two different ways: 1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the value of ->last_sub_trans minus 1; 2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just like we do for buffered and direct writes at btrfs_file_write_iter(), which is all we need to make sure multiple writes and fsyncs to an inode in the same transaction never result in an fsync missing that the inode changed and needs to be logged. Turn this into a helper function and use it both at btrfs_page_mkwrite() and at btrfs_file_write_iter() - this also fixes the problem that at btrfs_page_mkwrite() we were setting those fields without the protection of the inode's spinlock. This is an extremely unlikely race to happen in practice. Signed-off-by:
Filipe Manana <fdmanana@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com> Signed-off-by:
Anand Jain <anand.jain@oracle.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Gerd Rausch authored
[ Upstream commit fb4b1373 ] Function "dma_map_sg" is entitled to merge adjacent entries and return a value smaller than what was passed as "nents". Subsequently "ib_map_mr_sg" needs to work with this value ("sg_dma_len") rather than the original "nents" parameter ("sg_len"). This old RDS bug was exposed and reliably causes kernel panics (using RDMA operations "rds-stress -D") on x86_64 starting with: commit c588072b ("iommu/vt-d: Convert intel iommu driver to the iommu ops") Simply put: Linux 5.11 and later. Signed-off-by:
Gerd Rausch <gerd.rausch@oracle.com> Acked-by:
Santosh Shilimkar <santosh.shilimkar@oracle.com> Link: https://lore.kernel.org/r/60efc69f-1f35-529d-a7ef-da0549cad143@oracle.com Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Ben Skeggs authored
[ Upstream commit e78b1b54 ] Should fix some initial modeset failures on (at least) Ampere boards. Signed-off-by:
Ben Skeggs <bskeggs@redhat.com> Reviewed-by:
Lyude Paul <lyude@redhat.com> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Ben Skeggs authored
[ Upstream commit 6eaa1f3c ] When booted with multiple displays attached, the EFI GOP driver on (at least) Ampere, can leave DP links powered up that aren't being used to display anything. This confuses our tracking of SOR routing, with the likely result being a failed modeset and display engine hang. Fix this by (ab?)using the DisableLT IED script to power-down the link, restoring HW to a state the driver expects. Signed-off-by:
Ben Skeggs <bskeggs@redhat.com> Reviewed-by:
Lyude Paul <lyude@redhat.com> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Mark Yacoub authored
[ Upstream commit fa0b1ef5 ] [Why] Userspace should get back a copy of drm_wait_vblank that's been modified even when drm_wait_vblank_ioctl returns a failure. Rationale: drm_wait_vblank_ioctl modifies the request and expects the user to read it back. When the type is RELATIVE, it modifies it to ABSOLUTE and updates the sequence to become current_vblank_count + sequence (which was RELATIVE), but now it became ABSOLUTE. drmWaitVBlank (in libdrm) expects this to be the case as it modifies the request to be Absolute so it expects the sequence to would have been updated. The change is in compat_drm_wait_vblank, which is called by drm_compat_ioctl. This change of copying the data back regardless of the return number makes it en par with drm_ioctl, which always copies the data before returning. [How] Return from the function after everything has been copied to user. Fixes IGT:kms_flip::modeset-vs-vblank-race-interruptible Tested on ChromeOS Trogdor(msm) Reviewed-by:
Michel Dänzer <mdaenzer@redhat.com> Signed-off-by:
Mark Yacoub <markyacoub@chromium.org> Signed-off-by:
Sean Paul <seanpaul@chromium.org> Link: https://patchwork.freedesktop.org/patch/msgid/20210812194917.1703356-1-markyacoub@chromium.org Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Ming Lei authored
[ Upstream commit c797b40c ] Inside blk_mq_queue_tag_busy_iter() we already grabbed request's refcount before calling ->fn(), so needn't to grab it one more time in blk_mq_check_expired(). Meantime remove extra request expire check in blk_mq_check_expired(). Cc: Keith Busch <kbusch@kernel.org> Signed-off-by:
Ming Lei <ming.lei@redhat.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
John Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/20210811155202.629575-1-ming.lei@redhat.com Signed-off-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Kenneth Feng authored
[ Upstream commit 93c5701b ] change the workload type for some cards as it is needed. Signed-off-by:
Kenneth Feng <kenneth.feng@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Kenneth Feng authored
[ Upstream commit 2fd31689 ] This reverts commit 0979d432 . Revert this because it does not apply to all the cards. Signed-off-by:
Kenneth Feng <kenneth.feng@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Shai Malin authored
[ Upstream commit d33d19d3 ] Fix a possible null-pointer dereference in qed_rdma_create_qp(). Changes from V2: - Revert checkpatch fixes. Reported-by:
TOTE Robot <oslab@tsinghua.edu.cn> Signed-off-by:
Ariel Elior <aelior@marvell.com> Signed-off-by:
Shai Malin <smalin@marvell.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Shai Malin authored
[ Upstream commit 37110237 ] Avoiding qed ll2 race condition and NULL pointer dereference as part of the remove and recovery flows. Changes form V1: - Change (!p_rx->set_prod_addr). - qed_ll2.c checkpatch fixes. Change from V2: - Revert "qed_ll2.c checkpatch fixes". Signed-off-by:
Ariel Elior <aelior@marvell.com> Signed-off-by:
Shai Malin <smalin@marvell.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Michael S. Tsirkin authored
[ Upstream commit a24ce06c ] We use a spinlock now so add a stub. Ignore bogus uninitialized variable warnings. Signed-off-by:
Michael S. Tsirkin <mst@redhat.com> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Neeraj Upadhyay authored
[ Upstream commit e74cfa91 ] As __vringh_iov() traverses a descriptor chain, it populates each descriptor entry into either read or write vring iov and increments that iov's ->used member. So, as we iterate over a descriptor chain, at any point, (riov/wriov)->used value gives the number of descriptor enteries available, which are to be read or written by the device. As all read iovs must precede the write iovs, wiov->used should be zero when we are traversing a read descriptor. Current code checks for wiov->i, to figure out whether any previous entry in the current descriptor chain was a write descriptor. However, iov->i is only incremented, when these vring iovs are consumed, at a later point, and remain 0 in __vringh_iov(). So, correct the check for read and write descriptor order, to use wiov->used. Acked-by:
Jason Wang <jasowang@redhat.com> Reviewed-by:
Stefano Garzarella <sgarzare@redhat.com> Signed-off-by:
Neeraj Upadhyay <neeraju@codeaurora.org> Link: https://lore.kernel.org/r/1624591502-4827-1-git-send-email-neeraju@codeaurora.org Signed-off-by:
Michael S. Tsirkin <mst@redhat.com> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Vincent Whitchurch authored
[ Upstream commit cb5d2c1f ] Do not call vDPA drivers' callbacks with vq indicies larger than what the drivers indicate that they support. vDPA drivers do not bounds check the indices. Signed-off-by:
Vincent Whitchurch <vincent.whitchurch@axis.com> Link: https://lore.kernel.org/r/20210701114652.21956-1-vincent.whitchurch@axis.com Signed-off-by:
Michael S. Tsirkin <mst@redhat.com> Acked-by:
Jason Wang <jasowang@redhat.com> Reviewed-by:
Stefano Garzarella <sgarzare@redhat.com> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Parav Pandit authored
[ Upstream commit 43bb40c5 ] When a virtio pci device undergo surprise removal (aka async removal in PCIe spec), mark the device as broken so that any upper layer drivers can abort any outstanding operation. When a virtio net pci device undergo surprise removal which is used by a NetworkManager, a below call trace was observed. kernel:watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [kworker/1:1:27059] watchdog: BUG: soft lockup - CPU#1 stuck for 52s! [kworker/1:1:27059] CPU: 1 PID: 27059 Comm: kworker/1:1 Tainted: G S W I L 5.13.0-hotplug+ #8 Hardware name: Dell Inc. PowerEdge R640/0H28RR, BIOS 2.9.4 11/06/2020 Workqueue: events linkwatch_event RIP: 0010:virtnet_send_command+0xfc/0x150 [virtio_net] Call Trace: virtnet_set_rx_mode+0xcf/0x2a7 [virtio_net] ? __hw_addr_create_ex+0x85/0xc0 __dev_mc_add+0x72/0x80 igmp6_group_added+0xa7/0xd0 ipv6_mc_up+0x3c/0x60 ipv6_find_idev+0x36/0x80 addrconf_add_dev+0x1e/0xa0 addrconf_dev_config+0x71/0x130 addrconf_notify+0x1f5/0xb40 ? rtnl_is_locked+0x11/0x20 ? __switch_to_asm+0x42/0x70 ? finish_task_switch+0xaf/0x2c0 ? raw_notifier_call_chain+0x3e/0x50 raw_notifier_call_chain+0x3e/0x50 netdev_state_change+0x67/0x90 linkwatch_do_dev+0x3c/0x50 __linkwatch_run_queue+0xd2/0x220 linkwatch_event+0x21/0x30 process_one_work+0x1c8/0x370 worker_thread+0x30/0x380 ? process_one_work+0x370/0x370 kthread+0x118/0x140 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x1f/0x30 Hence, add the ability to abort the command on surprise removal which prevents infinite loop and system lockup. Signed-off-by:
Parav Pandit <parav@nvidia.com> Link: https://lore.kernel.org/r/20210721142648.1525924-5-parav@nvidia.com Signed-off-by:
Michael S. Tsirkin <mst@redhat.com> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Parav Pandit authored
[ Upstream commit 60f07798 ] Currently vq->broken field is read by virtqueue_is_broken() in busy loop in one context by virtnet_send_command(). vq->broken is set to true in other process context by virtio_break_device(). Reader and writer are accessing it without any synchronization. This may lead to a compiler optimization which may result to optimize reading vq->broken only once. Hence, force reading vq->broken on each invocation of virtqueue_is_broken() and also force writing it so that such update is visible to the readers. It is a theoretical fix that isn't yet encountered in the field. Signed-off-by:
Parav Pandit <parav@nvidia.com> Link: https://lore.kernel.org/r/20210721142648.1525924-2-parav@nvidia.com Signed-off-by:
Michael S. Tsirkin <mst@redhat.com> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Thara Gopinath authored
[ Upstream commit 5d79e5ce ] The Qualcomm sm8150 platform uses the qcom-cpufreq-hw driver, so add it to the cpufreq-dt-platdev driver's blocklist. Signed-off-by:
Thara Gopinath <thara.gopinath@linaro.org> Reviewed-by:
Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by:
Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Michał Mirosław authored
[ Upstream commit 335ffab3 ] This WARN can be triggered per-core and the stack trace is not useful. Replace it with plain dev_err(). Fix a comment while at it. Signed-off-by:
Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by:
Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Johannes Berg authored
[ Upstream commit 0f673c16 ] Some products (So) may have two different types of products with different mac-type that are otherwise equivalent, and have the same PNVM data, so the PNVM file will contain two (or perhaps later more) HW-type TLVs. Accept the file and use the data section that contains any matching entry. Signed-off-by:
Johannes Berg <johannes.berg@intel.com> Signed-off-by:
Kalle Valo <kvalo@codeaurora.org> Link: https://lore.kernel.org/r/20210719140154.a6a86e903035.Ic0b1b75c45d386698859f251518e8a5144431938@changeid Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Adam Ford authored
[ Upstream commit 1669a941 ] The probe was manually passing NULL instead of dev to devm_clk_hw_register. This caused a Unable to handle kernel NULL pointer dereference error. Fix this by passing 'dev'. Signed-off-by:
Adam Ford <aford173@gmail.com> Fixes: a20a40a8 ("clk: renesas: rcar-usb2-clock-sel: Fix error handling in .probe()") Reviewed-by:
Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by:
Stephen Boyd <sboyd@kernel.org> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Colin Ian King authored
[ Upstream commit 0b3a8738 ] The u32 variable pci_dword is being masked with 0x1fffffff and then left shifted 23 places. The shift is a u32 operation,so a value of 0x200 or more in pci_dword will overflow the u32 and only the bottow 32 bits are assigned to addr. I don't believe this was the original intent. Fix this by casting pci_dword to a resource_size_t to ensure no overflow occurs. Note that the mask and 12 bit left shift operation does not need this because the mask SNR_IMC_MMIO_MEM0_MASK and shift is always a 32 bit value. Fixes: ee49532b ("perf/x86/intel/uncore: Add IMC uncore support for Snow Ridge") Addresses-Coverity: ("Unintentional integer overflow") Signed-off-by:
Colin Ian King <colin.king@canonical.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Reviewed-by:
Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20210706114553.28249-1-colin.king@canonical.com Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Rob Herring authored
[ Upstream commit 1c8094e3 ] When the schema fixups are applied to 'select' the result is a single entry is required for a match, but that will never match as there should be 2 entries. Also, a 'select' schema should have the widest possible match, so use 'contains' which matches the compatible string(s) in any position and not just the first position. Fixes: 993dcfac ("dt-bindings: riscv: sifive-l2-cache: convert bindings to json-schema") Signed-off-by:
Rob Herring <robh@kernel.org> Cc: stable@vger.kernel.org Signed-off-by:
Palmer Dabbelt <palmerdabbelt@google.com> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Jerome Brunet authored
[ Upstream commit 068fdad2 ] If the endpoint completion callback is call right after the ep_enabled flag is cleared and before usb_ep_dequeue() is call, we could do a double free on the request and the associated buffer. Fix this by clearing ep_enabled after all the endpoint requests have been dequeued. Fixes: 7de8681b ("usb: gadget: u_audio: Free requests only after callback") Cc: stable <stable@vger.kernel.org> Reported-by:
Thinh Nguyen <Thinh.Nguyen@synopsys.com> Signed-off-by:
Jerome Brunet <jbrunet@baylibre.com> Link: https://lore.kernel.org/r/20210827092927.366482-1-jbrunet@baylibre.com Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Matthew Brost authored
[ Upstream commit a63bcf08 ] A small race exists between intel_gt_retire_requests_timeout and intel_timeline_exit which could result in the syncmap not getting free'd. Rather than work to hard to seal this race, simply cleanup the syncmap on fini. unreferenced object 0xffff88813bc53b18 (size 96): comm "gem_close_race", pid 5410, jiffies 4294917818 (age 1105.600s) hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 00 00 00 00 0a 00 00 00 ................ 00 00 00 00 00 00 00 00 6b 6b 6b 6b 06 00 00 00 ........kkkk.... backtrace: [<00000000120b863a>] __sync_alloc_leaf+0x1e/0x40 [i915] [<00000000042f6959>] __sync_set+0x1bb/0x240 [i915] [<0000000090f0e90f>] i915_request_await_dma_fence+0x1c7/0x400 [i915] [<0000000056a48219>] i915_request_await_object+0x222/0x360 [i915] [<00000000aaac4ee3>] i915_gem_do_execbuffer+0x1bd0/0x2250 [i915] [<000000003c9d830f>] i915_gem_execbuffer2_ioctl+0x405/0xce0 [i915] [<00000000fd7a8e68>] drm_ioctl_kernel+0xb0/0xf0 [drm] [<00000000e721ee87>] drm_ioctl+0x305/0x3c0 [drm] [<000000008b0d8986>] __x64_sys_ioctl+0x71/0xb0 [<0000000076c362a4>] do_syscall_64+0x33/0x80 [<00000000eb7a4831>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Signed-off-by:
Matthew Brost <matthew.brost@intel.com> Fixes: 531958f6 ("drm/i915/gt: Track timeline activeness in enter/exit") Cc: <stable@vger.kernel.org> Reviewed-by:
John Harrison <John.C.Harrison@Intel.com> Signed-off-by:
John Harrison <John.C.Harrison@Intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20210730195342.110234-1-matthew.brost@intel.com (cherry picked from commit faf89098 ) Signed-off-by:
Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Wong Vee Khee authored
[ Upstream commit 82a44ae1 ] In the case of taprio offload is not enabled, the error handling path causes a kernel crash due to kernel NULL pointer deference. Fix this by adding check for NULL before attempt to access 'plat->est' on the mutex_lock() call. The following kernel panic is observed without this patch: RIP: 0010:mutex_lock+0x10/0x20 Call Trace: tc_setup_taprio+0x482/0x560 [stmmac] kmem_cache_alloc_trace+0x13f/0x490 taprio_disable_offload.isra.0+0x9d/0x180 [sch_taprio] taprio_destroy+0x6c/0x100 [sch_taprio] qdisc_create+0x2e5/0x4f0 tc_modify_qdisc+0x126/0x740 rtnetlink_rcv_msg+0x12b/0x380 _raw_spin_lock_irqsave+0x19/0x40 _raw_spin_unlock_irqrestore+0x18/0x30 create_object+0x212/0x340 rtnl_calcit.isra.0+0x110/0x110 netlink_rcv_skb+0x50/0x100 netlink_unicast+0x191/0x230 netlink_sendmsg+0x243/0x470 sock_sendmsg+0x5e/0x60 ____sys_sendmsg+0x20b/0x280 copy_msghdr_from_user+0x5c/0x90 __mod_memcg_state+0x87/0xf0 ___sys_sendmsg+0x7c/0xc0 lru_cache_add+0x7f/0xa0 _raw_spin_unlock+0x16/0x30 wp_page_copy+0x449/0x890 handle_mm_fault+0x921/0xfc0 __sys_sendmsg+0x59/0xa0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ---[ end trace b1f19b24368a96aa ]--- Fixes: b60189e0 ("net: stmmac: Integrate EST with TAPRIO scheduler API") Cc: <stable@vger.kernel.org> # 5.10.x Signed-off-by:
Wong Vee Khee <vee.khee.wong@linux.intel.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Xiaoliang Yang authored
[ Upstream commit b2aae654 ] Add a mutex lock to protect est structure parameters so that the EST parameters can be updated by other threads. Signed-off-by:
Xiaoliang Yang <xiaoliang.yang_1@nxp.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Pavel Machek (CIP) <pavel@denx.de> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Ulf Hansson authored
[ Upstream commit 885814a9 ] This reverts commit 419dd626 . It turned out that the change from the reverted commit breaks the ACPI based rpi's because it causes the 100Mhz max clock to be overridden to the return from sdhci_iproc_get_max_clock(), which is 0 because there isn't a OF/DT based clock device. Reported-by:
Jeremy Linton <jeremy.linton@arm.com> Fixes: 419dd626 ("mmc: sdhci-iproc: Set SDHCI_QUIRK_CAP_CLOCK_BASE_BROKEN on BCM2711") Acked-by:
Stefan Wahren <stefan.wahren@i2se.com> Signed-off-by:
Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Guangbin Huang authored
[ Upstream commit 8c1671e0 ] Currently, when query PFC configuration by dcbtool, driver will return PFC enable status based on TC. As all priorities are mapped to TC0 by default, if TC0 is enabled, then all priorities mapped to TC0 will be shown as enabled status when query PFC setting, even though some priorities have never been set. for example: $ dcb pfc show dev eth0 pfc-cap 4 macsec-bypass off delay 0 prio-pfc 0:off 1:off 2:off 3:off 4:off 5:off 6:off 7:off $ dcb pfc set dev eth0 prio-pfc 0:on 1:on 2:on 3:on $ dcb pfc show dev eth0 pfc-cap 4 macsec-bypass off delay 0 prio-pfc 0:on 1:on 2:on 3:on 4:on 5:on 6:on 7:on To fix this problem, just returns user's PFC config parameter saved in driver. Fixes: cacde272 ("net: hns3: Add hclge_dcb module for the support of DCB feature") Signed-off-by:
Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Guojia Liao authored
[ Upstream commit 94391fae ] VLAN list should not be added duplicate VLAN node, otherwise it would cause "add failed" when restore VLAN from VLAN list, so this patch adds VLAN ID check before adding node into VLAN list. Fixes: c6075b19 ("net: hns3: Record VF vlan tables") Signed-off-by:
Guojia Liao <liaoguojia@huawei.com> Signed-off-by:
Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Yufeng Mo authored
[ Upstream commit a96d9330 ] After the cmdq registers are cleared, the firmware may take time to clear out possible left over commands in the cmdq. Driver must release cmdq memory only after firmware has completed processing of left over commands. Fixes: 232d0d55 ("net: hns3: uninitialize command queue while unloading PF driver") Signed-off-by:
Yufeng Mo <moyufeng@huawei.com> Signed-off-by:
Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Yufeng Mo authored
[ Upstream commit 1a6d2819 ] If a PF is bonded to a virtual machine and the virtual machine exits unexpectedly, some hardware resource cannot be cleared. In this case, loading driver may cause exceptions. Therefore, the hardware resource needs to be cleared when the driver is loaded. Fixes: 46a3df9f ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support") Signed-off-by:
Yufeng Mo <moyufeng@huawei.com> Signed-off-by:
Salil Mehta <salil.mehta@huawei.com> Signed-off-by:
Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-
Andrey Ignatov authored
[ Upstream commit 96a6b93b ] Currently when device is moved between network namespaces using RTM_NEWLINK message type and one of netns attributes (FLA_NET_NS_PID, IFLA_NET_NS_FD, IFLA_TARGET_NETNSID) but w/o specifying IFLA_IFNAME, and target namespace already has device with same name, userspace will get EINVAL what is confusing and makes debugging harder. Fix it so that userspace gets more appropriate EEXIST instead what makes debugging much easier. Before: # ./ifname.sh + ip netns add ns0 + ip netns exec ns0 ip link add l0 type dummy + ip netns exec ns0 ip link show l0 8: l0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 66:90:b5:d5:78:69 brd ff:ff:ff:ff:ff:ff + ip link add l0 type dummy + ip link show l0 10: l0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 6e:c6:1f:15:20:8d brd ff:ff:ff:ff:ff:ff + ip link set l0 netns ns0 RTNETLINK answers: Invalid argument After: # ./ifname.sh + ip netns add ns0 + ip netns exec ns0 ip link add l0 type dummy + ip netns exec ns0 ip link show l0 8: l0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 1e:4a:72:e3:e3:8f brd ff:ff:ff:ff:ff:ff + ip link add l0 type dummy + ip link show l0 10: l0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether f2:fc:fe:2b:7d:a6 brd ff:ff:ff:ff:ff:ff + ip link set l0 netns ns0 RTNETLINK answers: File exists The problem is that do_setlink() passes its `char *ifname` argument, that it gets from a caller, to __dev_change_net_namespace() as is (as `const char *pat`), but semantics of ifname and pat can be different. For example, __rtnl_newlink() does this: net/core/rtnetlink.c 3270 char ifname[IFNAMSIZ]; ... 3286 if (tb[IFLA_IFNAME]) 3287 nla_strscpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ); 3288 else 3289 ifname[0] = '\0'; ... 3364 if (dev) { ... 3394 return do_setlink(skb, dev, ifm, extack, tb, ifname, status); 3395 } , i.e. do_setlink() gets ifname pointer that is always valid no matter if user specified IFLA_IFNAME or not and then do_setlink() passes this ifname pointer as is to __dev_change_net_namespace() as pat argument. But the pat (pattern) in __dev_change_net_namespace() is used as: net/core/dev.c 11198 err = -EEXIST; 11199 if (__dev_get_by_name(net, dev->name)) { 11200 /* We get here if we can't use the current device name */ 11201 if (!pat) 11202 goto out; 11203 err = dev_get_valid_name(net, dev, pat); 11204 if (err < 0) 11205 goto out; 11206 } As the result the `goto out` path on line 11202 is neven taken and instead of returning EEXIST defined on line 11198, __dev_change_net_namespace() returns an error from dev_get_valid_name() and this, in turn, will be EINVAL for ifname[0] = '\0' set earlier. Fixes: d8a5ec67 ("[NET]: netlink support for moving devices between network namespaces.") Signed-off-by:
Andrey Ignatov <rdna@fb.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Sasha Levin <sashal@kernel.org>
-