Skip to content
  1. May 01, 2022
  2. Apr 27, 2022
    • Greg Kroah-Hartman's avatar
      Linux 5.15.36 · 45451e80
      Greg Kroah-Hartman authored
      
      
      Link: https://lore.kernel.org/r/20220426081747.286685339@linuxfoundation.org
      Tested-by: default avatarJon Hunter <jonathanh@nvidia.com>
      Tested-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Tested-by: default avatarLinux Kernel Functional Testing <lkft@linaro.org>
      Tested-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Tested-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Tested-by: default avatarSlade Watkins <slade@sladewatkins.com>
      Tested-by: default avatarRon Economos <re@w6rz.net>
      Tested-by: default avatarSudip Mukherjee <sudip.mukherjee@codethink.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      v5.15.36
      45451e80
    • Alex Elder's avatar
      arm64: dts: qcom: add IPA qcom,qmp property · bb906d15
      Alex Elder authored
      commit 73419e4d
      
       upstream.
      
      At least three platforms require the "qcom,qmp" property to be
      specified, so the IPA driver can request register retention across
      power collapse.  Update DTS files accordingly.
      
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarBjorn Andersson <bjorn.andersson@linaro.org>
      Link: https://lore.kernel.org/r/20220201140723.467431-1-elder@linaro.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bb906d15
    • Khazhismel Kumykov's avatar
      block/compat_ioctl: fix range check in BLKGETSIZE · 1ea01e64
      Khazhismel Kumykov authored
      commit ccf16413
      
       upstream.
      
      kernel ulong and compat_ulong_t may not be same width. Use type directly
      to eliminate mismatches.
      
      This would result in truncation rather than EFBIG for 32bit mode for
      large disks.
      
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarKhazhismel Kumykov <khazhy@google.com>
      Reviewed-by: default avatarChaitanya Kulkarni <kch@nvidia.com>
      Link: https://lore.kernel.org/r/20220414224056.2875681-1-khazhy@google.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1ea01e64
    • Tudor Ambarus's avatar
      spi: atmel-quadspi: Fix the buswidth adjustment between spi-mem and controller · 6a3c609f
      Tudor Ambarus authored
      commit 8c235cc2 upstream.
      
      Use the spi_mem_default_supports_op() core helper in order to take into
      account the buswidth specified by the user in device tree.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 0e6aae08
      
       ("spi: Add QuadSPI driver for Atmel SAMA5D2")
      Signed-off-by: default avatarTudor Ambarus <tudor.ambarus@microchip.com>
      Link: https://lore.kernel.org/r/20220406133604.455356-1-tudor.ambarus@microchip.com
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6a3c609f
    • Ye Bin's avatar
      jbd2: fix a potential race while discarding reserved buffers after an abort · b1b8f39c
      Ye Bin authored
      commit 23e3d7f7 upstream.
      
      we got issue as follows:
      [   72.796117] EXT4-fs error (device sda): ext4_journal_check_start:83: comm fallocate: Detected aborted journal
      [   72.826847] EXT4-fs (sda): Remounting filesystem read-only
      fallocate: fallocate failed: Read-only file system
      [   74.791830] jbd2_journal_commit_transaction: jh=0xffff9cfefe725d90 bh=0x0000000000000000 end delay
      [   74.793597] ------------[ cut here ]------------
      [   74.794203] kernel BUG at fs/jbd2/transaction.c:2063!
      [   74.794886] invalid opcode: 0000 [#1] PREEMPT SMP PTI
      [   74.795533] CPU: 4 PID: 2260 Comm: jbd2/sda-8 Not tainted 5.17.0-rc8-next-20220315-dirty #150
      [   74.798327] RIP: 0010:__jbd2_journal_unfile_buffer+0x3e/0x60
      [   74.801971] RSP: 0018:ffffa828c24a3cb8 EFLAGS: 00010202
      [   74.802694] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      [   74.803601] RDX: 0000000000000001 RSI: ffff9cfefe725d90 RDI: ffff9cfefe725d90
      [   74.804554] RBP: ffff9cfefe725d90 R08: 0000000000000000 R09: ffffa828c24a3b20
      [   74.805471] R10: 0000000000000001 R11: 0000000000000001 R12: ffff9cfefe725d90
      [   74.806385] R13: ffff9cfefe725d98 R14: 0000000000000000 R15: ffff9cfe833a4d00
      [   74.807301] FS:  0000000000000000(0000) GS:ffff9d01afb00000(0000) knlGS:0000000000000000
      [   74.808338] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   74.809084] CR2: 00007f2b81bf4000 CR3: 0000000100056000 CR4: 00000000000006e0
      [   74.810047] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   74.810981] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   74.811897] Call Trace:
      [   74.812241]  <TASK>
      [   74.812566]  __jbd2_journal_refile_buffer+0x12f/0x180
      [   74.813246]  jbd2_journal_refile_buffer+0x4c/0xa0
      [   74.813869]  jbd2_journal_commit_transaction.cold+0xa1/0x148
      [   74.817550]  kjournald2+0xf8/0x3e0
      [   74.819056]  kthread+0x153/0x1c0
      [   74.819963]  ret_from_fork+0x22/0x30
      
      Above issue may happen as follows:
              write                   truncate                   kjournald2
      generic_perform_write
       ext4_write_begin
        ext4_walk_page_buffers
         do_journal_get_write_access ->add BJ_Reserved list
       ext4_journalled_write_end
        ext4_walk_page_buffers
         write_end_fn
          ext4_handle_dirty_metadata
                      ***************JBD2 ABORT**************
           jbd2_journal_dirty_metadata
       -> return -EROFS, jh in reserved_list
                                                         jbd2_journal_commit_transaction
                                                          while (commit_transaction->t_reserved_list)
                                                            jh = commit_transaction->t_reserved_list;
                              truncate_pagecache_range
                               do_invalidatepage
      			  ext4_journalled_invalidatepage
      			   jbd2_journal_invalidatepage
      			    journal_unmap_buffer
      			     __dispose_buffer
      			      __jbd2_journal_unfile_buffer
      			       jbd2_journal_put_journal_head ->put last ref_count
      			        __journal_remove_journal_head
      				 bh->b_private = NULL;
      				 jh->b_bh = NULL;
      				                      jbd2_journal_refile_buffer(journal, jh);
      							bh = jh2bh(jh);
      							->bh is NULL, later will trigger null-ptr-deref
      				 journal_free_journal_head(jh);
      
      After commit 96f1e097, we no longer hold the j_state_lock while
      iterating over the list of reserved handles in
      jbd2_journal_commit_transaction().  This potentially allows the
      journal_head to be freed by journal_unmap_buffer while the commit
      codepath is also trying to free the BJ_Reserved buffers.  Keeping
      j_state_lock held while trying extends hold time of the lock
      minimally, and solves this issue.
      
      Fixes: 96f1e097
      
      ("jbd2: avoid long hold times of j_state_lock while committing a transaction")
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20220317142137.1821590-1-yebin10@huawei.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b1b8f39c
    • Florian Westphal's avatar
      netfilter: nft_ct: fix use after free when attaching zone template · 2e25c46c
      Florian Westphal authored
      commit 34243b9e upstream.
      
      The conversion erroneously removed the refcount increment.
      In case we can use the percpu template, we need to increment
      the refcount, else it will be released when the skb gets freed.
      
      In case the slowpath is taken, the new template already has a
      refcount of 1.
      
      Fixes: 71977437
      
       ("netfilter: conntrack: convert to refcount_t api")
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2e25c46c
    • Theodore Ts'o's avatar
      ext4: force overhead calculation if the s_overhead_cluster makes no sense · 2b273d1f
      Theodore Ts'o authored
      commit 85d825db
      
       upstream.
      
      If the file system does not use bigalloc, calculating the overhead is
      cheap, so force the recalculation of the overhead so we don't have to
      trust the precalculated overhead in the superblock.
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2b273d1f
    • Theodore Ts'o's avatar
      ext4: fix overhead calculation to account for the reserved gdt blocks · 52ca84a3
      Theodore Ts'o authored
      commit 10b01ee9
      
       upstream.
      
      The kernel calculation was underestimating the overhead by not taking
      into account the reserved gdt blocks.  With this change, the overhead
      calculated by the kernel matches the overhead calculation in mke2fs.
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      52ca84a3
    • wangjianjian (C)'s avatar
      ext4, doc: fix incorrect h_reserved size · 6b952563
      wangjianjian (C) authored
      commit 7102ffe4
      
       upstream.
      
      According to document and code, ext4_xattr_header's size is 32 bytes, so
      h_reserved size should be 3.
      
      Signed-off-by: default avatarWang Jianjian <wangjianjian3@huawei.com>
      Link: https://lore.kernel.org/r/92fcc3a6-7d77-8c09-4126-377fcb4c46a5@huawei.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6b952563
    • Tadeusz Struk's avatar
      ext4: limit length to bitmap_maxbytes - blocksize in punch_hole · 9b900037
      Tadeusz Struk authored
      commit 2da37622 upstream.
      
      Syzbot found an issue [1] in ext4_fallocate().
      The C reproducer [2] calls fallocate(), passing size 0xffeffeff000ul,
      and offset 0x1000000ul, which, when added together exceed the
      bitmap_maxbytes for the inode. This triggers a BUG in
      ext4_ind_remove_space(). According to the comments in this function
      the 'end' parameter needs to be one block after the last block to be
      removed. In the case when the BUG is triggered it points to the last
      block. Modify the ext4_punch_hole() function and add constraint that
      caps the length to satisfy the one before laster block requirement.
      
      LINK: [1] https://syzkaller.appspot.com/bug?id=b80bd9cf348aac724a4f4dff251800106d721331
      LINK: [2] https://syzkaller.appspot.com/text?tag=ReproC&x=14ba0238700000
      
      Fixes: a4bb6b64
      
       ("ext4: enable "punch hole" functionality")
      Reported-by: default avatar <syzbot+7a806094edd5d07ba029@syzkaller.appspotmail.com>
      Signed-off-by: default avatarTadeusz Struk <tadeusz.struk@linaro.org>
      Link: https://lore.kernel.org/r/20220331200515.153214-1-tadeusz.struk@linaro.org
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9b900037
    • Ye Bin's avatar
      ext4: fix use-after-free in ext4_search_dir · e3912775
      Ye Bin authored
      commit c186f088
      
       upstream.
      
      We got issue as follows:
      EXT4-fs (loop0): mounted filesystem without journal. Opts: ,errors=continue
      ==================================================================
      BUG: KASAN: use-after-free in ext4_search_dir fs/ext4/namei.c:1394 [inline]
      BUG: KASAN: use-after-free in search_dirblock fs/ext4/namei.c:1199 [inline]
      BUG: KASAN: use-after-free in __ext4_find_entry+0xdca/0x1210 fs/ext4/namei.c:1553
      Read of size 1 at addr ffff8881317c3005 by task syz-executor117/2331
      
      CPU: 1 PID: 2331 Comm: syz-executor117 Not tainted 5.10.0+ #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      Call Trace:
       __dump_stack lib/dump_stack.c:83 [inline]
       dump_stack+0x144/0x187 lib/dump_stack.c:124
       print_address_description+0x7d/0x630 mm/kasan/report.c:387
       __kasan_report+0x132/0x190 mm/kasan/report.c:547
       kasan_report+0x47/0x60 mm/kasan/report.c:564
       ext4_search_dir fs/ext4/namei.c:1394 [inline]
       search_dirblock fs/ext4/namei.c:1199 [inline]
       __ext4_find_entry+0xdca/0x1210 fs/ext4/namei.c:1553
       ext4_lookup_entry fs/ext4/namei.c:1622 [inline]
       ext4_lookup+0xb8/0x3a0 fs/ext4/namei.c:1690
       __lookup_hash+0xc5/0x190 fs/namei.c:1451
       do_rmdir+0x19e/0x310 fs/namei.c:3760
       do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x445e59
      Code: 4d c7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b c7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fff2277fac8 EFLAGS: 00000246 ORIG_RAX: 0000000000000054
      RAX: ffffffffffffffda RBX: 0000000000400280 RCX: 0000000000445e59
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000200000c0
      RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000002
      R10: 00007fff2277f990 R11: 0000000000000246 R12: 0000000000000000
      R13: 431bde82d7b634db R14: 0000000000000000 R15: 0000000000000000
      
      The buggy address belongs to the page:
      page:0000000048cd3304 refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1317c3
      flags: 0x200000000000000()
      raw: 0200000000000000 ffffea0004526588 ffffea0004528088 0000000000000000
      raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff8881317c2f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       ffff8881317c2f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      >ffff8881317c3000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                         ^
       ffff8881317c3080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
       ffff8881317c3100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      ==================================================================
      
      ext4_search_dir:
        ...
        de = (struct ext4_dir_entry_2 *)search_buf;
        dlimit = search_buf + buf_size;
        while ((char *) de < dlimit) {
        ...
          if ((char *) de + de->name_len <= dlimit &&
      	 ext4_match(dir, fname, de)) {
      	    ...
          }
        ...
          de_len = ext4_rec_len_from_disk(de->rec_len, dir->i_sb->s_blocksize);
          if (de_len <= 0)
            return -1;
          offset += de_len;
          de = (struct ext4_dir_entry_2 *) ((char *) de + de_len);
        }
      
      Assume:
      de=0xffff8881317c2fff
      dlimit=0x0xffff8881317c3000
      
      If read 'de->name_len' which address is 0xffff8881317c3005, obviously is
      out of range, then will trigger use-after-free.
      To solve this issue, 'dlimit' must reserve 8 bytes, as we will read
      'de->name_len' to judge if '(char *) de + de->name_len' out of range.
      
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20220324064816.1209985-1-yebin10@huawei.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e3912775
    • Ye Bin's avatar
      ext4: fix symlink file size not match to file content · 8bb5676b
      Ye Bin authored
      commit a2b0b205
      
       upstream.
      
      We got issue as follows:
      [home]# fsck.ext4  -fn  ram0yb
      e2fsck 1.45.6 (20-Mar-2020)
      Pass 1: Checking inodes, blocks, and sizes
      Pass 2: Checking directory structure
      Symlink /p3/d14/d1a/l3d (inode #3494) is invalid.
      Clear? no
      Entry 'l3d' in /p3/d14/d1a (3383) has an incorrect filetype (was 7, should be 0).
      Fix? no
      
      As the symlink file size does not match the file content. If the writeback
      of the symlink data block failed, ext4_finish_bio() handles the end of IO.
      However this function fails to mark the buffer with BH_write_io_error and
      so when unmount does journal checkpoint it cannot detect the writeback
      error and will cleanup the journal. Thus we've lost the correct data in the
      journal area. To solve this issue, mark the buffer as BH_write_io_error in
      ext4_finish_bio().
      
      Cc: stable@kernel.org
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20220321144438.201685-1-yebin10@huawei.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8bb5676b
    • Darrick J. Wong's avatar
      ext4: fix fallocate to use file_modified to update permissions consistently · ba50ea45
      Darrick J. Wong authored
      commit ad5cd4f4
      
       upstream.
      
      Since the initial introduction of (posix) fallocate back at the turn of
      the century, it has been possible to use this syscall to change the
      user-visible contents of files.  This can happen by extending the file
      size during a preallocation, or through any of the newer modes (punch,
      zero, collapse, insert range).  Because the call can be used to change
      file contents, we should treat it like we do any other modification to a
      file -- update the mtime, and drop set[ug]id privileges/capabilities.
      
      The VFS function file_modified() does all this for us if pass it a
      locked inode, so let's make fallocate drop permissions correctly.
      
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Link: https://lore.kernel.org/r/20220308185043.GA117678@magnolia
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ba50ea45
    • Florian Westphal's avatar
      netfilter: conntrack: avoid useless indirection during conntrack destruction · 67e4860e
      Florian Westphal authored
      commit 6ae7989c
      
       upstream.
      
      nf_ct_put() results in a usesless indirection:
      
      nf_ct_put -> nf_conntrack_put -> nf_conntrack_destroy -> rcu readlock +
      indirect call of ct_hooks->destroy().
      
      There are two _put helpers:
      nf_ct_put and nf_conntrack_put.  The latter is what should be used in
      code that MUST NOT cause a linker dependency on the conntrack module
      (e.g. calls from core network stack).
      
      Everyone else should call nf_ct_put() instead.
      
      A followup patch will convert a few nf_conntrack_put() calls to
      nf_ct_put(), in particular from modules that already have a conntrack
      dependency such as act_ct or even nf_conntrack itself.
      
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      67e4860e
    • Florian Westphal's avatar
      netfilter: conntrack: convert to refcount_t api · bcba40bd
      Florian Westphal authored
      commit 71977437
      
       upstream.
      
      Convert nf_conn reference counting from atomic_t to refcount_t based api.
      refcount_t api provides more runtime sanity checks and will warn on
      certain constructs, e.g. refcount_inc() on a zero reference count, which
      usually indicates use-after-free.
      
      For this reason template allocation is changed to init the refcount to
      1, the subsequenct add operations are removed.
      
      Likewise, init_conntrack() is changed to set the initial refcount to 1
      instead refcount_inc().
      
      This is safe because the new entry is not (yet) visible to other cpus.
      
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bcba40bd
    • Mingwei Zhang's avatar
      KVM: SVM: Flush when freeing encrypted pages even on SME_COHERENT CPUs · 4bbd693d
      Mingwei Zhang authored
      commit d45829b3 upstream.
      
      Use clflush_cache_range() to flush the confidential memory when
      SME_COHERENT is supported in AMD CPU. Cache flush is still needed since
      SME_COHERENT only support cache invalidation at CPU side. All confidential
      cache lines are still incoherent with DMA devices.
      
      Cc: stable@vger.kerel.org
      
      Fixes: add5e2f0
      
       ("KVM: SVM: Add support for the SEV-ES VMSA")
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMingwei Zhang <mizhang@google.com>
      Message-Id: <20220421031407.2516575-3-mizhang@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4bbd693d
    • Sean Christopherson's avatar
      KVM: nVMX: Defer APICv updates while L2 is active until L1 is active · 8b2da969
      Sean Christopherson authored
      commit 7c69661e
      
       upstream.
      
      Defer APICv updates that occur while L2 is active until nested VM-Exit,
      i.e. until L1 regains control.  vmx_refresh_apicv_exec_ctrl() assumes L1
      is active and (a) stomps all over vmcs02 and (b) neglects to ever updated
      vmcs01.  E.g. if vmcs12 doesn't enable the TPR shadow for L2 (and thus no
      APICv controls), L1 performs nested VM-Enter APICv inhibited, and APICv
      becomes unhibited while L2 is active, KVM will set various APICv controls
      in vmcs02 and trigger a failed VM-Entry.  The kicker is that, unless
      running with nested_early_check=1, KVM blames L1 and chaos ensues.
      
      In all cases, ignoring vmcs02 and always deferring the inhibition change
      to vmcs01 is correct (or at least acceptable).  The ABSENT and DISABLE
      inhibitions cannot truly change while L2 is active (see below).
      
      IRQ_BLOCKING can change, but it is firmly a best effort debug feature.
      Furthermore, only L2's APIC is accelerated/virtualized to the full extent
      possible, e.g. even if L1 passes through its APIC to L2, normal MMIO/MSR
      interception will apply to the virtual APIC managed by KVM.
      The exception is the SELF_IPI register when x2APIC is enabled, but that's
      an acceptable hole.
      
      Lastly, Hyper-V's Auto EOI can technically be toggled if L1 exposes the
      MSRs to L2, but for that to work in any sane capacity, L1 would need to
      pass through IRQs to L2 as well, and IRQs must be intercepted to enable
      virtual interrupt delivery.  I.e. exposing Auto EOI to L2 and enabling
      VID for L2 are, for all intents and purposes, mutually exclusive.
      
      Lack of dynamic toggling is also why this scenario is all but impossible
      to encounter in KVM's current form.  But a future patch will pend an
      APICv update request _during_ vCPU creation to plug a race where a vCPU
      that's being created doesn't get included in the "all vCPUs request"
      because it's not yet visible to other vCPUs.  If userspaces restores L2
      after VM creation (hello, KVM selftests), the first KVM_RUN will occur
      while L2 is active and thus service the APICv update request made during
      VM creation.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220420013732.3308816-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8b2da969