Skip to content
  1. Feb 12, 2022
    • Yang Shi's avatar
      fs/proc: task_mmu.c: don't read mapcount for migration entry · 24d7275c
      Yang Shi authored
      The syzbot reported the below BUG:
      
        kernel BUG at include/linux/page-flags.h:785!
        invalid opcode: 0000 [#1] PREEMPT SMP KASAN
        CPU: 1 PID: 4392 Comm: syz-executor560 Not tainted 5.16.0-rc6-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:PageDoubleMap include/linux/page-flags.h:785 [inline]
        RIP: 0010:__page_mapcount+0x2d2/0x350 mm/util.c:744
        Call Trace:
          page_mapcount include/linux/mm.h:837 [inline]
          smaps_account+0x470/0xb10 fs/proc/task_mmu.c:466
          smaps_pte_entry fs/proc/task_mmu.c:538 [inline]
          smaps_pte_range+0x611/0x1250 fs/proc/task_mmu.c:601
          walk_pmd_range mm/pagewalk.c:128 [inline]
          walk_pud_range mm/pagewalk.c:205 [inline]
          walk_p4d_range mm/pagewalk.c:240 [inline]
          walk_pgd_range mm/pagewalk.c:277 [inline]
          __walk_page_range+0xe23/0x1ea0 mm/pagewalk.c:379
          walk_page_vma+0x277/0x350 mm/pagewalk.c:530
          smap_gather_stats.part.0+0x148/0x260 fs/proc/task_mmu.c:768
          smap_gather_stats fs/proc/task_mmu.c:741 [inline]
          show_smap+0xc6/0x440 fs/proc/task_mmu.c:822
          seq_read_iter+0xbb0/0x1240 fs/seq_file.c:272
          seq_read+0x3e0/0x5b0 fs/seq_file.c:162
          vfs_read+0x1b5/0x600 fs/read_write.c:479
          ksys_read+0x12d/0x250 fs/read_write.c:619
          do_syscall_x64 arch/x86/entry/common.c:50 [inline]
          do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
          entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      The reproducer was trying to read /proc/$PID/smaps when calling
      MADV_FREE at the mean time.  MADV_FREE may split THPs if it is called
      for partial THP.  It may trigger the below race:
      
                 CPU A                         CPU B
                 -----                         -----
        smaps walk:                      MADV_FREE:
        page_mapcount()
          PageCompound()
                                         split_huge_page()
          page = compound_head(page)
          PageDoubleMap(page)
      
      When calling PageDoubleMap() this page is not a tail page of THP anymore
      so the BUG is triggered.
      
      This could be fixed by elevated refcount of the page before calling
      mapcount, but that would prevent it from counting migration entries, and
      it seems overkilling because the race just could happen when PMD is
      split so all PTE entries of tail pages are actually migration entries,
      and smaps_account() does treat migration entries as mapcount == 1 as
      Kirill pointed out.
      
      Add a new parameter for smaps_account() to tell this entry is migration
      entry then skip calling page_mapcount().  Don't skip getting mapcount
      for device private entries since they do track references with mapcount.
      
      Pagemap also has the similar issue although it was not reported.  Fixed
      it as well.
      
      [shy828301@gmail.com: v4]
        Link: https://lkml.kernel.org/r/20220203182641.824731-1-shy828301@gmail.com
      [nathan@kernel.org: avoid unused variable warning in pagemap_pmd_range()]
        Link: https://lkml.kernel.org/r/20220207171049.1102239-1-nathan@kernel.org
      Link: https://lkml.kernel.org/r/20220120202805.3369-1-shy828301@gmail.com
      Fixes: e9b61f19
      
       ("thp: reintroduce split_huge_page()")
      Signed-off-by: default avatarYang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Reported-by: default avatar <syzbot+1f52b3a18d5633fa7f82@syzkaller.appspotmail.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      24d7275c
    • Mike Rapoport's avatar
      fs/binfmt_elf: fix PT_LOAD p_align values for loaders · 925346c1
      Mike Rapoport authored
      Rui Salvaterra reported that Aisleroit solitaire crashes with "Wrong
      __data_start/_end pair" assertion from libgc after update to v5.17-rc1.
      
      Bisection pointed to commit 9630f0d6 ("fs/binfmt_elf: use PT_LOAD
      p_align values for static PIE") that fixed handling of static PIEs, but
      made the condition that guards load_bias calculation to exclude loader
      binaries.
      
      Restoring the check for presence of interpreter fixes the problem.
      
      Link: https://lkml.kernel.org/r/20220202121433.3697146-1-rppt@kernel.org
      Fixes: 9630f0d6
      
       ("fs/binfmt_elf: use PT_LOAD p_align values for static PIE")
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reported-by: default avatarRui Salvaterra <rsalvaterra@gmail.com>
      Tested-by: default avatarRui Salvaterra <rsalvaterra@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: "H.J. Lu" <hjl.tools@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      925346c1
  2. Feb 11, 2022
  3. Feb 10, 2022
  4. Feb 09, 2022
    • Chuck Lever's avatar
      NFSD: Deprecate NFS_OFFSET_MAX · c306d737
      Chuck Lever authored
      
      
      NFS_OFFSET_MAX was introduced way back in Linux v2.3.y before there
      was a kernel-wide OFFSET_MAX value. As a clean up, replace the last
      few uses of it with its generic equivalent, and get rid of it.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      c306d737
    • Chuck Lever's avatar
      NFSD: Fix offset type in I/O trace points · 6a4d333d
      Chuck Lever authored
      
      
      NFSv3 and NFSv4 use u64 offset values on the wire. Record these values
      verbatim without the implicit type case to loff_t.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      6a4d333d
    • Chuck Lever's avatar
      NFSD: COMMIT operations must not return NFS?ERR_INVAL · 3f965021
      Chuck Lever authored
      
      
      Since, well, forever, the Linux NFS server's nfsd_commit() function
      has returned nfserr_inval when the passed-in byte range arguments
      were non-sensical.
      
      However, according to RFC 1813 section 3.3.21, NFSv3 COMMIT requests
      are permitted to return only the following non-zero status codes:
      
            NFS3ERR_IO
            NFS3ERR_STALE
            NFS3ERR_BADHANDLE
            NFS3ERR_SERVERFAULT
      
      NFS3ERR_INVAL is not included in that list. Likewise, NFS4ERR_INVAL
      is not listed in the COMMIT row of Table 6 in RFC 8881.
      
      RFC 7530 does permit COMMIT to return NFS4ERR_INVAL, but does not
      specify when it can or should be used.
      
      Instead of dropping or failing a COMMIT request in a byte range that
      is not supported, turn it into a valid request by treating one or
      both arguments as zero. Offset zero means start-of-file, count zero
      means until-end-of-file, so we only ever extend the commit range.
      NFS servers are always allowed to commit more and sooner than
      requested.
      
      The range check is no longer bounded by NFS_OFFSET_MAX, but rather
      by the value that is returned in the maxfilesize field of the NFSv3
      FSINFO procedure or the NFSv4 maxfilesize file attribute.
      
      Note that this change results in a new pynfs failure:
      
      CMT4     st_commit.testCommitOverflow                             : RUNNING
      CMT4     st_commit.testCommitOverflow                             : FAILURE
                 COMMIT with offset + count overflow should return
                 NFS4ERR_INVAL, instead got NFS4_OK
      
      IMO the test is not correct as written: RFC 8881 does not allow the
      COMMIT operation to return NFS4ERR_INVAL.
      
      Reported-by: default avatarDan Aloni <dan.aloni@vastdata.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarBruce Fields <bfields@fieldses.org>
      3f965021
    • Chuck Lever's avatar
      NFSD: Clamp WRITE offsets · 6260d9a5
      Chuck Lever authored
      
      
      Ensure that a client cannot specify a WRITE range that falls in a
      byte range outside what the kernel's internal types (such as loff_t,
      which is signed) can represent. The kiocb iterators, invoked in
      nfsd_vfs_write(), should properly limit write operations to within
      the underlying file system's s_maxbytes.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      6260d9a5
    • Chuck Lever's avatar
      NFSD: Fix NFSv3 SETATTR/CREATE's handling of large file sizes · a648fdeb
      Chuck Lever authored
      
      
      iattr::ia_size is a loff_t, so these NFSv3 procedures must be
      careful to deal with incoming client size values that are larger
      than s64_max without corrupting the value.
      
      Silently capping the value results in storing a different value
      than the client passed in which is unexpected behavior, so remove
      the min_t() check in decode_sattr3().
      
      Note that RFC 1813 permits only the WRITE procedure to return
      NFS3ERR_FBIG. We believe that NFSv3 reference implementations
      also return NFS3ERR_FBIG when ia_size is too large.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      a648fdeb
    • Chuck Lever's avatar
      NFSD: Fix ia_size underflow · e6faac3f
      Chuck Lever authored
      
      
      iattr::ia_size is a loff_t, which is a signed 64-bit type. NFSv3 and
      NFSv4 both define file size as an unsigned 64-bit type. Thus there
      is a range of valid file size values an NFS client can send that is
      already larger than Linux can handle.
      
      Currently decode_fattr4() dumps a full u64 value into ia_size. If
      that value happens to be larger than S64_MAX, then ia_size
      underflows. I'm about to fix up the NFSv3 behavior as well, so let's
      catch the underflow in the common code path: nfsd_setattr().
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      e6faac3f
    • Chuck Lever's avatar
      NFSD: Fix the behavior of READ near OFFSET_MAX · 0cb4d23a
      Chuck Lever authored
      Dan Aloni reports:
      > Due to commit 8cfb9015
      
       ("NFS: Always provide aligned buffers to
      > the RPC read layers") on the client, a read of 0xfff is aligned up
      > to server rsize of 0x1000.
      >
      > As a result, in a test where the server has a file of size
      > 0x7fffffffffffffff, and the client tries to read from the offset
      > 0x7ffffffffffff000, the read causes loff_t overflow in the server
      > and it returns an NFS code of EINVAL to the client. The client as
      > a result indefinitely retries the request.
      
      The Linux NFS client does not handle NFS?ERR_INVAL, even though all
      NFS specifications permit servers to return that status code for a
      READ.
      
      Instead of NFS?ERR_INVAL, have out-of-range READ requests succeed
      and return a short result. Set the EOF flag in the result to prevent
      the client from retrying the READ request. This behavior appears to
      be consistent with Solaris NFS servers.
      
      Note that NFSv3 and NFSv4 use u64 offset values on the wire. These
      must be converted to loff_t internally before use -- an implicit
      type cast is not adequate for this purpose. Otherwise VFS checks
      against sb->s_maxbytes do not work properly.
      
      Reported-by: default avatarDan Aloni <dan.aloni@vastdata.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      0cb4d23a
    • David S. Miller's avatar
      Merge branch 'vlan-QinQ-leak-fix' · 3bed06e3
      David S. Miller authored
      
      
      Xin Long says:
      
      ====================
      vlan: fix a netdev refcnt leak for QinQ
      
      This issue can be simply reproduced by:
      
        # ip link add dummy0 type dummy
        # ip link add link dummy0 name dummy0.1 type vlan id 1
        # ip link add link dummy0.1 name dummy0.1.2 type vlan id 2
        # rmmod 8021q
      
       unregister_netdevice: waiting for dummy0.1 to become free. Usage count = 1
      
      So as to fix it, adjust vlan_dev_uninit() in Patch 1/1 so that it won't
      be called twice for the same device, then do the fix in vlan_dev_uninit()
      in Patch 2/2.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3bed06e3
    • Xin Long's avatar
      vlan: move dev_put into vlan_dev_uninit · d6ff94af
      Xin Long authored
      Shuang Li reported an QinQ issue by simply doing:
      
        # ip link add dummy0 type dummy
        # ip link add link dummy0 name dummy0.1 type vlan id 1
        # ip link add link dummy0.1 name dummy0.1.2 type vlan id 2
        # rmmod 8021q
      
       unregister_netdevice: waiting for dummy0.1 to become free. Usage count = 1
      
      When rmmods 8021q, all vlan devs are deleted from their real_dev's vlan grp
      and added into list_kill by unregister_vlan_dev(). dummy0.1 is unregistered
      before dummy0.1.2, as it's using for_each_netdev() in __rtnl_kill_links().
      
      When unregisters dummy0.1, dummy0.1.2 is not unregistered in the event of
      NETDEV_UNREGISTER, as it's been deleted from dummy0.1's vlan grp. However,
      due to dummy0.1.2 still holding dummy0.1, dummy0.1 will keep waiting in
      netdev_wait_allrefs(), while dummy0.1.2 will never get unregistered and
      release dummy0.1, as it delays dev_put until calling dev->priv_destructor,
      vlan_dev_free().
      
      This issue was introduced by Commit 563bcbae ("net: vlan: fix a UAF in
      vlan_dev_real_dev()"), and this patch is to fix it by moving dev_put() into
      vlan_dev_uninit(), which is called after NETDEV_UNREGISTER event but before
      netdev_wait_allrefs().
      
      Fixes: 563bcbae
      
       ("net: vlan: fix a UAF in vlan_dev_real_dev()")
      Reported-by: default avatarShuang Li <shuali@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d6ff94af
    • Xin Long's avatar
      vlan: introduce vlan_dev_free_egress_priority · 37aa50c5
      Xin Long authored
      
      
      This patch is to introduce vlan_dev_free_egress_priority() to
      free egress priority for vlan dev, and keep vlan_dev_uninit()
      static as .ndo_uninit. It makes the code more clear and safer
      when adding new code in vlan_dev_uninit() in the future.
      
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      37aa50c5
    • Duoming Zhou's avatar
      ax25: fix UAF bugs of net_device caused by rebinding operation · feef318c
      Duoming Zhou authored
      
      
      The ax25_kill_by_device() will set s->ax25_dev = NULL and
      call ax25_disconnect() to change states of ax25_cb and
      sock, if we call ax25_bind() before ax25_kill_by_device().
      
      However, if we call ax25_bind() again between the window of
      ax25_kill_by_device() and ax25_dev_device_down(), the values
      and states changed by ax25_kill_by_device() will be reassigned.
      
      Finally, ax25_dev_device_down() will deallocate net_device.
      If we dereference net_device in syscall functions such as
      ax25_release(), ax25_sendmsg(), ax25_getsockopt(), ax25_getname()
      and ax25_info_show(), a UAF bug will occur.
      
      One of the possible race conditions is shown below:
      
            (USE)                   |      (FREE)
      ax25_bind()                   |
                                    |  ax25_kill_by_device()
      ax25_bind()                   |
      ax25_connect()                |    ...
                                    |  ax25_dev_device_down()
                                    |    ...
                                    |    dev_put_track(dev, ...) //FREE
      ax25_release()                |    ...
        ax25_send_control()         |
          alloc_skb()      //USE    |
      
      the corresponding fail log is shown below:
      ===============================================================
      BUG: KASAN: use-after-free in ax25_send_control+0x43/0x210
      ...
      Call Trace:
        ...
        ax25_send_control+0x43/0x210
        ax25_release+0x2db/0x3b0
        __sock_release+0x6d/0x120
        sock_close+0xf/0x20
        __fput+0x11f/0x420
        ...
      Allocated by task 1283:
        ...
        __kasan_kmalloc+0x81/0xa0
        alloc_netdev_mqs+0x5a/0x680
        mkiss_open+0x6c/0x380
        tty_ldisc_open+0x55/0x90
        ...
      Freed by task 1969:
        ...
        kfree+0xa3/0x2c0
        device_release+0x54/0xe0
        kobject_put+0xa5/0x120
        tty_ldisc_kill+0x3e/0x80
        ...
      
      In order to fix these UAF bugs caused by rebinding operation,
      this patch adds dev_hold_track() into ax25_bind() and
      corresponding dev_put_track() into ax25_kill_by_device().
      
      Signed-off-by: default avatarDuoming Zhou <duoming@zju.edu.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      feef318c
    • Vladimir Oltean's avatar
      net: dsa: fix panic when DSA master device unbinds on shutdown · ee534378
      Vladimir Oltean authored
      Rafael reports that on a system with LX2160A and Marvell DSA switches,
      if a reboot occurs while the DSA master (dpaa2-eth) is up, the following
      panic can be seen:
      
      systemd-shutdown[1]: Rebooting.
      Unable to handle kernel paging request at virtual address 00a0000800000041
      [00a0000800000041] address between user and kernel address ranges
      Internal error: Oops: 96000004 [#1] PREEMPT SMP
      CPU: 6 PID: 1 Comm: systemd-shutdow Not tainted 5.16.5-00042-g8f5585009b24 #32
      pc : dsa_slave_netdevice_event+0x130/0x3e4
      lr : raw_notifier_call_chain+0x50/0x6c
      Call trace:
       dsa_slave_netdevice_event+0x130/0x3e4
       raw_notifier_call_chain+0x50/0x6c
       call_netdevice_notifiers_info+0x54/0xa0
       __dev_close_many+0x50/0x130
       dev_close_many+0x84/0x120
       unregister_netdevice_many+0x130/0x710
       unregister_netdevice_queue+0x8c/0xd0
       unregister_netdev+0x20/0x30
       dpaa2_eth_remove+0x68/0x190
       fsl_mc_driver_remove+0x20/0x5c
       __device_release_driver+0x21c/0x220
       device_release_driver_internal+0xac/0xb0
       device_links_unbind_consumers+0xd4/0x100
       __device_release_driver+0x94/0x220
       device_release_driver+0x28/0x40
       bus_remove_device+0x118/0x124
       device_del+0x174/0x420
       fsl_mc_device_remove+0x24/0x40
       __fsl_mc_device_remove+0xc/0x20
       device_for_each_child+0x58/0xa0
       dprc_remove+0x90/0xb0
       fsl_mc_driver_remove+0x20/0x5c
       __device_release_driver+0x21c/0x220
       device_release_driver+0x28/0x40
       bus_remove_device+0x118/0x124
       device_del+0x174/0x420
       fsl_mc_bus_remove+0x80/0x100
       fsl_mc_bus_shutdown+0xc/0x1c
       platform_shutdown+0x20/0x30
       device_shutdown+0x154/0x330
       __do_sys_reboot+0x1cc/0x250
       __arm64_sys_reboot+0x20/0x30
       invoke_syscall.constprop.0+0x4c/0xe0
       do_el0_svc+0x4c/0x150
       el0_svc+0x24/0xb0
       el0t_64_sync_handler+0xa8/0xb0
       el0t_64_sync+0x178/0x17c
      
      It can be seen from the stack trace that the problem is that the
      deregistration of the master causes a dev_close(), which gets notified
      as NETDEV_GOING_DOWN to dsa_slave_netdevice_event().
      But dsa_switch_shutdown() has already run, and this has unregistered the
      DSA slave interfaces, and yet, the NETDEV_GOING_DOWN handler attempts to
      call dev_close_many() on those slave interfaces, leading to the problem.
      
      The previous attempt to avoid the NETDEV_GOING_DOWN on the master after
      dsa_switch_shutdown() was called seems improper. Unregistering the slave
      interfaces is unnecessary and unhelpful. Instead, after the slaves have
      stopped being uppers of the DSA master, we can now reset to NULL the
      master->dsa_ptr pointer, which will make DSA start ignoring all future
      notifier events on the master.
      
      Fixes: 0650bf52
      
       ("net: dsa: be compatible with masters which unregister on shutdown")
      Reported-by: default avatarRafael Richter <rafael.richter@gin.de>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ee534378