Skip to content
  1. Dec 13, 2016
    • Minchan Kim's avatar
      mm: make unreserve highatomic functions reliable · 29fac03b
      Minchan Kim authored
      Currently, unreserve_highatomic_pageblock bails out if it found
      highatomic pageblock regardless of really moving free pages from the one
      so that it could mitigate unreserve logic's goal which saves OOM of a
      process.
      
      This patch makes unreserve functions bail out only if it moves some
      pages out of !highatomic free list to avoid such false positive.
      
      Another potential problem is that by race between page freeing and
      reserve highatomic function, pages could be in highatomic free list even
      though the pageblock is !high atomic migratetype.  In that case,
      unreserve_highatomic_pageblock can be void if count of highatomic
      reserve is less than pageblock_nr_pages.  We could solve it simply via
      draining all of reserved pages before the OOM.  It would have a
      safeguard role to exhuast reserved pages before converging to OOM.
      
      Link: http://lkml.kernel.org/r/1476259429-18279-5-git-send-email-minchan@kernel.org
      
      
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      29fac03b
    • Minchan Kim's avatar
      mm: try to exhaust highatomic reserve before the OOM · 04c8716f
      Minchan Kim authored
      I got OOM report from production team with v4.4 kernel.  It had enough
      free memory but failed to allocate GFP_KERNEL order-0 page and finally
      encountered OOM kill.  It occured during QA process which launches
      several apps, switching and so on.  It happned rarely.  IOW, In normal
      situation, it was not a problem but if we are unluck so that several
      apps uses peak memory at the same time, it can happen.  If we manage to
      pass the phase, the system can go working well.
      
      I could reproduce it with my test(memory spike easily.  Look at below.
      
      The reason is free pages(19M) of DMA32 zone are reserved for
      HIGHORDERATOMIC and doesn't unreserved before the OOM.
      
        balloon invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
        balloon cpuset=/ mems_allowed=0
        CPU: 1 PID: 8473 Comm: balloon Tainted: G        W  OE   4.8.0-rc7-00219-g3f74c9559583-dirty #3161
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
          dump_stack+0x63/0x90
          dump_header+0x5c/0x1ce
          oom_kill_process+0x22e/0x400
          out_of_memory+0x1ac/0x210
          __alloc_pages_nodemask+0x101e/0x1040
          handle_mm_fault+0xa0a/0xbf0
          __do_page_fault+0x1dd/0x4d0
          trace_do_page_fault+0x43/0x130
          do_async_page_fault+0x1a/0xa0
          async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:383949 inactive_anon:106724 isolated_anon:0
         active_file:15 inactive_file:44 isolated_file:0
         unevictable:0 dirty:0 writeback:24 unstable:0
         slab_reclaimable:2483 slab_unreclaimable:3326
         mapped:0 shmem:0 pagetables:1906 bounce:0
         free:6898 free_pcp:291 free_cma:0
        Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1418 all_unreclaimable? no
        DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 1952 1952 1952
        DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 2*4096kB (H) = 8192kB
        DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB
        51131 total pagecache pages
        50795 pages in swap cache
        Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228
        Free swap  = 8kB
        Total swap = 255996kB
        524158 pages RAM
        0 pages HighMem/MovableOnly
        12658 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
      
      Another example exceeded the limit by the race is
      
        in:imklog: page allocation failure: order:0, mode:0x2280020(GFP_ATOMIC|__GFP_NOTRACK)
        CPU: 0 PID: 476 Comm: in:imklog Tainted: G            E   4.8.0-rc7-00217-g266ef83c51e5-dirty #3135
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
          dump_stack+0x63/0x90
          warn_alloc_failed+0xdb/0x130
          __alloc_pages_nodemask+0x4d6/0xdb0
          new_slab+0x339/0x490
          ___slab_alloc.constprop.74+0x367/0x480
          __slab_alloc.constprop.73+0x20/0x40
          __kmalloc+0x1a4/0x1e0
          alloc_indirect.isra.14+0x1d/0x50
          virtqueue_add_sgs+0x1c4/0x470
          __virtblk_add_req+0xae/0x1f0
          virtio_queue_rq+0x12d/0x290
          __blk_mq_run_hw_queue+0x239/0x370
          blk_mq_run_hw_queue+0x8f/0xb0
          blk_mq_insert_requests+0x18c/0x1a0
          blk_mq_flush_plug_list+0x125/0x140
          blk_flush_plug_list+0xc7/0x220
          blk_finish_plug+0x2c/0x40
          __do_page_cache_readahead+0x196/0x230
          filemap_fault+0x448/0x4f0
          ext4_filemap_fault+0x36/0x50
          __do_fault+0x75/0x140
          handle_mm_fault+0x84d/0xbe0
          __do_page_fault+0x1dd/0x4d0
          trace_do_page_fault+0x43/0x130
          do_async_page_fault+0x1a/0xa0
          async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:363826 inactive_anon:121283 isolated_anon:32
         active_file:65 inactive_file:152 isolated_file:0
         unevictable:0 dirty:0 writeback:46 unstable:0
         slab_reclaimable:2778 slab_unreclaimable:3070
         mapped:112 shmem:0 pagetables:1822 bounce:0
         free:9469 free_pcp:231 free_cma:0
        Node 0 active_anon:1455304kB inactive_anon:485132kB active_file:260kB inactive_file:608kB unevictable:0kB isolated(anon):128kB isolated(file):0kB mapped:448kB dirty:0kB writeback:184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:13641 all_unreclaimable? no
        DMA free:7748kB min:44kB low:56kB high:68kB active_anon:7944kB inactive_anon:104kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:108kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 1952 1952 1952
        DMA32 free:30128kB min:5628kB low:7624kB high:9620kB active_anon:1447360kB inactive_anon:485028kB active_file:260kB inactive_file:608kB unevictable:0kB writepending:184kB present:2080640kB managed:2030132kB mlocked:0kB slab_reclaimable:11112kB slab_unreclaimable:12172kB kernel_stack:2400kB pagetables:7284kB bounce:0kB free_pcp:924kB local_pcp:72kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 7*4kB (UE) 3*8kB (UH) 1*16kB (M) 0*32kB 2*64kB (U) 1*128kB (M) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (U) 1*4096kB (H) = 7748kB
        DMA32: 10*4kB (H) 3*8kB (H) 47*16kB (H) 38*32kB (H) 5*64kB (H) 1*128kB (H) 2*256kB (H) 3*512kB (H) 3*1024kB (H) 3*2048kB (H) 4*4096kB (H) = 30128kB
        2775 total pagecache pages
        2536 pages in swap cache
        Swap cache stats: add 206786828, delete 206784292, find 7323106/106686077
        Free swap  = 108744kB
        Total swap = 255996kB
        524158 pages RAM
        0 pages HighMem/MovableOnly
        12648 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
      
      It's weird to show that zone has enough free memory above min watermark
      but OOMed with 4K GFP_KERNEL allocation due to reserved highatomic
      pages.  As last resort, try to unreserve highatomic pages again and if
      it has moved pages to non-highatmoc free list, retry reclaim once more.
      
      Link: http://lkml.kernel.org/r/1476259429-18279-4-git-send-email-minchan@kernel.org
      
      
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      04c8716f
    • Minchan Kim's avatar
      mm: prevent double decrease of nr_reserved_highatomic · 4855e4a7
      Minchan Kim authored
      There is race between page freeing and unreserved highatomic.
      
       CPU 0				    CPU 1
      
          free_hot_cold_page
            mt = get_pfnblock_migratetype
            set_pcppage_migratetype(page, mt)
          				    unreserve_highatomic_pageblock
          				    spin_lock_irqsave(&zone->lock)
          				    move_freepages_block
          				    set_pageblock_migratetype(page)
          				    spin_unlock_irqrestore(&zone->lock)
            free_pcppages_bulk
              __free_one_page(mt) <- mt is stale
      
      By above race, a page on CPU 0 could go non-highorderatomic free list
      since the pageblock's type is changed.  By that, unreserve logic of
      highorderatomic can decrease reserved count on a same pageblock severak
      times and then it will make mismatch between nr_reserved_highatomic and
      the number of reserved pageblock.
      
      So, this patch verifies whether the pageblock is highatomic or not and
      decrease the count only if the pageblock is highatomic.
      
      Link: http://lkml.kernel.org/r/1476259429-18279-3-git-send-email-minchan@kernel.org
      
      
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4855e4a7
    • Minchan Kim's avatar
      mm: don't steal highatomic pageblock · 88ed365e
      Minchan Kim authored
      Patch series "use up highorder free pages before OOM", v3.
      
      I got OOM report from production team with v4.4 kernel.  It had enough
      free memory but failed to allocate GFP_KERNEL order-0 page and finally
      encountered OOM kill.  It occured during QA process which launches
      several apps, switching and so on.  It happned rarely.  IOW, In normal
      situation, it was not a problem but if we are unluck so that several
      apps uses peak memory at the same time, it can happen.  If we manage to
      pass the phase, the system can go working well.
      
      I could reproduce it with my test(memory spike easily.  Look at below.
      
      The reason is free pages(19M) of DMA32 zone are reserved for
      HIGHORDERATOMIC and doesn't unreserved before the OOM.
      
        balloon invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
        balloon cpuset=/ mems_allowed=0
        CPU: 1 PID: 8473 Comm: balloon Tainted: G        W  OE   4.8.0-rc7-00219-g3f74c9559583-dirty #3161
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
          dump_stack+0x63/0x90
          dump_header+0x5c/0x1ce
          oom_kill_process+0x22e/0x400
          out_of_memory+0x1ac/0x210
          __alloc_pages_nodemask+0x101e/0x1040
          handle_mm_fault+0xa0a/0xbf0
          __do_page_fault+0x1dd/0x4d0
          trace_do_page_fault+0x43/0x130
          do_async_page_fault+0x1a/0xa0
          async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:383949 inactive_anon:106724 isolated_anon:0
         active_file:15 inactive_file:44 isolated_file:0
         unevictable:0 dirty:0 writeback:24 unstable:0
         slab_reclaimable:2483 slab_unreclaimable:3326
         mapped:0 shmem:0 pagetables:1906 bounce:0
         free:6898 free_pcp:291 free_cma:0
        Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1418 all_unreclaimable? no
        DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 1952 1952 1952
        DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 2*4096kB (H) = 8192kB
        DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB
        51131 total pagecache pages
        50795 pages in swap cache
        Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228
        Free swap  = 8kB
        Total swap = 255996kB
        524158 pages RAM
        0 pages HighMem/MovableOnly
        12658 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
      
      Another example exceeded the limit by the race is
      
        in:imklog: page allocation failure: order:0, mode:0x2280020(GFP_ATOMIC|__GFP_NOTRACK)
        CPU: 0 PID: 476 Comm: in:imklog Tainted: G            E   4.8.0-rc7-00217-g266ef83c51e5-dirty #3135
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
          dump_stack+0x63/0x90
          warn_alloc_failed+0xdb/0x130
          __alloc_pages_nodemask+0x4d6/0xdb0
          new_slab+0x339/0x490
          ___slab_alloc.constprop.74+0x367/0x480
          __slab_alloc.constprop.73+0x20/0x40
          __kmalloc+0x1a4/0x1e0
          alloc_indirect.isra.14+0x1d/0x50
          virtqueue_add_sgs+0x1c4/0x470
          __virtblk_add_req+0xae/0x1f0
          virtio_queue_rq+0x12d/0x290
          __blk_mq_run_hw_queue+0x239/0x370
          blk_mq_run_hw_queue+0x8f/0xb0
          blk_mq_insert_requests+0x18c/0x1a0
          blk_mq_flush_plug_list+0x125/0x140
          blk_flush_plug_list+0xc7/0x220
          blk_finish_plug+0x2c/0x40
          __do_page_cache_readahead+0x196/0x230
          filemap_fault+0x448/0x4f0
          ext4_filemap_fault+0x36/0x50
          __do_fault+0x75/0x140
          handle_mm_fault+0x84d/0xbe0
          __do_page_fault+0x1dd/0x4d0
          trace_do_page_fault+0x43/0x130
          do_async_page_fault+0x1a/0xa0
          async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:363826 inactive_anon:121283 isolated_anon:32
         active_file:65 inactive_file:152 isolated_file:0
         unevictable:0 dirty:0 writeback:46 unstable:0
         slab_reclaimable:2778 slab_unreclaimable:3070
         mapped:112 shmem:0 pagetables:1822 bounce:0
         free:9469 free_pcp:231 free_cma:0
        Node 0 active_anon:1455304kB inactive_anon:485132kB active_file:260kB inactive_file:608kB unevictable:0kB isolated(anon):128kB isolated(file):0kB mapped:448kB dirty:0kB writeback:184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:13641 all_unreclaimable? no
        DMA free:7748kB min:44kB low:56kB high:68kB active_anon:7944kB inactive_anon:104kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:108kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 1952 1952 1952
        DMA32 free:30128kB min:5628kB low:7624kB high:9620kB active_anon:1447360kB inactive_anon:485028kB active_file:260kB inactive_file:608kB unevictable:0kB writepending:184kB present:2080640kB managed:2030132kB mlocked:0kB slab_reclaimable:11112kB slab_unreclaimable:12172kB kernel_stack:2400kB pagetables:7284kB bounce:0kB free_pcp:924kB local_pcp:72kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 7*4kB (UE) 3*8kB (UH) 1*16kB (M) 0*32kB 2*64kB (U) 1*128kB (M) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (U) 1*4096kB (H) = 7748kB
        DMA32: 10*4kB (H) 3*8kB (H) 47*16kB (H) 38*32kB (H) 5*64kB (H) 1*128kB (H) 2*256kB (H) 3*512kB (H) 3*1024kB (H) 3*2048kB (H) 4*4096kB (H) = 30128kB
        2775 total pagecache pages
        2536 pages in swap cache
        Swap cache stats: add 206786828, delete 206784292, find 7323106/106686077
        Free swap  = 108744kB
        Total swap = 255996kB
        524158 pages RAM
        0 pages HighMem/MovableOnly
        12648 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
      
      During the investigation, I found some problems with highatomic so this
      patch aims to solve the problems and the final goal is to unreserve
      every highatomic free pages before the OOM kill.
      
      This patch (of 4):
      
      In page freeing path, migratetype is racy so that a highorderatomic page
      could free into non-highorderatomic free list.  If that page is
      allocated, VM can change the pageblock from higorderatomic to something.
      In that case, highatomic pageblock accounting is broken so it doesn't
      work(e.g., VM cannot reserve highorderatomic pageblocks any more
      although it doesn't reach 1% limit).
      
      So, this patch prohibits the changing from highatomic to other type.
      It's no problem because MIGRATE_HIGHATOMIC is not listed in fallback
      array so stealing will only happen due to unexpected races which is
      really rare.  Also, such prohibiting keeps highatomic pageblock more
      longer so it would be better for highorderatomic page allocation.
      
      Link: http://lkml.kernel.org/r/1476259429-18279-2-git-send-email-minchan@kernel.org
      
      
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88ed365e
    • Andreas Platschek's avatar
      kmemleak: fix reference to Documentation · 22901c6c
      Andreas Platschek authored
      Documentation/kmemleak.txt was moved to Documentation/dev-tools/kmemleak.rst,
      this fixes the reference to the new location.
      
      Link: http://lkml.kernel.org/r/1476544946-18804-1-git-send-email-andreas.platschek@opentech.at
      
      
      Signed-off-by: default avatarAndreas Platschek <andreas.platschek@opentech.at>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      22901c6c
    • Aneesh Kumar K.V's avatar
    • Aneesh Kumar K.V's avatar
      mm/hugetlb.c: use the right pte val for compare in hugetlb_cow · 3999f52e
      Aneesh Kumar K.V authored
      We cannot use the pte value used in set_pte_at for pte_same comparison,
      because archs like ppc64, filter/add new pte flag in set_pte_at.
      Instead fetch the pte value inside hugetlb_cow.  We are comparing pte
      value to make sure the pte didn't change since we dropped the page table
      lock.  hugetlb_cow get called with page table lock held, and we can take
      a copy of the pte value before we drop the page table lock.
      
      With hugetlbfs, we optimize the MAP_PRIVATE write fault path with no
      previous mapping (huge_pte_none entries), by forcing a cow in the fault
      path.  This avoid take an addition fault to covert a read-only mapping
      to read/write.  Here we were comparing a recently instantiated pte (via
      set_pte_at) to the pte values from linux page table.  As explained above
      on ppc64 such pte_same check returned wrong result, resulting in us
      taking an additional fault on ppc64.
      
      Fixes: 6a119eae ("powerpc/mm: Add a _PAGE_PTE bit")
      Link: http://lkml.kernel.org/r/20161018154245.18023-1-aneesh.kumar@linux.vnet.ibm.com
      
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reported-by: default avatarJan Stancek <jstancek@redhat.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Scott Wood <scottwood@freescale.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3999f52e
    • Tobias Klauser's avatar
      mm/gup.c: make unnecessarily global vma_permits_fault() static · 771ab430
      Tobias Klauser authored
      Make vma_permits_fault() static as it is only used in mm/gup.c
      
      This fixes a sparse warning.
      
      Link: http://lkml.kernel.org/r/20161017122353.31598-1-tklauser@distanz.ch
      
      
      Signed-off-by: default avatarTobias Klauser <tklauser@distanz.ch>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      771ab430
    • Shaohua Li's avatar
      mm/vmscan.c: set correct defer count for shrinker · 5f33a080
      Shaohua Li authored
      Our system uses significantly more slab memory with memcg enabled with
      the latest kernel.  With 3.10 kernel, slab uses 2G memory, while with
      4.6 kernel, 6G memory is used.  The shrinker has problem.  Let's see we
      have two memcg for one shrinker.  In do_shrink_slab:
      
      1. Check cg1.  nr_deferred = 0, assume total_scan = 700.  batch size
         is 1024, then no memory is freed.  nr_deferred = 700
      
      2. Check cg2.  nr_deferred = 700.  Assume freeable = 20, then
         total_scan = 10 or 40.  Let's assume it's 10.  No memory is freed.
         nr_deferred = 10.
      
      The deferred share of cg1 is lost in this case.  kswapd will free no
      memory even run above steps again and again.
      
      The fix makes sure one memcg's deferred share isn't lost.
      
      Link: http://lkml.kernel.org/r/2414be961b5d25892060315fbb56bb19d81d0c07.1476227351.git.shli@fb.com
      
      
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: <stable@vger.kernel.org>	[4.0+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f33a080
    • Andi Kleen's avatar
      mm/mprotect.c: don't touch single threaded PTEs which are on the right node · 3e321587
      Andi Kleen authored
      We had some problems with pages getting unmapped in single threaded
      affinitized processes.  It was tracked down to NUMA scanning.
      
      In this case it doesn't make any sense to unmap pages if the process is
      single threaded and the page is already on the node the process is
      running on.
      
      Add a check for this case into the numa protection code, and skip
      unmapping if true.
      
      In theory the process could be migrated later, but we will eventually
      rescan and unmap and migrate then.
      
      In theory this could be made more fancy: remembering this state per
      process or even whole mm.  However that would need extra tracking and be
      more complicated, and the simple check seems to work fine so far.
      
      [ak@linux.intel.com: v3: Minor updates from Mel. Change code layout]
        Link: http://lkml.kernel.org/r/1476382117-5440-1-git-send-email-andi@firstfloor.org
      Link: http://lkml.kernel.org/r/1476288949-20970-1-git-send-email-andi@firstfloor.org
      
      
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e321587
    • David Rientjes's avatar
      mm, slab: maintain total slab count instead of active count · bf00bd34
      David Rientjes authored
      Rather than tracking the number of active slabs for each node, track the
      total number of slabs.  This is a minor improvement that avoids active
      slab tracking when a slab goes from free to partial or partial to free.
      
      For slab debugging, this also removes an explicit free count since it
      can easily be inferred by the difference in number of total objects and
      number of active objects.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1612042020110.115755@chino.kir.corp.google.com
      
      
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Suggested-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf00bd34
    • Greg Thelen's avatar
      mm, slab: faster active and free stats · f728b0a5
      Greg Thelen authored
      Reading /proc/slabinfo or monitoring slabtop(1) can become very
      expensive if there are many slab caches and if there are very lengthy
      per-node partial and/or free lists.
      
      Commit 07a63c41 ("mm/slab: improve performance of gathering slabinfo
      stats") addressed the per-node full lists which showed a significant
      improvement when no objects were freed.  This patch has the same
      motivation and optimizes the remainder of the usecases where there are
      very lengthy partial and free lists.
      
      This patch maintains per-node active_slabs (full and partial) and
      free_slabs rather than iterating the lists at runtime when reading
      /proc/slabinfo.
      
      When allocating 100GB of slab from a test cache where every slab page is
      on the partial list, reading /proc/slabinfo (includes all other slab
      caches on the system) takes ~247ms on average with 48 samples.
      
      As a result of this patch, the same read takes ~0.856ms on average.
      
      [rientjes@google.com: changelog]
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1611081505240.13403@chino.kir.corp.google.com
      
      
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f728b0a5
    • Thomas Garnier's avatar
      mm/slab_common.c: check kmem_create_cache flags are common · e70954fd
      Thomas Garnier authored
      Verify that kmem_create_cache flags are not allocator specific.  It is
      done before removing flags that are not available with the current
      configuration.
      
      The current kmem_cache_create removes incorrect flags but do not
      validate the callers are using them right.  This change will ensure that
      callers are not trying to create caches with flags that won't be used
      because allocator specific.
      
      Link: http://lkml.kernel.org/r/1478553075-120242-2-git-send-email-thgarnie@google.com
      
      
      Signed-off-by: default avatarThomas Garnier <thgarnie@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e70954fd
    • Arnd Bergmann's avatar
      slub: avoid false-postive warning · 84582c8a
      Arnd Bergmann authored
      The slub allocator gives us some incorrect warnings when
      CONFIG_PROFILE_ANNOTATED_BRANCHES is set, as the unlikely() macro
      prevents it from seeing that the return code matches what it was before:
      
        mm/slub.c: In function `kmem_cache_free_bulk':
        mm/slub.c:262:23: error: `df.s' may be used uninitialized in this function [-Werror=maybe-uninitialized]
        mm/slub.c:2943:3: error: `df.cnt' may be used uninitialized in this function [-Werror=maybe-uninitialized]
        mm/slub.c:2933:4470: error: `df.freelist' may be used uninitialized in this function [-Werror=maybe-uninitialized]
        mm/slub.c:2943:3: error: `df.tail' may be used uninitialized in this function [-Werror=maybe-uninitialized]
      
      I have not been able to come up with a perfect way for dealing with
      this, the three options I see are:
      
       - add a bogus initialization, which would increase the runtime overhead
       - replace unlikely() with unlikely_notrace()
       - remove the unlikely() annotation completely
      
      I checked the object code for a typical x86 configuration and the last
      two cases produce the same result, so I went for the last one, which is
      the simplest.
      
      Link: http://lkml.kernel.org/r/20161024155704.3114445-1-arnd@arndb.de
      
      
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Laura Abbott <labbott@fedoraproject.org>
      Cc: Alexander Potapenko <glider@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      84582c8a
    • Vladimir Davydov's avatar
      slub: move synchronize_sched out of slab_mutex on shrink · 89e364db
      Vladimir Davydov authored
      synchronize_sched() is a heavy operation and calling it per each cache
      owned by a memory cgroup being destroyed may take quite some time.  What
      is worse, it's currently called under the slab_mutex, stalling all works
      doing cache creation/destruction.
      
      Actually, there isn't much point in calling synchronize_sched() for each
      cache - it's enough to call it just once - after setting cpu_partial for
      all caches and before shrinking them.  This way, we can also move it out
      of the slab_mutex, which we have to hold for iterating over the slab
      cache list.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=172991
      Link: http://lkml.kernel.org/r/0a10d71ecae3db00fb4421bcd3f82bcc911f4be4.1475329751.git.vdavydov.dev@gmail.com
      
      
      Signed-off-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Reported-by: default avatarDoug Smythies <dsmythies@telus.net>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      89e364db
    • Vladimir Davydov's avatar
      mm: memcontrol: use special workqueue for creating per-memcg caches · 13583c3d
      Vladimir Davydov authored
      Creating a lot of cgroups at the same time might stall all worker
      threads with kmem cache creation works, because kmem cache creation is
      done with the slab_mutex held.  The problem was amplified by commits
      801faf0d ("mm/slab: lockless decision to grow cache") in case of
      SLAB and 81ae6d03 ("mm/slub.c: replace kick_all_cpus_sync() with
      synchronize_sched() in kmem_cache_shrink()") in case of SLUB, which
      increased the maximal time the slab_mutex can be held.
      
      To prevent that from happening, let's use a special ordered single
      threaded workqueue for kmem cache creation.  This shouldn't introduce
      any functional changes regarding how kmem caches are created, as the
      work function holds the global slab_mutex during its whole runtime
      anyway, making it impossible to run more than one work at a time.  By
      using a single threaded workqueue, we just avoid creating a thread per
      each work.  Ordering is required to avoid a situation when a cgroup's
      work is put off indefinitely because there are other cgroups to serve,
      in other words to guarantee fairness.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=172981
      Link: http://lkml.kernel.org/r/20161004131417.GC1862@esperanza
      
      
      Signed-off-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Reported-by: default avatarDoug Smythies <dsmythies@telus.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      13583c3d
    • Deepa Dinamani's avatar
      ocfs2: replace CURRENT_TIME macro · c62c38f6
      Deepa Dinamani authored
      CURRENT_TIME is not y2038 safe.
      
      Use y2038 safe ktime_get_real_seconds() here for timestamps.  struct
      heartbeat_block's hb_seq and deletetion time are already 64 bits wide
      and accommodate times beyond y2038.
      
      Also use y2038 safe ktime_get_real_ts64() for on disk inode timestamps.
      These are also wide enough to accommodate time64_t.
      
      Link: http://lkml.kernel.org/r/1475365298-29236-1-git-send-email-deepa.kernel@gmail.com
      
      
      Signed-off-by: default avatarDeepa Dinamani <deepa.kernel@gmail.com>
      Reviewed-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c62c38f6
    • Deepa Dinamani's avatar
      ocfs2: use time64_t to represent orphan scan times · 395627b0
      Deepa Dinamani authored
      struct timespec is not y2038 safe.  Use time64_t which is y2038 safe to
      represent orphan scan times.  time64_t is sufficient here as only the
      seconds delta times are relevant.
      
      Also use appropriate time functions that return time in time64_t format.
      Time functions now return monotonic time instead of real time as only
      delta scan times are relevant and these values are not persistent across
      reboots.
      
      The format string for the debug print is still using long as this is
      only the time elapsed since the last scan and long is sufficient to
      represent this value.
      
      Link: http://lkml.kernel.org/r/1475365138-20567-1-git-send-email-deepa.kernel@gmail.com
      
      
      Signed-off-by: default avatarDeepa Dinamani <deepa.kernel@gmail.com>
      Reviewed-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      395627b0
    • Ashish Samant's avatar
      ocfs2: fix double put of recount tree in ocfs2_lock_refcount_tree() · 4131d538
      Ashish Samant authored
      In ocfs2_lock_refcount_tree, if ocfs2_read_refcount_block() returns an
      error, we do ocfs2_refcount_tree_put twice (once in
      ocfs2_unlock_refcount_tree and once outside it), thereby reducing the
      refcount of the refcount tree twice, but we dont delete the tree in this
      case.  This will make refcnt of the tree = 0 and the
      ocfs2_refcount_tree_put will eventually call ocfs2_mark_lockres_freeing,
      setting OCFS2_LOCK_FREEING for the refcount_tree->rf_lockres.
      
      The error returned by ocfs2_read_refcount_block is propagated all the
      way back and for next iteration of write, ocfs2_lock_refcount_tree gets
      the same tree back from ocfs2_get_refcount_tree because we havent
      deleted the tree.  Now we have the same tree, but OCFS2_LOCK_FREEING is
      set for rf_lockres and eventually, when _ocfs2_lock_refcount_tree is
      called in this iteration, BUG_ON( __ocfs2_cluster_lock:1395 ERROR:
      Cluster lock called on freeing lockres T00000000000000000386019775b08d!
      flags 0x81) is triggerred.
      
      Call stack:
      
        (loop16,11155,0):ocfs2_lock_refcount_tree:482 ERROR: status = -5
        (loop16,11155,0):ocfs2_refcount_cow_hunk:3497 ERROR: status = -5
        (loop16,11155,0):ocfs2_refcount_cow:3560 ERROR: status = -5
        (loop16,11155,0):ocfs2_prepare_inode_for_refcount:2111 ERROR: status = -5
        (loop16,11155,0):ocfs2_prepare_inode_for_write:2190 ERROR: status = -5
        (loop16,11155,0):ocfs2_file_write_iter:2331 ERROR: status = -5
        (loop16,11155,0):__ocfs2_cluster_lock:1395 ERROR: bug expression:
        lockres->l_flags & OCFS2_LOCK_FREEING
      
        (loop16,11155,0):__ocfs2_cluster_lock:1395 ERROR: Cluster lock called on
        freeing lockres T00000000000000000386019775b08d! flags 0x81
      
        kernel BUG at fs/ocfs2/dlmglue.c:1395!
      
        invalid opcode: 0000 [#1] SMP  CPU 0
        Modules linked in: tun ocfs2 jbd2 xen_blkback xen_netback xen_gntdev .. sd_mod crc_t10dif ext3 jbd mbcache
        RIP: __ocfs2_cluster_lock+0x31c/0x740 [ocfs2]
        RSP: e02b:ffff88017c0138a0  EFLAGS: 00010086
        Process loop16 (pid: 11155, threadinfo ffff88017c010000, task ffff8801b5374300)
        Call Trace:
           ocfs2_refcount_lock+0xae/0x130 [ocfs2]
           __ocfs2_lock_refcount_tree+0x29/0xe0 [ocfs2]
           ocfs2_lock_refcount_tree+0xdd/0x320 [ocfs2]
           ocfs2_refcount_cow_hunk+0x1cb/0x440 [ocfs2]
           ocfs2_refcount_cow+0xa9/0x1d0 [ocfs2]
           ocfs2_prepare_inode_for_refcount+0x115/0x200 [ocfs2]
           ocfs2_prepare_inode_for_write+0x33b/0x470 [ocfs2]
           ocfs2_file_write_iter+0x220/0x8c0 [ocfs2]
           aio_write_iter+0x2e/0x30
      
      Fix this by avoiding the second call to ocfs2_refcount_tree_put()
      
      Link: http://lkml.kernel.org/r/1473984404-32011-1-git-send-email-ashish.samant@oracle.com
      
      
      Signed-off-by: default avatarAshish Samant <ashish.samant@oracle.com>
      Reviewed-by: default avatarEric Ren <zren@suse.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4131d538
    • piaojun's avatar
      ocfs2: clean up unused 'page' parameter in ocfs2_write_end_nolock() · 07f38d97
      piaojun authored
      'page' parameter in ocfs2_write_end_nolock() is never used.
      
      Link: http://lkml.kernel.org/r/582FD91A.5000902@huawei.com
      
      
      Signed-off-by: default avatarJun Piao <piaojun@huawei.com>
      Reviewed-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      07f38d97
    • piaojun's avatar
      ocfs2/dlm: clean up deadcode in dlm_master_request_handler() · 28bb5ef4
      piaojun authored
      When 'dispatch_assert' is set, 'response' must be DLM_MASTER_RESP_YES,
      and 'res' won't be null, so execution can't reach these two branch.
      
      Link: http://lkml.kernel.org/r/58174C91.3040004@huawei.com
      
      
      Signed-off-by: default avatarJun Piao <piaojun@huawei.com>
      Reviewed-by: default avatarJoseph Qi Joseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28bb5ef4
    • Guozhonghua's avatar
      ocfs2: delete redundant code and set the node bit into maybe_map directly · aa7b5859
      Guozhonghua authored
      The variable `set_maybe' is redundant when the mle has been found in the
      map.  So it is ok to set the node_idx into mle's maybe_map directly.
      
      Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA4A3D490DD@H3CMLB12-EX.srv.huawei-3com.com
      
      
      Signed-off-by: default avatarGuozhonghua <guozhonghua@h3c.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@versity.com>
      Reviewed-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aa7b5859
    • piaojun's avatar
      ocfs2/dlm: clean up useless BUG_ON default case in dlm_finalize_reco_handler() · 46832b2d
      piaojun authored
      The value of 'stage' must be between 1 and 2, so the switch can't reach
      the default case.
      
      Link: http://lkml.kernel.org/r/57FB5EB2.7050002@huawei.com
      
      
      Signed-off-by: default avatarJun Piao <piaojun@huawei.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46832b2d
    • Sudip Mukherjee's avatar
      drivers/pcmcia/m32r_pcc.c: check return from add_pcc_socket · 3da82065
      Sudip Mukherjee authored
      If request_irq() fails it passes the error to the caller.  The caller
      now checks it and jumps to the common error path on failure.
      
      Link: http://lkml.kernel.org/r/1474237304-897-3-git-send-email-sudipm.mukherjee@gmail.com
      
      
      Signed-off-by: default avatarSudip Mukherjee <sudip.mukherjee@codethink.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3da82065
    • Sudip Mukherjee's avatar
    • Sudip Mukherjee's avatar
      drivers/pcmcia/m32r_pcc.c: check return from request_irq · 4170a20f
      Sudip Mukherjee authored
      While building m32r allmodconfig we were getting warning:
      
        drivers/pcmcia/m32r_pcc.c:331:2: warning: ignoring return value of 'request_irq', declared with attribute warn_unused_result
      
      request_irq() can fail and we should always be checking the result from
      it. Check the result and return it to the caller.
      
      Link: http://lkml.kernel.org/r/1474237304-897-1-git-send-email-sudipm.mukherjee@gmail.com
      
      
      Signed-off-by: default avatarSudip Mukherjee <sudip.mukherjee@codethink.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4170a20f
    • Sudip Mukherjee's avatar
      m32r: fix build warning · 17e96230
      Sudip Mukherjee authored
      While building m32r defconfig we got warnings:
      
        arch/m32r/platforms/m32700ut/setup.c:249:24: warning: 'm32700ut_lcdpld_irq_type' defined but not used [-Wunused-variable]
      
      m32700ut_lcdpld_irq_type is only used when CONFIG_USB is enabled.
      Modify the code to declare the related variables and functions only when
      CONFIG_USB is enabled.
      
      Link: http://lkml.kernel.org/r/1479244406-7507-1-git-send-email-sudipm.mukherjee@gmail.com
      
      
      Signed-off-by: default avatarSudip Mukherjee <sudip.mukherjee@codethink.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      17e96230
    • Sudip Mukherjee's avatar
      m32r: add simple dma · eb17726b
      Sudip Mukherjee authored
      Some builds of m32r were failing as it tried to build few drivers which
      needed dma but m32r is not having dma support.  Objections were raised
      when it was tried to make those drivers depend on HAS_DMA.  So the next
      best thing is to add dma support to m32r.  dma_noop is a very simple dma
      with 1:1 memory mapping.
      
      Link: http://lkml.kernel.org/r/1475949198-31623-1-git-send-email-sudipm.mukherjee@gmail.com
      
      
      Signed-off-by: default avatarSudip Mukherjee <sudip.mukherjee@codethink.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb17726b
    • Sam Protsenko's avatar
      scripts/tags.sh: handle OMAP platforms properly · 779d5eb3
      Sam Protsenko authored
      When SUBARCH is "omap1" or "omap2", plat-omap/ directory must be
      indexed.  Handle this special case properly.
      
      While at it, check if mach- directory exists at all.
      
      Link: http://lkml.kernel.org/r/20161202122148.15001-1-joe.skb7@gmail.com
      
      
      Signed-off-by: default avatarSam Protsenko <semen.protsenko@linaro.org>
      Cc: Michal Marek <mmarek@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      779d5eb3
    • Alexey Dobriyan's avatar
      scripts/bloat-o-meter: compile .NUMBER regex · 0d7bbb43
      Alexey Dobriyan authored
      Every often used regex is better be compiled in Python.
      
      Speedup is about ~9.8% (whee!)
      
          $ perf stat -r 16 taskset -c 15 ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux >/dev/null
          7.091202853 seconds time elapsed                         ( +-  0.15% )
      
          +re.compile
          6.397564973 seconds time elapsed                         ( +-  0.34% )
      
      Link: http://lkml.kernel.org/r/20161119004417.GB1200@avx2
      
      
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0d7bbb43
    • Alexey Dobriyan's avatar
      scripts/bloat-o-meter: don't use readlines() · 3af06fd9
      Alexey Dobriyan authored
      readlines() conses whole list before doing anything which is slower for
      big object files.  Use per line iterator.
      
      Speed up is ~2% on "allyesconfig" type of kernel.
      
          $ perf stat -r 16 taskset -c 15 ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux >/dev/null
      	...
      
        Before:  7.247708646 seconds time elapsed                ( +-  0.28% )
        After:   7.091202853 seconds time elapsed                ( +-  0.15% )
      
      Link: http://lkml.kernel.org/r/20161119004143.GA1200@avx2
      
      
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3af06fd9
    • Stanislav Kinsburskiy's avatar
      prctl: remove one-shot limitation for changing exe link · 3fb4afd9
      Stanislav Kinsburskiy authored
      This limitation came with the reason to remove "another way for
      malicious code to obscure a compromised program and masquerade as a
      benign process" by allowing "security-concious program can use this
      prctl once during its early initialization to ensure the prctl cannot
      later be abused for this purpose":
      
          http://marc.info/?l=linux-kernel&m=133160684517468&w=2
      
      This explanation doesn't look sufficient.  The only thing "exe" link is
      indicating is the file, used to execve, which is basically nothing and
      not reliable immediately after process has returned from execve system
      call.
      
      Moreover, to use this feture, all the mappings to previous exe file have
      to be unmapped and all the new exe file permissions must be satisfied.
      
      Which means, that changing exe link is very similar to calling execve on
      the binary.
      
      The need to remove this limitations comes from migration of NFS mount
      point, which is not accessible during restore and replaced by other file
      system.  Because of this exe link has to be changed twice.
      
      [akpm@linux-foundation.org: fix up comment]
      Link: http://lkml.kernel.org/r/20160927153755.9337.69650.stgit@localhost.localdomain
      
      
      Signed-off-by: default avatarStanislav Kinsburskiy <skinsbursky@virtuozzo.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3fb4afd9
    • Nicolas Iooss's avatar
      kthread: add __printf attributes · c0b942a7
      Nicolas Iooss authored
      When commit fbae2d44 ("kthread: add kthread_create_worker*()")
      introduced some kthread_create_...() functions which were taking
      printf-like parametter, it introduced __printf attributes to some
      functions (e.g.  kthread_create_worker()).  Nevertheless some new
      functions were forgotten (they have been detected thanks to
      -Wmissing-format-attribute warning flag).
      
      Add the missing __printf attributes to the newly-introduced functions in
      order to detect formatting issues at build-time with -Wformat flag.
      
      Link: http://lkml.kernel.org/r/20161126193543.22672-1-nicolas.iooss_linux@m4x.org
      
      
      Signed-off-by: default avatarNicolas Iooss <nicolas.iooss_linux@m4x.org>
      Reviewed-by: default avatarPetr Mladek <pmladek@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c0b942a7
    • Linus Torvalds's avatar
      Merge branch 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · df5f0f0a
      Linus Torvalds authored
      Pull x86 RAS updates from Ingo Molnar:
       "The main changes in this development cycle were:
      
         - more AMD northbridge support work, mostly in preparation for Fam17h
           CPUs (Yazen Ghannam, Borislav Petkov)
      
         - cleanups/refactorings and fixes (Borislav Petkov, Tony Luck,
           Yinghai Lu)"
      
      * 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/mce: Include the PPIN in MCE records when available
        x86/mce/AMD: Add system physical address translation for AMD Fam17h
        x86/amd_nb: Add SMN and Indirect Data Fabric access for AMD Fam17h
        x86/amd_nb: Add Fam17h Data Fabric as "Northbridge"
        x86/amd_nb: Make all exports EXPORT_SYMBOL_GPL
        x86/amd_nb: Make amd_northbridges internal to amd_nb.c
        x86/mce/AMD: Reset Threshold Limit after logging error
        x86/mce/AMD: Fix HWID_MCATYPE calculation by grouping arguments
        x86/MCE: Correct TSC timestamping of error records
        x86/RAS: Hide SMCA bank names
        x86/RAS: Rename smca_bank_names to smca_names
        x86/RAS: Simplify SMCA HWID descriptor struct
        x86/RAS: Simplify SMCA bank descriptor struct
        x86/MCE: Dump MCE to dmesg if no consumers
        x86/RAS: Add TSC timestamp to the injected MCE
        x86/MCE: Do not look at panic_on_oops in the severity grading
      df5f0f0a
    • Linus Torvalds's avatar
      Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · cbaa1576
      Linus Torvalds authored
      Pull hotplug API fix from Ingo Molnar:
       "Late breaking fix from the v4.9 cycle: fix a hotplug register/
        unregister notifier API asymmetry bug that can cause kernel warnings
        (and worse) with certain Kconfig combinations"
      
      * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        hotplug: Make register and unregister notifier API symmetric
      cbaa1576
    • Linus Torvalds's avatar
      Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 92c020d0
      Linus Torvalds authored
      Pull scheduler updates from Ingo Molnar:
       "The main scheduler changes in this cycle were:
      
         - support Intel Turbo Boost Max Technology 3.0 (TBM3) by introducig a
           notion of 'better cores', which the scheduler will prefer to
           schedule single threaded workloads on. (Tim Chen, Srinivas
           Pandruvada)
      
         - enhance the handling of asymmetric capacity CPUs further (Morten
           Rasmussen)
      
         - improve/fix load handling when moving tasks between task groups
           (Vincent Guittot)
      
         - simplify and clean up the cputime code (Stanislaw Gruszka)
      
         - improve mass fork()ed task spread a.k.a. hackbench speedup (Vincent
           Guittot)
      
         - make struct kthread kmalloc()ed and related fixes (Oleg Nesterov)
      
         - add uaccess atomicity debugging (when using access_ok() in the
           wrong context), under CONFIG_DEBUG_ATOMIC_SLEEP=y (Peter Zijlstra)
      
         - implement various fixes, cleanups and other enhancements (Daniel
           Bristot de Oliveira, Martin Schwidefsky, Rafael J. Wysocki)"
      
      * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
        sched/core: Use load_avg for selecting idlest group
        sched/core: Fix find_idlest_group() for fork
        kthread: Don't abuse kthread_create_on_cpu() in __kthread_create_worker()
        kthread: Don't use to_live_kthread() in kthread_[un]park()
        kthread: Don't use to_live_kthread() in kthread_stop()
        Revert "kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function"
        kthread: Make struct kthread kmalloc'ed
        x86/uaccess, sched/preempt: Verify access_ok() context
        sched/x86: Make CONFIG_SCHED_MC_PRIO=y easier to enable
        sched/x86: Change CONFIG_SCHED_ITMT to CONFIG_SCHED_MC_PRIO
        x86/sched: Use #include <linux/mutex.h> instead of #include <asm/mutex.h>
        cpufreq/intel_pstate: Use CPPC to get max performance
        acpi/bus: Set _OSC for diverse core support
        acpi/bus: Enable HWP CPPC objects
        x86/sched: Add SD_ASYM_PACKING flags to x86 ITMT CPU
        x86/sysctl: Add sysctl for ITMT scheduling feature
        x86: Enable Intel Turbo Boost Max Technology 3.0
        x86/topology: Define x86's arch_update_cpu_topology
        sched: Extend scheduler's asym packing
        sched/fair: Clean up the tunable parameter definitions
        ...
      92c020d0
    • Linus Torvalds's avatar
      Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · bca13ce4
      Linus Torvalds authored
      Pull perf updates from Ingo Molnar:
       "This update is pretty big and almost exclusively includes tooling
        changes, because v4.9's LTS status forced to completion most of the
        pending kernel side hardware enablement work and because we tried to
        freeze core perf work a bit to give a time window for the fuzzing
        efforts.
      
        The diff is large mostly due to the JSON hardware event tables added
        for Intel and Power8 CPUs. This was a popular feature request from
        people working close to hardware and from the HPC community.
      
        Tree size is big because this added the CPU event tables for over a
        decade of Intel CPUs. Future changes for a CPU vendor alrady support
        should be much smaller, as events for new models are added. The new
        events are listed in 'perf list', for the CPU model the tool is
        running on. If you find an interesting event it can be used as-is:
      
            $ perf stat -a -e l2_lines_out.pf_clean sleep 1
      
            Performance counter stats for 'system wide':
      
                  7,860,403      l2_lines_out.pf_clean
      
                 1.000624918 seconds time elapsed
      
        The event lists can be searched the usual 'perf list' fashion for
        (case insensitive) substrings as well:
      
            $ perf list l2_lines_out
      
            List of pre-defined events (to be used in -e):
      
            cache:
              l2_lines_out.demand_clean
                   [Clean L2 cache lines evicted by demand]
              l2_lines_out.demand_dirty
                   [Dirty L2 cache lines evicted by demand]
              l2_lines_out.dirty_all
                   [Dirty L2 cache lines filling the L2]
              l2_lines_out.pf_clean
                   [Clean L2 cache lines evicted by L2 prefetch]
              l2_lines_out.pf_dirty
                   [Dirty L2 cache lines evicted by L2 prefetch]
      
        etc.
      
        There's a few high level categories as well that can be listed:
        'cache', 'floating point', 'frontend', 'memory', 'pipeline', 'virtual
        memory'.
      
        Existing generic events and workflows should work as-is.
      
        The only kernel side change is a late breaking fix for an older
        regression, related to Intel BTS, LBR and PT feature interaction.
      
        On the tooling side there are three new tools / major features:
      
         - The new 'perf c2c' tool provides means for Shared Data C2C/HITM
           analysis.
      
           This allows you to track down cacheline contention. The tool is
           based on x86's load latency and precise store facility events
           provided by Intel CPUs.
      
           It was tested by Joe Mario and has proven to be useful, finding
           some cacheline contentions. Joe also wrote a blog about c2c tool
           with examples:
      
              https://joemario.github.io/blog/2016/09/01/c2c-blog/
      
           excerpt of the content on this site:
      
               At a high level, “perf c2c” will show you:
      
                * The cachelines where false sharing was detected.
                * The readers and writers to those cachelines, and the offsets where those accesses occurred.
                * The pid, tid, instruction addr, function name, binary object name for those readers and writers.
                * The source file and line number for each reader and writer.
                * The average load latency for the loads to those cachelines.
                * Which numa nodes the samples a cacheline came from and which CPUs were involved.
      
               Using perf c2c is similar to using the Linux perf tool today.
               First collect data with “perf c2c record”, then generate a
               report output with “perf c2c report”
      
           There one finds extensive details on using the tool, with tips on
           reducing the volume of samples while still capturing enough to do
           its job. (Dick Fowles, Joe Mario, Don Zickus, Jiri Olsa)
      
         - The new 'perf sched timehist' tool provides tailored analysis of
           scheduling events.
      
           Example usage:
      
                perf sched record -- sleep 1
                perf sched timehist
      
           By default it shows the individual schedule events, including the
           wait time (time between sched-out and next sched-in events for the
           task), the task scheduling delay (time between wakeup and actually
           running) and run time for the task:
      
                  time    cpu  task name         wait time  sch delay  run time
                               [tid/pid]            (msec)     (msec)    (msec)
              -------- ------  ----------------  ---------  ---------  --------
              1.874569 [0011]  gcc[31949]            0.014      0.000     1.148
              1.874591 [0010]  gcc[31951]            0.000      0.000     0.024
              1.874603 [0010]  migration/10[59]      3.350      0.004     0.011
              1.874604 [0011]  <idle>                1.148      0.000     0.035
              1.874723 [0005]  <idle>                0.016      0.000     1.383
              1.874746 [0005]  gcc[31949]            0.153      0.078     0.022
            ...
      
           Times are in msec.usec. (David Ahern, Namhyung Kim)
      
         - Add CPU vendor hardware event tables:
      
           Add JSON files with vendor event naming for Intel and Power8
           processors, allowing users of tools like oprofile to keep using the
           event names they are used to, as well as people reading vendor
           documentation, where such naming is used. (Andi Kleen, Sukadev
           Bhattiprolu)
      
           You should see all the new events with 'perf list' and you should
           be able to search them, for example 'perf list miss' will list all
           the myriads of miss events.
      
        Other tooling features added were:
      
         - Cross-arch annotation support:
      
           o Improve ARM support in the annotation code, affecting 'perf
             annotate', 'perf report' and live annotation in 'perf top' (Kim
             Phillips)
      
           o Initial support for PowerPC in the annotation code (Ravi
             Bangoria)
      
           o Support AArch64 in the 'annotate' code, native/local and
             cross-arch/remote (Kim Phillips)
      
         - Allow considering just events in a given time interval, via the
           '--time start.s.ms,end.s.ms' command line, added to 'perf kmem',
           'perf report', 'perf sched timehist' and 'perf script' (David
           Ahern)
      
         - Add option to stop printing a callchain at one of a given group of
           symbol names (David Ahern)
      
         - Track memory freed in 'perf kmem stat' (David Ahern)
      
         - Allow querying and setting .perfconfig variables (Taeung Song)
      
         - Show branch information in callchains (predicted, TSX aborts, loop
           iteractions, etc) (Jin Yao)
      
         - Dynamicly change verbosity level by pressing 'V' in the 'perf
           top/report' hists TUI browser (Alexis Berlemont)
      
         - Implement 'perf trace --delay' in the same fashion as in 'perf
           record --delay', to skip sampling workload initialization events
           (Alexis Berlemont)
      
         - Make vendor named events case insensitive in 'perf list', i.e.
           'perf list LONGEST_LAT' works just the same as 'perf list
           longest_lat' (Andi Kleen)
      
         - Add unwinding support for jitdump (Stefano Sanfilippo)
      
        Tooling infrastructure changes:
      
         - Support linking perf with clang and LLVM libraries, initially
           statically, but this limitation will be lifted and shared
           libraries, when available, will be preferred to the static build,
           that should, as with other features, be enabled explicitly (Wang
           Nan)
      
         - Add initial support (and perf test entry) for tooling hooks,
           starting with 'record_start' and 'record_end', that will have as
           its initial user the eBPF infrastructure, where perf_ prefixed
           functions will be JITed and run when such hooks are called (Wang
           Nan)
      
         - Implement assorted libbpf improvements (Wang Nan)"
      
        ... and lots of other changes, features, cleanups and refactorings I
        did not list, see the shortlog and the git log for details"
      
      * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (220 commits)
        perf/x86: Fix exclusion of BTS and LBR for Goldmont
        perf tools: Explicitly document that --children is enabled by default
        perf sched timehist: Cleanup idle_max_cpu handling
        perf sched timehist: Handle zero sample->tid properly
        perf callchain: Introduce callchain_cursor__copy()
        perf sched: Cleanup option processing
        perf sched timehist: Improve error message when analyzing wrong file
        perf tools: Move perf build related variables under non fixdep leg
        perf tools: Force fixdep compilation at the start of the build
        perf tools: Move PERF-VERSION-FILE target into rules area
        perf build: Check LLVM version in feature check
        perf annotate: Show raw form for jump instruction with indirect target
        perf tools: Add non config targets
        perf tools: Cleanup build directory before each test
        perf tools: Move python/perf.so target into rules area
        perf tools: Move install-gtk target into rules area
        tools build: Move tabs to spaces where suitable
        tools build: Make the .cmd file more readable
        perf clang: Compile BPF script using builtin clang support
        perf clang: Support compile IR to BPF object and add testcase
        ...
      bca13ce4
    • Linus Torvalds's avatar
      Merge branch 'mm-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 0719dbf5
      Linus Torvalds authored
      Pull mm/PAT cleanup from Ingo Molnar:
       "A single cleanup for a generic interface that was originally
        introduced for PAT"
      
      * 'mm-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/pat, mm: Make track_pfn_insert() return void
      0719dbf5
    • Linus Torvalds's avatar
      Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 6cdf89b1
      Linus Torvalds authored
      Pull locking updates from Ingo Molnar:
       "The tree got pretty big in this development cycle, but the net effect
        is pretty good:
      
          115 files changed, 673 insertions(+), 1522 deletions(-)
      
        The main changes were:
      
         - Rework and generalize the mutex code to remove per arch mutex
           primitives. (Peter Zijlstra)
      
         - Add vCPU preemption support: add an interface to query the
           preemption status of vCPUs and use it in locking primitives - this
           optimizes paravirt performance. (Pan Xinhui, Juergen Gross,
           Christian Borntraeger)
      
         - Introduce cpu_relax_yield() and remov cpu_relax_lowlatency() to
           clean up and improve the s390 lock yielding machinery and its core
           kernel impact. (Christian Borntraeger)
      
         - Micro-optimize mutexes some more. (Waiman Long)
      
         - Reluctantly add the to-be-deprecated mutex_trylock_recursive()
           interface on a temporary basis, to give the DRM code more time to
           get rid of its locking hacks. Any other users will be NAK-ed on
           sight. (We turned off the deprecation warning for the time being to
           not pollute the build log.) (Peter Zijlstra)
      
         - Improve the rtmutex code a bit, in light of recent long lived
           bugs/races. (Thomas Gleixner)
      
         - Misc fixes, cleanups"
      
      * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
        x86/paravirt: Fix bool return type for PVOP_CALL()
        x86/paravirt: Fix native_patch()
        locking/ww_mutex: Use relaxed atomics
        locking/rtmutex: Explain locking rules for rt_mutex_proxy_unlock()/init_proxy_locked()
        locking/rtmutex: Get rid of RT_MUTEX_OWNER_MASKALL
        x86/paravirt: Optimize native pv_lock_ops.vcpu_is_preempted()
        locking/mutex: Break out of expensive busy-loop on {mutex,rwsem}_spin_on_owner() when owner vCPU is preempted
        locking/osq: Break out of spin-wait busy waiting loop for a preempted vCPU in osq_lock()
        Documentation/virtual/kvm: Support the vCPU preemption check
        x86/xen: Support the vCPU preemption check
        x86/kvm: Support the vCPU preemption check
        x86/kvm: Support the vCPU preemption check
        kvm: Introduce kvm_write_guest_offset_cached()
        locking/core, x86/paravirt: Implement vcpu_is_preempted(cpu) for KVM and Xen guests
        locking/spinlocks, s390: Implement vcpu_is_preempted(cpu)
        locking/core, powerpc: Implement vcpu_is_preempted(cpu)
        sched/core: Introduce the vcpu_is_preempted(cpu) interface
        sched/wake_q: Rename WAKE_Q to DEFINE_WAKE_Q
        locking/core: Provide common cpu_relax_yield() definition
        locking/mutex: Don't mark mutex_trylock_recursive() as deprecated, temporarily
        ...
      6cdf89b1
    • Linus Torvalds's avatar
      Merge branch 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 3940cf0b
      Linus Torvalds authored
      Pull EFI updates from Ingo Molnar:
       "The main changes in this development cycle were:
      
         - Implement EFI dev path parser and other changes to fully support
           thunderbolt devices on Apple Macbooks (Lukas Wunner)
      
         - Add RNG seeding via the EFI stub, on ARM/arm64 (Ard Biesheuvel)
      
         - Expose EFI framebuffer configuration to user-space, to improve
           tooling (Peter Jones)
      
         - Misc fixes and cleanups (Ivan Hu, Wei Yongjun, Yisheng Xie, Dan
           Carpenter, Roy Franz)"
      
      * 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        efi/libstub: Make efi_random_alloc() allocate below 4 GB on 32-bit
        thunderbolt: Compile on x86 only
        thunderbolt, efi: Fix Kconfig dependencies harder
        thunderbolt, efi: Fix Kconfig dependencies
        thunderbolt: Use Device ROM retrieved from EFI
        x86/efi: Retrieve and assign Apple device properties
        efi: Allow bitness-agnostic protocol calls
        efi: Add device path parser
        efi/arm*/libstub: Invoke EFI_RNG_PROTOCOL to seed the UEFI RNG table
        efi/libstub: Add random.c to ARM build
        efi: Add support for seeding the RNG from a UEFI config table
        MAINTAINERS: Add ARM and arm64 EFI specific files to EFI subsystem
        efi/libstub: Fix allocation size calculations
        efi/efivar_ssdt_load: Don't return success on allocation failure
        efifb: Show framebuffer layout as device attributes
        efi/efi_test: Use memdup_user() as a cleanup
        efi/efi_test: Fix uninitialized variable 'rv'
        efi/efi_test: Fix uninitialized variable 'datasize'
        efi/arm*: Fix efi_init() error handling
        efi: Remove unused include of <linux/version.h>
      3940cf0b