Skip to content
  1. Feb 23, 2018
    • Daniel Borkmann's avatar
      bpf, arm64: fix out of bounds access in tail call · 16338a9b
      Daniel Borkmann authored
      I recently noticed a crash on arm64 when feeding a bogus index
      into BPF tail call helper. The crash would not occur when the
      interpreter is used, but only in case of JIT. Output looks as
      follows:
      
        [  347.007486] Unable to handle kernel paging request at virtual address fffb850e96492510
        [...]
        [  347.043065] [fffb850e96492510] address between user and kernel address ranges
        [  347.050205] Internal error: Oops: 96000004 [#1] SMP
        [...]
        [  347.190829] x13: 0000000000000000 x12: 0000000000000000
        [  347.196128] x11: fffc047ebe782800 x10: ffff808fd7d0fd10
        [  347.201427] x9 : 0000000000000000 x8 : 0000000000000000
        [  347.206726] x7 : 0000000000000000 x6 : 001c991738000000
        [  347.212025] x5 : 0000000000000018 x4 : 000000000000ba5a
        [  347.217325] x3 : 00000000000329c4 x2 : ffff808fd7cf0500
        [  347.222625] x1 : ffff808fd7d0fc00 x0 : ffff808fd7cf0500
        [  347.227926] Process test_verifier (pid: 4548, stack limit = 0x000000007467fa61)
        [  347.235221] Call trace:
        [  347.237656]  0xffff000002f3a4fc
        [  347.240784]  bpf_test_run+0x78/0xf8
        [  347.244260]  bpf_prog_test_run_skb+0x148/0x230
        [  347.248694]  SyS_bpf+0x77c/0x1110
        [  347.251999]  el0_svc_naked+0x30/0x34
        [  347.255564] Code: 9100075a d280220a 8b0a002a d37df04b (f86b694b)
        [...]
      
      In this case the index used in BPF r3 is the same as in r1
      at the time of the call, meaning we fed a pointer as index;
      here, it had the value 0xffff808fd7cf0500 which sits in x2.
      
      While I found tail calls to be working in general (also for
      hitting the error cases), I noticed the following in the code
      emission:
      
        # bpftool p d j i 988
        [...]
        38:   ldr     w10, [x1,x10]
        3c:   cmp     w2, w10
        40:   b.ge    0x000000000000007c              <-- signed cmp
        44:   mov     x10, #0x20                      // #32
        48:   cmp     x26, x10
        4c:   b.gt    0x000000000000007c
        50:   add     x26, x26, #0x1
        54:   mov     x10, #0x110                     // #272
        58:   add     x10, x1, x10
        5c:   lsl     x11, x2, #3
        60:   ldr     x11, [x10,x11]                  <-- faulting insn (f86b694b)
        64:   cbz     x11, 0x000000000000007c
        [...]
      
      Meaning, the tests passed because commit ddb55992 ("arm64:
      bpf: implement bpf_tail_call() helper") was using signed compares
      instead of unsigned which as a result had the test wrongly passing.
      
      Change this but also the tail call count test both into unsigned
      and cap the index as u32. Latter we did as well in 90caccdd
      ("bpf: fix bpf_tail_call() x64 JIT") and is needed in addition here,
      too. Tested on HiSilicon Hi1616.
      
      Result after patch:
      
        # bpftool p d j i 268
        [...]
        38:	ldr	w10, [x1,x10]
        3c:	add	w2, w2, #0x0
        40:	cmp	w2, w10
        44:	b.cs	0x0000000000000080
        48:	mov	x10, #0x20                  	// #32
        4c:	cmp	x26, x10
        50:	b.hi	0x0000000000000080
        54:	add	x26, x26, #0x1
        58:	mov	x10, #0x110                 	// #272
        5c:	add	x10, x1, x10
        60:	lsl	x11, x2, #3
        64:	ldr	x11, [x10,x11]
        68:	cbz	x11, 0x0000000000000080
        [...]
      
      Fixes: ddb55992
      
       ("arm64: bpf: implement bpf_tail_call() helper")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      16338a9b
    • Daniel Borkmann's avatar
      bpf, x64: implement retpoline for tail call · a493a87f
      Daniel Borkmann authored
      Implement a retpoline [0] for the BPF tail call JIT'ing that converts
      the indirect jump via jmp %rax that is used to make the long jump into
      another JITed BPF image. Since this is subject to speculative execution,
      we need to control the transient instruction sequence here as well
      when CONFIG_RETPOLINE is set, and direct it into a pause + lfence loop.
      The latter aligns also with what gcc / clang emits (e.g. [1]).
      
      JIT dump after patch:
      
        # bpftool p d x i 1
         0: (18) r2 = map[id:1]
         2: (b7) r3 = 0
         3: (85) call bpf_tail_call#12
         4: (b7) r0 = 2
         5: (95) exit
      
      With CONFIG_RETPOLINE:
      
        # bpftool p d j i 1
        [...]
        33:	cmp    %edx,0x24(%rsi)
        36:	jbe    0x0000000000000072  |*
        38:	mov    0x24(%rbp),%eax
        3e:	cmp    $0x20,%eax
        41:	ja     0x0000000000000072  |
        43:	add    $0x1,%eax
        46:	mov    %eax,0x24(%rbp)
        4c:	mov    0x90(%rsi,%rdx,8),%rax
        54:	test   %rax,%rax
        57:	je     0x0000000000000072  |
        59:	mov    0x28(%rax),%rax
        5d:	add    $0x25,%rax
        61:	callq  0x000000000000006d  |+
        66:	pause                      |
        68:	lfence                     |
        6b:	jmp    0x0000000000000066  |
        6d:	mov    %rax,(%rsp)         |
        71:	retq                       |
        72:	mov    $0x2,%eax
        [...]
      
        * relative fall-through jumps in error case
        + retpoline for indirect jump
      
      Without CONFIG_RETPOLINE:
      
        # bpftool p d j i 1
        [...]
        33:	cmp    %edx,0x24(%rsi)
        36:	jbe    0x0000000000000063  |*
        38:	mov    0x24(%rbp),%eax
        3e:	cmp    $0x20,%eax
        41:	ja     0x0000000000000063  |
        43:	add    $0x1,%eax
        46:	mov    %eax,0x24(%rbp)
        4c:	mov    0x90(%rsi,%rdx,8),%rax
        54:	test   %rax,%rax
        57:	je     0x0000000000000063  |
        59:	mov    0x28(%rax),%rax
        5d:	add    $0x25,%rax
        61:	jmpq   *%rax               |-
        63:	mov    $0x2,%eax
        [...]
      
        * relative fall-through jumps in error case
        - plain indirect jump as before
      
        [0] https://support.google.com/faqs/answer/7625886
        [1] https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b
      
      
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a493a87f
    • Yonghong Song's avatar
      bpf: fix rcu lockdep warning for lpm_trie map_free callback · 6c5f6102
      Yonghong Song authored
      Commit 9a3efb6b ("bpf: fix memory leak in lpm_trie map_free callback function")
      fixed a memory leak and removed unnecessary locks in map_free callback function.
      Unfortrunately, it introduced a lockdep warning. When lockdep checking is turned on,
      running tools/testing/selftests/bpf/test_lpm_map will have:
      
        [   98.294321] =============================
        [   98.294807] WARNING: suspicious RCU usage
        [   98.295359] 4.16.0-rc2+ #193 Not tainted
        [   98.295907] -----------------------------
        [   98.296486] /home/yhs/work/bpf/kernel/bpf/lpm_trie.c:572 suspicious rcu_dereference_check() usage!
        [   98.297657]
        [   98.297657] other info that might help us debug this:
        [   98.297657]
        [   98.298663]
        [   98.298663] rcu_scheduler_active = 2, debug_locks = 1
        [   98.299536] 2 locks held by kworker/2:1/54:
        [   98.300152]  #0:  ((wq_completion)"events"){+.+.}, at: [<00000000196bc1f0>] process_one_work+0x157/0x5c0
        [   98.301381]  #1:  ((work_completion)(&map->work)){+.+.}, at: [<00000000196bc1f0>] process_one_work+0x157/0x5c0
      
      Since actual trie tree removal happens only after no other
      accesses to the tree are possible, replacing
        rcu_dereference_protected(*slot, lockdep_is_held(&trie->lock))
      with
        rcu_dereference_protected(*slot, 1)
      fixed the issue.
      
      Fixes: 9a3efb6b
      
       ("bpf: fix memory leak in lpm_trie map_free callback function")
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      6c5f6102
    • Eric Dumazet's avatar
      bpf: add schedule points in percpu arrays management · 32fff239
      Eric Dumazet authored
      syszbot managed to trigger RCU detected stalls in
      bpf_array_free_percpu()
      
      It takes time to allocate a huge percpu map, but even more time to free
      it.
      
      Since we run in process context, use cond_resched() to yield cpu if
      needed.
      
      Fixes: a10423b8
      
       ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      32fff239
  2. Feb 22, 2018
    • Li Zhijian's avatar
      selftests/bpf/test_maps: exit child process without error in ENOMEM case · 80475c48
      Li Zhijian authored
      
      
      test_maps contains a series of stress tests, and previously it will break the
      rest tests when it failed to alloc memory.
      -----------------------
      Failed to create hashmap key=8 value=262144 'Cannot allocate memory'
      Failed to create hashmap key=16 value=262144 'Cannot allocate memory'
      Failed to create hashmap key=8 value=262144 'Cannot allocate memory'
      Failed to create hashmap key=8 value=262144 'Cannot allocate memory'
      test_maps: test_maps.c:955: run_parallel: Assertion `status == 0' failed.
      Aborted
      not ok 1..3 selftests:  test_maps [FAIL]
      -----------------------
      after this patch, the rest tests will be continue when it occurs an ENOMEM failure
      
      CC: Alexei Starovoitov <alexei.starovoitov@gmail.com>
      CC: Philip Li <philip.li@intel.com>
      Suggested-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarLi Zhijian <zhijianx.li@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      80475c48
    • Anders Roxell's avatar
      selftests/bpf: update gitignore with test_libbpf_open · 31a8260d
      Anders Roxell authored
      
      
      bpf builds a test program for loading BPF ELF files. Add the executable
      to the .gitignore list.
      
      Signed-off-by: default avatarAnders Roxell <anders.roxell@linaro.org>
      Tested-by: default avatarDaniel Díaz <daniel.diaz@linaro.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Acked-by: default avatarShuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      31a8260d
    • Anders Roxell's avatar
      selftests/bpf: tcpbpf_kern: use in6_* macros from glibc · b52db43a
      Anders Roxell authored
      
      
      Both glibc and the kernel have in6_* macros definitions. Build fails
      because it picks up wrong in6_* macro from the kernel header and not the
      header from glibc.
      
      Fixes build error below:
      clang -I. -I./include/uapi -I../../../include/uapi
           -Wno-compare-distinct-pointer-types \
               -O2 -target bpf -emit-llvm -c test_tcpbpf_kern.c -o - |      \
      llc -march=bpf -mcpu=generic -filetype=obj
           -o .../tools/testing/selftests/bpf/test_tcpbpf_kern.o
      In file included from test_tcpbpf_kern.c:12:
      .../netinet/in.h:101:5: error: expected identifier
          IPPROTO_HOPOPTS = 0,   /* IPv6 Hop-by-Hop options.  */
          ^
      .../linux/in6.h:131:26: note: expanded from macro 'IPPROTO_HOPOPTS'
                                      ^
      In file included from test_tcpbpf_kern.c:12:
      /usr/include/netinet/in.h:103:5: error: expected identifier
          IPPROTO_ROUTING = 43,  /* IPv6 routing header.  */
          ^
      .../linux/in6.h:132:26: note: expanded from macro 'IPPROTO_ROUTING'
                                      ^
      In file included from test_tcpbpf_kern.c:12:
      .../netinet/in.h:105:5: error: expected identifier
          IPPROTO_FRAGMENT = 44, /* IPv6 fragmentation header.  */
          ^
      
      Since both glibc and the kernel have in6_* macros definitions, use the
      one from glibc.  Kernel headers will check for previous libc definitions
      by including include/linux/libc-compat.h.
      
      Reported-by: default avatarDaniel Díaz <daniel.diaz@linaro.org>
      Signed-off-by: default avatarAnders Roxell <anders.roxell@linaro.org>
      Tested-by: default avatarDaniel Díaz <daniel.diaz@linaro.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      b52db43a
    • Arnd Bergmann's avatar
      bpf: clean up unused-variable warning · a7dcdf6e
      Arnd Bergmann authored
      The only user of this variable is inside of an #ifdef, causing
      a warning without CONFIG_INET:
      
      net/core/filter.c: In function '____bpf_sock_ops_cb_flags_set':
      net/core/filter.c:3382:6: error: unused variable 'val' [-Werror=unused-variable]
        int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
      
      This replaces the #ifdef with a nicer IS_ENABLED() check that
      makes the code more readable and avoids the warning.
      
      Fixes: b13d8807
      
       ("bpf: Adds field bpf_sock_ops_cb_flags to tcp_sock")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a7dcdf6e
    • Tom Lendacky's avatar
      amd-xgbe: Restore PCI interrupt enablement setting on resume · cfd092f2
      Tom Lendacky authored
      
      
      After resuming from suspend, the PCI device support must re-enable the
      interrupt setting so that interrupts are actually delivered.
      
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cfd092f2
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · bf006d18
      David S. Miller authored
      
      
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2018-02-20
      
      The following pull-request contains BPF updates for your *net* tree.
      
      The main changes are:
      
      1) Fix a memory leak in LPM trie's map_free() callback function, where
         the trie structure itself was not freed since initial implementation.
         Also a synchronize_rcu() was needed in order to wait for outstanding
         programs accessing the trie to complete, from Yonghong.
      
      2) Fix sock_map_alloc()'s error path in order to correctly propagate
         the -EINVAL error in case of too large allocation requests. This
         was just recently introduced when fixing close hooks via ULP layer,
         fix from Eric.
      
      3) Do not use GFP_ATOMIC in __cpu_map_entry_alloc(). Reason is that this
         will not work with the recent __ptr_ring_init_queue_alloc() conversion
         to kvmalloc_array(), where in case of fallback to vmalloc() that GFP
         flag is invalid, from Jason.
      
      4) Fix two recent syzkaller warnings: i) fix bpf_prog_array_copy_to_user()
         when a prog query with a big number of ids was performed where we'd
         otherwise trigger a warning from allocator side, ii) fix a missing
         mlock precharge on arraymaps, from Daniel.
      
      5) Two fixes for bpftool in order to avoid breaking JSON output when used
         in batch mode, from Quentin.
      
      6) Move a pr_debug() in libbpf in order to avoid having an otherwise
         uninitialized variable in bpf_program__reloc_text(), from Jeremy.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf006d18
    • David S. Miller's avatar
      Merge branch 'virtio_net-XDP-fixes' · 6c4df17c
      David S. Miller authored
      
      
      Jesper Dangaard Brouer says:
      
      ====================
      virtio_net: several bugs in XDP code for driver virtio_net
      
      The virtio_net driver actually violates the original memory model of
      XDP causing hard to debug crashes.  Per request of John Fastabend,
      instead of removing the XDP feature I'm fixing as much as possible.
      While testing virtio_net with XDP_REDIRECT I found 4 different bugs.
      
      Patch-1: not enough tail-room for build_skb in receive_mergeable()
       only option is to disable XDP_REDIRECT in receive_mergeable()
      
      Patch-2: XDP in receive_small() basically never worked (check wrong flag)
      
      Patch-3: fix memory leak for XDP_REDIRECT in error cases
      
      Patch-4: avoid crash when ndo_xdp_xmit is called on dev not ready for XDP
      
      In the longer run, we should consider introducing a separate receive
      function when attaching an XDP program, and also change the memory
      model to be compatible with XDP when attaching an XDP prog.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c4df17c
    • Jesper Dangaard Brouer's avatar
      virtio_net: fix ndo_xdp_xmit crash towards dev not ready for XDP · 8dcc5b0a
      Jesper Dangaard Brouer authored
      
      
      When a driver implements the ndo_xdp_xmit() function, there is
      (currently) no generic way to determine whether it is safe to call.
      
      It is e.g. unsafe to call the drivers ndo_xdp_xmit, if it have not
      allocated the needed XDP TX queues yet.  This is the case for
      virtio_net, which first allocates the XDP TX queues once an XDP/bpf
      prog is attached (in virtnet_xdp_set()).
      
      Thus, a crash will occur for virtio_net when redirecting to another
      virtio_net device's ndo_xdp_xmit, which have not attached a XDP prog.
      The sample xdp_redirect_map tries to attach a dummy XDP prog to take
      this into account, but it can also easily fail if the virtio_net (or
      actually underlying vhost driver) have not allocated enough extra
      queues for the device.
      
      Allocating more queue this is currently a manual config.
      Hint for libvirt XML add:
      
        <driver name='vhost' queues='16'>
          <host mrg_rxbuf='off'/>
          <guest tso4='off' tso6='off' ecn='off' ufo='off'/>
        </driver>
      
      The solution in this patch is to check that the device have loaded an
      XDP/bpf prog before proceeding.  This is similar to the check
      performed in driver ixgbe.
      
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8dcc5b0a
    • Jesper Dangaard Brouer's avatar
      virtio_net: fix memory leak in XDP_REDIRECT · 11b7d897
      Jesper Dangaard Brouer authored
      XDP_REDIRECT calling xdp_do_redirect() can fail for multiple reasons
      (which can be inspected by tracepoints). The current semantics is that
      on failure the driver calling xdp_do_redirect() must handle freeing or
      recycling the page associated with this frame.  This can be seen as an
      optimization, as drivers usually have an optimized XDP_DROP code path
      for frame recycling in place already.
      
      The virtio_net driver didn't handle when xdp_do_redirect() failed.
      This caused a memory leak as the page refcnt wasn't decremented on
      failures.
      
      The function __virtnet_xdp_xmit() did handle one type of failure,
      when the xmit queue virtqueue_add_outbuf() is full, which "hides"
      releasing a refcnt on the page.  Instead the function __virtnet_xdp_xmit()
      must follow API of xdp_do_redirect(), which on errors leave it up to
      the caller to free the page, of the failed send operation.
      
      Fixes: 186b3c99
      
       ("virtio-net: support XDP_REDIRECT")
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      11b7d897
    • Jesper Dangaard Brouer's avatar
      virtio_net: fix XDP code path in receive_small() · 95dbe9e7
      Jesper Dangaard Brouer authored
      When configuring virtio_net to use the code path 'receive_small()',
      in-order to get correct XDP_REDIRECT support, I discovered TCP packets
      would get silently dropped when loading an XDP program action XDP_PASS.
      
      The bug seems to be that receive_small() when XDP is loaded check that
      hdr->hdr.flags is zero, which seems wrong as hdr.flags contains the
      flags VIRTIO_NET_HDR_F_* :
       #define VIRTIO_NET_HDR_F_NEEDS_CSUM 1 /* Use csum_start, csum_offset */
       #define VIRTIO_NET_HDR_F_DATA_VALID 2 /* Csum is valid */
      
      TCP got dropped as it had the VIRTIO_NET_HDR_F_DATA_VALID flag set.
      
      The flags that are relevant here are the VIRTIO_NET_HDR_GSO_* flags
      stored in hdr->hdr.gso_type. Thus, the fix is just check that none of
      the gso_type flags have been set.
      
      Fixes: bb91accf
      
       ("virtio-net: XDP support for small buffers")
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      95dbe9e7
    • Jesper Dangaard Brouer's avatar
      virtio_net: disable XDP_REDIRECT in receive_mergeable() case · 7324f539
      Jesper Dangaard Brouer authored
      The virtio_net code have three different RX code-paths in receive_buf().
      Two of these code paths can handle XDP, but one of them is broken for
      at least XDP_REDIRECT.
      
      Function(1): receive_big() does not support XDP.
      Function(2): receive_small() support XDP fully and uses build_skb().
      Function(3): receive_mergeable() broken XDP_REDIRECT uses napi_alloc_skb().
      
      The simple explanation is that receive_mergeable() is broken because
      it uses napi_alloc_skb(), which violates XDP given XDP assumes packet
      header+data in single page and enough tail room for skb_shared_info.
      
      The longer explaination is that receive_mergeable() tries to
      work-around and satisfy these XDP requiresments e.g. by having a
      function xdp_linearize_page() that allocates and memcpy RX buffers
      around (in case packet is scattered across multiple rx buffers).  This
      does currently satisfy XDP_PASS, XDP_DROP and XDP_TX (but only because
      we have not implemented bpf_xdp_adjust_tail yet).
      
      The XDP_REDIRECT action combined with cpumap is broken, and cause hard
      to debug crashes.  The main issue is that the RX packet does not have
      the needed tail-room (SKB_DATA_ALIGN(skb_shared_info)), causing
      skb_shared_info to overlap the next packets head-room (in which cpumap
      stores info).
      
      Reproducing depend on the packet payload length and if RX-buffer size
      happened to have tail-room for skb_shared_info or not.  But to make
      this even harder to troubleshoot, the RX-buffer size is runtime
      dynamically change based on an Exponentially Weighted Moving Average
      (EWMA) over the packet length, when refilling RX rings.
      
      This patch only disable XDP_REDIRECT support in receive_mergeable()
      case, because it can cause a real crash.
      
      IMHO we should consider NOT supporting XDP in receive_mergeable() at
      all, because the principles behind XDP are to gain speed by (1) code
      simplicity, (2) sacrificing memory and (3) where possible moving
      runtime checks to setup time.  These principles are clearly being
      violated in receive_mergeable(), that e.g. runtime track average
      buffer size to save memory consumption.
      
      In the longer run, we should consider introducing a separate receive
      function when attaching an XDP program, and also change the memory
      model to be compatible with XDP when attaching an XDP prog.
      
      Fixes: 186b3c99
      
       ("virtio-net: support XDP_REDIRECT")
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7324f539
    • David S. Miller's avatar
      Merge tag 'mlx5-fixes-2018-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 9c4ff2a9
      David S. Miller authored
      
      
      Saeed Mahameed says:
      
      ====================
      Mellanox, mlx5 fixes 2018-02-20
      
      The following pull request includes some fixes for the mlx5 core and
      netdevice driver.
      
      Please pull and let me know if there's any issue.
      
      -stable 4.10.y:
      ('net/mlx5e: Fix loopback self test when GRO is off')
      
      -stable 4.12.y:
      ('net/mlx5e: Specify numa node when allocating drop rq')
      
      -stable 4.13.y:
      ('net/mlx5e: Verify inline header size do not exceed SKB linear size')
      
      -stable 4.15.y:
      ('net/mlx5e: Fix TCP checksum in LRO buffers')
      ('net/mlx5: Fix error handling when adding flow rules')
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9c4ff2a9
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · 943a0d4a
      David S. Miller authored
      
      
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains large batch with Netfilter fixes for
      your net tree, mostly due to syzbot report fixups and pr_err()
      ratelimiting, more specifically, they are:
      
      1) Get rid of superfluous unnecessary check in x_tables before vmalloc(),
         we don't hit BUG there anymore, patch from Michal Hock, suggested by
         Andrew Morton.
      
      2) Race condition in proc file creation in ipt_CLUSTERIP, from Cong Wang.
      
      3) Drop socket lock that results in circular locking dependency, patch
         from Paolo Abeni.
      
      4) Drop packet if case of malformed blob that makes backpointer jump
         in x_tables, from Florian Westphal.
      
      5) Fix refcount leak due to race in ipt_CLUSTERIP in
         clusterip_config_find_get(), from Cong Wang.
      
      6) Several patches to ratelimit pr_err() for x_tables since this can be
         a problem where CAP_NET_ADMIN semantics can protect us in untrusted
         namespace, from Florian Westphal.
      
      7) Missing .gitignore update for new autogenerated asn1 state machine
         for the SNMP NAT helper, from Zhu Lingshan.
      
      8) Missing timer initialization in xt_LED, from Paolo Abeni.
      
      9) Do not allow negative port range in NAT, also from Paolo.
      
      10) Lock imbalance in the xt_hashlimit rate match mode, patch from
          Eric Dumazet.
      
      11) Initialize workqueue before timer in the idletimer match,
          from Eric Dumazet.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      943a0d4a
  3. Feb 21, 2018
  4. Feb 20, 2018
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 79c0ef3e
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Prevent index integer overflow in ptr_ring, from Jason Wang.
      
       2) Program mvpp2 multicast filter properly, from Mikulas Patocka.
      
       3) The bridge brport attribute file is write only and doesn't have a
          ->show() method, don't blindly invoke it. From Xin Long.
      
       4) Inverted mask used in genphy_setup_forced(), from Ingo van Lil.
      
       5) Fix multiple definition issue with if_ether.h UAPI header, from
          Hauke Mehrtens.
      
       6) Fix GFP_KERNEL usage in atomic in RDS protocol code, from Sowmini
          Varadhan.
      
       7) Revert XDP redirect support from thunderx driver, it is not
          implemented properly. From Jesper Dangaard Brouer.
      
       8) Fix missing RTNL protection across some tipc operations, from Ying
          Xue.
      
       9) Return the correct IV bytes in the TLS getsockopt code, from Boris
          Pismenny.
      
      10) Take tclassid into consideration properly when doing FIB rule
          matching. From Stefano Brivio.
      
      11) cxgb4 device needs more PCI VPD quirks, from Casey Leedom.
      
      12) TUN driver doesn't align frags properly, and we can end up doing
          unaligned atomics on misaligned metadata. From Eric Dumazet.
      
      13) Fix various crashes found using DEBUG_PREEMPT in rmnet driver, from
          Subash Abhinov Kasiviswanathan.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (56 commits)
        tg3: APE heartbeat changes
        mlxsw: spectrum_router: Do not unconditionally clear route offload indication
        net: qualcomm: rmnet: Fix possible null dereference in command processing
        net: qualcomm: rmnet: Fix warning seen with 64 bit stats
        net: qualcomm: rmnet: Fix crash on real dev unregistration
        sctp: remove the left unnecessary check for chunk in sctp_renege_events
        rxrpc: Work around usercopy check
        tun: fix tun_napi_alloc_frags() frag allocator
        udplite: fix partial checksum initialization
        skbuff: Fix comment mis-spelling.
        dn_getsockoptdecnet: move nf_{get/set}sockopt outside sock lock
        PCI/cxgb4: Extend T3 PCI quirk to T4+ devices
        cxgb4: fix trailing zero in CIM LA dump
        cxgb4: free up resources of pf 0-3
        fib_semantics: Don't match route with mismatching tclassid
        NFC: llcp: Limit size of SDP URI
        tls: getsockopt return record sequence number
        tls: reset the crypto info if copy_from_user fails
        tls: retrun the correct IV in getsockopt
        docs: segmentation-offloads.txt: add SCTP info
        ...
      79c0ef3e
    • Prashant Sreedharan's avatar
      tg3: APE heartbeat changes · 506b0a39
      Prashant Sreedharan authored
      
      
      In ungraceful host shutdown or driver crash case BMC connectivity is
      lost. APE firmware is missing the driver state in this
      case to keep the BMC connectivity alive.
      This patch has below change to address this issue.
      
      Heartbeat mechanism with APE firmware. This heartbeat mechanism
      is needed to notify the APE firmware about driver state.
      
      This patch also has the change in wait time for APE event from
      1ms to 20ms as there can be some delay in getting response.
      
      v2: Drop inline keyword as per David suggestion.
      
      Signed-off-by: default avatarPrashant Sreedharan <prashant.sreedharan@broadcom.com>
      Signed-off-by: default avatarSatish Baddipadige <satish.baddipadige@broadcom.com>
      Signed-off-by: default avatarSiva Reddy Kallam <siva.kallam@broadcom.com>
      Acked-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      506b0a39
    • Eric Dumazet's avatar
      netfilter: IDLETIMER: be syzkaller friendly · cfc2c740
      Eric Dumazet authored
      We had one report from syzkaller [1]
      
      First issue is that INIT_WORK() should be done before mod_timer()
      or we risk timer being fired too soon, even with a 1 second timer.
      
      Second issue is that we need to reject too big info->timeout
      to avoid overflows in msecs_to_jiffies(info->timeout * 1000), or
      risk looping, if result after overflow is 0.
      
      [1]
      WARNING: CPU: 1 PID: 5129 at kernel/workqueue.c:1444 __queue_work+0xdf4/0x1230 kernel/workqueue.c:1444
      Kernel panic - not syncing: panic_on_warn set ...
      
      CPU: 1 PID: 5129 Comm: syzkaller159866 Not tainted 4.16.0-rc1+ #230
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:53
       panic+0x1e4/0x41c kernel/panic.c:183
       __warn+0x1dc/0x200 kernel/panic.c:547
       report_bug+0x211/0x2d0 lib/bug.c:184
       fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
       fixup_bug arch/x86/kernel/traps.c:247 [inline]
       do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
       do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
       invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:988
      RIP: 0010:__queue_work+0xdf4/0x1230 kernel/workqueue.c:1444
      RSP: 0018:ffff8801db507538 EFLAGS: 00010006
      RAX: ffff8801aeb46080 RBX: ffff8801db530200 RCX: ffffffff81481404
      RDX: 0000000000000100 RSI: ffffffff86b42640 RDI: 0000000000000082
      RBP: ffff8801db507758 R08: 1ffff1003b6a0de5 R09: 000000000000000c
      R10: ffff8801db5073f0 R11: 0000000000000020 R12: 1ffff1003b6a0eb6
      R13: ffff8801b1067ae0 R14: 00000000000001f8 R15: dffffc0000000000
       queue_work_on+0x16a/0x1c0 kernel/workqueue.c:1488
       queue_work include/linux/workqueue.h:488 [inline]
       schedule_work include/linux/workqueue.h:546 [inline]
       idletimer_tg_expired+0x44/0x60 net/netfilter/xt_IDLETIMER.c:116
       call_timer_fn+0x228/0x820 kernel/time/timer.c:1326
       expire_timers kernel/time/timer.c:1363 [inline]
       __run_timers+0x7ee/0xb70 kernel/time/timer.c:1666
       run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
       __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
       invoke_softirq kernel/softirq.c:365 [inline]
       irq_exit+0x1cc/0x200 kernel/softirq.c:405
       exiting_irq arch/x86/include/asm/apic.h:541 [inline]
       smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
       apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:829
       </IRQ>
      RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:777 [inline]
      RIP: 0010:__raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:160 [inline]
      RIP: 0010:_raw_spin_unlock_irqrestore+0x5e/0xba kernel/locking/spinlock.c:184
      RSP: 0018:ffff8801c20173c8 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff12
      RAX: dffffc0000000000 RBX: 0000000000000282 RCX: 0000000000000006
      RDX: 1ffffffff0d592cd RSI: 1ffff10035d68d23 RDI: 0000000000000282
      RBP: ffff8801c20173d8 R08: 1ffff10038402e47 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff8820e5c8
      R13: ffff8801b1067ad8 R14: ffff8801aea7c268 R15: ffff8801aea7c278
       __debug_object_init+0x235/0x1040 lib/debugobjects.c:378
       debug_object_init+0x17/0x20 lib/debugobjects.c:391
       __init_work+0x2b/0x60 kernel/workqueue.c:506
       idletimer_tg_create net/netfilter/xt_IDLETIMER.c:152 [inline]
       idletimer_tg_checkentry+0x691/0xb00 net/netfilter/xt_IDLETIMER.c:213
       xt_check_target+0x22c/0x7d0 net/netfilter/x_tables.c:850
       check_target net/ipv6/netfilter/ip6_tables.c:533 [inline]
       find_check_entry.isra.7+0x935/0xcf0 net/ipv6/netfilter/ip6_tables.c:575
       translate_table+0xf52/0x1690 net/ipv6/netfilter/ip6_tables.c:744
       do_replace net/ipv6/netfilter/ip6_tables.c:1160 [inline]
       do_ip6t_set_ctl+0x370/0x5f0 net/ipv6/netfilter/ip6_tables.c:1686
       nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
       nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115
       ipv6_setsockopt+0x10b/0x130 net/ipv6/ipv6_sockglue.c:927
       udpv6_setsockopt+0x45/0x80 net/ipv6/udp.c:1422
       sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2976
       SYSC_setsockopt net/socket.c:1850 [inline]
       SyS_setsockopt+0x189/0x360 net/socket.c:1829
       do_syscall_64+0x282/0x940 arch/x86/entry/common.c:287
      
      Fixes: 0902b469
      
       ("netfilter: xtables: idletimer target implementation")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzkaller <syzkaller@googlegroups.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      cfc2c740
    • Ido Schimmel's avatar
      mlxsw: spectrum_router: Do not unconditionally clear route offload indication · d1c95af3
      Ido Schimmel authored
      When mlxsw replaces (or deletes) a route it removes the offload
      indication from the replaced route. This is problematic for IPv4 routes,
      as the offload indication is stored in the fib_info which is usually
      shared between multiple routes.
      
      Instead of unconditionally clearing the offload indication, only clear
      it if no other route is using the fib_info.
      
      Fixes: 3984d1a8
      
       ("mlxsw: spectrum_router: Provide offload indication using nexthop flags")
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reported-by: default avatarAlexander Petrovskiy <alexpe@mellanox.com>
      Tested-by: default avatarAlexander Petrovskiy <alexpe@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d1c95af3
    • David S. Miller's avatar
      Merge branch 'qualcomm-rmnet-Fix-issues-with-CONFIG_DEBUG_PREEMPT-enabled' · cae69256
      David S. Miller authored
      
      
      Subash Abhinov Kasiviswanathan says:
      
      ====================
      net: qualcomm: rmnet: Fix issues with CONFIG_DEBUG_PREEMPT enabled
      
      Patch 1 and 2 fixes issues identified when CONFIG_DEBUG_PREEMPT was
      enabled. These involve APIs which were called in invalid contexts.
      
      Patch 3 is a null derefence fix identified by code inspection.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cae69256
    • Subash Abhinov Kasiviswanathan's avatar
      net: qualcomm: rmnet: Fix possible null dereference in command processing · f57bbaae
      Subash Abhinov Kasiviswanathan authored
      If a command packet with invalid mux id is received, the packet would
      not have a valid endpoint. This invalid endpoint maybe dereferenced
      leading to a crash. Identified by manual code inspection.
      
      Fixes: 3352e6c4
      
       ("net: qualcomm: rmnet: Convert the muxed endpoint to hlist")
      Signed-off-by: default avatarSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f57bbaae
    • Subash Abhinov Kasiviswanathan's avatar
      net: qualcomm: rmnet: Fix warning seen with 64 bit stats · 4dba8bbc
      Subash Abhinov Kasiviswanathan authored
      With CONFIG_DEBUG_PREEMPT enabled, a warning was seen on device
      creation. This occurs due to the incorrect cpu API usage in
      ndo_get_stats64 handler.
      
      BUG: using smp_processor_id() in preemptible [00000000] code: rmnetcli/5743
      caller is debug_smp_processor_id+0x1c/0x24
      Call trace:
      [<ffffff9d48c8967c>] dump_backtrace+0x0/0x2a8
      [<ffffff9d48c89bbc>] show_stack+0x20/0x28
      [<ffffff9d4901fff8>] dump_stack+0xa8/0xe0
      [<ffffff9d490421e0>] check_preemption_disabled+0x104/0x108
      [<ffffff9d49042200>] debug_smp_processor_id+0x1c/0x24
      [<ffffff9d494a36b0>] rmnet_get_stats64+0x64/0x13c
      [<ffffff9d49b014e0>] dev_get_stats+0x68/0xd8
      [<ffffff9d49d58df8>] rtnl_fill_stats+0x54/0x140
      [<ffffff9d49b1f0b8>] rtnl_fill_ifinfo+0x428/0x9cc
      [<ffffff9d49b23834>] rtmsg_ifinfo_build_skb+0x80/0xf4
      [<ffffff9d49b23930>] rtnetlink_event+0x88/0xb4
      [<ffffff9d48cd21b4>] raw_notifier_call_chain+0x58/0x78
      [<ffffff9d49b028a4>] call_netdevice_notifiers_info+0x48/0x78
      [<ffffff9d49b08bf8>] __netdev_upper_dev_link+0x290/0x5e8
      [<ffffff9d49b08fcc>] netdev_master_upper_dev_link+0x3c/0x48
      [<ffffff9d494a2e74>] rmnet_newlink+0xf0/0x1c8
      [<ffffff9d49b23360>] rtnl_newlink+0x57c/0x6c8
      [<ffffff9d49b2355c>] rtnetlink_rcv_msg+0xb0/0x244
      [<ffffff9d49b5230c>] netlink_rcv_skb+0xb4/0xdc
      [<ffffff9d49b204f4>] rtnetlink_rcv+0x34/0x44
      [<ffffff9d49b51af0>] netlink_unicast+0x1ec/0x294
      [<ffffff9d49b51fdc>] netlink_sendmsg+0x320/0x390
      [<ffffff9d49ae6858>] sock_sendmsg+0x54/0x60
      [<ffffff9d49ae91bc>] SyS_sendto+0x1a0/0x1e4
      [<ffffff9d48c83770>] el0_svc_naked+0x24/0x28
      
      Fixes: 192c4b5d
      
       ("net: qualcomm: rmnet: Add support for 64 bit stats")
      Signed-off-by: default avatarSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4dba8bbc
    • Subash Abhinov Kasiviswanathan's avatar
      net: qualcomm: rmnet: Fix crash on real dev unregistration · b37f78f2
      Subash Abhinov Kasiviswanathan authored
      With CONFIG_DEBUG_PREEMPT enabled, a crash with the following call
      stack was observed when removing a real dev which had rmnet devices
      attached to it.
      To fix this, remove the netdev_upper link APIs and instead use the
      existing information in rmnet_port and rmnet_priv to get the
      association between real and rmnet devs.
      
      BUG: sleeping function called from invalid context
      in_atomic(): 0, irqs_disabled(): 0, pid: 5762, name: ip
      Preemption disabled at:
      [<ffffff9d49043564>] debug_object_active_state+0xa4/0x16c
      Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
      Modules linked in:
      PC is at ___might_sleep+0x13c/0x180
      LR is at ___might_sleep+0x17c/0x180
      [<ffffff9d48ce0924>] ___might_sleep+0x13c/0x180
      [<ffffff9d48ce09c0>] __might_sleep+0x58/0x8c
      [<ffffff9d49d6253c>] mutex_lock+0x2c/0x48
      [<ffffff9d48ed4840>] kernfs_remove_by_name_ns+0x48/0xa8
      [<ffffff9d48ed6ec8>] sysfs_remove_link+0x30/0x58
      [<ffffff9d49b05840>] __netdev_adjacent_dev_remove+0x14c/0x1e0
      [<ffffff9d49b05914>] __netdev_adjacent_dev_unlink_lists+0x40/0x68
      [<ffffff9d49b08820>] netdev_upper_dev_unlink+0xb4/0x1fc
      [<ffffff9d494a29f0>] rmnet_dev_walk_unreg+0x6c/0xc8
      [<ffffff9d49b00b40>] netdev_walk_all_lower_dev_rcu+0x58/0xb4
      [<ffffff9d494a30fc>] rmnet_config_notify_cb+0xf4/0x134
      [<ffffff9d48cd21b4>] raw_notifier_call_chain+0x58/0x78
      [<ffffff9d49b028a4>] call_netdevice_notifiers_info+0x48/0x78
      [<ffffff9d49b0b568>] rollback_registered_many+0x230/0x3c8
      [<ffffff9d49b0b738>] unregister_netdevice_many+0x38/0x94
      [<ffffff9d49b1e110>] rtnl_delete_link+0x58/0x88
      [<ffffff9d49b201dc>] rtnl_dellink+0xbc/0x1cc
      [<ffffff9d49b2355c>] rtnetlink_rcv_msg+0xb0/0x244
      [<ffffff9d49b5230c>] netlink_rcv_skb+0xb4/0xdc
      [<ffffff9d49b204f4>] rtnetlink_rcv+0x34/0x44
      [<ffffff9d49b51af0>] netlink_unicast+0x1ec/0x294
      [<ffffff9d49b51fdc>] netlink_sendmsg+0x320/0x390
      [<ffffff9d49ae6858>] sock_sendmsg+0x54/0x60
      [<ffffff9d49ae6f94>] ___sys_sendmsg+0x298/0x2b0
      [<ffffff9d49ae98f8>] SyS_sendmsg+0xb4/0xf0
      [<ffffff9d48c83770>] el0_svc_naked+0x24/0x28
      
      Fixes: ceed73a2 ("drivers: net: ethernet: qualcomm: rmnet: Initial implementation")
      Fixes: 60d58f97
      
       ("net: qualcomm: rmnet: Implement bridge mode")
      Signed-off-by: default avatarSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b37f78f2
  5. Feb 19, 2018