Skip to content
  1. Sep 18, 2020
    • Maciej Fijalkowski's avatar
      bpf, x64: rework pro/epilogue and tailcall handling in JIT · ebf7d1f5
      Maciej Fijalkowski authored
      
      
      This commit serves two things:
      1) it optimizes BPF prologue/epilogue generation
      2) it makes possible to have tailcalls within BPF subprogram
      
      Both points are related to each other since without 1), 2) could not be
      achieved.
      
      In [1], Alexei says:
      "The prologue will look like:
      nop5
      xor eax,eax  // two new bytes if bpf_tail_call() is used in this
                   // function
      push rbp
      mov rbp, rsp
      sub rsp, rounded_stack_depth
      push rax // zero init tail_call counter
      variable number of push rbx,r13,r14,r15
      
      Then bpf_tail_call will pop variable number rbx,..
      and final 'pop rax'
      Then 'add rsp, size_of_current_stack_frame'
      jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
      rbp, rsp'
      
      This way new function will set its own stack size and will init tail
      call
      counter with whatever value the parent had.
      
      If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
      Instead it would need to have 'nop2' in there."
      
      Implement that suggestion.
      
      Since the layout of stack is changed, tail call counter handling can not
      rely anymore on popping it to rbx just like it have been handled for
      constant prologue case and later overwrite of rbx with actual value of
      rbx pushed to stack. Therefore, let's use one of the register (%rcx) that
      is considered to be volatile/caller-saved and pop the value of tail call
      counter in there in the epilogue.
      
      Drop the BUILD_BUG_ON in emit_prologue and in
      emit_bpf_tail_call_indirect where instruction layout is not constant
      anymore.
      
      Introduce new poke target, 'tailcall_bypass' to poke descriptor that is
      dedicated for skipping the register pops and stack unwind that are
      generated right before the actual jump to target program.
      For case when the target program is not present, BPF program will skip
      the pop instructions and nop5 dedicated for jmpq $target. An example of
      such state when only R6 of callee saved registers is used by program:
      
      ffffffffc0513aa1:       e9 0e 00 00 00          jmpq   0xffffffffc0513ab4
      ffffffffc0513aa6:       5b                      pop    %rbx
      ffffffffc0513aa7:       58                      pop    %rax
      ffffffffc0513aa8:       48 81 c4 00 00 00 00    add    $0x0,%rsp
      ffffffffc0513aaf:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
      ffffffffc0513ab4:       48 89 df                mov    %rbx,%rdi
      
      When target program is inserted, the jump that was there to skip
      pops/nop5 will become the nop5, so CPU will go over pops and do the
      actual tailcall.
      
      One might ask why there simply can not be pushes after the nop5?
      In the following example snippet:
      
      ffffffffc037030c:       48 89 fb                mov    %rdi,%rbx
      (...)
      ffffffffc0370332:       5b                      pop    %rbx
      ffffffffc0370333:       58                      pop    %rax
      ffffffffc0370334:       48 81 c4 00 00 00 00    add    $0x0,%rsp
      ffffffffc037033b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
      ffffffffc0370340:       48 81 ec 00 00 00 00    sub    $0x0,%rsp
      ffffffffc0370347:       50                      push   %rax
      ffffffffc0370348:       53                      push   %rbx
      ffffffffc0370349:       48 89 df                mov    %rbx,%rdi
      ffffffffc037034c:       e8 f7 21 00 00          callq  0xffffffffc0372548
      
      There is the bpf2bpf call (at ffffffffc037034c) right after the tailcall
      and jump target is not present. ctx is in %rbx register and BPF
      subprogram that we will call into on ffffffffc037034c is relying on it,
      e.g. it will pick ctx from there. Such code layout is therefore broken
      as we would overwrite the content of %rbx with the value that was pushed
      on the prologue. That is the reason for the 'bypass' approach.
      
      Special care needs to be taken during the install/update/remove of
      tailcall target. In case when target program is not present, the CPU
      must not execute the pop instructions that precede the tailcall.
      
      To address that, the following states can be defined:
      A nop, unwind, nop
      B nop, unwind, tail
      C skip, unwind, nop
      D skip, unwind, tail
      
      A is forbidden (lead to incorrectness). The state transitions between
      tailcall install/update/remove will work as follows:
      
      First install tail call f: C->D->B(f)
       * poke the tailcall, after that get rid of the skip
      Update tail call f to f': B(f)->B(f')
       * poke the tailcall (poke->tailcall_target) and do NOT touch the
         poke->tailcall_bypass
      Remove tail call: B(f')->C(f')
       * poke->tailcall_bypass is poked back to jump, then we wait the RCU
         grace period so that other programs will finish its execution and
         after that we are safe to remove the poke->tailcall_target
      Install new tail call (f''): C(f')->D(f'')->B(f'').
       * same as first step
      
      This way CPU can never be exposed to "unwind, tail" state.
      
      Last but not least, when tailcalls get mixed with bpf2bpf calls, it
      would be possible to encounter the endless loop due to clearing the
      tailcall counter if for example we would use the tailcall3-like from BPF
      selftests program that would be subprogram-based, meaning the tailcall
      would be present within the BPF subprogram.
      
      This test, broken down to particular steps, would do:
      entry -> set tailcall counter to 0, bump it by 1, tailcall to func0
      func0 -> call subprog_tail
      (we are NOT skipping the first 11 bytes of prologue and this subprogram
      has a tailcall, therefore we clear the counter...)
      subprog -> do the same thing as entry
      
      and then loop forever.
      
      To address this, the idea is to go through the call chain of bpf2bpf progs
      and look for a tailcall presence throughout whole chain. If we saw a single
      tail call then each node in this call chain needs to be marked as a subprog
      that can reach the tailcall. We would later feed the JIT with this info
      and:
      - set eax to 0 only when tailcall is reachable and this is the entry prog
      - if tailcall is reachable but there's no tailcall in insns of currently
        JITed prog then push rax anyway, so that it will be possible to
        propagate further down the call chain
      - finally if tailcall is reachable, then we need to precede the 'call'
        insn with mov rax, [rbp - (stack_depth + 8)]
      
      Tail call related cases from test_verifier kselftest are also working
      fine. Sample BPF programs that utilize tail calls (sockex3, tracex5)
      work properly as well.
      
      [1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/
      
      Suggested-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ebf7d1f5
    • Maciej Fijalkowski's avatar
      bpf: Limit caller's stack depth 256 for subprogs with tailcalls · 7f6e4312
      Maciej Fijalkowski authored
      
      
      Protect against potential stack overflow that might happen when bpf2bpf
      calls get combined with tailcalls. Limit the caller's stack depth for
      such case down to 256 so that the worst case scenario would result in 8k
      stack size (32 which is tailcall limit * 256 = 8k).
      
      Suggested-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7f6e4312
    • Maciej Fijalkowski's avatar
      bpf: rename poke descriptor's 'ip' member to 'tailcall_target' · cf71b174
      Maciej Fijalkowski authored
      
      
      Reflect the actual purpose of poke->ip and rename it to
      poke->tailcall_target so that it will not the be confused with another
      poke target that will be introduced in next commit.
      
      While at it, do the same thing with poke->ip_stable - rename it to
      poke->tailcall_target_stable.
      
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      cf71b174
    • Maciej Fijalkowski's avatar
      bpf: propagate poke descriptors to subprograms · a748c697
      Maciej Fijalkowski authored
      
      
      Previously, there was no need for poke descriptors being present in
      subprogram's bpf_prog_aux struct since tailcalls were simply not allowed
      in them. Each subprog is JITed independently so in order to enable
      JITing subprograms that use tailcalls, do the following:
      
      - in fixup_bpf_calls() store the index of tailcall insn onto the generated
        poke descriptor,
      - in case when insn patching occurs, adjust the tailcall insn idx from
        bpf_patch_insn_data,
      - then in jit_subprogs() check whether the given poke descriptor belongs
        to the current subprog by checking if that previously stored absolute
        index of tail call insn is in the scope of the insns of given subprog,
      - update the insn->imm with new poke descriptor slot so that while JITing
        the proper poke descriptor will be grabbed
      
      This way each of the main program's poke descriptors are distributed
      across the subprograms poke descriptor array, so main program's
      descriptors can be untracked out of the prog array map.
      
      Add also subprog's aux struct to the BPF map poke_progs list by calling
      on it map_poke_track().
      
      In case of any error, call the map_poke_untrack() on subprog's aux
      structs that have already been registered to prog array map.
      
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a748c697
    • Maciej Fijalkowski's avatar
      bpf, x64: use %rcx instead of %rax for tail call retpolines · 0d4ddce3
      Maciej Fijalkowski authored
      
      
      Currently, %rax is used to store the jump target when BPF program is
      emitting the retpoline instructions that are handling the indirect
      tailcall.
      
      There is a plan to use %rax for different purpose, which is storing the
      tail call counter. In order to preserve this value across the tailcalls,
      adjust the BPF indirect tailcalls so that the target program will reside
      in %rcx and teach the retpoline instructions about new location of jump
      target.
      
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0d4ddce3
  2. Sep 16, 2020
    • Andrii Nakryiko's avatar
      selftests/bpf: Merge most of test_btf into test_progs · c64779e2
      Andrii Nakryiko authored
      
      
      Merge 183 tests from test_btf into test_progs framework to be exercised
      regularly. All the test_btf tests that were moved are modeled as proper
      sub-tests in test_progs framework for ease of debugging and reporting.
      
      No functional or behavioral changes were intended, I tried to preserve
      original behavior as much as possible. E.g., `test_progs -v` will activate
      "always_log" flag to emit BTF validation log.
      
      The only difference is in reducing the max_entries limit for pretty-printing
      tests from (128 * 1024) to just 128 to reduce tests running time without
      reducing the coverage.
      
      Example test run:
      
        $ sudo ./test_progs -n 8
        ...
        #8 btf:OK
        Summary: 1/183 PASSED, 0 SKIPPED, 0 FAILED
      
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200916004819.3767489-1-andriin@fb.com
      c64779e2
    • Alexei Starovoitov's avatar
      Merge branch 'bpf_metadata' · ffa915f4
      Alexei Starovoitov authored
      
      
      Stanislav Fomichev says:
      
      ====================
      Currently, if a user wants to store arbitrary metadata for an eBPF
      program, for example, the program build commit hash or version, they
      could store it in a map, and conveniently libbpf uses .data section to
      populate an internal map. However, if the program does not actually
      reference the map, then the map would be de-refcounted and freed.
      
      This patch set introduces a new syscall BPF_PROG_BIND_MAP to add a map
      to a program's used_maps, even if the program instructions does not
      reference the map.
      
      libbpf is extended to always BPF_PROG_BIND_MAP .rodata section so the
      metadata is kept in place.
      bpftool is also extended to print metadata in the 'bpftool prog' list.
      
      The variable is considered metadata if it starts with the
      magic 'bpf_metadata_' prefix; everything after the prefix is the
      metadata name.
      
      An example use of this would be BPF C file declaring:
      
        volatile const char bpf_metadata_commit_hash[] SEC(".rodata") = "abcdef123456";
      
      and bpftool would emit:
      
        $ bpftool prog
        [...]
              metadata:
                      commit_hash = "abcdef123456"
      
      v6 changes:
      * libbpf: drop FEAT_GLOBAL_DATA from probe_prog_bind_map (Andrii Nakryiko)
      * bpftool: combine find_metadata_map_id & find_metadata;
        drops extra bpf_map_get_fd_by_id and bpf_map_get_fd_by_id (Andrii Nakryiko)
      * bpftool: use strncmp instead of strstr (Andrii Nakryiko)
      * bpftool: memset(map_info) and extra empty line (Andrii Nakryiko)
      
      v5 changes:
      * selftest: verify that prog holds rodata (Andrii Nakryiko)
      * selftest: use volatile for metadata (Andrii Nakryiko)
      * bpftool: use sizeof in BPF_METADATA_PREFIX_LEN (Andrii Nakryiko)
      * bpftool: new find_metadata that does map lookup (Andrii Nakryiko)
      * libbpf: don't generalize probe_create_global_data (Andrii Nakryiko)
      * libbpf: use OPTS_VALID in bpf_prog_bind_map (Andrii Nakryiko)
      * libbpf: keep LIBBPF_0.2.0 sorted (Andrii Nakryiko)
      
      v4 changes:
      * Don't return EEXIST from syscall if already bound (Andrii Nakryiko)
      * Removed --metadata argument (Andrii Nakryiko)
      * Removed custom .metadata section (Alexei Starovoitov)
      * Addressed Andrii's suggestions about btf helpers and vsi (Andrii Nakryiko)
      * Moved bpf_prog_find_metadata into bpftool (Alexei Starovoitov)
      
      v3 changes:
      * API changes for bpf_prog_find_metadata (Toke Høiland-Jørgensen)
      
      v2 changes:
      * Made struct bpf_prog_bind_opts in libbpf so flags is optional.
      * Deduped probe_kern_global_data and probe_prog_bind_map into a common
        helper.
      * Added comment regarding why EEXIST is ignored in libbpf bind map.
      * Froze all LIBBPF_MAP_METADATA internal maps.
      * Moved bpf_prog_bind_map into new LIBBPF_0.1.1 in libbpf.map.
      * Added p_err() calls on error cases in bpftool show_prog_metadata.
      * Reverse christmas tree coding style in bpftool show_prog_metadata.
      * Made bpftool gen skeleton recognize .metadata as an internal map and
        generate datasec definition in skeleton.
      * Added C test using skeleton to see asset that the metadata is what we
        expect and rebinding causes EEXIST.
      
      v1 changes:
      * Fixed a few missing unlocks, and missing close while iterating map fds.
      * Move mutex initialization to right after prog aux allocation, and mutex
        destroy to right after prog aux free.
      * s/ADD_MAP/BIND_MAP/
      * Use mutex only instead of RCU to protect the used_map array & count.
      
      Cc: YiFei Zhu <zhuyifei1999@gmail.com>
      ====================
      
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ffa915f4
    • YiFei Zhu's avatar
      selftests/bpf: Test load and dump metadata with btftool and skel · d42d1cc4
      YiFei Zhu authored
      
      
      This is a simple test to check that loading and dumping metadata
      in btftool works, whether or not metadata contents are used by the
      program.
      
      A C test is also added to make sure the skeleton code can read the
      metadata values.
      
      Signed-off-by: default avatarYiFei Zhu <zhuyifei@google.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Cc: YiFei Zhu <zhuyifei1999@gmail.com>
      Link: https://lore.kernel.org/bpf/20200915234543.3220146-6-sdf@google.com
      d42d1cc4
    • YiFei Zhu's avatar
      bpftool: Support dumping metadata · aff52e68
      YiFei Zhu authored
      
      
      Dump metadata in the 'bpftool prog' list if it's present.
      For some formatting some BTF code is put directly in the
      metadata dumping. Sanity checks on the map and the kind of the btf_type
      to make sure we are actually dumping what we are expecting.
      
      A helper jsonw_reset is added to json writer so we can reuse the same
      json writer without having extraneous commas.
      
      Sample output:
      
        $ bpftool prog
        6: cgroup_skb  name prog  tag bcf7977d3b93787c  gpl
        [...]
        	btf_id 4
        	metadata:
        		a = "foo"
        		b = 1
      
        $ bpftool prog --json --pretty
        [{
                "id": 6,
        [...]
                "btf_id": 4,
                "metadata": {
                    "a": "foo",
                    "b": 1
                }
            }
        ]
      
      Signed-off-by: default avatarYiFei Zhu <zhuyifei@google.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: YiFei Zhu <zhuyifei1999@gmail.com>
      Link: https://lore.kernel.org/bpf/20200915234543.3220146-5-sdf@google.com
      aff52e68
    • YiFei Zhu's avatar
      libbpf: Add BPF_PROG_BIND_MAP syscall and use it on .rodata section · 5d23328d
      YiFei Zhu authored
      
      
      The patch adds a simple wrapper bpf_prog_bind_map around the syscall.
      When the libbpf tries to load a program, it will probe the kernel for
      the support of this syscall and unconditionally bind .rodata section
      to the program.
      
      Signed-off-by: default avatarYiFei Zhu <zhuyifei@google.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: YiFei Zhu <zhuyifei1999@gmail.com>
      Link: https://lore.kernel.org/bpf/20200915234543.3220146-4-sdf@google.com
      5d23328d
    • YiFei Zhu's avatar
      bpf: Add BPF_PROG_BIND_MAP syscall · ef15314a
      YiFei Zhu authored
      
      
      This syscall binds a map to a program. Returns success if the map is
      already bound to the program.
      
      Signed-off-by: default avatarYiFei Zhu <zhuyifei@google.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Cc: YiFei Zhu <zhuyifei1999@gmail.com>
      Link: https://lore.kernel.org/bpf/20200915234543.3220146-3-sdf@google.com
      ef15314a
    • YiFei Zhu's avatar
      bpf: Mutex protect used_maps array and count · 984fe94f
      YiFei Zhu authored
      
      
      To support modifying the used_maps array, we use a mutex to protect
      the use of the counter and the array. The mutex is initialized right
      after the prog aux is allocated, and destroyed right before prog
      aux is freed. This way we guarantee it's initialized for both cBPF
      and eBPF.
      
      Signed-off-by: default avatarYiFei Zhu <zhuyifei@google.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Cc: YiFei Zhu <zhuyifei1999@gmail.com>
      Link: https://lore.kernel.org/bpf/20200915234543.3220146-2-sdf@google.com
      984fe94f
  3. Sep 15, 2020
    • Yonghong Song's avatar
      libbpf: Fix a compilation error with xsk.c for ubuntu 16.04 · d317b0a8
      Yonghong Song authored
      When syncing latest libbpf repo to bcc, ubuntu 16.04 (4.4.0 LTS kernel)
      failed compilation for xsk.c:
        In file included from /tmp/debuild.0jkauG/bcc/src/cc/libbpf/src/xsk.c:23:0:
        /tmp/debuild.0jkauG/bcc/src/cc/libbpf/src/xsk.c: In function ‘xsk_get_ctx’:
        /tmp/debuild.0jkauG/bcc/src/cc/libbpf/include/linux/list.h:81:9: warning: implicit
        declaration of function ‘container_of’ [-Wimplicit-function-declaration]
                 container_of(ptr, type, member)
                 ^
        /tmp/debuild.0jkauG/bcc/src/cc/libbpf/include/linux/list.h:83:9: note: in expansion
        of macro ‘list_entry’
                 list_entry((ptr)->next, type, member)
        ...
        src/cc/CMakeFiles/bpf-static.dir/build.make:209: recipe for target
        'src/cc/CMakeFiles/bpf-static.dir/libbpf/src/xsk.c.o' failed
      
      Commit 2f6324a3 ("libbpf: Support shared umems between queues and devices")
      added include file <linux/list.h>, which uses macro "container_of".
      xsk.c file also includes <linux/ethtool.h> before <linux/list.h>.
      
      In a more recent distro kernel, <linux/ethtool.h> includes <linux/kernel.h>
      which contains the macro definition for "container_of". So compilation is all fine.
      But in ubuntu 16.04 kernel, <linux/ethtool.h> does not contain <linux/kernel.h>
      which caused the above compilation error.
      
      Let explicitly add <linux/kernel.h> in xsk.c to avoid compilation error
      in old distro's.
      
      Fixes: 2f6324a3
      
       ("libbpf: Support shared umems between queues and devices")
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200914223210.1831262-1-yhs@fb.com
      d317b0a8
    • Yonghong Song's avatar
      bpftool: Fix build failure · 63bea244
      Yonghong Song authored
      When building bpf selftests like
        make -C tools/testing/selftests/bpf -j20
      I hit the following errors:
        ...
        GEN      /net-next/tools/testing/selftests/bpf/tools/build/bpftool/Documentation/bpftool-gen.8
        <stdin>:75: (WARNING/2) Block quote ends without a blank line; unexpected unindent.
        <stdin>:71: (WARNING/2) Literal block ends without a blank line; unexpected unindent.
        <stdin>:85: (WARNING/2) Literal block ends without a blank line; unexpected unindent.
        <stdin>:57: (WARNING/2) Block quote ends without a blank line; unexpected unindent.
        <stdin>:66: (WARNING/2) Literal block ends without a blank line; unexpected unindent.
        <stdin>:109: (WARNING/2) Literal block ends without a blank line; unexpected unindent.
        <stdin>:175: (WARNING/2) Literal block ends without a blank line; unexpected unindent.
        <stdin>:273: (WARNING/2) Literal block ends without a blank line; unexpected unindent.
        make[1]: *** [/net-next/tools/testing/selftests/bpf/tools/build/bpftool/Documentation/bpftool-perf.8] Error 12
        make[1]: *** Waiting for unfinished jobs....
        make[1]: *** [/net-next/tools/testing/selftests/bpf/tools/build/bpftool/Documentation/bpftool-iter.8] Error 12
        make[1]: *** [/net-next/tools/testing/selftests/bpf/tools/build/bpftool/Documentation/bpftool-struct_ops.8] Error 12
        ...
      
      I am using:
        -bash-4.4$ rst2man --version
        rst2man (Docutils 0.11 [repository], Python 2.7.5, on linux2)
        -bash-4.4$
      
      The Makefile generated final .rst file (e.g., bpftool-cgroup.rst) looks like
        ...
            ID       AttachType      AttachFlags     Name
        \n SEE ALSO\n========\n\t**bpf**\ (2),\n\t**bpf-helpers**\
        (7),\n\t**bpftool**\ (8),\n\t**bpftool-btf**\
        (8),\n\t**bpftool-feature**\ (8),\n\t**bpftool-gen**\
        (8),\n\t**bpftool-iter**\ (8),\n\t**bpftool-link**\
        (8),\n\t**bpftool-map**\ (8),\n\t**bpftool-net**\
        (8),\n\t**bpftool-perf**\ (8),\n\t**bpftool-prog**\
        (8),\n\t**bpftool-struct_ops**\ (8)\n
      
      The rst2man generated .8 file looks like
      Literal block ends without a blank line; unexpected unindent.
       .sp
       n SEEALSOn========nt**bpf**(2),nt**bpf\-helpers**(7),nt**bpftool**(8),nt**bpftool\-btf**(8),nt**
       bpftool\-feature**(8),nt**bpftool\-gen**(8),nt**bpftool\-iter**(8),nt**bpftool\-link**(8),nt**
       bpftool\-map**(8),nt**bpftool\-net**(8),nt**bpftool\-perf**(8),nt**bpftool\-prog**(8),nt**
       bpftool\-struct_ops**(8)n
      
      Looks like that particular version of rst2man prefers to have actual new line
      instead of \n.
      
      Since `echo -e` may not be available in some environment, let us use `printf`.
      Format string "%b" is used for `printf` to ensure all escape characters are
      interpretted properly.
      
      Fixes: 18841da9
      
       ("tools: bpftool: Automate generation for "SEE ALSO" sections in man pages")
      Suggested-by: default avatarAndrii Nakryiko <andrii.nakryiko@gmail.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Cc: Quentin Monnet <quentin@isovalent.com>
      Link: https://lore.kernel.org/bpf/20200914183110.999906-1-yhs@fb.com
      63bea244
    • Magnus Karlsson's avatar
      xsk: Fix refcount warning in xp_dma_map · bf74a370
      Magnus Karlsson authored
      Fix a potential refcount warning that a zero value is increased to one
      in xp_dma_map, by initializing the refcount to one to start with,
      instead of zero plus a refcount_inc().
      
      Fixes: 921b6869
      
       ("xsk: Enable sharing of dma mappings")
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/1600095036-23868-1-git-send-email-magnus.karlsson@gmail.com
      bf74a370
    • Magnus Karlsson's avatar
      samples/bpf: Add quiet option to xdpsock · 74e00676
      Magnus Karlsson authored
      
      
      Add a quiet option (-Q) that disables the statistics print outs of
      xdpsock. This is good to have when measuring 0% loss rate performance
      as it will be quite terrible if the application uses printfs.
      
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/1599726666-8431-4-git-send-email-magnus.karlsson@gmail.com
      74e00676
    • Magnus Karlsson's avatar
      samples/bpf: Fix possible deadlock in xdpsock · 5a2a0dd8
      Magnus Karlsson authored
      
      
      Fix a possible deadlock in the l2fwd application in xdpsock that can
      occur when there is no space in the Tx ring. There are two ways to get
      the kernel to consume entries in the Tx ring: calling sendto() to make
      it send packets and freeing entries from the completion ring, as the
      kernel will not send a packet if there is no space for it to add a
      completion entry in the completion ring. The Tx loop in l2fwd only
      used to call sendto(). This patches adds cleaning the completion ring
      in that loop.
      
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/1599726666-8431-3-git-send-email-magnus.karlsson@gmail.com
      5a2a0dd8
    • Magnus Karlsson's avatar
      samples/bpf: Fix one packet sending in xdpsock · 3131cf66
      Magnus Karlsson authored
      
      
      Fix the sending of a single packet (or small burst) in xdpsock when
      executing in copy mode. Currently, the l2fwd application in xdpsock
      only transmits the packets after a batch of them has been received,
      which might be confusing if you only send one packet and expect that
      it is returned pronto. Fix this by calling sendto() more often and add
      a comment in the code that states that this can be optimized if
      needed.
      
      Reported-by: default avatarTirthendu Sarkar <tirthendu.sarkar@intel.com>
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/1599726666-8431-2-git-send-email-magnus.karlsson@gmail.com
      3131cf66
    • Ilya Leoshkevich's avatar
      s390/bpf: Fix multiple tail calls · d72714c1
      Ilya Leoshkevich authored
      In order to branch around tail calls (due to out-of-bounds index,
      exceeding tail call count or missing tail call target), JIT uses
      label[0] field, which contains the address of the instruction following
      the tail call. When there are multiple tail calls, label[0] value comes
      from handling of a previous tail call, which is incorrect.
      
      Fix by getting rid of label array and resolving the label address
      locally: for all 3 branches that jump to it, emit 0 offsets at the
      beginning, and then backpatch them with the correct value.
      
      Also, do not use the long jump infrastructure: the tail call sequence
      is known to be short, so make all 3 jumps short.
      
      Fixes: 6651ee07
      
       ("s390/bpf: implement bpf_tail_call() helper")
      Signed-off-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200909232141.3099367-1-iii@linux.ibm.com
      d72714c1
  4. Sep 11, 2020
    • Alexei Starovoitov's avatar
      Merge branch 'improve-bpf-tcp-cc-init' · 2bab48c5
      Alexei Starovoitov authored
      
      
      Neal Cardwell says:
      
      ====================
      This patch series reorganizes TCP congestion control initialization so that if
      EBPF code called by tcp_init_transfer() sets the congestion control algorithm
      by calling setsockopt(TCP_CONGESTION) then the TCP stack initializes the
      congestion control module immediately, instead of having tcp_init_transfer()
      later initialize the congestion control module.
      
      This increases flexibility for the EBPF code that runs at connection
      establishment time, and simplifies the code.
      
      This has the following benefits:
      
      (1) This allows CC module customizations made by the EBPF called in
          tcp_init_transfer() to persist, and not be wiped out by a later
          call to tcp_init_congestion_control() in tcp_init_transfer().
      
      (2) Does not flip the order of EBPF and CC init, to avoid causing bugs
          for existing code upstream that depends on the current order.
      
      (3) Does not cause 2 initializations for for CC in the case where the
          EBPF called in tcp_init_transfer() wants to set the CC to a new CC
          algorithm.
      
      (4) Allows follow-on simplifications to the code in net/core/filter.c
          and net/ipv4/tcp_cong.c, which currently both have some complexity
          to special-case CC initialization to avoid double CC
          initialization if EBPF sets the CC.
      
      changes in v2:
      
      o rebase onto bpf-next
      
      o add another follow-on simplification suggested by Martin KaFai Lau:
         "tcp: simplify tcp_set_congestion_control() load=false case"
      
      changes in v3:
      
      o no change in commits
      
      o resent patch series from @gmail.com, since mail from ncardwell@google.com
        stopped being accepted at netdev@vger.kernel.org mid-way through processing
        the v2 patch series (between patches 2 and 3), confusing patchwork about
        which patches belonged to the v2 patch series
      ====================
      
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2bab48c5
    • Neal Cardwell's avatar
      tcp: Simplify tcp_set_congestion_control() load=false case · 5050bef8
      Neal Cardwell authored
      Simplify tcp_set_congestion_control() by removing the initialization
      code path for the !load case.
      
      There are only two call sites for tcp_set_congestion_control(). The
      EBPF call site is the only one that passes load=false; it also passes
      cap_net_admin=true. Because of that, the exact same behavior can be
      achieved by removing the special if (!load) branch of the logic. Both
      before and after this commit, the EBPF case will call
      bpf_try_module_get(), and if that succeeds then call
      tcp_reinit_congestion_control() or if that fails then return EBUSY.
      
      Note that this returns the logic to a structure very similar to the
      structure before:
        commit 91b5b21c
      
       ("bpf: Add support for changing congestion control")
      except that the CAP_NET_ADMIN status is passed in as a function
      argument.
      
      This clean-up was suggested by Martin KaFai Lau.
      
      Suggested-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: Lawrence Brakmo <brakmo@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Kevin Yang <yyd@google.com>
      5050bef8
    • Neal Cardwell's avatar
      tcp: simplify _bpf_setsockopt(): Remove flags argument · 5cdc744c
      Neal Cardwell authored
      
      
      Now that the previous patches have removed the code that uses the
      flags argument to _bpf_setsockopt(), we can remove that argument.
      
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarKevin Yang <yyd@google.com>
      Cc: Lawrence Brakmo <brakmo@fb.com>
      5cdc744c
    • Neal Cardwell's avatar
      tcp: simplify tcp_set_congestion_control(): Always reinitialize · 29a94932
      Neal Cardwell authored
      
      
      Now that the previous patches ensure that all call sites for
      tcp_set_congestion_control() want to initialize congestion control, we
      can simplify tcp_set_congestion_control() by removing the reinit
      argument and the code to support it.
      
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarKevin Yang <yyd@google.com>
      Cc: Lawrence Brakmo <brakmo@fb.com>
      29a94932
    • Neal Cardwell's avatar
      tcp: Simplify EBPF TCP_CONGESTION to always init CC · e7b10a4d
      Neal Cardwell authored
      
      
      Now that the previous patch ensures we don't initialize the congestion
      control twice, when EBPF sets the congestion control algorithm at
      connection establishment we can simplify the code by simply
      initializing the congestion control module at that time.
      
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarKevin Yang <yyd@google.com>
      Cc: Lawrence Brakmo <brakmo@fb.com>
      e7b10a4d
    • Neal Cardwell's avatar
      tcp: Only init congestion control if not initialized already · 8919a9b3
      Neal Cardwell authored
      
      
      Change tcp_init_transfer() to only initialize congestion control if it
      has not been initialized already.
      
      With this new approach, we can arrange things so that if the EBPF code
      sets the congestion control by calling setsockopt(TCP_CONGESTION) then
      tcp_init_transfer() will not re-initialize the CC module.
      
      This is an approach that has the following beneficial properties:
      
      (1) This allows CC module customizations made by the EBPF called in
          tcp_init_transfer() to persist, and not be wiped out by a later
          call to tcp_init_congestion_control() in tcp_init_transfer().
      
      (2) Does not flip the order of EBPF and CC init, to avoid causing bugs
          for existing code upstream that depends on the current order.
      
      (3) Does not cause 2 initializations for for CC in the case where the
          EBPF called in tcp_init_transfer() wants to set the CC to a new CC
          algorithm.
      
      (4) Allows follow-on simplifications to the code in net/core/filter.c
          and net/ipv4/tcp_cong.c, which currently both have some complexity
          to special-case CC initialization to avoid double CC
          initialization if EBPF sets the CC.
      
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarKevin Yang <yyd@google.com>
      Cc: Lawrence Brakmo <brakmo@fb.com>
      8919a9b3
    • Quentin Monnet's avatar
      tools: bpftool: Automate generation for "SEE ALSO" sections in man pages · 18841da9
      Quentin Monnet authored
      
      
      The "SEE ALSO" sections of bpftool's manual pages refer to bpf(2),
      bpf-helpers(7), then all existing bpftool man pages (save the current
      one).
      
      This leads to nearly-identical lists being duplicated in all manual
      pages. Ideally, when a new page is created, all lists should be updated
      accordingly, but this has led to omissions and inconsistencies multiple
      times in the past.
      
      Let's take it out of the RST files and generate the "SEE ALSO" sections
      automatically in the Makefile when generating the man pages. The lists
      are not really useful in the RST anyway because all other pages are
      available in the same directory.
      
      v3:
      - Fix conflict with a previous patchset that introduced RST2MAN_OPTS
        variable passed to rst2man.
      
      v2:
      - Use "echo -n" instead of "printf" in Makefile, to avoid any risk of
        passing a format string directly to the command.
      
      Signed-off-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200910203935.25304-1-quentin@isovalent.com
      18841da9
    • Song Liu's avatar
      bpf: Fix comment for helper bpf_current_task_under_cgroup() · 1aef5b43
      Song Liu authored
      This should be "current" not "skb".
      
      Fixes: c6b5fb86
      
       ("bpf: add documentation for eBPF helpers (42-50)")
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: <stable@vger.kernel.org>
      Link: https://lore.kernel.org/bpf/20200910203314.70018-1-songliubraving@fb.com
      1aef5b43
    • Yonghong Song's avatar
      selftests/bpf: Define string const as global for test_sysctl_prog.c · 6e057fc1
      Yonghong Song authored
      When tweaking llvm optimizations, I found that selftest build failed
      with the following error:
        libbpf: elf: skipping unrecognized data section(6) .rodata.str1.1
        libbpf: prog 'sysctl_tcp_mem': bad map relo against '.L__const.is_tcp_mem.tcp_mem_name'
                in section '.rodata.str1.1'
        Error: failed to open BPF object file: Relocation failed
        make: *** [/work/net-next/tools/testing/selftests/bpf/test_sysctl_prog.skel.h] Error 255
        make: *** Deleting file `/work/net-next/tools/testing/selftests/bpf/test_sysctl_prog.skel.h'
      
      The local string constant "tcp_mem_name" is put into '.rodata.str1.1' section
      which libbpf cannot handle. Using untweaked upstream llvm, "tcp_mem_name"
      is completely inlined after loop unrolling.
      
      Commit 7fb5eefd
      
       ("selftests/bpf: Fix test_sysctl_loop{1, 2}
      failure due to clang change") solved a similar problem by defining
      the string const as a global. Let us do the same here
      for test_sysctl_prog.c so it can weather future potential llvm changes.
      
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200910202718.956042-1-yhs@fb.com
      6e057fc1
    • Ilya Leoshkevich's avatar
      selftests/bpf: Fix test_ksyms on non-SMP kernels · 90a1deda
      Ilya Leoshkevich authored
      
      
      On non-SMP kernels __per_cpu_start is not 0, so look it up in kallsyms.
      
      Signed-off-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200910171336.3161995-1-iii@linux.ibm.com
      90a1deda
    • Lorenz Bauer's avatar
      bpf: Plug hole in struct bpf_sk_lookup_kern · d66423fb
      Lorenz Bauer authored
      As Alexei points out, struct bpf_sk_lookup_kern has two 4-byte holes.
      This leads to suboptimal instructions being generated (IPv4, x86):
      
          1372                    struct bpf_sk_lookup_kern ctx = {
             0xffffffff81b87f30 <+624>:   xor    %eax,%eax
             0xffffffff81b87f32 <+626>:   mov    $0x6,%ecx
             0xffffffff81b87f37 <+631>:   lea    0x90(%rsp),%rdi
             0xffffffff81b87f3f <+639>:   movl   $0x110002,0x88(%rsp)
             0xffffffff81b87f4a <+650>:   rep stos %rax,%es:(%rdi)
             0xffffffff81b87f4d <+653>:   mov    0x8(%rsp),%eax
             0xffffffff81b87f51 <+657>:   mov    %r13d,0x90(%rsp)
             0xffffffff81b87f59 <+665>:   incl   %gs:0x7e4970a0(%rip)
             0xffffffff81b87f60 <+672>:   mov    %eax,0x8c(%rsp)
             0xffffffff81b87f67 <+679>:   movzwl 0x10(%rsp),%eax
             0xffffffff81b87f6c <+684>:   mov    %ax,0xa8(%rsp)
             0xffffffff81b87f74 <+692>:   movzwl 0x38(%rsp),%eax
             0xffffffff81b87f79 <+697>:   mov    %ax,0xaa(%rsp)
      
      Fix this by moving around sport and dport. pahole confirms there
      are no more holes:
      
          struct bpf_sk_lookup_kern {
              u16                        family;       /*     0     2 */
              u16                        protocol;     /*     2     2 */
              __be16                     sport;        /*     4     2 */
              u16                        dport;        /*     6     2 */
              struct {
                      __be32             saddr;        /*     8     4 */
                      __be32             daddr;        /*    12     4 */
              } v4;                                    /*     8     8 */
              struct {
                      const struct in6_addr  * saddr;  /*    16     8 */
                      const struct in6_addr  * daddr;  /*    24     8 */
              } v6;                                    /*    16    16 */
              struct sock *              selected_sk;  /*    32     8 */
              bool                       no_reuseport; /*    40     1 */
      
              /* size: 48, cachelines: 1, members: 8 */
              /* padding: 7 */
              /* last cacheline: 48 bytes */
          };
      
      The assembly also doesn't contain the pesky rep stos anymore:
      
          1372                    struct bpf_sk_lookup_kern ctx = {
             0xffffffff81b87f60 <+624>:   movzwl 0x10(%rsp),%eax
             0xffffffff81b87f65 <+629>:   movq   $0x0,0xa8(%rsp)
             0xffffffff81b87f71 <+641>:   movq   $0x0,0xb0(%rsp)
             0xffffffff81b87f7d <+653>:   mov    %ax,0x9c(%rsp)
             0xffffffff81b87f85 <+661>:   movzwl 0x38(%rsp),%eax
             0xffffffff81b87f8a <+666>:   movq   $0x0,0xb8(%rsp)
             0xffffffff81b87f96 <+678>:   mov    %ax,0x9e(%rsp)
             0xffffffff81b87f9e <+686>:   mov    0x8(%rsp),%eax
             0xffffffff81b87fa2 <+690>:   movq   $0x0,0xc0(%rsp)
             0xffffffff81b87fae <+702>:   movl   $0x110002,0x98(%rsp)
             0xffffffff81b87fb9 <+713>:   mov    %eax,0xa0(%rsp)
             0xffffffff81b87fc0 <+720>:   mov    %r13d,0xa4(%rsp)
      
      1: https://lore.kernel.org/bpf/CAADnVQKE6y9h2fwX6OS837v-Uf+aBXnT_JXiN_bbo2gitZQ3tA@mail.gmail.com/
      
      Fixes: e9ddbb77
      
       ("bpf: Introduce SK_LOOKUP program type with a dedicated attach point")
      Suggested-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Link: https://lore.kernel.org/bpf/20200910110248.198326-1-lmb@cloudflare.com
      d66423fb
    • Quentin Monnet's avatar
      tools: bpftool: Add "inner_map" to "bpftool map create" outer maps · e3b9626f
      Quentin Monnet authored
      
      
      There is no support for creating maps of types array-of-map or
      hash-of-map in bpftool. This is because the kernel needs an inner_map_fd
      to collect metadata on the inner maps to be supported by the new map,
      but bpftool does not provide a way to pass this file descriptor.
      
      Add a new optional "inner_map" keyword that can be used to pass a
      reference to a map, retrieve a fd to that map, and pass it as the
      inner_map_fd.
      
      Add related documentation and bash completion. Note that we can
      reference the inner map by its name, meaning we can have several times
      the keyword "name" with different meanings (mandatory outer map name,
      and possibly a name to use to find the inner_map_fd). The bash
      completion will offer it just once, and will not suggest "name" on the
      following command:
      
          # bpftool map create /sys/fs/bpf/my_outer_map type hash_of_maps \
              inner_map name my_inner_map [TAB]
      
      Fixing that specific case seems too convoluted. Completion will work as
      expected, however, if the outer map name comes first and the "inner_map
      name ..." is passed second.
      
      Signed-off-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200910102652.10509-4-quentin@isovalent.com
      e3b9626f
    • Quentin Monnet's avatar
      tools: bpftool: Keep errors for map-of-map dumps if distinct from ENOENT · 86233ce3
      Quentin Monnet authored
      
      
      When dumping outer maps or prog_array maps, and on lookup failure,
      bpftool simply skips the entry with no error message. This is because
      the kernel returns non-zero when no value is found for the provided key,
      which frequently happen for those maps if they have not been filled.
      
      When such a case occurs, errno is set to ENOENT. It seems unlikely we
      could receive other error codes at this stage (we successfully retrieved
      map info just before), but to be on the safe side, let's skip the entry
      only if errno was ENOENT, and not for the other errors.
      
      v3: New patch
      
      Signed-off-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200910102652.10509-3-quentin@isovalent.com
      86233ce3
    • Quentin Monnet's avatar
      tools: bpftool: Clean up function to dump map entry · a20693b6
      Quentin Monnet authored
      
      
      The function used to dump a map entry in bpftool is a bit difficult to
      follow, as a consequence to earlier refactorings. There is a variable
      ("num_elems") which does not appear to be necessary, and the error
      handling would look cleaner if moved to its own function. Let's clean it
      up. No functional change.
      
      v2:
      - v1 was erroneously removing the check on fd maps in an attempt to get
        support for outer map dumps. This is already working. Instead, v2
        focuses on cleaning up the dump_map_elem() function, to avoid
        similar confusion in the future.
      
      Signed-off-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200910102652.10509-2-quentin@isovalent.com
      a20693b6
    • Lorenz Bauer's avatar
      selftests: bpf: Test iterating a sockmap · 2f7de986
      Lorenz Bauer authored
      
      
      Add a test that exercises a basic sockmap / sockhash iteration. For
      now we simply count the number of elements seen. Once sockmap update
      from iterators works we can extend this to perform a full copy.
      
      Signed-off-by: default avatarLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200909162712.221874-4-lmb@cloudflare.com
      2f7de986
    • Lorenz Bauer's avatar
      net: Allow iterating sockmap and sockhash · 03653515
      Lorenz Bauer authored
      
      
      Add bpf_iter support for sockmap / sockhash, based on the bpf_sk_storage and
      hashtable implementation. sockmap and sockhash share the same iteration
      context: a pointer to an arbitrary key and a pointer to a socket. Both
      pointers may be NULL, and so BPF has to perform a NULL check before accessing
      them. Technically it's not possible for sockhash iteration to yield a NULL
      socket, but we ignore this to be able to use a single iteration point.
      
      Iteration will visit all keys that remain unmodified during the lifetime of
      the iterator. It may or may not visit newly added ones.
      
      Switch from using rcu_dereference_raw to plain rcu_dereference, so we gain
      another guard rail if CONFIG_PROVE_RCU is enabled.
      
      Signed-off-by: default avatarLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20200909162712.221874-3-lmb@cloudflare.com
      03653515
    • Lorenz Bauer's avatar
      net: sockmap: Remove unnecessary sk_fullsock checks · 654785a1
      Lorenz Bauer authored
      
      
      The lookup paths for sockmap and sockhash currently include a check
      that returns NULL if the socket we just found is not a full socket.
      However, this check is not necessary. On insertion we ensure that
      we have a full socket (caveat around sock_ops), so request sockets
      are not a problem. Time-wait sockets are allocated separate from
      the original socket and then fed into the hashdance. They don't
      affect the sockets already stored in the sockmap.
      
      Suggested-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200909162712.221874-2-lmb@cloudflare.com
      654785a1
    • Quentin Monnet's avatar
      tools: bpftool: Include common options from separate file · f28ef96d
      Quentin Monnet authored
      
      
      Nearly all man pages for bpftool have the same common set of option
      flags (--help, --version, --json, --pretty, --debug). The description is
      duplicated across all the pages, which is more difficult to maintain if
      the description of an option changes. It may also be confusing to sort
      out what options are not "common" and should not be copied when creating
      new manual pages.
      
      Let's move the description for those common options to a separate file,
      which is included with a RST directive when generating the man pages.
      
      Signed-off-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200909162500.17010-3-quentin@isovalent.com
      f28ef96d
    • Quentin Monnet's avatar
      tools: bpftool: Print optional built-in features along with version · 82b8cf0a
      Quentin Monnet authored
      
      
      Bpftool has a number of features that can be included or left aside
      during compilation. This includes:
      
      - Support for libbfd, providing the disassembler for JIT-compiled
        programs.
      - Support for BPF skeletons, used for profiling programs or iterating on
        the PIDs of processes associated with BPF objects.
      
      In order to make it easy for users to understand what features were
      compiled for a given bpftool binary, print the status of the two
      features above when showing the version number for bpftool ("bpftool -V"
      or "bpftool version"). Document this in the main manual page. Example
      invocations:
      
          $ bpftool version
          ./bpftool v5.9.0-rc1
          features: libbfd, skeletons
      
          $ bpftool -p version
          {
              "version": "5.9.0-rc1",
              "features": {
                  "libbfd": true,
                  "skeletons": true
              }
          }
      
      Some other parameters are optional at compilation
      ("DISASM_FOUR_ARGS_SIGNATURE", LIBCAP support) but they do not impact
      significantly bpftool's behaviour from a user's point of view, so their
      status is not reported.
      
      Available commands and supported program types depend on the version
      number, and are therefore not reported either. Note that they are
      already available, albeit without JSON, via bpftool's help messages.
      
      v3:
      - Use a simple list instead of boolean values for plain output.
      
      v2:
      - Fix JSON (object instead or array for the features).
      
      Signed-off-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200909162500.17010-2-quentin@isovalent.com
      82b8cf0a
    • Quentin Monnet's avatar
      selftests, bpftool: Add bpftool (and eBPF helpers) documentation build · 41d5c37b
      Quentin Monnet authored
      
      
      eBPF selftests include a script to check that bpftool builds correctly
      with different command lines. Let's add one build for bpftool's
      documentation so as to detect errors or warning reported by rst2man when
      compiling the man pages. Also add a build to the selftests Makefile to
      make sure we build bpftool documentation along with bpftool when
      building the selftests.
      
      This also builds and checks warnings for the man page for eBPF helpers,
      which is built along bpftool's documentation.
      
      This change adds rst2man as a dependency for selftests (it comes with
      Python's "docutils").
      
      v2:
      - Use "--exit-status=1" option for rst2man instead of counting lines
        from stderr.
      - Also build bpftool as part as the selftests build (and not only when
        the tests are actually run).
      
      Signed-off-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200909162251.15498-3-quentin@isovalent.com
      41d5c37b
    • Quentin Monnet's avatar
      tools: bpftool: Log info-level messages when building bpftool man pages · 16f3ddfb
      Quentin Monnet authored
      
      
      To build man pages for bpftool (and for eBPF helper functions), rst2man
      can log different levels of information. Let's make it log all levels
      to keep the RST files clean.
      
      Doing so, rst2man complains about double colons, used for literal
      blocks, that look like underlines for section titles. Let's add the
      necessary blank lines.
      
      v2:
      - Use "--verbose" instead of "-r 1" (same behaviour but more readable).
      - Pass it through a RST2MAN_OPTS variable so we can easily pass other
        options too.
      
      Signed-off-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200909162251.15498-2-quentin@isovalent.com
      16f3ddfb