Skip to content
  1. Nov 02, 2021
    • Björn Töpel's avatar
      tools, build: Add RISC-V to HOSTARCH parsing · b390d698
      Björn Töpel authored
      
      
      Add RISC-V to the HOSTARCH parsing, so that ARCH is "riscv", and not
      "riscv32" or "riscv64".
      
      This affects the perf and libbpf builds, so that arch specific
      includes are correctly picked up for RISC-V.
      
      Signed-off-by: default avatarBjörn Töpel <bjorn@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20211028161057.520552-3-bjorn@kernel.org
      b390d698
    • Björn Töpel's avatar
      riscv, bpf: Increase the maximum number of iterations · 4b54214f
      Björn Töpel authored
      
      
      Now that BPF programs can be up to 1M instructions, it is not uncommon
      that a program requires more than the current 16 iterations to
      converge.
      
      Bump it to 32, which is enough for selftests/bpf, and test_bpf.ko.
      
      Signed-off-by: default avatarBjörn Töpel <bjorn@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20211028161057.520552-2-bjorn@kernel.org
      4b54214f
    • Liu Jian's avatar
      selftests, bpf: Add one test for sockmap with strparser · d6967214
      Liu Jian authored
      
      
      Add the test to check sockmap with strparser is working well.
      
      Signed-off-by: default avatarLiu Jian <liujian56@huawei.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20211029141216.211899-3-liujian56@huawei.com
      d6967214
    • Liu Jian's avatar
      selftests, bpf: Fix test_txmsg_ingress_parser error · b556c3fd
      Liu Jian authored
      
      
      After "skmsg: lose offset info in sk_psock_skb_ingress", the test case
      with ktls failed. This because ktls parser(tls_read_size) return value
      is 285 not 256.
      
      The case like this:
      
      	tls_sk1 --> redir_sk --> tls_sk2
      
      tls_sk1 sent out 512 bytes data, after tls related processing redir_sk
      recved 570 btyes data, and redirect 512 (skb_use_parser) bytes data to
      tls_sk2; but tls_sk2 needs 285 * 2 bytes data, receive timeout occurred.
      
      Signed-off-by: default avatarLiu Jian <liujian56@huawei.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20211029141216.211899-2-liujian56@huawei.com
      b556c3fd
    • Liu Jian's avatar
      skmsg: Lose offset info in sk_psock_skb_ingress · 7303524e
      Liu Jian authored
      If sockmap enable strparser, there are lose offset info in
      sk_psock_skb_ingress(). If the length determined by parse_msg function is not
      skb->len, the skb will be converted to sk_msg multiple times, and userspace
      app will get the data multiple times.
      
      Fix this by get the offset and length from strp_msg. And as Cong suggested,
      add one bit in skb->_sk_redir to distinguish enable or disable strparser.
      
      Fixes: 604326b4
      
       ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: default avatarLiu Jian <liujian56@huawei.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarCong Wang <cong.wang@bytedance.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20211029141216.211899-1-liujian56@huawei.com
      7303524e
    • Andrii Nakryiko's avatar
      selftests/bpf: Fix strobemeta selftest regression · 0133c204
      Andrii Nakryiko authored
      
      
      After most recent nightly Clang update strobemeta selftests started
      failing with the following error (relevant portion of assembly included):
      
        1624: (85) call bpf_probe_read_user_str#114
        1625: (bf) r1 = r0
        1626: (18) r2 = 0xfffffffe
        1628: (5f) r1 &= r2
        1629: (55) if r1 != 0x0 goto pc+7
        1630: (07) r9 += 104
        1631: (6b) *(u16 *)(r9 +0) = r0
        1632: (67) r0 <<= 32
        1633: (77) r0 >>= 32
        1634: (79) r1 = *(u64 *)(r10 -456)
        1635: (0f) r1 += r0
        1636: (7b) *(u64 *)(r10 -456) = r1
        1637: (79) r1 = *(u64 *)(r10 -368)
        1638: (c5) if r1 s< 0x1 goto pc+778
        1639: (bf) r6 = r8
        1640: (0f) r6 += r7
        1641: (b4) w1 = 0
        1642: (6b) *(u16 *)(r6 +108) = r1
        1643: (79) r3 = *(u64 *)(r10 -352)
        1644: (79) r9 = *(u64 *)(r10 -456)
        1645: (bf) r1 = r9
        1646: (b4) w2 = 1
        1647: (85) call bpf_probe_read_user_str#114
      
        R1 unbounded memory access, make sure to bounds check any such access
      
      In the above code r0 and r1 are implicitly related. Clang knows that,
      but verifier isn't able to infer this relationship.
      
      Yonghong Song narrowed down this "regression" in code generation to
      a recent Clang optimization change ([0]), which for BPF target generates
      code pattern that BPF verifier can't handle and loses track of register
      boundaries.
      
      This patch works around the issue by adding an BPF assembly-based helper
      that helps to prove to the verifier that upper bound of the register is
      a given constant by controlling the exact share of generated BPF
      instruction sequence. This fixes the immediate issue for strobemeta
      selftest.
      
        [0] https://github.com/llvm/llvm-project/commit/acabad9ff6bf13e00305d9d8621ee8eafc1f8b08
      
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20211029182907.166910-1-andrii@kernel.org
      0133c204
    • Pawan Gupta's avatar
      bpf: Disallow unprivileged bpf by default · 8a03e56b
      Pawan Gupta authored
      Disabling unprivileged BPF would help prevent unprivileged users from
      creating certain conditions required for potential speculative execution
      side-channel attacks on unmitigated affected hardware.
      
      A deep dive on such attacks and current mitigations is available here [0].
      
      Sync with what many distros are currently applying already, and disable
      unprivileged BPF by default. An admin can enable this at runtime, if
      necessary, as described in 08389d88
      
       ("bpf: Add kconfig knob for
      disabling unpriv bpf by default").
      
        [0] "BPF and Spectre: Mitigating transient execution attacks", Daniel Borkmann, eBPF Summit '21
            https://ebpf.io/summit-2021-slides/eBPF_Summit_2021-Keynote-Daniel_Borkmann-BPF_and_Spectre.pdf
      
      Signed-off-by: default avatarPawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Link: https://lore.kernel.org/bpf/0ace9ce3f97656d5f62d11093ad7ee81190c3c25.1635535215.git.pawan.kumar.gupta@linux.intel.com
      8a03e56b
  2. Oct 29, 2021
    • Andrea Righi's avatar
      selftests/bpf: Fix fclose/pclose mismatch in test_progs · f48ad690
      Andrea Righi authored
      Make sure to use pclose() to properly close the pipe opened by popen().
      
      Fixes: 81f77fd0
      
       ("bpf: add selftest for stackmap with BPF_F_STACK_BUILD_ID")
      Signed-off-by: default avatarAndrea Righi <andrea.righi@canonical.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20211026143409.42666-1-andrea.righi@canonical.com
      f48ad690
    • Alexei Starovoitov's avatar
      Merge branch 'Typeless/weak ksym for gen_loader + misc fixups' · b9989b59
      Alexei Starovoitov authored
      
      
      Kumar Kartikeya says:
      
      ====================
      
      Patches (1,2,3,6) add typeless and weak ksym support to gen_loader. It is follow
      up for the recent kfunc from modules series.
      
      The later patches (7,8) are misc fixes for selftests, and patch 4 for libbpf
      where we try to be careful to not end up with fds == 0, as libbpf assumes in
      various places that they are greater than 0. Patch 5 fixes up missing O_CLOEXEC
      in libbpf.
      
      Changelog:
      ----------
      v4 -> v5
      v4: https://lore.kernel.org/bpf/20211020191526.2306852-1-memxor@gmail.com
      
       * Address feedback from Andrii
         * Drop use of ensure_good_fd in unneeded call sites
         * Add sys_bpf_fd
         * Add _lskel suffix to all light skeletons and change all current selftests
         * Drop early break in close loop for sk_lookup
         * Fix other nits
      
      v3 -> v4
      v3: https://lore.kernel.org/bpf/20211014205644.1837280-1-memxor@gmail.com
      
       * Remove gpl_only = true from bpf_kallsyms_lookup_name (Alexei)
       * Add bpf_dump_raw_ok check to ensure kptr_restrict isn't bypassed (Alexei)
      
      v2 -> v3
      v2: https://lore.kernel.org/bpf/20211013073348.1611155-1-memxor@gmail.com
      
       * Address feedback from Song
         * Move ksym logging to separate helper to avoid code duplication
         * Move src_reg mask stuff to separate helper
         * Fix various other nits, add acks
           * __builtin_expect is used instead of likely to as skel_internal.h is
             included in isolation.
      
      v1 -> v2
      v1: https://lore.kernel.org/bpf/20211006002853.308945-1-memxor@gmail.com
      
       * Remove redundant OOM checks in emit_bpf_kallsyms_lookup_name
       * Use designated initializer for sk_lookup fd array (Jakub)
       * Do fd check for all fd returning low level APIs (Andrii, Alexei)
       * Make Fixes: tag quote commit message, use selftests/bpf prefix (Song, Andrii)
       * Split typeless and weak ksym support into separate patches, expand commit
         message (Song)
       * Fix duplication in selftests stemming from use of LSKELS_EXTRA (Song)
      ====================
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b9989b59
    • Kumar Kartikeya Dwivedi's avatar
      selftests/bpf: Fix memory leak in test_ima · efadf2ad
      Kumar Kartikeya Dwivedi authored
      The allocated ring buffer is never freed, do so in the cleanup path.
      
      Fixes: f446b570
      
       ("bpf/selftests: Update the IMA test to use BPF ring buffer")
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20211028063501.2239335-9-memxor@gmail.com
      efadf2ad
    • Kumar Kartikeya Dwivedi's avatar
      selftests/bpf: Fix fd cleanup in sk_lookup test · c3fc706e
      Kumar Kartikeya Dwivedi authored
      Similar to the fix in commit:
      e31eec77 ("bpf: selftests: Fix fd cleanup in get_branch_snapshot")
      
      We use designated initializer to set fds to -1 without breaking on
      future changes to MAX_SERVER constant denoting the array size.
      
      The particular close(0) occurs on non-reuseport tests, so it can be seen
      with -n 115/{2,3} but not 115/4. This can cause problems with future
      tests if they depend on BTF fd never being acquired as fd 0, breaking
      internal libbpf assumptions.
      
      Fixes: 0ab5539f
      
       ("selftests/bpf: Tests for BPF_SK_LOOKUP attach point")
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20211028063501.2239335-8-memxor@gmail.com
      c3fc706e
    • Kumar Kartikeya Dwivedi's avatar
      selftests/bpf: Add weak/typeless ksym test for light skeleton · 087cba79
      Kumar Kartikeya Dwivedi authored
      Also, avoid using CO-RE features, as lskel doesn't support CO-RE, yet.
      Include both light and libbpf skeleton in same file to test both of them
      together.
      
      In c48e51c8
      
       ("bpf: selftests: Add selftests for module kfunc support"),
      I added support for generating both lskel and libbpf skel for a BPF
      object, however the name parameter for bpftool caused collisions when
      included in same file together. This meant that every test needed a
      separate file for a libbpf/light skeleton separation instead of
      subtests.
      
      Change that by appending a "_lskel" suffix to the name for files using
      light skeleton, and convert all existing users.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20211028063501.2239335-7-memxor@gmail.com
      087cba79
    • Kumar Kartikeya Dwivedi's avatar
      libbpf: Use O_CLOEXEC uniformly when opening fds · 92274e24
      Kumar Kartikeya Dwivedi authored
      
      
      There are some instances where we don't use O_CLOEXEC when opening an
      fd, fix these up. Otherwise, it is possible that a parallel fork causes
      these fds to leak into a child process on execve.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20211028063501.2239335-6-memxor@gmail.com
      92274e24
    • Kumar Kartikeya Dwivedi's avatar
      libbpf: Ensure that BPF syscall fds are never 0, 1, or 2 · 549a6323
      Kumar Kartikeya Dwivedi authored
      Add a simple wrapper for passing an fd and getting a new one >= 3 if it
      is one of 0, 1, or 2. There are two primary reasons to make this change:
      First, libbpf relies on the assumption a certain BPF fd is never 0 (e.g.
      most recently noticed in [0]). Second, Alexei pointed out in [1] that
      some environments reset stdin, stdout, and stderr if they notice an
      invalid fd at these numbers. To protect against both these cases, switch
      all internal BPF syscall wrappers in libbpf to always return an fd >= 3.
      We only need to modify the syscall wrappers and not other code that
      assumes a valid fd by doing >= 0, to avoid pointless churn, and because
      it is still a valid assumption. The cost paid is two additional syscalls
      if fd is in range [0, 2].
      
        [0]: e31eec77
      
       ("bpf: selftests: Fix fd cleanup in get_branch_snapshot")
        [1]: https://lore.kernel.org/bpf/CAADnVQKVKY8o_3aU8Gzke443+uHa-eGoM0h7W4srChMXU1S4Bg@mail.gmail.com
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20211028063501.2239335-5-memxor@gmail.com
      549a6323
    • Kumar Kartikeya Dwivedi's avatar
      libbpf: Add weak ksym support to gen_loader · 585a3571
      Kumar Kartikeya Dwivedi authored
      
      
      This extends existing ksym relocation code to also support relocating
      weak ksyms. Care needs to be taken to zero out the src_reg (currently
      BPF_PSEUOD_BTF_ID, always set for gen_loader by bpf_object__relocate_data)
      when the BTF ID lookup fails at runtime.  This is not a problem for
      libbpf as it only sets ext->is_set when BTF ID lookup succeeds (and only
      proceeds in case of failure if ext->is_weak, leading to src_reg
      remaining as 0 for weak unresolved ksym).
      
      A pattern similar to emit_relo_kfunc_btf is followed of first storing
      the default values and then jumping over actual stores in case of an
      error. For src_reg adjustment, we also need to perform it when copying
      the populated instruction, so depending on if copied insn[0].imm is 0 or
      not, we decide to jump over the adjustment.
      
      We cannot reach that point unless the ksym was weak and resolved and
      zeroed out, as the emit_check_err will cause us to jump to cleanup
      label, so we do not need to recheck whether the ksym is weak before
      doing the adjustment after copying BTF ID and BTF FD.
      
      This is consistent with how libbpf relocates weak ksym. Logging
      statements are added to show the relocation result and aid debugging.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211028063501.2239335-4-memxor@gmail.com
      585a3571
    • Kumar Kartikeya Dwivedi's avatar
      libbpf: Add typeless ksym support to gen_loader · c24941cd
      Kumar Kartikeya Dwivedi authored
      
      
      This uses the bpf_kallsyms_lookup_name helper added in previous patches
      to relocate typeless ksyms. The return value ENOENT can be ignored, and
      the value written to 'res' can be directly stored to the insn, as it is
      overwritten to 0 on lookup failure. For repeating symbols, we can simply
      copy the previously populated bpf_insn.
      
      Also, we need to take care to not close fds for typeless ksym_desc, so
      reuse the 'off' member's space to add a marker for typeless ksym and use
      that to skip them in cleanup_relos.
      
      We add a emit_ksym_relo_log helper that avoids duplicating common
      logging instructions between typeless and weak ksym (for future commit).
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211028063501.2239335-3-memxor@gmail.com
      c24941cd
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Add bpf_kallsyms_lookup_name helper · d6aef08a
      Kumar Kartikeya Dwivedi authored
      
      
      This helper allows us to get the address of a kernel symbol from inside
      a BPF_PROG_TYPE_SYSCALL prog (used by gen_loader), so that we can
      relocate typeless ksym vars.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20211028063501.2239335-2-memxor@gmail.com
      d6aef08a
    • Alexei Starovoitov's avatar
      Merge branch 'Implement bloom filter map' · 2895f48f
      Alexei Starovoitov authored
      
      
      Joanne Koong says:
      
      ====================
      
      This patchset adds a new kind of bpf map: the bloom filter map.
      Bloom filters are a space-efficient probabilistic data structure
      used to quickly test whether an element exists in a set.
      For a brief overview about how bloom filters work,
      https://en.wikipedia.org/wiki/Bloom_filter
      may be helpful.
      
      One example use-case is an application leveraging a bloom filter
      map to determine whether a computationally expensive hashmap
      lookup can be avoided. If the element was not found in the bloom
      filter map, the hashmap lookup can be skipped.
      
      This patchset includes benchmarks for testing the performance of
      the bloom filter for different entry sizes and different number of
      hash functions used, as well as comparisons for hashmap lookups
      with vs. without the bloom filter.
      
      A high level overview of this patchset is as follows:
      1/5 - kernel changes for adding bloom filter map
      2/5 - libbpf changes for adding map_extra flags
      3/5 - tests for the bloom filter map
      4/5 - benchmarks for bloom filter lookup/update throughput and false positive
      rate
      5/5 - benchmarks for how hashmap lookups perform with vs. without the bloom
      filter
      
      v5 -> v6:
      * in 1/5: remove "inline" from the hash function, add check in syscall to
      fail out in cases where map_extra is not 0 for non-bloom-filter maps,
      fix alignment matching issues, move "map_extra flags" comments to inside
      the bpf_attr struct, add bpf_map_info map_extra changes here, add map_extra
      assignment in bpf_map_get_info_by_fd, change hash value_size to u32 instead of
      a u64
      * in 2/5: remove bpf_map_info map_extra changes, remove TODO comment about
      extending BTF arrays to cover u64s, cast to unsigned long long for %llx when
      printing out map_extra flags
      * in 3/5: use __type(value, ...) instead of __uint(value_size, ...) for values
      and keys
      * in 4/5: fix wrong bounds for the index when iterating through random values,
      update commit message to include update+lookup benchmark results for 8 byte
      and 64-byte value sizes, remove explicit global bool initializaton to false
      for hashmap_use_bloom and count_false_hits variables
      
      v4 -> v5:
      * Change the "bitset map with bloom filter capabilities" to a bloom filter map
      with max_entries signifying the number of unique entries expected in the bloom
      filter, remove bitset tests
      * Reduce verbiage by changing "bloom_filter" to "bloom", and renaming progs to
      more concise names.
      * in 2/5: remove "map_extra" from struct definitions that are frozen, create a
      "bpf_create_map_params" struct to propagate map_extra to the kernel at map
      creation time, change map_extra to __u64
      * in 4/5: check pthread condition variable in a loop when generating initial
      map data, remove "err" checks where not pragmatic, generate random values
      for the hashmap in the setup() instead of in the bpf program, add check_args()
      for checking that there aren't more requested entries than possible unique
      entries for the specified value size
      * in 5/5: Update commit message with updated benchmark data
      
      v3 -> v4:
      * Generalize the bloom filter map to be a bitset map with bloom filter
      capabilities
      * Add map_extra flags; pass in nr_hash_funcs through lower 4 bits of map_extra
      for the bitset map
      * Add tests for the bitset map (non-bloom filter) functionality
      * In the benchmarks, stats are computed only as monotonic increases, and place
      stats in a struct instead of as a percpu_array bpf map
      
      v2 -> v3:
      * Add libbpf changes for supporting nr_hash_funcs, instead of passing the
      number of hash functions through map_flags.
      * Separate the hashing logic in kernel/bpf/bloom_filter.c into a helper
      function
      
      v1 -> v2:
      * Remove libbpf changes, and pass the number of hash functions through
      map_flags instead.
      * Default to using 5 hash functions if no number of hash functions
      is specified.
      * Use set_bit instead of spinlocks in the bloom filter bitmap. This
      improved the speed significantly. For example, using 5 hash functions
      with 100k entries, there was roughly a 35% speed increase.
      * Use jhash2 (instead of jhash) for u32-aligned value sizes. This
      increased the speed by roughly 5 to 15%. When using jhash2 on value
      sizes non-u32 aligned (truncating any remainder bits), there was not
      a noticeable difference.
      * Add test for using the bloom filter as an inner map.
      * Reran the benchmarks, updated the commit messages to correspond to
      the new results.
      ====================
      
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2895f48f
    • Joanne Koong's avatar
      bpf/benchs: Add benchmarks for comparing hashmap lookups w/ vs. w/out bloom filter · f44bc543
      Joanne Koong authored
      
      
      This patch adds benchmark tests for comparing the performance of hashmap
      lookups without the bloom filter vs. hashmap lookups with the bloom filter.
      
      Checking the bloom filter first for whether the element exists should
      overall enable a higher throughput for hashmap lookups, since if the
      element does not exist in the bloom filter, we can avoid a costly lookup in
      the hashmap.
      
      On average, using 5 hash functions in the bloom filter tended to perform
      the best across the widest range of different entry sizes. The benchmark
      results using 5 hash functions (running on 8 threads on a machine with one
      numa node, and taking the average of 3 runs) were roughly as follows:
      
      value_size = 4 bytes -
      	10k entries: 30% faster
      	50k entries: 40% faster
      	100k entries: 40% faster
      	500k entres: 70% faster
      	1 million entries: 90% faster
      	5 million entries: 140% faster
      
      value_size = 8 bytes -
      	10k entries: 30% faster
      	50k entries: 40% faster
      	100k entries: 50% faster
      	500k entres: 80% faster
      	1 million entries: 100% faster
      	5 million entries: 150% faster
      
      value_size = 16 bytes -
      	10k entries: 20% faster
      	50k entries: 30% faster
      	100k entries: 35% faster
      	500k entres: 65% faster
      	1 million entries: 85% faster
      	5 million entries: 110% faster
      
      value_size = 40 bytes -
      	10k entries: 5% faster
      	50k entries: 15% faster
      	100k entries: 20% faster
      	500k entres: 65% faster
      	1 million entries: 75% faster
      	5 million entries: 120% faster
      
      Signed-off-by: default avatarJoanne Koong <joannekoong@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211027234504.30744-6-joannekoong@fb.com
      f44bc543
    • Joanne Koong's avatar
      bpf/benchs: Add benchmark tests for bloom filter throughput + false positive · 57fd1c63
      Joanne Koong authored
      
      
      This patch adds benchmark tests for the throughput (for lookups + updates)
      and the false positive rate of bloom filter lookups, as well as some
      minor refactoring of the bash script for running the benchmarks.
      
      These benchmarks show that as the number of hash functions increases,
      the throughput and the false positive rate of the bloom filter decreases.
      >From the benchmark data, the approximate average false-positive rates
      are roughly as follows:
      
      1 hash function = ~30%
      2 hash functions = ~15%
      3 hash functions = ~5%
      4 hash functions = ~2.5%
      5 hash functions = ~1%
      6 hash functions = ~0.5%
      7 hash functions  = ~0.35%
      8 hash functions = ~0.15%
      9 hash functions = ~0.1%
      10 hash functions = ~0%
      
      For reference data, the benchmarks run on one thread on a machine
      with one numa node for 1 to 5 hash functions for 8-byte and 64-byte
      values are as follows:
      
      1 hash function:
        50k entries
      	8-byte value
      	    Lookups - 51.1 M/s operations
      	    Updates - 33.6 M/s operations
      	    False positive rate: 24.15%
      	64-byte value
      	    Lookups - 15.7 M/s operations
      	    Updates - 15.1 M/s operations
      	    False positive rate: 24.2%
        100k entries
      	8-byte value
      	    Lookups - 51.0 M/s operations
      	    Updates - 33.4 M/s operations
      	    False positive rate: 24.04%
      	64-byte value
      	    Lookups - 15.6 M/s operations
      	    Updates - 14.6 M/s operations
      	    False positive rate: 24.06%
        500k entries
      	8-byte value
      	    Lookups - 50.5 M/s operations
      	    Updates - 33.1 M/s operations
      	    False positive rate: 27.45%
      	64-byte value
      	    Lookups - 15.6 M/s operations
      	    Updates - 14.2 M/s operations
      	    False positive rate: 27.42%
        1 mil entries
      	8-byte value
      	    Lookups - 49.7 M/s operations
      	    Updates - 32.9 M/s operations
      	    False positive rate: 27.45%
      	64-byte value
      	    Lookups - 15.4 M/s operations
      	    Updates - 13.7 M/s operations
      	    False positive rate: 27.58%
        2.5 mil entries
      	8-byte value
      	    Lookups - 47.2 M/s operations
      	    Updates - 31.8 M/s operations
      	    False positive rate: 30.94%
      	64-byte value
      	    Lookups - 15.3 M/s operations
      	    Updates - 13.2 M/s operations
      	    False positive rate: 30.95%
        5 mil entries
      	8-byte value
      	    Lookups - 41.1 M/s operations
      	    Updates - 28.1 M/s operations
      	    False positive rate: 31.01%
      	64-byte value
      	    Lookups - 13.3 M/s operations
      	    Updates - 11.4 M/s operations
      	    False positive rate: 30.98%
      
      2 hash functions:
        50k entries
      	8-byte value
      	    Lookups - 34.1 M/s operations
      	    Updates - 20.1 M/s operations
      	    False positive rate: 9.13%
      	64-byte value
      	    Lookups - 8.4 M/s operations
      	    Updates - 7.9 M/s operations
      	    False positive rate: 9.21%
        100k entries
      	8-byte value
      	    Lookups - 33.7 M/s operations
      	    Updates - 18.9 M/s operations
      	    False positive rate: 9.13%
      	64-byte value
      	    Lookups - 8.4 M/s operations
      	    Updates - 7.7 M/s operations
      	    False positive rate: 9.19%
        500k entries
      	8-byte value
      	    Lookups - 32.7 M/s operations
      	    Updates - 18.1 M/s operations
      	    False positive rate: 12.61%
      	64-byte value
      	    Lookups - 8.4 M/s operations
      	    Updates - 7.5 M/s operations
      	    False positive rate: 12.61%
        1 mil entries
      	8-byte value
      	    Lookups - 30.6 M/s operations
      	    Updates - 18.9 M/s operations
      	    False positive rate: 12.54%
      	64-byte value
      	    Lookups - 8.0 M/s operations
      	    Updates - 7.0 M/s operations
      	    False positive rate: 12.52%
        2.5 mil entries
      	8-byte value
      	    Lookups - 25.3 M/s operations
      	    Updates - 16.7 M/s operations
      	    False positive rate: 16.77%
      	64-byte value
      	    Lookups - 7.9 M/s operations
      	    Updates - 6.5 M/s operations
      	    False positive rate: 16.88%
        5 mil entries
      	8-byte value
      	    Lookups - 20.8 M/s operations
      	    Updates - 14.7 M/s operations
      	    False positive rate: 16.78%
      	64-byte value
      	    Lookups - 7.0 M/s operations
      	    Updates - 6.0 M/s operations
      	    False positive rate: 16.78%
      
      3 hash functions:
        50k entries
      	8-byte value
      	    Lookups - 25.1 M/s operations
      	    Updates - 14.6 M/s operations
      	    False positive rate: 7.65%
      	64-byte value
      	    Lookups - 5.8 M/s operations
      	    Updates - 5.5 M/s operations
      	    False positive rate: 7.58%
        100k entries
      	8-byte value
      	    Lookups - 24.7 M/s operations
      	    Updates - 14.1 M/s operations
      	    False positive rate: 7.71%
      	64-byte value
      	    Lookups - 5.8 M/s operations
      	    Updates - 5.3 M/s operations
      	    False positive rate: 7.62%
        500k entries
      	8-byte value
      	    Lookups - 22.9 M/s operations
      	    Updates - 13.9 M/s operations
      	    False positive rate: 2.62%
      	64-byte value
      	    Lookups - 5.6 M/s operations
      	    Updates - 4.8 M/s operations
      	    False positive rate: 2.7%
        1 mil entries
      	8-byte value
      	    Lookups - 19.8 M/s operations
      	    Updates - 12.6 M/s operations
      	    False positive rate: 2.60%
      	64-byte value
      	    Lookups - 5.3 M/s operations
      	    Updates - 4.4 M/s operations
      	    False positive rate: 2.69%
        2.5 mil entries
      	8-byte value
      	    Lookups - 16.2 M/s operations
      	    Updates - 10.7 M/s operations
      	    False positive rate: 4.49%
      	64-byte value
      	    Lookups - 4.9 M/s operations
      	    Updates - 4.1 M/s operations
      	    False positive rate: 4.41%
        5 mil entries
      	8-byte value
      	    Lookups - 18.8 M/s operations
      	    Updates - 9.2 M/s operations
      	    False positive rate: 4.45%
      	64-byte value
      	    Lookups - 5.2 M/s operations
      	    Updates - 3.9 M/s operations
      	    False positive rate: 4.54%
      
      4 hash functions:
        50k entries
      	8-byte value
      	    Lookups - 19.7 M/s operations
      	    Updates - 11.1 M/s operations
      	    False positive rate: 1.01%
      	64-byte value
      	    Lookups - 4.4 M/s operations
      	    Updates - 4.0 M/s operations
      	    False positive rate: 1.00%
        100k entries
      	8-byte value
      	    Lookups - 19.5 M/s operations
      	    Updates - 10.9 M/s operations
      	    False positive rate: 1.00%
      	64-byte value
      	    Lookups - 4.3 M/s operations
      	    Updates - 3.9 M/s operations
      	    False positive rate: 0.97%
        500k entries
      	8-byte value
      	    Lookups - 18.2 M/s operations
      	    Updates - 10.6 M/s operations
      	    False positive rate: 2.05%
      	64-byte value
      	    Lookups - 4.3 M/s operations
      	    Updates - 3.7 M/s operations
      	    False positive rate: 2.05%
        1 mil entries
      	8-byte value
      	    Lookups - 15.5 M/s operations
      	    Updates - 9.6 M/s operations
      	    False positive rate: 1.99%
      	64-byte value
      	    Lookups - 4.0 M/s operations
      	    Updates - 3.4 M/s operations
      	    False positive rate: 1.99%
        2.5 mil entries
      	8-byte value
      	    Lookups - 13.8 M/s operations
      	    Updates - 7.7 M/s operations
      	    False positive rate: 3.91%
      	64-byte value
      	    Lookups - 3.7 M/s operations
      	    Updates - 3.6 M/s operations
      	    False positive rate: 3.78%
        5 mil entries
      	8-byte value
      	    Lookups - 13.0 M/s operations
      	    Updates - 6.9 M/s operations
      	    False positive rate: 3.93%
      	64-byte value
      	    Lookups - 3.5 M/s operations
      	    Updates - 3.7 M/s operations
      	    False positive rate: 3.39%
      
      5 hash functions:
        50k entries
      	8-byte value
      	    Lookups - 16.4 M/s operations
      	    Updates - 9.1 M/s operations
      	    False positive rate: 0.78%
      	64-byte value
      	    Lookups - 3.5 M/s operations
      	    Updates - 3.2 M/s operations
      	    False positive rate: 0.77%
        100k entries
      	8-byte value
      	    Lookups - 16.3 M/s operations
      	    Updates - 9.0 M/s operations
      	    False positive rate: 0.79%
      	64-byte value
      	    Lookups - 3.5 M/s operations
      	    Updates - 3.2 M/s operations
      	    False positive rate: 0.78%
        500k entries
      	8-byte value
      	    Lookups - 15.1 M/s operations
      	    Updates - 8.8 M/s operations
      	    False positive rate: 1.82%
      	64-byte value
      	    Lookups - 3.4 M/s operations
      	    Updates - 3.0 M/s operations
      	    False positive rate: 1.78%
        1 mil entries
      	8-byte value
      	    Lookups - 13.2 M/s operations
      	    Updates - 7.8 M/s operations
      	    False positive rate: 1.81%
      	64-byte value
      	    Lookups - 3.2 M/s operations
      	    Updates - 2.8 M/s operations
      	    False positive rate: 1.80%
        2.5 mil entries
      	8-byte value
      	    Lookups - 10.5 M/s operations
      	    Updates - 5.9 M/s operations
      	    False positive rate: 0.29%
      	64-byte value
      	    Lookups - 3.2 M/s operations
      	    Updates - 2.4 M/s operations
      	    False positive rate: 0.28%
        5 mil entries
      	8-byte value
      	    Lookups - 9.6 M/s operations
      	    Updates - 5.7 M/s operations
      	    False positive rate: 0.30%
      	64-byte value
      	    Lookups - 3.2 M/s operations
      	    Updates - 2.7 M/s operations
      	    False positive rate: 0.30%
      
      Signed-off-by: default avatarJoanne Koong <joannekoong@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20211027234504.30744-5-joannekoong@fb.com
      57fd1c63
    • Joanne Koong's avatar
      selftests/bpf: Add bloom filter map test cases · ed9109ad
      Joanne Koong authored
      
      
      This patch adds test cases for bpf bloom filter maps. They include tests
      checking against invalid operations by userspace, tests for using the
      bloom filter map as an inner map, and a bpf program that queries the
      bloom filter map for values added by a userspace program.
      
      Signed-off-by: default avatarJoanne Koong <joannekoong@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20211027234504.30744-4-joannekoong@fb.com
      ed9109ad
    • Joanne Koong's avatar
      libbpf: Add "map_extra" as a per-map-type extra flag · 47512102
      Joanne Koong authored
      
      
      This patch adds the libbpf infrastructure for supporting a
      per-map-type "map_extra" field, whose definition will be
      idiosyncratic depending on map type.
      
      For example, for the bloom filter map, the lower 4 bits of
      map_extra is used to denote the number of hash functions.
      
      Please note that until libbpf 1.0 is here, the
      "bpf_create_map_params" struct is used as a temporary
      means for propagating the map_extra field to the kernel.
      
      Signed-off-by: default avatarJoanne Koong <joannekoong@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20211027234504.30744-3-joannekoong@fb.com
      47512102
    • Joanne Koong's avatar
      bpf: Add bloom filter map implementation · 9330986c
      Joanne Koong authored
      
      
      This patch adds the kernel-side changes for the implementation of
      a bpf bloom filter map.
      
      The bloom filter map supports peek (determining whether an element
      is present in the map) and push (adding an element to the map)
      operations.These operations are exposed to userspace applications
      through the already existing syscalls in the following way:
      
      BPF_MAP_LOOKUP_ELEM -> peek
      BPF_MAP_UPDATE_ELEM -> push
      
      The bloom filter map does not have keys, only values. In light of
      this, the bloom filter map's API matches that of queue stack maps:
      user applications use BPF_MAP_LOOKUP_ELEM/BPF_MAP_UPDATE_ELEM
      which correspond internally to bpf_map_peek_elem/bpf_map_push_elem,
      and bpf programs must use the bpf_map_peek_elem and bpf_map_push_elem
      APIs to query or add an element to the bloom filter map. When the
      bloom filter map is created, it must be created with a key_size of 0.
      
      For updates, the user will pass in the element to add to the map
      as the value, with a NULL key. For lookups, the user will pass in the
      element to query in the map as the value, with a NULL key. In the
      verifier layer, this requires us to modify the argument type of
      a bloom filter's BPF_FUNC_map_peek_elem call to ARG_PTR_TO_MAP_VALUE;
      as well, in the syscall layer, we need to copy over the user value
      so that in bpf_map_peek_elem, we know which specific value to query.
      
      A few things to please take note of:
       * If there are any concurrent lookups + updates, the user is
      responsible for synchronizing this to ensure no false negative lookups
      occur.
       * The number of hashes to use for the bloom filter is configurable from
      userspace. If no number is specified, the default used will be 5 hash
      functions. The benchmarks later in this patchset can help compare the
      performance of using different number of hashes on different entry
      sizes. In general, using more hashes decreases both the false positive
      rate and the speed of a lookup.
       * Deleting an element in the bloom filter map is not supported.
       * The bloom filter map may be used as an inner map.
       * The "max_entries" size that is specified at map creation time is used
      to approximate a reasonable bitmap size for the bloom filter, and is not
      otherwise strictly enforced. If the user wishes to insert more entries
      into the bloom filter than "max_entries", they may do so but they should
      be aware that this may lead to a higher false positive rate.
      
      Signed-off-by: default avatarJoanne Koong <joannekoong@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20211027234504.30744-2-joannekoong@fb.com
      9330986c
  3. Oct 28, 2021
    • Tiezhu Yang's avatar
      bpf, tests: Add module parameter test_suite to test_bpf module · b066abba
      Tiezhu Yang authored
      After commit 9298e63e
      
       ("bpf/tests: Add exhaustive tests of ALU
      operand magnitudes"), when modprobe test_bpf.ko with JIT on mips64,
      there exists segment fault due to the following reason:
      
        [...]
        ALU64_MOV_X: all register value magnitudes jited:1
        Break instruction in kernel code[#1]
        [...]
      
      It seems that the related JIT implementations of some test cases
      in test_bpf() have problems. At this moment, I do not care about
      the segment fault while I just want to verify the test cases of
      tail calls.
      
      Based on the above background and motivation, add the following
      module parameter test_suite to the test_bpf.ko:
      
        test_suite=<string>: only the specified test suite will be run, the
        string can be "test_bpf", "test_tail_calls" or "test_skb_segment".
      
      If test_suite is not specified, but test_id, test_name or test_range
      is specified, set 'test_bpf' as the default test suite. This is useful
      to only test the corresponding test suite when specifying the valid
      test_suite string.
      
      Any invalid test suite will result in -EINVAL being returned and no
      tests being run. If the test_suite is not specified or specified as
      empty string, it does not change the current logic, all of the test
      cases will be run.
      
      Here are some test results:
      
       # dmesg -c
       # modprobe test_bpf
       # dmesg | grep Summary
       test_bpf: Summary: 1009 PASSED, 0 FAILED, [0/997 JIT'ed]
       test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [0/8 JIT'ed]
       test_bpf: test_skb_segment: Summary: 2 PASSED, 0 FAILED
      
       # rmmod test_bpf
       # dmesg -c
       # modprobe test_bpf test_suite=test_bpf
       # dmesg | tail -1
       test_bpf: Summary: 1009 PASSED, 0 FAILED, [0/997 JIT'ed]
      
       # rmmod test_bpf
       # dmesg -c
       # modprobe test_bpf test_suite=test_tail_calls
       # dmesg
       test_bpf: #0 Tail call leaf jited:0 21 PASS
       [...]
       test_bpf: #7 Tail call error path, index out of range jited:0 32 PASS
       test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [0/8 JIT'ed]
      
       # rmmod test_bpf
       # dmesg -c
       # modprobe test_bpf test_suite=test_skb_segment
       # dmesg
       test_bpf: #0 gso_with_rx_frags PASS
       test_bpf: #1 gso_linear_no_head_frag PASS
       test_bpf: test_skb_segment: Summary: 2 PASSED, 0 FAILED
      
       # rmmod test_bpf
       # dmesg -c
       # modprobe test_bpf test_id=1
       # dmesg
       test_bpf: test_bpf: set 'test_bpf' as the default test_suite.
       test_bpf: #1 TXA jited:0 54 51 50 PASS
       test_bpf: Summary: 1 PASSED, 0 FAILED, [0/1 JIT'ed]
      
       # rmmod test_bpf
       # dmesg -c
       # modprobe test_bpf test_suite=test_bpf test_name=TXA
       # dmesg
       test_bpf: #1 TXA jited:0 54 50 51 PASS
       test_bpf: Summary: 1 PASSED, 0 FAILED, [0/1 JIT'ed]
      
       # rmmod test_bpf
       # dmesg -c
       # modprobe test_bpf test_suite=test_tail_calls test_range=6,7
       # dmesg
       test_bpf: #6 Tail call error path, NULL target jited:0 41 PASS
       test_bpf: #7 Tail call error path, index out of range jited:0 32 PASS
       test_bpf: test_tail_calls: Summary: 2 PASSED, 0 FAILED, [0/2 JIT'ed]
      
       # rmmod test_bpf
       # dmesg -c
       # modprobe test_bpf test_suite=test_skb_segment test_id=1
       # dmesg
       test_bpf: #1 gso_linear_no_head_frag PASS
       test_bpf: test_skb_segment: Summary: 1 PASSED, 0 FAILED
      
      By the way, the above segment fault has been fixed in the latest bpf-next
      tree which contains the mips64 JIT rework.
      
      Signed-off-by: default avatarTiezhu Yang <yangtiezhu@loongson.cn>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarJohan Almbladh <johan.almbladh@anyfinetworks.com>
      Acked-by: default avatarJohan Almbladh <johan.almbladh@anyfinetworks.com>
      Link: https://lore.kernel.org/bpf/1635384321-28128-1-git-send-email-yangtiezhu@loongson.cn
      b066abba
    • Tong Tiangen's avatar
      riscv, bpf: Add BPF exception tables · 252c765b
      Tong Tiangen authored
      When a tracing BPF program attempts to read memory without using the
      bpf_probe_read() helper, the verifier marks the load instruction with
      the BPF_PROBE_MEM flag. Since the riscv JIT does not currently recognize
      this flag it falls back to the interpreter.
      
      Add support for BPF_PROBE_MEM, by appending an exception table to the
      BPF program. If the load instruction causes a data abort, the fixup
      infrastructure finds the exception table and fixes up the fault, by
      clearing the destination register and jumping over the faulting
      instruction.
      
      A more generic solution would add a "handler" field to the table entry,
      like on x86 and s390. The same issue in ARM64 is fixed in 80083428
      
      
      ("bpf, arm64: Add BPF exception tables").
      
      Signed-off-by: default avatarTong Tiangen <tongtiangen@huawei.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarPu Lehui <pulehui@huawei.com>
      Tested-by: default avatarBjörn Töpel <bjorn@kernel.org>
      Acked-by: default avatarBjörn Töpel <bjorn@kernel.org>
      Link: https://lore.kernel.org/bpf/20211027111822.3801679-1-tongtiangen@huawei.com
      252c765b
    • Andrii Nakryiko's avatar
      Merge branch 'selftests/bpf: parallel mode improvement' · 03e6a7a9
      Andrii Nakryiko authored
      
      
      Yucong Sun says:
      
      ====================
      
      Several patches to improve parallel execution mode, updating vmtest.sh
      and fixed two previously dropped patches according to feedback.
      ====================
      
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      03e6a7a9
    • Yucong Sun's avatar
      selftests/bpf: Adding a namespace reset for tc_redirect · e1ef62a4
      Yucong Sun authored
      
      
      This patch delete ns_src/ns_dst/ns_redir namespaces before recreating
      them, making the test more robust.
      
      Signed-off-by: default avatarYucong Sun <sunyucong@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20211025223345.2136168-5-fallentree@fb.com
      e1ef62a4
    • Yucong Sun's avatar
      selftests/bpf: Fix attach_probe in parallel mode · 9e7240fb
      Yucong Sun authored
      
      
      This patch makes attach_probe uses its own method as attach point,
      avoiding conflict with other tests like bpf_cookie.
      
      Signed-off-by: default avatarYucong Sun <sunyucong@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20211025223345.2136168-4-fallentree@fb.com
      9e7240fb
    • Yucong Sun's avatar
      selfetests/bpf: Update vmtest.sh defaults · 547208a3
      Yucong Sun authored
      
      
      Increase memory to 4G, 8 SMP core with host cpu passthrough. This
      make it run faster in parallel mode and more likely to succeed.
      
      Signed-off-by: default avatarYucong Sun <sunyucong@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20211025223345.2136168-2-fallentree@fb.com
      547208a3
    • Alexei Starovoitov's avatar
      Merge branch 'bpf: use 32bit safe version of u64_stats' · f9d532fc
      Alexei Starovoitov authored
      
      
      Eric Dumazet says:
      
      ====================
      
      From: Eric Dumazet <edumazet@google.com>
      
      Two first patches fix bugs added in 5.1 and 5.5
      
      Third patch replaces the u64 fields in struct bpf_prog_stats
      with u64_stats_t ones to avoid possible sampling errors,
      in case of load/store stearing.
      ====================
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f9d532fc
    • Eric Dumazet's avatar
      bpf: Use u64_stats_t in struct bpf_prog_stats · 61a0abae
      Eric Dumazet authored
      Commit 316580b6
      
       ("u64_stats: provide u64_stats_t type")
      fixed possible load/store tearing on 64bit arches.
      
      For instance the following C code
      
      stats->nsecs += sched_clock() - start;
      
      Could be rightfully implemented like this by a compiler,
      confusing concurrent readers a lot:
      
      stats->nsecs += sched_clock();
      // arbitrary delay
      stats->nsecs -= start;
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211026214133.3114279-4-eric.dumazet@gmail.com
      61a0abae
    • Eric Dumazet's avatar
      bpf: Fixes possible race in update_prog_stats() for 32bit arches · d979617a
      Eric Dumazet authored
      It seems update_prog_stats() suffers from same issue fixed
      in the prior patch:
      
      As it can run while interrupts are enabled, it could
      be re-entered and the u64_stats syncp could be mangled.
      
      Fixes: fec56f58
      
       ("bpf: Introduce BPF trampoline")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211026214133.3114279-3-eric.dumazet@gmail.com
      d979617a
    • Eric Dumazet's avatar
      bpf: Avoid races in __bpf_prog_run() for 32bit arches · f941eadd
      Eric Dumazet authored
      __bpf_prog_run() can run from non IRQ contexts, meaning
      it could be re entered if interrupted.
      
      This calls for the irq safe variant of u64_stats_update_{begin|end},
      or risk a deadlock.
      
      This patch is a nop on 64bit arches, fortunately.
      
      syzbot report:
      
      WARNING: inconsistent lock state
      5.12.0-rc3-syzkaller #0 Not tainted
      --------------------------------
      inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
      udevd/4013 [HC0[0]:SC0[0]:HE1:SE1] takes:
      ff7c9dec (&(&pstats->syncp)->seq){+.?.}-{0:0}, at: sk_filter include/linux/filter.h:867 [inline]
      ff7c9dec (&(&pstats->syncp)->seq){+.?.}-{0:0}, at: do_one_broadcast net/netlink/af_netlink.c:1468 [inline]
      ff7c9dec (&(&pstats->syncp)->seq){+.?.}-{0:0}, at: netlink_broadcast_filtered+0x27c/0x4fc net/netlink/af_netlink.c:1520
      {IN-SOFTIRQ-W} state was registered at:
        lock_acquire.part.0+0xf0/0x41c kernel/locking/lockdep.c:5510
        lock_acquire+0x6c/0x74 kernel/locking/lockdep.c:5483
        do_write_seqcount_begin_nested include/linux/seqlock.h:520 [inline]
        do_write_seqcount_begin include/linux/seqlock.h:545 [inline]
        u64_stats_update_begin include/linux/u64_stats_sync.h:129 [inline]
        bpf_prog_run_pin_on_cpu include/linux/filter.h:624 [inline]
        bpf_prog_run_clear_cb+0x1bc/0x270 include/linux/filter.h:755
        run_filter+0xa0/0x17c net/packet/af_packet.c:2031
        packet_rcv+0xc0/0x3e0 net/packet/af_packet.c:2104
        dev_queue_xmit_nit+0x2bc/0x39c net/core/dev.c:2387
        xmit_one net/core/dev.c:3588 [inline]
        dev_hard_start_xmit+0x94/0x518 net/core/dev.c:3609
        sch_direct_xmit+0x11c/0x1f0 net/sched/sch_generic.c:313
        qdisc_restart net/sched/sch_generic.c:376 [inline]
        __qdisc_run+0x194/0x7f8 net/sched/sch_generic.c:384
        qdisc_run include/net/pkt_sched.h:136 [inline]
        qdisc_run include/net/pkt_sched.h:128 [inline]
        __dev_xmit_skb net/core/dev.c:3795 [inline]
        __dev_queue_xmit+0x65c/0xf84 net/core/dev.c:4150
        dev_queue_xmit+0x14/0x18 net/core/dev.c:4215
        neigh_resolve_output net/core/neighbour.c:1491 [inline]
        neigh_resolve_output+0x170/0x228 net/core/neighbour.c:1471
        neigh_output include/net/neighbour.h:510 [inline]
        ip6_finish_output2+0x2e4/0x9fc net/ipv6/ip6_output.c:117
        __ip6_finish_output net/ipv6/ip6_output.c:182 [inline]
        __ip6_finish_output+0x164/0x3f8 net/ipv6/ip6_output.c:161
        ip6_finish_output+0x2c/0xb0 net/ipv6/ip6_output.c:192
        NF_HOOK_COND include/linux/netfilter.h:290 [inline]
        ip6_output+0x74/0x294 net/ipv6/ip6_output.c:215
        dst_output include/net/dst.h:448 [inline]
        NF_HOOK include/linux/netfilter.h:301 [inline]
        NF_HOOK include/linux/netfilter.h:295 [inline]
        mld_sendpack+0x2a8/0x7e4 net/ipv6/mcast.c:1679
        mld_send_cr net/ipv6/mcast.c:1975 [inline]
        mld_ifc_timer_expire+0x1e8/0x494 net/ipv6/mcast.c:2474
        call_timer_fn+0xd0/0x570 kernel/time/timer.c:1431
        expire_timers kernel/time/timer.c:1476 [inline]
        __run_timers kernel/time/timer.c:1745 [inline]
        run_timer_softirq+0x2e4/0x384 kernel/time/timer.c:1758
        __do_softirq+0x204/0x7ac kernel/softirq.c:345
        do_softirq_own_stack include/asm-generic/softirq_stack.h:10 [inline]
        invoke_softirq kernel/softirq.c:228 [inline]
        __irq_exit_rcu+0x1d8/0x200 kernel/softirq.c:422
        irq_exit+0x10/0x3c kernel/softirq.c:446
        __handle_domain_irq+0xb4/0x120 kernel/irq/irqdesc.c:692
        handle_domain_irq include/linux/irqdesc.h:176 [inline]
        gic_handle_irq+0x84/0xac drivers/irqchip/irq-gic.c:370
        __irq_svc+0x5c/0x94 arch/arm/kernel/entry-armv.S:205
        debug_smp_processor_id+0x0/0x24 lib/smp_processor_id.c:53
        rcu_read_lock_held_common kernel/rcu/update.c:108 [inline]
        rcu_read_lock_sched_held+0x24/0x7c kernel/rcu/update.c:123
        trace_lock_acquire+0x24c/0x278 include/trace/events/lock.h:13
        lock_acquire+0x3c/0x74 kernel/locking/lockdep.c:5481
        rcu_lock_acquire include/linux/rcupdate.h:267 [inline]
        rcu_read_lock include/linux/rcupdate.h:656 [inline]
        avc_has_perm_noaudit+0x6c/0x260 security/selinux/avc.c:1150
        selinux_inode_permission+0x140/0x220 security/selinux/hooks.c:3141
        security_inode_permission+0x44/0x60 security/security.c:1268
        inode_permission.part.0+0x5c/0x13c fs/namei.c:521
        inode_permission fs/namei.c:494 [inline]
        may_lookup fs/namei.c:1652 [inline]
        link_path_walk.part.0+0xd4/0x38c fs/namei.c:2208
        link_path_walk fs/namei.c:2189 [inline]
        path_lookupat+0x3c/0x1b8 fs/namei.c:2419
        filename_lookup+0xa8/0x1a4 fs/namei.c:2453
        user_path_at_empty+0x74/0x90 fs/namei.c:2733
        do_readlinkat+0x5c/0x12c fs/stat.c:417
        __do_sys_readlink fs/stat.c:450 [inline]
        sys_readlink+0x24/0x28 fs/stat.c:447
        ret_fast_syscall+0x0/0x2c arch/arm/mm/proc-v7.S:64
        0x7eaa4974
      irq event stamp: 298277
      hardirqs last  enabled at (298277): [<802000d0>] no_work_pending+0x4/0x34
      hardirqs last disabled at (298276): [<8020c9b8>] do_work_pending+0x9c/0x648 arch/arm/kernel/signal.c:676
      softirqs last  enabled at (298216): [<8020167c>] __do_softirq+0x584/0x7ac kernel/softirq.c:372
      softirqs last disabled at (298201): [<8024dff4>] do_softirq_own_stack include/asm-generic/softirq_stack.h:10 [inline]
      softirqs last disabled at (298201): [<8024dff4>] invoke_softirq kernel/softirq.c:228 [inline]
      softirqs last disabled at (298201): [<8024dff4>] __irq_exit_rcu+0x1d8/0x200 kernel/softirq.c:422
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(&(&pstats->syncp)->seq);
        <Interrupt>
          lock(&(&pstats->syncp)->seq);
      
       *** DEADLOCK ***
      
      1 lock held by udevd/4013:
       #0: 82b09c5c (rcu_read_lock){....}-{1:2}, at: sk_filter_trim_cap+0x54/0x434 net/core/filter.c:139
      
      stack backtrace:
      CPU: 1 PID: 4013 Comm: udevd Not tainted 5.12.0-rc3-syzkaller #0
      Hardware name: ARM-Versatile Express
      Backtrace:
      [<81802550>] (dump_backtrace) from [<818027c4>] (show_stack+0x18/0x1c arch/arm/kernel/traps.c:252)
       r7:00000080 r6:600d0093 r5:00000000 r4:82b58344
      [<818027ac>] (show_stack) from [<81809e98>] (__dump_stack lib/dump_stack.c:79 [inline])
      [<818027ac>] (show_stack) from [<81809e98>] (dump_stack+0xb8/0xe8 lib/dump_stack.c:120)
      [<81809de0>] (dump_stack) from [<81804a00>] (print_usage_bug.part.0+0x228/0x230 kernel/locking/lockdep.c:3806)
       r7:86bcb768 r6:81a0326c r5:830f96a8 r4:86bcb0c0
      [<818047d8>] (print_usage_bug.part.0) from [<802bb1b8>] (print_usage_bug kernel/locking/lockdep.c:3776 [inline])
      [<818047d8>] (print_usage_bug.part.0) from [<802bb1b8>] (valid_state kernel/locking/lockdep.c:3818 [inline])
      [<818047d8>] (print_usage_bug.part.0) from [<802bb1b8>] (mark_lock_irq kernel/locking/lockdep.c:4021 [inline])
      [<818047d8>] (print_usage_bug.part.0) from [<802bb1b8>] (mark_lock.part.0+0xc34/0x136c kernel/locking/lockdep.c:4478)
       r10:83278fe8 r9:82c6d748 r8:00000000 r7:82c6d2d4 r6:00000004 r5:86bcb768
       r4:00000006
      [<802ba584>] (mark_lock.part.0) from [<802bc644>] (mark_lock kernel/locking/lockdep.c:4442 [inline])
      [<802ba584>] (mark_lock.part.0) from [<802bc644>] (mark_usage kernel/locking/lockdep.c:4391 [inline])
      [<802ba584>] (mark_lock.part.0) from [<802bc644>] (__lock_acquire+0x9bc/0x3318 kernel/locking/lockdep.c:4854)
       r10:86bcb768 r9:86bcb0c0 r8:00000001 r7:00040000 r6:0000075a r5:830f96a8
       r4:00000000
      [<802bbc88>] (__lock_acquire) from [<802bfb90>] (lock_acquire.part.0+0xf0/0x41c kernel/locking/lockdep.c:5510)
       r10:00000000 r9:600d0013 r8:00000000 r7:00000000 r6:828a2680 r5:828a2680
       r4:861e5bc8
      [<802bfaa0>] (lock_acquire.part.0) from [<802bff28>] (lock_acquire+0x6c/0x74 kernel/locking/lockdep.c:5483)
       r10:8146137c r9:00000000 r8:00000001 r7:00000000 r6:00000000 r5:00000000
       r4:ff7c9dec
      [<802bfebc>] (lock_acquire) from [<81381eb4>] (do_write_seqcount_begin_nested include/linux/seqlock.h:520 [inline])
      [<802bfebc>] (lock_acquire) from [<81381eb4>] (do_write_seqcount_begin include/linux/seqlock.h:545 [inline])
      [<802bfebc>] (lock_acquire) from [<81381eb4>] (u64_stats_update_begin include/linux/u64_stats_sync.h:129 [inline])
      [<802bfebc>] (lock_acquire) from [<81381eb4>] (__bpf_prog_run_save_cb include/linux/filter.h:727 [inline])
      [<802bfebc>] (lock_acquire) from [<81381eb4>] (bpf_prog_run_save_cb include/linux/filter.h:741 [inline])
      [<802bfebc>] (lock_acquire) from [<81381eb4>] (sk_filter_trim_cap+0x26c/0x434 net/core/filter.c:149)
       r10:a4095dd0 r9:ff7c9dd0 r8:e44be000 r7:8146137c r6:00000001 r5:8611ba80
       r4:00000000
      [<81381c48>] (sk_filter_trim_cap) from [<8146137c>] (sk_filter include/linux/filter.h:867 [inline])
      [<81381c48>] (sk_filter_trim_cap) from [<8146137c>] (do_one_broadcast net/netlink/af_netlink.c:1468 [inline])
      [<81381c48>] (sk_filter_trim_cap) from [<8146137c>] (netlink_broadcast_filtered+0x27c/0x4fc net/netlink/af_netlink.c:1520)
       r10:00000001 r9:833d6b1c r8:00000000 r7:8572f864 r6:8611ba80 r5:8698d800
       r4:8572f800
      [<81461100>] (netlink_broadcast_filtered) from [<81463e60>] (netlink_broadcast net/netlink/af_netlink.c:1544 [inline])
      [<81461100>] (netlink_broadcast_filtered) from [<81463e60>] (netlink_sendmsg+0x3d0/0x478 net/netlink/af_netlink.c:1925)
       r10:00000000 r9:00000002 r8:8698d800 r7:000000b7 r6:8611b900 r5:861e5f50
       r4:86aa3000
      [<81463a90>] (netlink_sendmsg) from [<81321f54>] (sock_sendmsg_nosec net/socket.c:654 [inline])
      [<81463a90>] (netlink_sendmsg) from [<81321f54>] (sock_sendmsg+0x3c/0x4c net/socket.c:674)
       r10:00000000 r9:861e5dd4 r8:00000000 r7:86570000 r6:00000000 r5:86570000
       r4:861e5f50
      [<81321f18>] (sock_sendmsg) from [<813234d0>] (____sys_sendmsg+0x230/0x29c net/socket.c:2350)
       r5:00000040 r4:861e5f50
      [<813232a0>] (____sys_sendmsg) from [<8132549c>] (___sys_sendmsg+0xac/0xe4 net/socket.c:2404)
       r10:00000128 r9:861e4000 r8:00000000 r7:00000000 r6:86570000 r5:861e5f50
       r4:00000000
      [<813253f0>] (___sys_sendmsg) from [<81325684>] (__sys_sendmsg net/socket.c:2433 [inline])
      [<813253f0>] (___sys_sendmsg) from [<81325684>] (__do_sys_sendmsg net/socket.c:2442 [inline])
      [<813253f0>] (___sys_sendmsg) from [<81325684>] (sys_sendmsg+0x58/0xa0 net/socket.c:2440)
       r8:80200224 r7:00000128 r6:00000000 r5:7eaa541c r4:86570000
      [<8132562c>] (sys_sendmsg) from [<80200060>] (ret_fast_syscall+0x0/0x2c arch/arm/mm/proc-v7.S:64)
      Exception stack(0x861e5fa8 to 0x861e5ff0)
      5fa0:                   00000000 00000000 0000000c 7eaa541c 00000000 00000000
      5fc0: 00000000 00000000 76fbf840 00000128 00000000 0000008f 7eaa541c 000563f8
      5fe0: 00056110 7eaa53e0 00036cec 76c9bf44
       r6:76fbf840 r5:00000000 r4:00000000
      
      Fixes: 492ecee8
      
       ("bpf: enable program stats")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211026214133.3114279-2-eric.dumazet@gmail.com
      f941eadd
    • Joe Burton's avatar
      libbpf: Deprecate bpf_objects_list · 689624f0
      Joe Burton authored
      
      
      Add a flag to `enum libbpf_strict_mode' to disable the global
      `bpf_objects_list', preventing race conditions when concurrent threads
      call bpf_object__open() or bpf_object__close().
      
      bpf_object__next() will return NULL if this option is set.
      
      Callers may achieve the same workflow by tracking bpf_objects in
      application code.
      
        [0] Closes: https://github.com/libbpf/libbpf/issues/293
      
      Signed-off-by: default avatarJoe Burton <jevburton@google.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20211026223528.413950-1-jevburton.kernel@gmail.com
      689624f0
  4. Oct 26, 2021