Skip to content
  1. Nov 21, 2022
    • Yonghong Song's avatar
      bpf: Add a kfunc to type cast from bpf uapi ctx to kernel ctx · fd264ca0
      Yonghong Song authored
      
      
      Implement bpf_cast_to_kern_ctx() kfunc which does a type cast
      of a uapi ctx object to the corresponding kernel ctx. Previously
      if users want to access some data available in kctx but not
      in uapi ctx, bpf_probe_read_kernel() helper is needed.
      The introduction of bpf_cast_to_kern_ctx() allows direct
      memory access which makes code simpler and easier to understand.
      
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/r/20221120195432.3113982-1-yhs@fb.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      fd264ca0
    • Yonghong Song's avatar
      bpf: Add support for kfunc set with common btf_ids · cfe14564
      Yonghong Song authored
      
      
      Later on, we will introduce kfuncs bpf_cast_to_kern_ctx() and
      bpf_rdonly_cast() which apply to all program types. Currently kfunc set
      only supports individual prog types. This patch added support for kfunc
      applying to all program types.
      
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/r/20221120195426.3113828-1-yhs@fb.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      cfe14564
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Disallow bpf_obj_new_impl call when bpf_mem_alloc_init fails · e181d3f1
      Kumar Kartikeya Dwivedi authored
      
      
      In the unlikely event that bpf_global_ma is not correctly initialized,
      instead of checking the boolean everytime bpf_obj_new_impl is called,
      simply check it while loading the program and return an error if
      bpf_global_ma_set is false.
      
      Suggested-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221120212610.2361700-1-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e181d3f1
    • Alexei Starovoitov's avatar
      Merge branch 'Support storing struct task_struct objects as kptrs' · efc1970d
      Alexei Starovoitov authored
      
      
      David Vernet says:
      
      ====================
      
      Now that BPF supports adding new kernel functions with kfuncs, and
      storing kernel objects in maps with kptrs, we can add a set of kfuncs
      which allow struct task_struct objects to be stored in maps as
      referenced kptrs.
      
      The possible use cases for doing this are plentiful.  During tracing,
      for example, it would be useful to be able to collect some tasks that
      performed a certain operation, and then periodically summarize who they
      are, which cgroup they're in, how much CPU time they've utilized, etc.
      Doing this now would require storing the tasks' pids along with some
      relevant data to be exported to user space, and later associating the
      pids to tasks in other event handlers where the data is recorded.
      Another useful by-product of this is that it allows a program to pin a
      task in a BPF program, and by proxy therefore also e.g. pin its task
      local storage.
      
      In order to support this, we'll need to expand KF_TRUSTED_ARGS to
      support receiving trusted, non-refcounted pointers. It currently only
      supports either PTR_TO_CTX pointers, or refcounted pointers. What this
      means in terms of the implementation is that check_kfunc_args() would
      have to also check for the PTR_TRUSTED or MEM_ALLOC type modifiers when
      determining if a trusted KF_ARG_PTR_TO_ALLOC_BTF_ID or
      KF_ARG_PTR_TO_BTF_ID pointer requires a refcount.
      
      Note that PTR_UNTRUSTED is insufficient for this purpose, as it does not
      cover all of the possible types of potentially unsafe pointers. For
      example, a pointer obtained from walking a struct is not PTR_UNTRUSTED.
      To account for this and enable us to expand KF_TRUSTED_ARGS to include
      allow-listed arguments such as those passed by the kernel to tracepoints
      and struct_ops callbacks, this patch set also introduces a new
      PTR_TRUSTED type flag modifier which records if a pointer was obtained
      passed from the kernel in a trusted context.
      
      Currently, both PTR_TRUSTED and MEM_ALLOC are used to imply that a
      pointer is trusted. Longer term, PTR_TRUSTED should be the sole source
      of truth for whether a pointer is trusted. This requires us to set
      PTR_TRUSTED when appropriate (e.g. when setting MEM_ALLOC), and unset it
      when appropriate (e.g. when setting PTR_UNTRUSTED). We don't do that in
      this patch, as we need to do more clean up before this can be done in a
      clear and well-defined manner.
      
      In closing, this patch set:
      
      1. Adds the new PTR_TRUSTED register type modifier flag, and updates the
         verifier and existing selftests accordingly. Also expands
         KF_TRUSTED_ARGS to also include trusted pointers that were not obtained
         from walking structs.
      2. Adds a new set of kfuncs that allows struct task_struct* objects to be
         used as kptrs.
      3. Adds a new selftest suite to validate these new task kfuncs.
      ---
      Changelog:
      v8 -> v9:
      - Moved check for release register back to where we check for
        !PTR_TO_BTF_ID || socket. Change the verifier log message to
        reflect really what's being tested (the presence of unsafe
        modifiers) (Alexei)
      - Fix verifier_test error tests to reflect above changes
      - Remove unneeded parens around bitwise operator checks (Alexei)
      - Move updates to reg_type_str() which allow multiple type modifiers
        to be present in the prefix string, to a separate patch (Alexei)
      - Increase TYPE_STR_BUF_LEN size to 128 to reflect larger prefix size
        in reg_type_str().
      
      v7 -> v8:
      - Rebased onto Kumar's latest patch set which, adds a new MEM_ALLOC reg
        type modifier for bpf_obj_new() calls.
      - Added comments to bpf_task_kptr_get() describing some of the subtle
        races we're protecting against (Alexei and John)
      - Slightly rework process_kf_arg_ptr_to_btf_id(), and add a new
        reg_has_unsafe_modifiers() function which validates that a register
        containing a kfunc release arg doesn't have unsafe modifiers. Note
        that this is slightly different than the check for KF_TRUSTED_ARGS.
        An alternative here would be to treat KF_RELEASE as implicitly
        requiring KF_TRUSTED_ARGS.
      - Export inline bpf_type_has_unsafe_modifiers() function from
        bpf_verifier.h so that it can be used from bpf_tcp_ca.c. Eventually this
        function should likely be changed to bpf_type_is_trusted(), once
        PTR_TRUSTED is the real source of truth.
      
      v6 -> v7:
      - Removed the PTR_WALKED type modifier, and instead define a new
        PTR_TRUSTED type modifier which is set on registers containing
        pointers passed from trusted contexts (i.e. as tracepoint or
        struct_ops callback args) (Alexei)
      - Remove the new KF_OWNED_ARGS kfunc flag. This can be accomplished
        by defining a new type that wraps an existing type, such as with
        struct nf_conn___init (Alexei)
      - Add a test_task_current_acquire_release testcase which verifies we can
        acquire a task struct returned from bpf_get_current_task_btf().
      - Make bpf_task_acquire() no longer return NULL, as it can no longer be
        called with a NULL task.
      - Removed unnecessary is_test_kfunc_task() checks from failure
        testcases.
      
      v5 -> v6:
      - Add a new KF_OWNED_ARGS kfunc flag which may be used by kfuncs to
        express that they require trusted, refcounted args (Kumar)
      - Rename PTR_NESTED -> PTR_WALKED in the verifier (Kumar)
      - Convert reg_type_str() prefixes to use snprintf() instead of strncpy()
        (Kumar)
      - Add PTR_TO_BTF_ID | PTR_WALKED to missing struct btf_reg_type
        instances -- specifically btf_id_sock_common_types, and
        percpu_btf_ptr_types.
      - Add a missing PTR_TO_BTF_ID | PTR_WALKED switch case entry in
        check_func_arg_reg_off(), which is required when validating helper
        calls (Kumar)
      - Update reg_type_mismatch_ok() to check base types for the registers
        (i.e. to accommodate type modifiers). Additionally, add a lengthy
        comment that explains why this is being done (Kumar)
      - Update convert_ctx_accesses() to also issue probe reads for
        PTR_TO_BTF_ID | PTR_WALKED (Kumar)
      - Update selftests to expect new prefix reg type strings.
      - Rename task_kfunc_acquire_trusted_nested testcase to
        task_kfunc_acquire_trusted_walked, and fix a comment (Kumar)
      - Remove KF_TRUSTED_ARGS from bpf_task_release(), which already includes
        KF_RELEASE (Kumar)
      - Add bpf-next in patch subject lines (Kumar)
      
      v4 -> v5:
      - Fix an improperly formatted patch title.
      
      v3 -> v4:
      - Remove an unnecessary check from my repository that I forgot to remove
        after debugging something.
      
      v2 -> v3:
      - Make bpf_task_acquire() check for NULL, and include KF_RET_NULL
        (Martin)
      - Include new PTR_NESTED register modifier type flag which specifies
        whether a pointer was obtained from walking a struct. Use this to
        expand the meaning of KF_TRUSTED_ARGS to include trusted pointers that
        were passed from the kernel (Kumar)
      - Add more selftests to the task_kfunc selftest suite which verify that
        you cannot pass a walked pointer to bpf_task_acquire().
      - Update bpf_task_acquire() to also specify KF_TRUSTED_ARGS.
      
      v1 -> v2:
      - Rename tracing_btf_ids to generic_kfunc_btf_ids, and add the new
        kfuncs to that list instead of making a separate btf id list (Alexei).
      - Don't run the new selftest suite on s390x, which doesn't appear to
        support invoking kfuncs.
      - Add a missing __diag_ignore block for -Wmissing-prototypes
        (lkp@intel.com).
      - Fix formatting on some of the SPDX-License-Identifier tags.
      - Clarified the function header comment a bit on bpf_task_kptr_get().
      ====================
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      efc1970d
    • David Vernet's avatar
      bpf/selftests: Add selftests for new task kfuncs · fe147956
      David Vernet authored
      
      
      A previous change added a series of kfuncs for storing struct
      task_struct objects as referenced kptrs. This patch adds a new
      task_kfunc test suite for validating their expected behavior.
      
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20221120051004.3605026-5-void@manifault.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      fe147956
    • David Vernet's avatar
      bpf: Add kfuncs for storing struct task_struct * as a kptr · 90660309
      David Vernet authored
      
      
      Now that BPF supports adding new kernel functions with kfuncs, and
      storing kernel objects in maps with kptrs, we can add a set of kfuncs
      which allow struct task_struct objects to be stored in maps as
      referenced kptrs. The possible use cases for doing this are plentiful.
      During tracing, for example, it would be useful to be able to collect
      some tasks that performed a certain operation, and then periodically
      summarize who they are, which cgroup they're in, how much CPU time
      they've utilized, etc.
      
      In order to enable this, this patch adds three new kfuncs:
      
      struct task_struct *bpf_task_acquire(struct task_struct *p);
      struct task_struct *bpf_task_kptr_get(struct task_struct **pp);
      void bpf_task_release(struct task_struct *p);
      
      A follow-on patch will add selftests validating these kfuncs.
      
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20221120051004.3605026-4-void@manifault.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      90660309
    • David Vernet's avatar
      bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs · 3f00c523
      David Vernet authored
      
      
      Kfuncs currently support specifying the KF_TRUSTED_ARGS flag to signal
      to the verifier that it should enforce that a BPF program passes it a
      "safe", trusted pointer. Currently, "safe" means that the pointer is
      either PTR_TO_CTX, or is refcounted. There may be cases, however, where
      the kernel passes a BPF program a safe / trusted pointer to an object
      that the BPF program wishes to use as a kptr, but because the object
      does not yet have a ref_obj_id from the perspective of the verifier, the
      program would be unable to pass it to a KF_ACQUIRE | KF_TRUSTED_ARGS
      kfunc.
      
      The solution is to expand the set of pointers that are considered
      trusted according to KF_TRUSTED_ARGS, so that programs can invoke kfuncs
      with these pointers without getting rejected by the verifier.
      
      There is already a PTR_UNTRUSTED flag that is set in some scenarios,
      such as when a BPF program reads a kptr directly from a map
      without performing a bpf_kptr_xchg() call. These pointers of course can
      and should be rejected by the verifier. Unfortunately, however,
      PTR_UNTRUSTED does not cover all the cases for safety that need to
      be addressed to adequately protect kfuncs. Specifically, pointers
      obtained by a BPF program "walking" a struct are _not_ considered
      PTR_UNTRUSTED according to BPF. For example, say that we were to add a
      kfunc called bpf_task_acquire(), with KF_ACQUIRE | KF_TRUSTED_ARGS, to
      acquire a struct task_struct *. If we only used PTR_UNTRUSTED to signal
      that a task was unsafe to pass to a kfunc, the verifier would mistakenly
      allow the following unsafe BPF program to be loaded:
      
      SEC("tp_btf/task_newtask")
      int BPF_PROG(unsafe_acquire_task,
                   struct task_struct *task,
                   u64 clone_flags)
      {
              struct task_struct *acquired, *nested;
      
              nested = task->last_wakee;
      
              /* Would not be rejected by the verifier. */
              acquired = bpf_task_acquire(nested);
              if (!acquired)
                      return 0;
      
              bpf_task_release(acquired);
              return 0;
      }
      
      To address this, this patch defines a new type flag called PTR_TRUSTED
      which tracks whether a PTR_TO_BTF_ID pointer is safe to pass to a
      KF_TRUSTED_ARGS kfunc or a BPF helper function. PTR_TRUSTED pointers are
      passed directly from the kernel as a tracepoint or struct_ops callback
      argument. Any nested pointer that is obtained from walking a PTR_TRUSTED
      pointer is no longer PTR_TRUSTED. From the example above, the struct
      task_struct *task argument is PTR_TRUSTED, but the 'nested' pointer
      obtained from 'task->last_wakee' is not PTR_TRUSTED.
      
      A subsequent patch will add kfuncs for storing a task kfunc as a kptr,
      and then another patch will add selftests to validate.
      
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20221120051004.3605026-3-void@manifault.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3f00c523
    • David Vernet's avatar
      bpf: Allow multiple modifiers in reg_type_str() prefix · ef66c547
      David Vernet authored
      
      
      reg_type_str() in the verifier currently only allows a single register
      type modifier to be present in the 'prefix' string which is eventually
      stored in the env type_str_buf. This currently works fine because there
      are no overlapping type modifiers, but once PTR_TRUSTED is added, that
      will no longer be the case. This patch updates reg_type_str() to support
      having multiple modifiers in the prefix string, and updates the size of
      type_str_buf to be 128 bytes.
      
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20221120051004.3605026-2-void@manifault.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ef66c547
  2. Nov 19, 2022
  3. Nov 18, 2022
    • Alexei Starovoitov's avatar
      Merge branch 'Allocated objects, BPF linked lists' · db6bf999
      Alexei Starovoitov authored
      
      
      Kumar Kartikeya Dwivedi says:
      
      ====================
      
      This series introduces user defined BPF objects of a type in program
      BTF. This allows BPF programs to allocate their own objects, build their
      own object hierarchies, and use the basic building blocks provided by
      BPF runtime to build their own data structures flexibly.
      
      Then, we introduce the support for single ownership BPF linked lists,
      which can be put inside BPF maps, or allocated objects, and hold such
      allocated objects as elements. It works as an instrusive collection,
      which is done to allow making allocated objects part of multiple data
      structures at the same time in the future.
      
      The eventual goal of this and future patches is to allow one to do some
      limited form of kernel style programming in BPF C, and allow programmers
      to build their own complex data structures flexibly out of basic
      building blocks.
      
      The key difference will be that such programs are verified to be safe,
      preserve runtime integrity of the system, and are proven to be bug free
      as far as the invariants of BPF specific APIs are concerned.
      
      One immediate use case that will be using the entire infrastructure this
      series is introducing will be managing percpu NMI safe linked lists
      inside BPF programs.
      
      The other use case this will serve in the near future will be linking
      kernel structures like XDP frame and sk_buff directly into user data
      structures (rbtree, pifomap, etc.) for packet queueing. This will follow
      single ownership concept included in this series.
      
      The user has complete control of the internal locking, and hence also
      the batching of operations for each critical section.
      
      The features are:
      - Allocated objects.
      - bpf_obj_new, bpf_obj_drop to allocate and free them.
      - Single ownership BPF linked lists.
        - Support for them in BPF maps.
        - Support for them in allocated objects.
      - Global spin locks.
      - Spin locks inside allocated objects.
      
      Some other notable things:
      - Completely static verification of locking.
      - Kfunc argument handling has been completely reworked.
      - Argument rewriting support for kfuncs.
      - A new bpf_experimental.h header as a dumping ground for these APIs.
      
      Any functionality exposed in this series is NOT part of UAPI. It is only
      available through use of kfuncs, and structs that can be added to map
      value may also change their size or name in the future. Hence, every
      feature in this series must be considered experimental.
      
      Follow-ups:
      -----------
       * Support for kptrs (local and kernel) in local storage and percpu maps + kptr tests
       * Fixes for helper access checks rebasing on top of this series
      
      Next steps:
      -----------
       * NMI safe percpu single ownership linked lists (using local_t protection).
       * Lockless linked lists.
       * Allow RCU protected BPF allocated objects. This then allows RCU
         protected list lookups, since spinlock protection for readers does
         not scale.
       * Introduce bpf_refcount for local kptrs, shared ownership.
       * Introduce shared ownership linked lists.
       * Documentation.
      
      Changelog:
      ----------
       v9 -> v10
       v9: https://lore.kernel.org/bpf/20221117225510.1676785-1-memxor@gmail.com
      
        * Deduplicate code to find btf_record of reg (Alexei)
        * Add linked_list test to DENYLIST.aarch64 (Alexei)
        * Disable some linked list tests for now so that they compile with
          clang nightly (Alexei)
      
       v8 -> v9
       v8: https://lore.kernel.org/bpf/20221117162430.1213770-1-memxor@gmail.com
      
        * Fix up commit log of patch 2, Simplify patch 3
        * Explain the implicit requirement of bpf_list_head requiring map BTF
          to match in btf_record_equal in a separate patch.
      
       v7 -> v8
       v7: https://lore.kernel.org/bpf/20221114191547.1694267-1-memxor@gmail.com
      
        * Fix early return in map_check_btf (Dan Carpenter)
        * Fix two memory leak bugs in local storage maps, outer maps
        * Address comments from Alexei and Dave
         * More local kptr -> allocated object renaming
         * Use krealloc with NULL instead kmalloc + krealloc
         * Drop WARN_ON_ONCE for field_offs parsing
         * Combine kfunc add + remove patches into one
         * Drop STRONG suffix from KF_ARG_PTR_TO_KPTR
         * Rename is_kfunc_arg_ret_buf_size to is_kfunc_arg_scalar_with_name
         * Remove redundant check for reg->type and arg type in it
         * Drop void * ret type check
         * Remove code duplication in checks for NULL pointer with offset != 0
         * Fix two bpf_list_node typos
         * Improve log message for bpf_list_head operations
         * Improve comments for active_lock struct
         * Improve comments for Implementation details of process_spin_lock
        * Add Dave's acks
      
       v6 -> v7
       v6: https://lore.kernel.org/bpf/20221111193224.876706-1-memxor@gmail.com
      
        * Fix uninitialized variable warning (Dan Carpenter, Kernel Test Robot)
        * One more local_kptr renaming
      
       v5 -> v6
       v5: https://lore.kernel.org/bpf/20221107230950.7117-1-memxor@gmail.com
      
        * Replace (i && !off) check with next_off, include test (Andrii)
        * Drop local kptrs naming (Andrii, Alexei)
        * Drop reg->precise == EXACT patch (Andrii)
        * Add comment about ptr member of struct active_lock (Andrii)
        * Use btf__new_empty + btf__add_xxx APIs (Andrii)
        * Address other misc nits from Andrii
      
       v4 -> v5
       v4: https://lore.kernel.org/bpf/20221103191013.1236066-1-memxor@gmail.com
      
        * Add a lot more selftests (failure, success, runtime, BTF)
        * Make sure series is bisect friendly
        * Move list draining out of spin lock
          * This exposed an issue where bpf_mem_free can now be called in
            map_free path without migrate_disable, also fixed that.
        * Rename MEM_ALLOC -> MEM_RINGBUF, MEM_TYPE_LOCAL -> MEM_ALLOC (Alexei)
        * Group lock identity into a struct active_lock { ptr, id } (Dave)
        * Split set_release_on_unlock logic into separate patch (Alexei)
      
       v3 -> v4
       v3: https://lore.kernel.org/bpf/20221102202658.963008-1-memxor@gmail.com
      
        * Fix compiler error for !CONFIG_BPF_SYSCALL (Kernel Test Robot)
        * Fix error due to BUILD_BUG_ON on 32-bit platforms (Kernel Test Robot)
      
       v2 -> v3
       v2: https://lore.kernel.org/bpf/20221013062303.896469-1-memxor@gmail.com
      
        * Add ack from Dave for patch 5
        * Rename btf_type_fields -> btf_record, btf_type_fields_off ->
          btf_field_offs, rename functions similarly (Alexei)
        * Remove 'kind' component from contains declaration tag (Alexei)
        * Move bpf_list_head, bpf_list_node definitions to UAPI bpf.h (Alexei)
        * Add note in commit log about modifying btf_struct_access API (Dave)
        * Downgrade WARN_ON_ONCE to verbose(env, "...") and return -EFAULT (Dave)
        * Add type_is_local_kptr wrapper to avoid noisy checks (Dave)
        * Remove unused flags parameter from bpf_kptr_new (Alexei)
        * Rename bpf_kptr_new -> bpf_obj_new, bpf_kptr_drop -> bpf_obj_drop (Alexei)
        * Reword comment in ref_obj_id_set_release_on_unlock (Dave)
        * Fix return type of ref_obj_id_set_release_on_unlock (Dave)
        * Introduce is_bpf_list_api_kfunc to dedup checks (Dave)
        * Disallow BPF_WRITE to untrusted local kptrs
        * Add details about soundness of check_reg_allocation_locked logic
        * List untrusted local kptrs for PROBE_MEM handling
      
       v1 -> v2
       v1: https://lore.kernel.org/bpf/20221011012240.3149-1-memxor@gmail.com
      
        * Rebase on bpf-next to resolve merge conflict in DENYLIST.s390x
        * Fix a couple of mental lapses in bpf_list_head_free
      
       RFC v1 -> v1
       RFC v1: https://lore.kernel.org/bpf/20220904204145.3089-1-memxor@gmail.com
      
        * Mostly a complete rewrite of BTF parsing, refactor existing code (Kartikeya)
        * Rebase kfunc rewrite for bpf-next, add support for more changes
        * Cache type metadata in BTF to avoid recomputation inside verifier (Kartikeya)
        * Remove __kernel tag, make things similar to map values, reserve bpf_ prefix
        * bpf_kptr_new, bpf_kptr_drop
        * Rename precision state enum values (Alexei)
        * Drop explicit constructor/destructor support (Alexei)
        * Rewrite code for constructing/destructing objects and offload to runtime
        * Minimize duplication in bpf_map_value_off_desc handling (Alexei)
        * Expose global memory allocator (Alexei)
        * Address other nits from Alexei
        * Split out local kptrs in maps, more kptrs in maps support into a follow up
      
      Links:
      ------
       * Dave's BPF RB-Tree RFC series
         v1 (Discussion thread)
           https://lore.kernel.org/bpf/20220722183438.3319790-1-davemarchevsky@fb.com
         v2 (With support for static locks)
           https://lore.kernel.org/bpf/20220830172759.4069786-1-davemarchevsky@fb.com
       * BPF Linked Lists Discussion
         https://lore.kernel.org/bpf/CAP01T74U30+yeBHEgmgzTJ-XYxZ0zj71kqCDJtTH9YQNfTK+Xw@mail.gmail.com
       * BPF Memory Allocator from Alexei
         https://lore.kernel.org/bpf/20220902211058.60789-1-alexei.starovoitov@gmail.com
       * BPF Memory Allocator UAPI Discussion
         https://lore.kernel.org/bpf/d3f76b27f4e55ec9e400ae8dcaecbb702a4932e8.camel@fb.com
      ====================
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      db6bf999
    • Kumar Kartikeya Dwivedi's avatar
      selftests/bpf: Temporarily disable linked list tests · 0a2f85a1
      Kumar Kartikeya Dwivedi authored
      
      
      The latest clang nightly as of writing crashes with the given test case
      for BPF linked lists wherever global glock, ghead, glock2 are used,
      hence comment out the parts that cause the crash, and prepare this commit
      so that it can be reverted when the fix has been made. More context in [0].
      
       [0]: https://lore.kernel.org/bpf/d56223f9-483e-fbc1-4564-44c0858a1e3e@meta.com
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-25-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0a2f85a1
    • Kumar Kartikeya Dwivedi's avatar
      selftests/bpf: Add BTF sanity tests · dc2df7bf
      Kumar Kartikeya Dwivedi authored
      
      
      Preparing the metadata for bpf_list_head involves a complicated parsing
      step and type resolution for the contained value. Ensure that corner
      cases are tested against and invalid specifications in source are duly
      rejected. Also include tests for incorrect ownership relationships in
      the BTF.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-24-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      dc2df7bf
    • Kumar Kartikeya Dwivedi's avatar
      selftests/bpf: Add BPF linked list API tests · 300f19dc
      Kumar Kartikeya Dwivedi authored
      
      
      Include various tests covering the success and failure cases. Also, run
      the success cases at runtime to verify correctness of linked list
      manipulation routines, in addition to ensuring successful verification.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-23-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      300f19dc
    • Kumar Kartikeya Dwivedi's avatar
      selftests/bpf: Add failure test cases for spin lock pairing · c48748ae
      Kumar Kartikeya Dwivedi authored
      
      
      First, ensure that whenever a bpf_spin_lock is present in an allocation,
      the reg->id is preserved. This won't be true for global variables
      however, since they have a single map value per map, hence the verifier
      harcodes it to 0 (so that multiple pseudo ldimm64 insns can yield the
      same lock object per map at a given offset).
      
      Next, add test cases for all possible combinations (kptr, global, map
      value, inner map value). Since we lifted restriction on locking in inner
      maps, also add test cases for them. Currently, each lookup into an inner
      map gets a fresh reg->id, so even if the reg->map_ptr is same, they will
      be treated as separate allocations and the incorrect unlock pairing will
      be rejected.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-22-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c48748ae
    • Kumar Kartikeya Dwivedi's avatar
      selftests/bpf: Update spinlock selftest · d85aedac
      Kumar Kartikeya Dwivedi authored
      
      
      Make updates in preparation for adding more test cases to this selftest:
      - Convert from CHECK_ to ASSERT macros.
      - Use BPF skeleton
      - Fix typo sping -> spin
      - Rename spinlock.c -> spin_lock.c
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-21-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d85aedac
    • Kumar Kartikeya Dwivedi's avatar
      selftests/bpf: Add __contains macro to bpf_experimental.h · 64069c72
      Kumar Kartikeya Dwivedi authored
      
      
      Add user facing __contains macro which provides a convenient wrapper
      over the verbose kernel specific BTF declaration tag required to
      annotate BPF list head structs in user types.
      
      Acked-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-20-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      64069c72
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Add comments for map BTF matching requirement for bpf_list_head · c22dfdd2
      Kumar Kartikeya Dwivedi authored
      
      
      The old behavior of bpf_map_meta_equal was that it compared timer_off
      to be equal (but not spin_lock_off, because that was not allowed), and
      did memcmp of kptr_off_tab.
      
      Now, we memcmp the btf_record of two bpf_map structs, which has all
      fields.
      
      We preserve backwards compat as we kzalloc the array, so if only spin
      lock and timer exist in map, we only compare offset while the rest of
      unused members in the btf_field struct are zeroed out.
      
      In case of kptr, btf and everything else is of vmlinux or module, so as
      long type is same it will match, since kernel btf, module, dtor pointer
      will be same across maps.
      
      Now with list_head in the mix, things are a bit complicated. We
      implicitly add a requirement that both BTFs are same, because struct
      btf_field_list_head has btf and value_rec members.
      
      We obviously shouldn't force BTFs to be equal by default, as that breaks
      backwards compatibility.
      
      Currently it is only implicitly required due to list_head matching
      struct btf and value_rec member. value_rec points back into a btf_record
      stashed in the map BTF (btf member of btf_field_list_head). So that
      pointer and btf member has to match exactly.
      
      Document all these subtle details so that things don't break in the
      future when touching this code.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-19-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c22dfdd2
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Add 'release on unlock' logic for bpf_list_push_{front,back} · 534e86bc
      Kumar Kartikeya Dwivedi authored
      
      
      This commit implements the delayed release logic for bpf_list_push_front
      and bpf_list_push_back.
      
      Once a node has been added to the list, it's pointer changes to
      PTR_UNTRUSTED. However, it is only released once the lock protecting the
      list is unlocked. For such PTR_TO_BTF_ID | MEM_ALLOC with PTR_UNTRUSTED
      set but an active ref_obj_id, it is still permitted to read them as long
      as the lock is held. Writing to them is not allowed.
      
      This allows having read access to push items we no longer own until we
      release the lock guarding the list, allowing a little more flexibility
      when working with these APIs.
      
      Note that enabling write support has fairly tricky interactions with
      what happens inside the critical section. Just as an example, currently,
      bpf_obj_drop is not permitted, but if it were, being able to write to
      the PTR_UNTRUSTED pointer while the object gets released back to the
      memory allocator would violate safety properties we wish to guarantee
      (i.e. not crashing the kernel). The memory could be reused for a
      different type in the BPF program or even in the kernel as it gets
      eventually kfree'd.
      
      Not enabling bpf_obj_drop inside the critical section would appear to
      prevent all of the above, but that is more of an artifical limitation
      right now. Since the write support is tangled with how we handle
      potential aliasing of nodes inside the critical section that may or may
      not be part of the list anymore, it has been deferred to a future patch.
      
      Acked-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-18-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      534e86bc
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Introduce single ownership BPF linked list API · 8cab76ec
      Kumar Kartikeya Dwivedi authored
      
      
      Add a linked list API for use in BPF programs, where it expects
      protection from the bpf_spin_lock in the same allocation as the
      bpf_list_head. For now, only one bpf_spin_lock can be present hence that
      is assumed to be the one protecting the bpf_list_head.
      
      The following functions are added to kick things off:
      
      // Add node to beginning of list
      void bpf_list_push_front(struct bpf_list_head *head, struct bpf_list_node *node);
      
      // Add node to end of list
      void bpf_list_push_back(struct bpf_list_head *head, struct bpf_list_node *node);
      
      // Remove node at beginning of list and return it
      struct bpf_list_node *bpf_list_pop_front(struct bpf_list_head *head);
      
      // Remove node at end of list and return it
      struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head);
      
      The lock protecting the bpf_list_head needs to be taken for all
      operations. The verifier ensures that the lock that needs to be taken is
      always held, and only the correct lock is taken for these operations.
      These checks are made statically by relying on the reg->id preserved for
      registers pointing into regions having both bpf_spin_lock and the
      objects protected by it. The comment over check_reg_allocation_locked in
      this change describes the logic in detail.
      
      Note that bpf_list_push_front and bpf_list_push_back are meant to
      consume the object containing the node in the 1st argument, however that
      specific mechanism is intended to not release the ref_obj_id directly
      until the bpf_spin_unlock is called. In this commit, nothing is done,
      but the next commit will be introducing logic to handle this case, so it
      has been left as is for now.
      
      bpf_list_pop_front and bpf_list_pop_back delete the first or last item
      of the list respectively, and return pointer to the element at the
      list_node offset. The user can then use container_of style macro to get
      the actual entry type. The verifier however statically knows the actual
      type, so the safety properties are still preserved.
      
      With these additions, programs can now manage their own linked lists and
      store their objects in them.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-17-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      8cab76ec
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Permit NULL checking pointer with non-zero fixed offset · df57f38a
      Kumar Kartikeya Dwivedi authored
      
      
      Pointer increment on seeing PTR_MAYBE_NULL is already protected against,
      hence make an exception for PTR_TO_BTF_ID | MEM_ALLOC while still
      keeping the warning for other unintended cases that might creep in.
      
      bpf_list_pop_{front,_back} helpers planned to be introduced in next
      commit will return a MEM_ALLOC register with incremented offset pointing
      to bpf_list_node field. The user is supposed to then obtain the pointer
      to the entry using container_of after NULL checking it. The current
      restrictions trigger a warning when doing the NULL checking. Revisiting
      the reason, it is meant as an assertion which seems to actually work and
      catch the bad case.
      
      Hence, under no other circumstances can reg->off be non-zero for a
      register that has the PTR_MAYBE_NULL type flag set.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-16-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      df57f38a
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Introduce bpf_obj_drop · ac9f0605
      Kumar Kartikeya Dwivedi authored
      
      
      Introduce bpf_obj_drop, which is the kfunc used to free allocated
      objects (allocated using bpf_obj_new). Pairing with bpf_obj_new, it
      implicitly destructs the fields part of object automatically without
      user intervention.
      
      Just like the previous patch, btf_struct_meta that is needed to free up
      the special fields is passed as a hidden argument to the kfunc.
      
      For the user, a convenience macro hides over the kernel side kfunc which
      is named bpf_obj_drop_impl.
      
      Continuing the previous example:
      
      void prog(void) {
      	struct foo *f;
      
      	f = bpf_obj_new(typeof(*f));
      	if (!f)
      		return;
      	bpf_obj_drop(f);
      }
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-15-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ac9f0605
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Introduce bpf_obj_new · 958cf2e2
      Kumar Kartikeya Dwivedi authored
      
      
      Introduce type safe memory allocator bpf_obj_new for BPF programs. The
      kernel side kfunc is named bpf_obj_new_impl, as passing hidden arguments
      to kfuncs still requires having them in prototype, unlike BPF helpers
      which always take 5 arguments and have them checked using bpf_func_proto
      in verifier, ignoring unset argument types.
      
      Introduce __ign suffix to ignore a specific kfunc argument during type
      checks, then use this to introduce support for passing type metadata to
      the bpf_obj_new_impl kfunc.
      
      The user passes BTF ID of the type it wants to allocates in program BTF,
      the verifier then rewrites the first argument as the size of this type,
      after performing some sanity checks (to ensure it exists and it is a
      struct type).
      
      The second argument is also fixed up and passed by the verifier. This is
      the btf_struct_meta for the type being allocated. It would be needed
      mostly for the offset array which is required for zero initializing
      special fields while leaving the rest of storage in unitialized state.
      
      It would also be needed in the next patch to perform proper destruction
      of the object's special fields.
      
      Under the hood, bpf_obj_new will call bpf_mem_alloc and bpf_mem_free,
      using the any context BPF memory allocator introduced recently. To this
      end, a global instance of the BPF memory allocator is initialized on
      boot to be used for this purpose. This 'bpf_global_ma' serves all
      allocations for bpf_obj_new. In the future, bpf_obj_new variants will
      allow specifying a custom allocator.
      
      Note that now that bpf_obj_new can be used to allocate objects that can
      be linked to BPF linked list (when future linked list helpers are
      available), we need to also free the elements using bpf_mem_free.
      However, since the draining of elements is done outside the
      bpf_spin_lock, we need to do migrate_disable around the call since
      bpf_list_head_free can be called from map free path where migration is
      enabled. Otherwise, when called from BPF programs migration is already
      disabled.
      
      A convenience macro is included in the bpf_experimental.h header to hide
      over the ugly details of the implementation, leading to user code
      looking similar to a language level extension which allocates and
      constructs fields of a user type.
      
      struct bar {
      	struct bpf_list_node node;
      };
      
      struct foo {
      	struct bpf_spin_lock lock;
      	struct bpf_list_head head __contains(bar, node);
      };
      
      void prog(void) {
      	struct foo *f;
      
      	f = bpf_obj_new(typeof(*f));
      	if (!f)
      		return;
      	...
      }
      
      A key piece of this story is still missing, i.e. the free function,
      which will come in the next patch.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-14-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      958cf2e2
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Support constant scalar arguments for kfuncs · a50388db
      Kumar Kartikeya Dwivedi authored
      
      
      Allow passing known constant scalars as arguments to kfuncs that do not
      represent a size parameter. We use mark_chain_precision for the constant
      scalar argument to mark it precise. This makes the search pruning
      optimization of verifier more conservative for such kfunc calls, and
      each non-distinct argument is considered unequivalent.
      
      We will use this support to then expose a bpf_obj_new function where it
      takes the local type ID of a type in program BTF, and returns a
      PTR_TO_BTF_ID | MEM_ALLOC to the local type, and allows programs to
      allocate their own objects.
      
      Each type ID resolves to a distinct type with a possibly distinct size,
      hence the type ID constant matters in terms of program safety and its
      precision needs to be checked between old and cur states inside regsafe.
      The use of mark_chain_precision enables this.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-13-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a50388db
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Rewrite kfunc argument handling · 00b85860
      Kumar Kartikeya Dwivedi authored
      
      
      As we continue to add more features, argument types, kfunc flags, and
      different extensions to kfuncs, the code to verify the correctness of
      the kfunc prototype wrt the passed in registers has become ad-hoc and
      ugly to read. To make life easier, and make a very clear split between
      different stages of argument processing, move all the code into
      verifier.c and refactor into easier to read helpers and functions.
      
      This also makes sharing code within the verifier easier with kfunc
      argument processing. This will be more and more useful in later patches
      as we are now moving to implement very core BPF helpers as kfuncs, to
      keep them experimental before baking into UAPI.
      
      Remove all kfunc related bits now from btf_check_func_arg_match, as
      users have been converted away to refactored kfunc argument handling.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-12-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      00b85860
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Allow locking bpf_spin_lock in inner map values · b7ff9792
      Kumar Kartikeya Dwivedi authored
      
      
      There is no need to restrict users from locking bpf_spin_lock in map
      values of inner maps. Each inner map lookup gets a unique reg->id
      assigned to the returned PTR_TO_MAP_VALUE which will be preserved after
      the NULL check. Distinct lookups into different inner map get unique
      IDs, and distinct lookups into same inner map also get unique IDs.
      
      Hence, lift the restriction by removing the check return -ENOTSUPP in
      map_in_map.c. Later commits will add comprehensive test cases to ensure
      that invalid cases are rejected.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-11-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b7ff9792
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Allow locking bpf_spin_lock global variables · d0d78c1d
      Kumar Kartikeya Dwivedi authored
      
      
      Global variables reside in maps accessible using direct_value_addr
      callbacks, so giving each load instruction's rewrite a unique reg->id
      disallows us from holding locks which are global.
      
      The reason for preserving reg->id as a unique value for registers that
      may point to spin lock is that two separate lookups are treated as two
      separate memory regions, and any possible aliasing is ignored for the
      purposes of spin lock correctness.
      
      This is not great especially for the global variable case, which are
      served from maps that have max_entries == 1, i.e. they always lead to
      map values pointing into the same map value.
      
      So refactor the active_spin_lock into a 'active_lock' structure which
      represents the lock identity, and instead of the reg->id, remember two
      fields, a pointer and the reg->id. The pointer will store reg->map_ptr
      or reg->btf. It's only necessary to distinguish for the id == 0 case of
      global variables, but always setting the pointer to a non-NULL value and
      using the pointer to check whether the lock is held simplifies code in
      the verifier.
      
      This is generic enough to allow it for global variables, map lookups,
      and allocated objects at the same time.
      
      Note that while whether a lock is held can be answered by just comparing
      active_lock.ptr to NULL, to determine whether the register is pointing
      to the same held lock requires comparing _both_ ptr and id.
      
      Finally, as a result of this refactoring, pseudo load instructions are
      not given a unique reg->id, as they are doing lookup for the same map
      value (max_entries is never greater than 1).
      
      Essentially, we consider that the tuple of (ptr, id) will always be
      unique for any kind of argument to bpf_spin_{lock,unlock}.
      
      Note that this can be extended in the future to also remember offset
      used for locking, so that we can introduce multiple bpf_spin_lock fields
      in the same allocation.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-10-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d0d78c1d
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Allow locking bpf_spin_lock in allocated objects · 4e814da0
      Kumar Kartikeya Dwivedi authored
      
      
      Allow locking a bpf_spin_lock in an allocated object, in addition to
      already supported map value pointers. The handling is similar to that of
      map values, by just preserving the reg->id of PTR_TO_BTF_ID | MEM_ALLOC
      as well, and adjusting process_spin_lock to work with them and remember
      the id in verifier state.
      
      Refactor the existing process_spin_lock to work with PTR_TO_BTF_ID |
      MEM_ALLOC in addition to PTR_TO_MAP_VALUE. We need to update the
      reg_may_point_to_spin_lock which is used in mark_ptr_or_null_reg to
      preserve reg->id, that will be used in env->cur_state->active_spin_lock
      to remember the currently held spin lock.
      
      Also update the comment describing bpf_spin_lock implementation details
      to also talk about PTR_TO_BTF_ID | MEM_ALLOC type.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-9-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4e814da0
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Verify ownership relationships for user BTF types · 865ce09a
      Kumar Kartikeya Dwivedi authored
      
      
      Ensure that there can be no ownership cycles among different types by
      way of having owning objects that can hold some other type as their
      element. For instance, a map value can only hold allocated objects, but
      these are allowed to have another bpf_list_head. To prevent unbounded
      recursion while freeing resources, elements of bpf_list_head in local
      kptrs can never have a bpf_list_head which are part of list in a map
      value. Later patches will verify this by having dedicated BTF selftests.
      
      Also, to make runtime destruction easier, once btf_struct_metas is fully
      populated, we can stash the metadata of the value type directly in the
      metadata of the list_head fields, as that allows easier access to the
      value type's layout to destruct it at runtime from the btf_field entry
      of the list head itself.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-8-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      865ce09a
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Recognize lock and list fields in allocated objects · 8ffa5cc1
      Kumar Kartikeya Dwivedi authored
      
      
      Allow specifying bpf_spin_lock, bpf_list_head, bpf_list_node fields in a
      allocated object.
      
      Also update btf_struct_access to reject direct access to these special
      fields.
      
      A bpf_list_head allows implementing map-in-map style use cases, where an
      allocated object with bpf_list_head is linked into a list in a map
      value. This would require embedding a bpf_list_node, support for which
      is also included. The bpf_spin_lock is used to protect the bpf_list_head
      and other data.
      
      While we strictly don't require to hold a bpf_spin_lock while touching
      the bpf_list_head in such objects, as when have access to it, we have
      complete ownership of the object, the locking constraint is still kept
      and may be conditionally lifted in the future.
      
      Note that the specification of such types can be done just like map
      values, e.g.:
      
      struct bar {
      	struct bpf_list_node node;
      };
      
      struct foo {
      	struct bpf_spin_lock lock;
      	struct bpf_list_head head __contains(bar, node);
      	struct bpf_list_node node;
      };
      
      struct map_value {
      	struct bpf_spin_lock lock;
      	struct bpf_list_head head __contains(foo, node);
      };
      
      To recognize such types in user BTF, we build a btf_struct_metas array
      of metadata items corresponding to each BTF ID. This is done once during
      the btf_parse stage to avoid having to do it each time during the
      verification process's requirement to inspect the metadata.
      
      Moreover, the computed metadata needs to be passed to some helpers in
      future patches which requires allocating them and storing them in the
      BTF that is pinned by the program itself, so that valid access can be
      assumed to such data during program runtime.
      
      A key thing to note is that once a btf_struct_meta is available for a
      type, both the btf_record and btf_field_offs should be available. It is
      critical that btf_field_offs is available in case special fields are
      present, as we extensively rely on special fields being zeroed out in
      map values and allocated objects in later patches. The code ensures that
      by bailing out in case of errors and ensuring both are available
      together. If the record is not available, the special fields won't be
      recognized, so not having both is also fine (in terms of being a
      verification error and not a runtime bug).
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-7-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      8ffa5cc1
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Introduce allocated objects support · 282de143
      Kumar Kartikeya Dwivedi authored
      
      
      Introduce support for representing pointers to objects allocated by the
      BPF program, i.e. PTR_TO_BTF_ID that point to a type in program BTF.
      This is indicated by the presence of MEM_ALLOC type flag in reg->type to
      avoid having to check btf_is_kernel when trying to match argument types
      in helpers.
      
      Whenever walking such types, any pointers being walked will always yield
      a SCALAR instead of pointer. In the future we might permit kptr inside
      such allocated objects (either kernel or program allocated), and it will
      then form a PTR_TO_BTF_ID of the respective type.
      
      For now, such allocated objects will always be referenced in verifier
      context, hence ref_obj_id == 0 for them is a bug. It is allowed to write
      to such objects, as long fields that are special are not touched
      (support for which will be added in subsequent patches). Note that once
      such a pointer is marked PTR_UNTRUSTED, it is no longer allowed to write
      to it.
      
      No PROBE_MEM handling is therefore done for loads into this type unless
      PTR_UNTRUSTED is part of the register type, since they can never be in
      an undefined state, and their lifetime will always be valid.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-6-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      282de143
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Populate field_offs for inner_map_meta · f73e601a
      Kumar Kartikeya Dwivedi authored
      
      
      Far too much code simply assumes that both btf_record and btf_field_offs
      are set to valid pointers together, or both are unset. They go together
      hand in hand as btf_record describes the special fields and
      btf_field_offs is compact representation for runtime copying/zeroing.
      
      It is very difficult to make this clear in the code when the only
      exception to this universal invariant is inner_map_meta which is used
      as reg->map_ptr in the verifier. This is simply a bug waiting to happen,
      as in verifier context we cannot easily distinguish if PTR_TO_MAP_VALUE
      is coming from an inner map, and if we ever end up using field_offs for
      any reason in the future, we will silently ignore the special fields for
      inner map case (as NULL is not an error but unset field_offs).
      
      Hence, simply copy field_offs from inner map together with btf_record.
      
      While at it, refactor code to unwind properly on errors with gotos.
      
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-5-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f73e601a
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Free inner_map_meta when btf_record_dup fails · d4899572
      Kumar Kartikeya Dwivedi authored
      Whenever btf_record_dup fails, we must free inner_map_meta that was
      allocated before.
      
      This fixes a memory leak (in case of errors) during inner map creation.
      
      Fixes: aa3496ac
      
       ("bpf: Refactor kptr_off_tab into btf_record")
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-4-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d4899572
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Do btf_record_free outside map_free callback · d7f5ef65
      Kumar Kartikeya Dwivedi authored
      Since the commit being fixed, we now miss freeing btf_record for local
      storage maps which will have a btf_record populated in case they have
      bpf_spin_lock element.
      
      This was missed because I made the choice of offloading the job to free
      kptr_off_tab (now btf_record) to the map_free callback when adding
      support for kptrs.
      
      Revisiting the reason for this decision, there is the possibility that
      the btf_record gets used inside map_free callback (e.g. in case of maps
      embedding kptrs) to iterate over them and free them, hence doing it
      before the map_free callback would be leaking special field memory, and
      do invalid memory access. The btf_record keeps module references which
      is critical to ensure the dtor call made for referenced kptr is safe to
      do.
      
      If doing it after map_free callback, the map area is already freed, so
      we cannot access bpf_map structure anymore.
      
      To fix this and prevent such lapses in future, move bpf_map_free_record
      out of the map_free callback, and do it after map_free by remembering
      the btf_record pointer. There is no need to access bpf_map structure in
      that case, and we can avoid missing this case when support for new map
      types is added for other special fields.
      
      Since a btf_record and its btf_field_offs are used together, for
      consistency delay freeing of field_offs as well. While not a problem
      right now, a lot of code assumes that either both record and field_offs
      are set or none at once.
      
      Note that in case of map of maps (outer maps), inner_map_meta->record is
      only used during verification, not to free fields in map value, hence we
      simply keep the bpf_map_free_record call as is in bpf_map_meta_free and
      never touch map->inner_map_meta in bpf_map_free_deferred.
      
      Add a comment making note of these details.
      
      Fixes: db559117
      
       ("bpf: Consolidate spin_lock, timer management into btf_record")
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-3-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d7f5ef65
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Fix early return in map_check_btf · c237bfa5
      Kumar Kartikeya Dwivedi authored
      Instead of returning directly with -EOPNOTSUPP for the timer case, we
      need to free the btf_record before returning to userspace.
      
      Fixes: db559117
      
       ("bpf: Consolidate spin_lock, timer management into btf_record")
      Reported-by: default avatarDan Carpenter <error27@gmail.com>
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221118015614.2013203-2-memxor@gmail.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c237bfa5
    • Björn Töpel's avatar
      selftests/bpf: Pass target triple to get_sys_includes macro · 98b2afc8
      Björn Töpel authored
      
      
      When cross-compiling [1], the get_sys_includes make macro should use
      the target system include path, and not the build hosts system include
      path.
      
      Make clang honor the CROSS_COMPILE triple.
      
      [1] e.g. "ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- make"
      
      Signed-off-by: default avatarBjörn Töpel <bjorn@rivosinc.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Tested-by: default avatarAnders Roxell <anders.roxell@linaro.org>
      Link: https://lore.kernel.org/bpf/20221115182051.582962-2-bjorn@kernel.org
      98b2afc8
    • Björn Töpel's avatar
      selftests/bpf: Explicitly pass RESOLVE_BTFIDS to sub-make · c4525f05
      Björn Töpel authored
      
      
      When cross-compiling selftests/bpf, the resolve_btfids binary end up
      in a different directory, than the regular resolve_btfids
      builds. Populate RESOLVE_BTFIDS for sub-make, so it can find the
      binary.
      
      Signed-off-by: default avatarBjörn Töpel <bjorn@rivosinc.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/bpf/20221115182051.582962-1-bjorn@kernel.org
      c4525f05
    • Hou Tao's avatar
      bpf: Pass map file to .map_update_batch directly · 3af43ba4
      Hou Tao authored
      
      
      Currently bpf_map_do_batch() first invokes fdget(batch.map_fd) to get
      the target map file, then it invokes generic_map_update_batch() to do
      batch update. generic_map_update_batch() will get the target map file
      by using fdget(batch.map_fd) again and pass it to bpf_map_update_value().
      
      The problem is map file returned by the second fdget() may be NULL or a
      totally different file compared by map file in bpf_map_do_batch(). The
      reason is that the first fdget() only guarantees the liveness of struct
      file instead of file descriptor and the file description may be released
      by concurrent close() through pick_file().
      
      It doesn't incur any problem as for now, because maps with batch update
      support don't use map file in .map_fd_get_ptr() ops. But it is better to
      fix the potential access of an invalid map file.
      
      Using __bpf_map_get() again in generic_map_update_batch() can not fix
      the problem, because batch.map_fd may be closed and reopened, and the
      returned map file may be different with map file got in
      bpf_map_do_batch(), so just passing the map file directly to
      .map_update_batch() in bpf_map_do_batch().
      
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20221116075059.1551277-1-houtao@huaweicloud.com
      3af43ba4