Skip to content
  1. Dec 07, 2023
    • Andrii Nakryiko's avatar
      bpf: rename MAX_BPF_LINK_TYPE into __MAX_BPF_LINK_TYPE for consistency · 7065eefb
      Andrii Nakryiko authored
      
      
      To stay consistent with the naming pattern used for similar cases in BPF
      UAPI (__MAX_BPF_ATTACH_TYPE, etc), rename MAX_BPF_LINK_TYPE into
      __MAX_BPF_LINK_TYPE.
      
      Also similar to MAX_BPF_ATTACH_TYPE and MAX_BPF_REG, add:
      
        #define MAX_BPF_LINK_TYPE __MAX_BPF_LINK_TYPE
      
      Not all __MAX_xxx enums have such #define, so I'm not sure if we should
      add it or not, but I figured I'll start with a completely backwards
      compatible way, and we can drop that, if necessary.
      
      Also adjust a selftest that used MAX_BPF_LINK_TYPE enum.
      
      Suggested-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20231206190920.1651226-1-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7065eefb
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-token-and-bpf-fs-based-delegation' · c35919dc
      Alexei Starovoitov authored
      Andrii Nakryiko says:
      
      ====================
      BPF token and BPF FS-based delegation
      
      This patch set introduces an ability to delegate a subset of BPF subsystem
      functionality from privileged system-wide daemon (e.g., systemd or any other
      container manager) through special mount options for userns-bound BPF FS to
      a *trusted* unprivileged application. Trust is the key here. This
      functionality is not about allowing unconditional unprivileged BPF usage.
      Establishing trust, though, is completely up to the discretion of respective
      privileged application that would create and mount a BPF FS instance with
      delegation enabled, as different production setups can and do achieve it
      through a combination of different means (signing, LSM, code reviews, etc),
      and it's undesirable and infeasible for kernel to enforce any particular way
      of validating trustworthiness of particular process.
      
      The main motivation for this work is a desire to enable containerized BPF
      applications to be used together with user namespaces. This is currently
      impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
      or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
      helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
      arbitrary memory, and it's impossible to ensure that they only read memory of
      processes belonging to any given namespace. This means that it's impossible to
      have a mechanically verifiable namespace-aware CAP_BPF capability, and as such
      another mechanism to allow safe usage of BPF functionality is necessary.BPF FS
      delegation mount options and BPF token derived from such BPF FS instance is
      such a mechanism. Kernel makes no assumption about what "trusted" constitutes
      in any particular case, and it's up to specific privileged applications and
      their surrounding infrastructure to decide that. What kernel provides is a set
      of APIs to setup and mount special BPF FS instanecs and derive BPF tokens from
      it. BPF FS and BPF token are both bound to its owning userns and in such a way
      are constrained inside intended container. Users can then pass BPF token FD to
      privileged bpf() syscall commands, like BPF map creation and BPF program
      loading, to perform such operations without having init userns privileged.
      
      This version incorporates feedback and suggestions ([3]) received on v3 of
      this patch set, and instead of allowing to create BPF tokens directly assuming
      capable(CAP_SYS_ADMIN), we instead enhance BPF FS to accept a few new
      delegation mount options. If these options are used and BPF FS itself is
      properly created, set up, and mounted inside the user namespaced container,
      user application is able to derive a BPF token object from BPF FS instance,
      and pass that token to bpf() syscall. As explained in patch #3, BPF token
      itself doesn't grant access to BPF functionality, but instead allows kernel to
      do namespaced capabilities checks (ns_capable() vs capable()) for CAP_BPF,
      CAP_PERFMON, CAP_NET_ADMIN, and CAP_SYS_ADMIN, as applicable. So it forms one
      half of a puzzle and allows container managers and sys admins to have safe and
      flexible configuration options: determining which containers get delegation of
      BPF functionality through BPF FS, and then which applications within such
      containers are allowed to perform bpf() commands, based on namespaces
      capabilities.
      
      Previous attempt at addressing this very same problem ([0]) attempted to
      utilize authoritative LSM approach, but was conclusively rejected by upstream
      LSM maintainers. BPF token concept is not changing anything about LSM
      approach, but can be combined with LSM hooks for very fine-grained security
      policy. Some ideas about making BPF token more convenient to use with LSM (in
      particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
      2023 presentation ([1]). E.g., an ability to specify user-provided data
      (context), which in combination with BPF LSM would allow implementing a very
      dynamic and fine-granular custom security policies on top of BPF token. In the
      interest of minimizing API surface area and discussions this was relegated to
      follow up patches, as it's not essential to the fundamental concept of
      delegatable BPF token.
      
      It should be noted that BPF token is conceptually quite similar to the idea of
      /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
      difference is the idea of using virtual anon_inode file to hold BPF token and
      allowing multiple independent instances of them, each (potentially) with its
      own set of restrictions. And also, crucially, BPF token approach is not using
      any special stateful task-scoped flags. Instead, bpf() syscall accepts
      token_fd parameters explicitly for each relevant BPF command. This addresses
      main concerns brought up during the /dev/bpf discussion, and fits better with
      overall BPF subsystem design.
      
      This patch set adds a basic minimum of functionality to make BPF token idea
      useful and to discuss API and functionality. Currently only low-level libbpf
      APIs support creating and passing BPF token around, allowing to test kernel
      functionality, but for the most part is not sufficient for real-world
      applications, which typically use high-level libbpf APIs based on `struct
      bpf_object` type. This was done with the intent to limit the size of patch set
      and concentrate on mostly kernel-side changes. All the necessary plumbing for
      libbpf will be sent as a separate follow up patch set kernel support makes it
      upstream.
      
      Another part that should happen once kernel-side BPF token is established, is
      a set of conventions between applications (e.g., systemd), tools (e.g.,
      bpftool), and libraries (e.g., libbpf) on exposing delegatable BPF FS
      instance(s) at well-defined locations to allow applications take advantage of
      this in automatic fashion without explicit code changes on BPF application's
      side. But I'd like to postpone this discussion to after BPF token concept
      lands.
      
        [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
        [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
        [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/
        [3] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
      
      v11->v12:
        - enforce exact userns match in bpf_token_capable() and
          bpf_token_allow_cmd() checks, for added strictness (Christian);
      v10->v11:
        - fix BPF FS root check to disallow using bind-mounted subdirectory of BPF
          FS instance (Christian);
        - further restrict BPF_TOKEN_CREATE command to be executed from inside
          exactly the same user namespace as the one used to create BPF FS instance
          (Christian);
      v9->v10:
        - slight adjustments in LSM parts (Paul);
        - setting delegate_xxx  options require capable(CAP_SYS_ADMIN) (Christian);
        - simplify BPF_TOKEN_CREATE UAPI by accepting BPF FS FD directly (Christian);
      v8->v9:
        - fix issue in selftests due to sys/mount.h header (Jiri);
        - fix warning in doc comments in LSM hooks (kernel test robot);
      v7->v8:
        - add bpf_token_allow_cmd and bpf_token_capable hooks (Paul);
        - inline bpf_token_alloc() into bpf_token_create() to prevent accidental
          divergence with security_bpf_token_create() hook (Paul);
      v6->v7:
        - separate patches to refactor bpf_prog_alloc/bpf_map_alloc LSM hooks, as
          discussed with Paul, and now they also accept struct bpf_token;
        - added bpf_token_create/bpf_token_free to allow LSMs (SELinux,
          specifically) to set up security LSM blob (Paul);
        - last patch also wires bpf_security_struct setup by SELinux, similar to how
          it's done for BPF map/prog, though I'm not sure if that's enough, so worst
          case it's easy to drop this patch if more full fledged SELinux
          implementation will be done separately;
        - small fixes for issues caught by code reviews (Jiri, Hou);
        - fix for test_maps test that doesn't use LIBBPF_OPTS() macro (CI);
      v5->v6:
        - fix possible use of uninitialized variable in selftests (CI);
        - don't use anon_inode, instead create one from BPF FS instance (Christian);
        - don't store bpf_token inside struct bpf_map, instead pass it explicitly to
          map_check_btf(). We do store bpf_token inside prog->aux, because it's used
          during verification and even can be checked during attach time for some
          program types;
        - LSM hooks are left intact pending the conclusion of discussion with Paul
          Moore; I'd prefer to do LSM-related changes as a follow up patch set
          anyways;
      v4->v5:
        - add pre-patch unifying CAP_NET_ADMIN handling inside kernel/bpf/syscall.c
          (Paul Moore);
        - fix build warnings and errors in selftests and kernel, detected by CI and
          kernel test robot;
      v3->v4:
        - add delegation mount options to BPF FS;
        - BPF token is derived from the instance of BPF FS and associates itself
          with BPF FS' owning userns;
        - BPF token doesn't grant BPF functionality directly, it just turns
          capable() checks into ns_capable() checks within BPF FS' owning user;
        - BPF token cannot be pinned;
      v2->v3:
        - make BPF_TOKEN_CREATE pin created BPF token in BPF FS, and disallow
          BPF_OBJ_PIN for BPF token;
      v1->v2:
        - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset;
        - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav).
      ====================
      
      Link: https://lore.kernel.org/r/20231130185229.2688956-1-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c35919dc
    • Andrii Nakryiko's avatar
      bpf,selinux: allocate bpf_security_struct per BPF token · 36fb9494
      Andrii Nakryiko authored
      
      
      Utilize newly added bpf_token_create/bpf_token_free LSM hooks to
      allocate struct bpf_security_struct for each BPF token object in
      SELinux. This just follows similar pattern for BPF prog and map.
      
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-18-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      36fb9494
    • Andrii Nakryiko's avatar
      selftests/bpf: add BPF token-enabled tests · dc5196fa
      Andrii Nakryiko authored
      Add a selftest that attempts to conceptually replicate intended BPF
      token use cases inside user namespaced container.
      
      Child process is forked. It is then put into its own userns and mountns.
      Child creates BPF FS context object. This ensures child userns is
      captured as the owning userns for this instance of BPF FS. Given setting
      delegation mount options is privileged operation, we ensure that child
      cannot set them.
      
      This context is passed back to privileged parent process through Unix
      socket, where parent sets up delegation options, creates, and mounts it
      as a detached mount. This mount FD is passed back to the child to be
      used for BPF token creation, which allows otherwise privileged BPF
      operations to succeed inside userns.
      
      We validate that all of token-enabled privileged commands (BPF_BTF_LOAD,
      BPF_MAP_CREATE, and BPF_PROG_LOAD) work as intended. They should only
      succeed inside the userns if a) BPF token is provided with proper
      allowed sets of commands and types; and b) namespaces CAP_BPF and other
      privileges are set. Lacking a) or b) should lead to -EPERM failures.
      
      Based on suggested workflow by Christian Brauner ([0]).
      
        [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
      
      
      
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-17-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      dc5196fa
    • Andrii Nakryiko's avatar
      1571740a
    • Andrii Nakryiko's avatar
      libbpf: add BPF token support to bpf_btf_load() API · 1a8df7fa
      Andrii Nakryiko authored
      
      
      Allow user to specify token_fd for bpf_btf_load() API that wraps
      kernel's BPF_BTF_LOAD command. This allows loading BTF from unprivileged
      process as long as it has BPF token allowing BPF_BTF_LOAD command, which
      can be created and delegated by privileged process.
      
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-15-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1a8df7fa
    • Andrii Nakryiko's avatar
      libbpf: add BPF token support to bpf_map_create() API · 37891cea
      Andrii Nakryiko authored
      
      
      Add ability to provide token_fd for BPF_MAP_CREATE command through
      bpf_map_create() API.
      
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-14-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      37891cea
    • Andrii Nakryiko's avatar
      libbpf: add bpf_token_create() API · ecd43514
      Andrii Nakryiko authored
      
      
      Add low-level wrapper API for BPF_TOKEN_CREATE command in bpf() syscall.
      
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-13-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ecd43514
    • Andrii Nakryiko's avatar
      bpf,lsm: add BPF token LSM hooks · d734ca7b
      Andrii Nakryiko authored
      
      
      Wire up bpf_token_create and bpf_token_free LSM hooks, which allow to
      allocate LSM security blob (we add `void *security` field to struct
      bpf_token for that), but also control who can instantiate BPF token.
      This follows existing pattern for BPF map and BPF prog.
      
      Also add security_bpf_token_allow_cmd() and security_bpf_token_capable()
      LSM hooks that allow LSM implementation to control and negate (if
      necessary) BPF token's delegation of a specific bpf_cmd and capability,
      respectively.
      
      Acked-by: default avatarPaul Moore <paul@paul-moore.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-12-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d734ca7b
    • Andrii Nakryiko's avatar
      bpf,lsm: refactor bpf_map_alloc/bpf_map_free LSM hooks · 66d636d7
      Andrii Nakryiko authored
      
      
      Similarly to bpf_prog_alloc LSM hook, rename and extend bpf_map_alloc
      hook into bpf_map_create, taking not just struct bpf_map, but also
      bpf_attr and bpf_token, to give a fuller context to LSMs.
      
      Unlike bpf_prog_alloc, there is no need to move the hook around, as it
      currently is firing right before allocating BPF map ID and FD, which
      seems to be a sweet spot.
      
      But like bpf_prog_alloc/bpf_prog_free combo, make sure that bpf_map_free
      LSM hook is called even if bpf_map_create hook returned error, as if few
      LSMs are combined together it could be that one LSM successfully
      allocated security blob for its needs, while subsequent LSM rejected BPF
      map creation. The former LSM would still need to free up LSM blob, so we
      need to ensure security_bpf_map_free() is called regardless of the
      outcome.
      
      Acked-by: default avatarPaul Moore <paul@paul-moore.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-11-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      66d636d7
    • Andrii Nakryiko's avatar
      bpf,lsm: refactor bpf_prog_alloc/bpf_prog_free LSM hooks · c3dd6e94
      Andrii Nakryiko authored
      Based on upstream discussion ([0]), rework existing
      bpf_prog_alloc_security LSM hook. Rename it to bpf_prog_load and instead
      of passing bpf_prog_aux, pass proper bpf_prog pointer for a full BPF
      program struct. Also, we pass bpf_attr union with all the user-provided
      arguments for BPF_PROG_LOAD command.  This will give LSMs as much
      information as we can basically provide.
      
      The hook is also BPF token-aware now, and optional bpf_token struct is
      passed as a third argument. bpf_prog_load LSM hook is called after
      a bunch of sanity checks were performed, bpf_prog and bpf_prog_aux were
      allocated and filled out, but right before performing full-fledged BPF
      verification step.
      
      bpf_prog_free LSM hook is now accepting struct bpf_prog argument, for
      consistency. SELinux code is adjusted to all new names, types, and
      signatures.
      
      Note, given that bpf_prog_load (previously bpf_prog_alloc) hook can be
      used by some LSMs to allocate extra security blob, but also by other
      LSMs to reject BPF program loading, we need to make sure that
      bpf_prog_free LSM hook is called after bpf_prog_load/bpf_prog_alloc one
      *even* if the hook itself returned error. If we don't do that, we run
      the risk of leaking memory. This seems to be possible today when
      combining SELinux and BPF LSM, as one example, depending on their
      relative ordering.
      
      Also, for BPF LSM setup, add bpf_prog_load and bpf_prog_free to
      sleepable LSM hooks list, as they are both executed in sleepable
      context. Also drop bpf_prog_load hook from untrusted, as there is no
      issue with refcount or anything else anymore, that originally forced us
      to add it to untrusted list in c0c852dd ("bpf: Do not mark certain LSM
      hook arguments as trusted"). We now trigger this hook much later and it
      should not be an issue anymore.
      
        [0] https://lore.kernel.org/bpf/9fe88aef7deabbe87d3fc38c4aea3c69.paul@paul-moore.com/
      
      
      
      Acked-by: default avatarPaul Moore <paul@paul-moore.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-10-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c3dd6e94
    • Andrii Nakryiko's avatar
      bpf: consistently use BPF token throughout BPF verifier logic · 8062fb12
      Andrii Nakryiko authored
      
      
      Remove remaining direct queries to perfmon_capable() and bpf_capable()
      in BPF verifier logic and instead use BPF token (if available) to make
      decisions about privileges.
      
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-9-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      8062fb12
    • Andrii Nakryiko's avatar
      bpf: take into account BPF token when fetching helper protos · 4cbb270e
      Andrii Nakryiko authored
      
      
      Instead of performing unconditional system-wide bpf_capable() and
      perfmon_capable() calls inside bpf_base_func_proto() function (and other
      similar ones) to determine eligibility of a given BPF helper for a given
      program, use previously recorded BPF token during BPF_PROG_LOAD command
      handling to inform the decision.
      
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-8-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4cbb270e
    • Andrii Nakryiko's avatar
      bpf: add BPF token support to BPF_PROG_LOAD command · e1cef620
      Andrii Nakryiko authored
      
      
      Add basic support of BPF token to BPF_PROG_LOAD. Wire through a set of
      allowed BPF program types and attach types, derived from BPF FS at BPF
      token creation time. Then make sure we perform bpf_token_capable()
      checks everywhere where it's relevant.
      
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-7-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e1cef620
    • Andrii Nakryiko's avatar
      bpf: add BPF token support to BPF_BTF_LOAD command · ee54b1a9
      Andrii Nakryiko authored
      
      
      Accept BPF token FD in BPF_BTF_LOAD command to allow BTF data loading
      through delegated BPF token. BTF loading is a pretty straightforward
      operation, so as long as BPF token is created with allow_cmds granting
      BPF_BTF_LOAD command, kernel proceeds to parsing BTF data and creating
      BTF object.
      
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-6-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ee54b1a9
    • Andrii Nakryiko's avatar
      bpf: add BPF token support to BPF_MAP_CREATE command · 688b7270
      Andrii Nakryiko authored
      
      
      Allow providing token_fd for BPF_MAP_CREATE command to allow controlled
      BPF map creation from unprivileged process through delegated BPF token.
      
      Wire through a set of allowed BPF map types to BPF token, derived from
      BPF FS at BPF token creation time. This, in combination with allowed_cmds
      allows to create a narrowly-focused BPF token (controlled by privileged
      agent) with a restrictive set of BPF maps that application can attempt
      to create.
      
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-5-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      688b7270
    • Andrii Nakryiko's avatar
      bpf: introduce BPF token object · 4527358b
      Andrii Nakryiko authored
      
      
      Add new kind of BPF kernel object, BPF token. BPF token is meant to
      allow delegating privileged BPF functionality, like loading a BPF
      program or creating a BPF map, from privileged process to a *trusted*
      unprivileged process, all while having a good amount of control over which
      privileged operations could be performed using provided BPF token.
      
      This is achieved through mounting BPF FS instance with extra delegation
      mount options, which determine what operations are delegatable, and also
      constraining it to the owning user namespace (as mentioned in the
      previous patch).
      
      BPF token itself is just a derivative from BPF FS and can be created
      through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
      FS FD, which can be attained through open() API by opening BPF FS mount
      point. Currently, BPF token "inherits" delegated command, map types,
      prog type, and attach type bit sets from BPF FS as is. In the future,
      having an BPF token as a separate object with its own FD, we can allow
      to further restrict BPF token's allowable set of things either at the
      creation time or after the fact, allowing the process to guard itself
      further from unintentionally trying to load undesired kind of BPF
      programs. But for now we keep things simple and just copy bit sets as is.
      
      When BPF token is created from BPF FS mount, we take reference to the
      BPF super block's owning user namespace, and then use that namespace for
      checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
      capabilities that are normally only checked against init userns (using
      capable()), but now we check them using ns_capable() instead (if BPF
      token is provided). See bpf_token_capable() for details.
      
      Such setup means that BPF token in itself is not sufficient to grant BPF
      functionality. User namespaced process has to *also* have necessary
      combination of capabilities inside that user namespace. So while
      previously CAP_BPF was useless when granted within user namespace, now
      it gains a meaning and allows container managers and sys admins to have
      a flexible control over which processes can and need to use BPF
      functionality within the user namespace (i.e., container in practice).
      And BPF FS delegation mount options and derived BPF tokens serve as
      a per-container "flag" to grant overall ability to use bpf() (plus further
      restrict on which parts of bpf() syscalls are treated as namespaced).
      
      Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
      within the BPF FS owning user namespace, rounding up the ns_capable()
      story of BPF token.
      
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-4-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4527358b
    • Andrii Nakryiko's avatar
      bpf: add BPF token delegation mount options to BPF FS · 40bba140
      Andrii Nakryiko authored
      Add few new mount options to BPF FS that allow to specify that a given
      BPF FS instance allows creation of BPF token (added in the next patch),
      and what sort of operations are allowed under BPF token. As such, we get
      4 new mount options, each is a bit mask
        - `delegate_cmds` allow to specify which bpf() syscall commands are
          allowed with BPF token derived from this BPF FS instance;
        - if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
          a set of allowable BPF map types that could be created with BPF token;
        - if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
          a set of allowable BPF program types that could be loaded with BPF token;
        - if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
          a set of allowable BPF program attach types that could be loaded with
          BPF token; delegate_progs and delegate_attachs are meant to be used
          together, as full BPF program type is, in general, determined
          through both program type and program attach type.
      
      Currently, these mount options accept the following forms of values:
        - a special value "any", that enables all possible values of a given
        bit set;
        - numeric value (decimal or hexadecimal, determined by kernel
        automatically) that specifies a bit mask value directly;
        - all the values for a given mount option are combined, if specified
        multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
        delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
        mask.
      
      Ideally, more convenient (for humans) symbolic form derived from
      corresponding UAPI enums would be accepted (e.g., `-o
      delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
      it requires a bunch of UAPI header churn, so I postponed it until this
      feature lands upstream or at least there is a definite consensus that
      this feature is acceptable and is going to make it, just to minimize
      amount of wasted effort and not increase amount of non-essential code to
      be reviewed.
      
      Attentive reader will notice that BPF FS is now marked as
      FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
      user namespace as long as the process has sufficient *namespaced*
      capabilities within that user namespace. But in reality we still
      restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
      init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
      to allow creating BPF FS context object (i.e., fsopen("bpf")) from
      inside unprivileged process inside non-init userns, to capture that
      userns as the owning userns. It will still be required to pass this
      context object back to privileged process to instantiate and mount it.
      
      This manipulation is important, because capturing non-init userns as the
      owning userns of BPF FS instance (super block) allows to use that userns
      to constraint BPF token to that userns later on (see next patch). So
      creating BPF FS with delegation inside unprivileged userns will restrict
      derived BPF token objects to only "work" inside that intended userns,
      making it scoped to a intended "container". Also, setting these
      delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
      process cannot set this up without involvement of a privileged process.
      
      There is a set of selftests at the end of the patch set that simulates
      this sequence of steps and validates that everything works as intended.
      But careful review is requested to make sure there are no missed gaps in
      the implementation and testing.
      
      This somewhat subtle set of aspects is the result of previous
      discussions ([0]) about various user namespace implications and
      interactions with BPF token functionality and is necessary to contain
      BPF token inside intended user namespace.
      
        [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
      
      
      
      Acked-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-3-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      40bba140
    • Andrii Nakryiko's avatar
      bpf: align CAP_NET_ADMIN checks with bpf_capable() approach · 909fa05d
      Andrii Nakryiko authored
      
      
      Within BPF syscall handling code CAP_NET_ADMIN checks stand out a bit
      compared to CAP_BPF and CAP_PERFMON checks. For the latter, CAP_BPF or
      CAP_PERFMON are checked first, but if they are not set, CAP_SYS_ADMIN
      takes over and grants whatever part of BPF syscall is required.
      
      Similar kind of checks that involve CAP_NET_ADMIN are not so consistent.
      One out of four uses does follow CAP_BPF/CAP_PERFMON model: during
      BPF_PROG_LOAD, if the type of BPF program is "network-related" either
      CAP_NET_ADMIN or CAP_SYS_ADMIN is required to proceed.
      
      But in three other cases CAP_NET_ADMIN is required even if CAP_SYS_ADMIN
      is set:
        - when creating DEVMAP/XDKMAP/CPU_MAP maps;
        - when attaching CGROUP_SKB programs;
        - when handling BPF_PROG_QUERY command.
      
      This patch is changing the latter three cases to follow BPF_PROG_LOAD
      model, that is allowing to proceed under either CAP_NET_ADMIN or
      CAP_SYS_ADMIN.
      
      This also makes it cleaner in subsequent BPF token patches to switch
      wholesomely to a generic bpf_token_capable(int cap) check, that always
      falls back to CAP_SYS_ADMIN if requested capability is missing.
      
      Cc: Jakub Kicinski <kuba@kernel.org>
      Acked-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-2-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      909fa05d
  2. Dec 06, 2023
    • Alexei Starovoitov's avatar
      Merge branch 'complete-bpf-verifier-precision-tracking-support-for-register-spills' · 3aee2bf9
      Alexei Starovoitov authored
      Andrii Nakryiko says:
      
      ====================
      Complete BPF verifier precision tracking support for register spills
      
      Add support to BPF verifier to track and support register spill/fill to/from
      stack regardless if it was done through read-only R10 register (which is the
      only form supported today), or through a general register after copying R10
      into it, while also potentially modifying offset.
      
      Once we add register this generic spill/fill support to precision
      backtracking, we can take advantage of it to stop doing eager STACK_ZERO
      conversion on register spill. Instead we can rely on (im)precision of spilled
      const zero register to improve verifier state pruning efficiency. This
      situation of using const zero register to initialize stack slots is very
      common with __builtin_memset() usage or just zero-initializing variables on
      the stack, and it causes unnecessary state duplication, as that STACK_ZERO
      knowledge is often not necessary for correctness, as those zero values are
      never used in precise context. Thus, relying on register imprecision helps
      tremendously, especially in real-world BPF programs.
      
      To make spilled const zero register behave completely equivalently to
      STACK_ZERO, we need to improve few other small pieces, which is done in the
      second part of the patch set. See individual patches for details. There are
      also two small bug fixes spotted during STACK_ZERO debugging.
      
      The patch set consists of logically three changes:
        - patch #1 (and corresponding tests in patch #2) is fixing/impoving precision
          propagation for stack spills/fills. This can be landed as a stand-alone
          improvement;
        - patches #3 through #9 is improving verification scalability by utilizing
          register (im)precision instead of eager STACK_ZERO. These changes depend
          on patch #1.
        - patch #10 is a memory efficiency improvement to how instruction/jump
          history is tracked and maintained. It depends on patch #1, but is not
          strictly speaking required, even though I believe it's a good long-term
          solution to have a path-dependent per-instruction information. Kind
          of like a path-dependent counterpart to path-agnostic insn_aux array.
      
      v3->v3:
        - fixed up Fixes tag (Alexei);
        - fixed few more selftests to not use BPF_ST instruction in inline asm
          directly, checked with CI, it was happy (CI);
      v2->v3:
        - BPF_ST instruction workaround (Eduard);
        - force dereference in added tests to catch problems (Eduard);
        - some commit message massaging (Alexei);
      v1->v2:
        - clean ups, WARN_ONCE(), insn_flags helpers added (Eduard);
        - added more selftests for STACK_ZERO/STACK_MISC cases (Eduard);
        - a bit more detailed explanation of effect of avoiding STACK_ZERO in favor
          of register spill in patch #8 commit (Alexei);
        - global shared instruction history refactoring moved to be the last patch
          in the series to make it easier to revert it, if applied (Alexei).
      ====================
      
      Link: https://lore.kernel.org/r/20231205184248.1502704-1-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3aee2bf9
    • Andrii Nakryiko's avatar
      selftests/bpf: validate precision logic in partial_stack_load_preserves_zeros · 064e0bea
      Andrii Nakryiko authored
      
      
      Enhance partial_stack_load_preserves_zeros subtest with detailed
      precision propagation log checks. We know expect fp-16 to be spilled,
      initially imprecise, zero const register, which is later marked as
      precise even when partial stack slot load is performed, even if it's not
      a register fill (!).
      
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231205184248.1502704-10-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      064e0bea
    • Andrii Nakryiko's avatar
      bpf: track aligned STACK_ZERO cases as imprecise spilled registers · 18a433b6
      Andrii Nakryiko authored
      
      
      Now that precision backtracing is supporting register spill/fill to/from
      stack, there is another oportunity to be exploited here: minimizing
      precise STACK_ZERO cases. With a simple code change we can rely on
      initially imprecise register spill tracking for cases when register
      spilled to stack was a known zero.
      
      This is a very common case for initializing on the stack variables,
      including rather large structures. Often times zero has no special
      meaning for the subsequent BPF program logic and is often overwritten
      with non-zero values soon afterwards. But due to STACK_ZERO vs
      STACK_MISC tracking, such initial zero initialization actually causes
      duplication of verifier states as STACK_ZERO is clearly different than
      STACK_MISC or spilled SCALAR_VALUE register.
      
      The effect of this (now) trivial change is huge, as can be seen below.
      These are differences between BPF selftests, Cilium, and Meta-internal
      BPF object files relative to previous patch in this series. You can see
      improvements ranging from single-digit percentage improvement for
      instructions and states, all the way to 50-60% reduction for some of
      Meta-internal host agent programs, and even some Cilium programs.
      
      For Meta-internal ones I left only the differences for largest BPF
      object files by states/instructions, as there were too many differences
      in the overall output. All the differences were improvements, reducting
      number of states and thus instructions validated.
      
      Note, Meta-internal BPF object file names are not printed below.
      Many copies of balancer_ingress are actually many different
      configurations of Katran, so they are different BPF programs, which
      explains state reduction going from -16% all the way to 31%, depending
      on BPF program logic complexity.
      
      I also tooked a closer look at a few small-ish BPF programs to validate
      the behavior. Let's take bpf_iter_netrlink.bpf.o (first row below).
      While it's just 8 vs 5 states, verifier log is still pretty long to
      include it here. But the reduction in states is due to the following
      piece of C code:
      
              unsigned long ino;
      
      	...
      
              sk = s->sk_socket;
              if (!sk) {
                      ino = 0;
              } else {
                      inode = SOCK_INODE(sk);
                      bpf_probe_read_kernel(&ino, sizeof(ino), &inode->i_ino);
              }
              BPF_SEQ_PRINTF(seq, "%-8u %-8lu\n", s->sk_drops.counter, ino);
      	return 0;
      
      You can see that in some situations `ino` is zero-initialized, while in
      others it's unknown value filled out by bpf_probe_read_kernel(). Before
      this change code after if/else branches have to be validated twice. Once
      with (precise) ino == 0, due to eager STACK_ZERO logic, and then again
      for when ino is just STACK_MISC. But BPF_SEQ_PRINTF() doesn't care about
      precise value of ino, so with the change in this patch verifier is able
      to prune states from after one of the branches, reducing number of total
      states (and instructions) required for successful validation.
      
      Similar principle applies to bigger real-world applications, just at
      a much larger scale.
      
      SELFTESTS
      =========
      File                                     Program                  Insns (A)  Insns (B)  Insns    (DIFF)  States (A)  States (B)  States (DIFF)
      ---------------------------------------  -----------------------  ---------  ---------  ---------------  ----------  ----------  -------------
      bpf_iter_netlink.bpf.linked3.o           dump_netlink                   148        104    -44 (-29.73%)           8           5   -3 (-37.50%)
      bpf_iter_unix.bpf.linked3.o              dump_unix                     8474       8404     -70 (-0.83%)         151         147    -4 (-2.65%)
      bpf_loop.bpf.linked3.o                   stack_check                    560        324   -236 (-42.14%)          42          24  -18 (-42.86%)
      local_storage_bench.bpf.linked3.o        get_local                      120         77    -43 (-35.83%)           9           6   -3 (-33.33%)
      loop6.bpf.linked3.o                      trace_virtqueue_add_sgs      10167       9868    -299 (-2.94%)         226         206   -20 (-8.85%)
      pyperf600_bpf_loop.bpf.linked3.o         on_event                      4872       3423  -1449 (-29.74%)         322         229  -93 (-28.88%)
      strobemeta.bpf.linked3.o                 on_event                    180697     176036   -4661 (-2.58%)        4780        4734   -46 (-0.96%)
      test_cls_redirect.bpf.linked3.o          cls_redirect                 65594      65401    -193 (-0.29%)        4230        4212   -18 (-0.43%)
      test_global_func_args.bpf.linked3.o      test_cls                       145        136      -9 (-6.21%)          10           9   -1 (-10.00%)
      test_l4lb.bpf.linked3.o                  balancer_ingress              4760       2612  -2148 (-45.13%)         113         102   -11 (-9.73%)
      test_l4lb_noinline.bpf.linked3.o         balancer_ingress              4845       4877     +32 (+0.66%)         219         221    +2 (+0.91%)
      test_l4lb_noinline_dynptr.bpf.linked3.o  balancer_ingress              2072       2087     +15 (+0.72%)          97          98    +1 (+1.03%)
      test_seg6_loop.bpf.linked3.o             __add_egr_x                  12440       9975  -2465 (-19.82%)         364         353   -11 (-3.02%)
      test_tcp_hdr_options.bpf.linked3.o       estab                         2558       2572     +14 (+0.55%)         179         180    +1 (+0.56%)
      test_xdp_dynptr.bpf.linked3.o            _xdp_tx_iptunnel               645        596     -49 (-7.60%)          26          24    -2 (-7.69%)
      test_xdp_noinline.bpf.linked3.o          balancer_ingress_v6           3520       3516      -4 (-0.11%)         216         216    +0 (+0.00%)
      xdp_synproxy_kern.bpf.linked3.o          syncookie_tc                 82661      81241   -1420 (-1.72%)        5073        5155   +82 (+1.62%)
      xdp_synproxy_kern.bpf.linked3.o          syncookie_xdp                84964      82297   -2667 (-3.14%)        5130        5157   +27 (+0.53%)
      
      META-INTERNAL
      =============
      Program                                 Insns (A)  Insns (B)  Insns      (DIFF)  States (A)  States (B)  States   (DIFF)
      --------------------------------------  ---------  ---------  -----------------  ----------  ----------  ---------------
      balancer_ingress                            27925      23608    -4317 (-15.46%)        1488        1482      -6 (-0.40%)
      balancer_ingress                            31824      27546    -4278 (-13.44%)        1658        1652      -6 (-0.36%)
      balancer_ingress                            32213      27935    -4278 (-13.28%)        1689        1683      -6 (-0.36%)
      balancer_ingress                            32213      27935    -4278 (-13.28%)        1689        1683      -6 (-0.36%)
      balancer_ingress                            31824      27546    -4278 (-13.44%)        1658        1652      -6 (-0.36%)
      balancer_ingress                            38647      29562    -9085 (-23.51%)        2069        1835   -234 (-11.31%)
      balancer_ingress                            38647      29562    -9085 (-23.51%)        2069        1835   -234 (-11.31%)
      balancer_ingress                            40339      30792    -9547 (-23.67%)        2193        1934   -259 (-11.81%)
      balancer_ingress                            37321      29055    -8266 (-22.15%)        1972        1795    -177 (-8.98%)
      balancer_ingress                            38176      29753    -8423 (-22.06%)        2008        1831    -177 (-8.81%)
      balancer_ingress                            29193      20910    -8283 (-28.37%)        1599        1422   -177 (-11.07%)
      balancer_ingress                            30013      21452    -8561 (-28.52%)        1645        1447   -198 (-12.04%)
      balancer_ingress                            28691      24290    -4401 (-15.34%)        1545        1531     -14 (-0.91%)
      balancer_ingress                            34223      28965    -5258 (-15.36%)        1984        1875    -109 (-5.49%)
      balancer_ingress                            35481      26158    -9323 (-26.28%)        2095        1806   -289 (-13.79%)
      balancer_ingress                            35481      26158    -9323 (-26.28%)        2095        1806   -289 (-13.79%)
      balancer_ingress                            35868      26455    -9413 (-26.24%)        2140        1827   -313 (-14.63%)
      balancer_ingress                            35868      26455    -9413 (-26.24%)        2140        1827   -313 (-14.63%)
      balancer_ingress                            35481      26158    -9323 (-26.28%)        2095        1806   -289 (-13.79%)
      balancer_ingress                            35481      26158    -9323 (-26.28%)        2095        1806   -289 (-13.79%)
      balancer_ingress                            34844      29485    -5359 (-15.38%)        2036        1918    -118 (-5.80%)
      fbflow_egress                                3256       2652     -604 (-18.55%)         218         192    -26 (-11.93%)
      fbflow_ingress                               1026        944       -82 (-7.99%)          70          63     -7 (-10.00%)
      sslwall_tc_egress                            8424       7360    -1064 (-12.63%)         498         458     -40 (-8.03%)
      syar_accept_protect                         15040       9539    -5501 (-36.58%)         364         220   -144 (-39.56%)
      syar_connect_tcp_v6                         15036       9535    -5501 (-36.59%)         360         216   -144 (-40.00%)
      syar_connect_udp_v4                         15039       9538    -5501 (-36.58%)         361         217   -144 (-39.89%)
      syar_connect_connect4_protect4              24805      15833    -8972 (-36.17%)         756         480   -276 (-36.51%)
      syar_lsm_file_open                         167772     151813    -15959 (-9.51%)        1836        1667    -169 (-9.20%)
      syar_namespace_create_new                   14805       9304    -5501 (-37.16%)         353         209   -144 (-40.79%)
      syar_python3_detect                         17531      12030    -5501 (-31.38%)         391         247   -144 (-36.83%)
      syar_ssh_post_fork                          16412      10911    -5501 (-33.52%)         405         261   -144 (-35.56%)
      syar_enter_execve                           14728       9227    -5501 (-37.35%)         345         201   -144 (-41.74%)
      syar_enter_execveat                         14728       9227    -5501 (-37.35%)         345         201   -144 (-41.74%)
      syar_exit_execve                            16622      11121    -5501 (-33.09%)         376         232   -144 (-38.30%)
      syar_exit_execveat                          16622      11121    -5501 (-33.09%)         376         232   -144 (-38.30%)
      syar_syscalls_kill                          15288       9787    -5501 (-35.98%)         398         254   -144 (-36.18%)
      syar_task_enter_pivot_root                  14898       9397    -5501 (-36.92%)         357         213   -144 (-40.34%)
      syar_syscalls_setreuid                      16678      11177    -5501 (-32.98%)         429         285   -144 (-33.57%)
      syar_syscalls_setuid                        16678      11177    -5501 (-32.98%)         429         285   -144 (-33.57%)
      syar_syscalls_process_vm_readv              14959       9458    -5501 (-36.77%)         364         220   -144 (-39.56%)
      syar_syscalls_process_vm_writev             15757      10256    -5501 (-34.91%)         390         246   -144 (-36.92%)
      do_uprobe                                   15519      10018    -5501 (-35.45%)         373         229   -144 (-38.61%)
      edgewall                                   179715      55783  -123932 (-68.96%)       12607        3999  -8608 (-68.28%)
      bictcp_state                                 7570       4131    -3439 (-45.43%)         496         269   -227 (-45.77%)
      cubictcp_state                               7570       4131    -3439 (-45.43%)         496         269   -227 (-45.77%)
      tcp_rate_skb_delivered                        447        272     -175 (-39.15%)          29          18    -11 (-37.93%)
      kprobe__bbr_set_state                        4566       2615    -1951 (-42.73%)         209         124    -85 (-40.67%)
      kprobe__bictcp_state                         4566       2615    -1951 (-42.73%)         209         124    -85 (-40.67%)
      inet_sock_set_state                          1501       1337     -164 (-10.93%)          93          85      -8 (-8.60%)
      tcp_retransmit_skb                           1145        981     -164 (-14.32%)          67          59     -8 (-11.94%)
      tcp_retransmit_synack                        1183        951     -232 (-19.61%)          67          55    -12 (-17.91%)
      bpf_tcptuner                                 1459       1187     -272 (-18.64%)          99          80    -19 (-19.19%)
      tw_egress                                     801        776       -25 (-3.12%)          69          66      -3 (-4.35%)
      tw_ingress                                    795        770       -25 (-3.14%)          69          66      -3 (-4.35%)
      ttls_tc_ingress                             19025      19383      +358 (+1.88%)         470         465      -5 (-1.06%)
      ttls_nat_egress                               490        299     -191 (-38.98%)          33          20    -13 (-39.39%)
      ttls_nat_ingress                              448        285     -163 (-36.38%)          32          21    -11 (-34.38%)
      tw_twfw_egress                             511127     212071  -299056 (-58.51%)       16733        8504  -8229 (-49.18%)
      tw_twfw_ingress                            500095     212069  -288026 (-57.59%)       16223        8504  -7719 (-47.58%)
      tw_twfw_tc_eg                              511113     212064  -299049 (-58.51%)       16732        8504  -8228 (-49.18%)
      tw_twfw_tc_in                              500095     212069  -288026 (-57.59%)       16223        8504  -7719 (-47.58%)
      tw_twfw_egress                              12632      12435      -197 (-1.56%)         276         260     -16 (-5.80%)
      tw_twfw_ingress                             12631      12454      -177 (-1.40%)         278         261     -17 (-6.12%)
      tw_twfw_tc_eg                               12595      12435      -160 (-1.27%)         274         259     -15 (-5.47%)
      tw_twfw_tc_in                               12631      12454      -177 (-1.40%)         278         261     -17 (-6.12%)
      tw_xdp_dump                                   266        209      -57 (-21.43%)           9           8     -1 (-11.11%)
      
      CILIUM
      =========
      File           Program                           Insns (A)  Insns (B)  Insns     (DIFF)  States (A)  States (B)  States  (DIFF)
      -------------  --------------------------------  ---------  ---------  ----------------  ----------  ----------  --------------
      bpf_host.o     cil_to_netdev                          6047       4578   -1469 (-24.29%)         362         249  -113 (-31.22%)
      bpf_host.o     handle_lxc_traffic                     2227       1585    -642 (-28.83%)         156         103   -53 (-33.97%)
      bpf_host.o     tail_handle_ipv4_from_netdev           2244       1458    -786 (-35.03%)         163         106   -57 (-34.97%)
      bpf_host.o     tail_handle_nat_fwd_ipv4              21022      10479  -10543 (-50.15%)        1289         670  -619 (-48.02%)
      bpf_host.o     tail_handle_nat_fwd_ipv6              15433      11375   -4058 (-26.29%)         905         643  -262 (-28.95%)
      bpf_host.o     tail_ipv4_host_policy_ingress          2219       1367    -852 (-38.40%)         161          96   -65 (-40.37%)
      bpf_host.o     tail_nodeport_nat_egress_ipv4         22460      19862   -2598 (-11.57%)        1469        1293  -176 (-11.98%)
      bpf_host.o     tail_nodeport_nat_ingress_ipv4         5526       3534   -1992 (-36.05%)         366         243  -123 (-33.61%)
      bpf_host.o     tail_nodeport_nat_ingress_ipv6         5132       4256    -876 (-17.07%)         241         219    -22 (-9.13%)
      bpf_host.o     tail_nodeport_nat_ipv6_egress          3702       3542     -160 (-4.32%)         215         205    -10 (-4.65%)
      bpf_lxc.o      tail_handle_nat_fwd_ipv4              21022      10479  -10543 (-50.15%)        1289         670  -619 (-48.02%)
      bpf_lxc.o      tail_handle_nat_fwd_ipv6              15433      11375   -4058 (-26.29%)         905         643  -262 (-28.95%)
      bpf_lxc.o      tail_ipv4_ct_egress                    5073       3374   -1699 (-33.49%)         262         172   -90 (-34.35%)
      bpf_lxc.o      tail_ipv4_ct_ingress                   5093       3385   -1708 (-33.54%)         262         172   -90 (-34.35%)
      bpf_lxc.o      tail_ipv4_ct_ingress_policy_only       5093       3385   -1708 (-33.54%)         262         172   -90 (-34.35%)
      bpf_lxc.o      tail_ipv6_ct_egress                    4593       3878    -715 (-15.57%)         194         151   -43 (-22.16%)
      bpf_lxc.o      tail_ipv6_ct_ingress                   4606       3891    -715 (-15.52%)         194         151   -43 (-22.16%)
      bpf_lxc.o      tail_ipv6_ct_ingress_policy_only       4606       3891    -715 (-15.52%)         194         151   -43 (-22.16%)
      bpf_lxc.o      tail_nodeport_nat_ingress_ipv4         5526       3534   -1992 (-36.05%)         366         243  -123 (-33.61%)
      bpf_lxc.o      tail_nodeport_nat_ingress_ipv6         5132       4256    -876 (-17.07%)         241         219    -22 (-9.13%)
      bpf_overlay.o  tail_handle_nat_fwd_ipv4              20524      10114  -10410 (-50.72%)        1271         638  -633 (-49.80%)
      bpf_overlay.o  tail_nodeport_nat_egress_ipv4         22718      19490   -3228 (-14.21%)        1475        1275  -200 (-13.56%)
      bpf_overlay.o  tail_nodeport_nat_ingress_ipv4         5526       3534   -1992 (-36.05%)         366         243  -123 (-33.61%)
      bpf_overlay.o  tail_nodeport_nat_ingress_ipv6         5132       4256    -876 (-17.07%)         241         219    -22 (-9.13%)
      bpf_overlay.o  tail_nodeport_nat_ipv6_egress          3638       3548      -90 (-2.47%)         209         203     -6 (-2.87%)
      bpf_overlay.o  tail_rev_nodeport_lb4                  4368       3820    -548 (-12.55%)         248         215   -33 (-13.31%)
      bpf_overlay.o  tail_rev_nodeport_lb6                  2867       2428    -439 (-15.31%)         167         140   -27 (-16.17%)
      bpf_sock.o     cil_sock6_connect                      1718       1703      -15 (-0.87%)         100          99     -1 (-1.00%)
      bpf_xdp.o      tail_handle_nat_fwd_ipv4              12917      12443     -474 (-3.67%)         875         849    -26 (-2.97%)
      bpf_xdp.o      tail_handle_nat_fwd_ipv6              13515      13264     -251 (-1.86%)         715         702    -13 (-1.82%)
      bpf_xdp.o      tail_lb_ipv4                          39492      36367    -3125 (-7.91%)        2430        2251   -179 (-7.37%)
      bpf_xdp.o      tail_lb_ipv6                          80441      78058    -2383 (-2.96%)        3647        3523   -124 (-3.40%)
      bpf_xdp.o      tail_nodeport_ipv6_dsr                 1038        901    -137 (-13.20%)          61          55     -6 (-9.84%)
      bpf_xdp.o      tail_nodeport_nat_egress_ipv4         13027      12096     -931 (-7.15%)         868         809    -59 (-6.80%)
      bpf_xdp.o      tail_nodeport_nat_ingress_ipv4         7617       5900   -1717 (-22.54%)         522         413  -109 (-20.88%)
      bpf_xdp.o      tail_nodeport_nat_ingress_ipv6         7575       7395     -180 (-2.38%)         383         374     -9 (-2.35%)
      bpf_xdp.o      tail_rev_nodeport_lb4                  6808       6739      -69 (-1.01%)         403         396     -7 (-1.74%)
      bpf_xdp.o      tail_rev_nodeport_lb6                 16173      15847     -326 (-2.02%)        1010         990    -20 (-1.98%)
      
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231205184248.1502704-9-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      18a433b6
    • Andrii Nakryiko's avatar
      selftests/bpf: validate zero preservation for sub-slot loads · add1cd7f
      Andrii Nakryiko authored
      
      
      Validate that 1-, 2-, and 4-byte loads from stack slots not aligned on
      8-byte boundary still preserve zero, when loading from all-STACK_ZERO
      sub-slots, or when stack sub-slots are covered by spilled register with
      known constant zero value.
      
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231205184248.1502704-8-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      add1cd7f
    • Andrii Nakryiko's avatar
      bpf: preserve constant zero when doing partial register restore · e322f0bc
      Andrii Nakryiko authored
      
      
      Similar to special handling of STACK_ZERO, when reading 1/2/4 bytes from
      stack from slot that has register spilled into it and that register has
      a constant value zero, preserve that zero and mark spilled register as
      precise for that. This makes spilled const zero register and STACK_ZERO
      cases equivalent in their behavior.
      
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231205184248.1502704-7-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e322f0bc
    • Andrii Nakryiko's avatar
      selftests/bpf: validate STACK_ZERO is preserved on subreg spill · b33ceb6a
      Andrii Nakryiko authored
      
      
      Add tests validating that STACK_ZERO slots are preserved when slot is
      partially overwritten with subregister spill.
      
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231205184248.1502704-6-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b33ceb6a
    • Andrii Nakryiko's avatar
      bpf: preserve STACK_ZERO slots on partial reg spills · eaf18feb
      Andrii Nakryiko authored
      
      
      Instead of always forcing STACK_ZERO slots to STACK_MISC, preserve it in
      situations where this is possible. E.g., when spilling register as
      1/2/4-byte subslots on the stack, all the remaining bytes in the stack
      slot do not automatically become unknown. If we knew they contained
      zeroes, we can preserve those STACK_ZERO markers.
      
      Add a helper mark_stack_slot_misc(), similar to scrub_spilled_slot(),
      but that doesn't overwrite either STACK_INVALID nor STACK_ZERO. Note
      that we need to take into account possibility of being in unprivileged
      mode, in which case STACK_INVALID is forced to STACK_MISC for correctness,
      as treating STACK_INVALID as equivalent STACK_MISC is only enabled in
      privileged mode.
      
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231205184248.1502704-5-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      eaf18feb
    • Andrii Nakryiko's avatar
      bpf: fix check for attempt to corrupt spilled pointer · ab125ed3
      Andrii Nakryiko authored
      When register is spilled onto a stack as a 1/2/4-byte register, we set
      slot_type[BPF_REG_SIZE - 1] (plus potentially few more below it,
      depending on actual spill size). So to check if some stack slot has
      spilled register we need to consult slot_type[7], not slot_type[0].
      
      To avoid the need to remember and double-check this in the future, just
      use is_spilled_reg() helper.
      
      Fixes: 27113c59
      
       ("bpf: Check the other end of slot_type for STACK_SPILL")
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231205184248.1502704-4-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ab125ed3
    • Andrii Nakryiko's avatar
      selftests/bpf: add stack access precision test · 87630188
      Andrii Nakryiko authored
      
      
      Add a new selftests that validates precision tracking for stack access
      instruction, using both r10-based and non-r10-based accesses. For
      non-r10 ones we also make sure to have non-zero var_off to validate that
      final stack offset is tracked properly in instruction history
      information inside verifier.
      
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231205184248.1502704-3-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      87630188
    • Andrii Nakryiko's avatar
      bpf: support non-r10 register spill/fill to/from stack in precision tracking · 41f6f64e
      Andrii Nakryiko authored
      Use instruction (jump) history to record instructions that performed
      register spill/fill to/from stack, regardless if this was done through
      read-only r10 register, or any other register after copying r10 into it
      *and* potentially adjusting offset.
      
      To make this work reliably, we push extra per-instruction flags into
      instruction history, encoding stack slot index (spi) and stack frame
      number in extra 10 bit flags we take away from prev_idx in instruction
      history. We don't touch idx field for maximum performance, as it's
      checked most frequently during backtracking.
      
      This change removes basically the last remaining practical limitation of
      precision backtracking logic in BPF verifier. It fixes known
      deficiencies, but also opens up new opportunities to reduce number of
      verified states, explored in the subsequent patches.
      
      There are only three differences in selftests' BPF object files
      according to veristat, all in the positive direction (less states).
      
      File                                    Program        Insns (A)  Insns (B)  Insns  (DIFF)  States (A)  States (B)  States (DIFF)
      --------------------------------------  -------------  ---------  ---------  -------------  ----------  ----------  -------------
      test_cls_redirect_dynptr.bpf.linked3.o  cls_redirect        2987       2864  -123 (-4.12%)         240         231    -9 (-3.75%)
      xdp_synproxy_kern.bpf.linked3.o         syncookie_tc       82848      82661  -187 (-0.23%)        5107        5073   -34 (-0.67%)
      xdp_synproxy_kern.bpf.linked3.o         syncookie_xdp      85116      84964  -152 (-0.18%)        5162        5130   -32 (-0.62%)
      
      Note, I avoided renaming jmp_history to more generic insn_hist to
      minimize number of lines changed and potential merge conflicts between
      bpf and bpf-next trees.
      
      Notice also cur_hist_entry pointer reset to NULL at the beginning of
      instruction verification loop. This pointer avoids the problem of
      relying on last jump history entry's insn_idx to determine whether we
      already have entry for current instruction or not. It can happen that we
      added jump history entry because current instruction is_jmp_point(), but
      also we need to add instruction flags for stack access. In this case, we
      don't want to entries, so we need to reuse last added entry, if it is
      present.
      
      Relying on insn_idx comparison has the same ambiguity problem as the one
      that was fixed recently in [0], so we avoid that.
      
        [0] https://patchwork.kernel.org/project/netdevbpf/patch/20231110002638.4168352-3-andrii@kernel.org/
      
      
      
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Reported-by: default avatarTao Lyu <tao.lyu@epfl.ch>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231205184248.1502704-2-andrii@kernel.org
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      41f6f64e
  3. Dec 05, 2023
    • Stanislav Fomichev's avatar
      selftests/bpf: Make sure we trigger metadata kfuncs for dst 8080 · 5ffb260f
      Stanislav Fomichev authored
      
      
      xdp_metadata test is flaky sometimes:
      
        verify_xsk_metadata:FAIL:rx_hash_type unexpected rx_hash_type: actual 8 != expected 0
      
      Where 8 means XDP_RSS_TYPE_L4_ANY and is exported from veth driver only when
      'skb->l4_hash' condition is met. This makes me think that the program is
      triggering again for some other packet.
      
      Let's have a filter, similar to xdp_hw_metadata, where we trigger XDP kfuncs
      only for UDP packets destined to port 8080.
      
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20231204174423.3460052-1-sdf@google.com
      5ffb260f
    • Stanislav Fomichev's avatar
      xsk: Add missing SPDX to AF_XDP TX metadata documentation · 5c399ae0
      Stanislav Fomichev authored
      Not sure how I missed that. I even acknowledged it explicitly
      in the changelog [0]. Add the tag for real now.
      
        [0] https://lore.kernel.org/bpf/20231127190319.1190813-1-sdf@google.com/
      
      Fixes: 11614723
      
       ("xsk: Add option to calculate TX checksum in SW")
      Suggested-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20231204174231.3457705-1-sdf@google.com
      5c399ae0
    • Dave Marchevsky's avatar
      selftests/bpf: Test bpf_kptr_xchg stashing of bpf_rb_root · 1b4c7e20
      Dave Marchevsky authored
      
      
      There was some confusion amongst Meta sched_ext folks regarding whether
      stashing bpf_rb_root - the tree itself, rather than a single node - was
      supported. This patch adds a small test which demonstrates this
      functionality: a local kptr with rb_root is created, a node is created
      and added to the tree, then the tree is kptr_xchg'd into a mapval.
      
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/bpf/20231204211722.571346-1-davemarchevsky@fb.com
      1b4c7e20
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-fix-the-release-of-inner-map' · ce3c49da
      Alexei Starovoitov authored
      Hou Tao says:
      
      ====================
      bpf: Fix the release of inner map
      
      From: Hou Tao <houtao1@huawei.com>
      
      Hi,
      
      The patchset aims to fix the release of inner map in map array or map
      htab. The release of inner map is different with normal map. For normal
      map, the map is released after the bpf program which uses the map is
      destroyed, because the bpf program tracks the used maps. However bpf
      program can not track the used inner map because these inner map may be
      updated or deleted dynamically, and for now the ref-counter of inner map
      is decreased after the inner map is remove from outer map, so the inner
      map may be freed before the bpf program, which is accessing the inner
      map, exits and there will be use-after-free problem as demonstrated by
      patch #6.
      
      The patchset fixes the problem by deferring the release of inner map.
      The freeing of inner map is deferred according to the sleepable
      attributes of the bpf programs which own the outer map. Patch #1 fixes
      the warning when running the newly-added selftest under interpreter
      mode. Patch #2 adds more parameters to .map_fd_put_ptr() to prepare for
      the fix. Patch #3 fixes the incorrect value of need_defer when freeing
      the fd array. Patch #4 fixes the potential use-after-free problem by
      using call_rcu_tasks_trace() and call_rcu() to wait for one tasks trace
      RCU GP and one RCU GP unconditionally. Patch #5 optimizes the free of
      inner map by removing the unnecessary RCU GP waiting. Patch #6 adds a
      selftest to demonstrate the potential use-after-free problem. Patch #7
      updates a selftest to update outer map in syscall bpf program.
      
      Please see individual patches for more details. And comments are always
      welcome.
      
      Change Log:
      v5:
       * patch #3: rename fd_array_map_delete_elem_with_deferred_free() to
                   __fd_array_map_delete_elem() (Alexei)
       * patch #5: use atomic64_t instead of atomic_t to prevent potential
                   overflow (Alexei)
       * patch #7: use ptr_to_u64() helper instead of force casting to initialize
                   pointers in bpf_attr (Alexei)
      
      v4: https://lore.kernel.org/bpf/20231130140120.1736235-1-houtao@huaweicloud.com
        * patch #2: don't use "deferred", use "need_defer" uniformly
        * patch #3: newly-added, fix the incorrect value of need_defer during
                    fd array free.
        * patch #4: doesn't consider the case in which bpf map is not used by
                    any bpf program and only use sleepable_refcnt to remove
      	      unnecessary tasks trace RCU GP (Alexei)
        * patch #4: remove memory barriers added due to cautiousness (Alexei)
      
      v3: https://lore.kernel.org/bpf/20231124113033.503338-1-houtao@huaweicloud.com
        * multiple variable renamings (Martin)
        * define BPF_MAP_RCU_GP/BPF_MAP_RCU_TT_GP as bit (Martin)
        * use call_rcu() and its variants instead of synchronize_rcu() (Martin)
        * remove unnecessary mask in bpf_map_free_deferred() (Martin)
        * place atomic_or() and the related smp_mb() together (Martin)
        * add patch #6 to demonstrate that updating outer map in syscall
          program is dead-lock free (Alexei)
        * update comments about the memory barrier in bpf_map_fd_put_ptr()
        * update commit message for patch #3 and #4 to describe more details
      
      v2: https://lore.kernel.org/bpf/20231113123324.3914612-1-houtao@huaweicloud.com
        * defer the invocation of ops->map_free() instead of bpf_map_put() (Martin)
        * update selftest to make it being reproducible under JIT mode (Martin)
        * remove unnecessary preparatory patches
      
      v1: https://lore.kernel.org/bpf/20231107140702.1891778-1-houtao@huaweicloud.com
      ====================
      
      Link: https://lore.kernel.org/r/20231204140425.1480317-1-houtao@huaweicloud.com
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ce3c49da
    • Hou Tao's avatar
      selftests/bpf: Test outer map update operations in syscall program · e3dd4082
      Hou Tao authored
      
      
      Syscall program is running with rcu_read_lock_trace being held, so if
      bpf_map_update_elem() or bpf_map_delete_elem() invokes
      synchronize_rcu_tasks_trace() when operating on an outer map, there will
      be dead-lock, so add a test to guarantee that it is dead-lock free.
      
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20231204140425.1480317-8-houtao@huaweicloud.com
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e3dd4082
    • Hou Tao's avatar
      selftests/bpf: Add test cases for inner map · 1624918b
      Hou Tao authored
      
      
      Add test cases to test the race between the destroy of inner map due to
      map-in-map update and the access of inner map in bpf program. The
      following 4 combinations are added:
      (1) array map in map array + bpf program
      (2) array map in map array + sleepable bpf program
      (3) array map in map htab + bpf program
      (4) array map in map htab + sleepable bpf program
      
      Before applying the fixes, when running `./test_prog -a map_in_map`, the
      following error was reported:
      
        ==================================================================
        BUG: KASAN: slab-use-after-free in array_map_update_elem+0x48/0x3e0
        Read of size 4 at addr ffff888114f33824 by task test_progs/1858
      
        CPU: 1 PID: 1858 Comm: test_progs Tainted: G           O     6.6.0+ #7
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......
        Call Trace:
         <TASK>
         dump_stack_lvl+0x4a/0x90
         print_report+0xd2/0x620
         kasan_report+0xd1/0x110
         __asan_load4+0x81/0xa0
         array_map_update_elem+0x48/0x3e0
         bpf_prog_be94a9f26772f5b7_access_map_in_array+0xe6/0xf6
         trace_call_bpf+0x1aa/0x580
         kprobe_perf_func+0xdd/0x430
         kprobe_dispatcher+0xa0/0xb0
         kprobe_ftrace_handler+0x18b/0x2e0
         0xffffffffc02280f7
        RIP: 0010:__x64_sys_getpgid+0x1/0x30
        ......
         </TASK>
      
        Allocated by task 1857:
         kasan_save_stack+0x26/0x50
         kasan_set_track+0x25/0x40
         kasan_save_alloc_info+0x1e/0x30
         __kasan_kmalloc+0x98/0xa0
         __kmalloc_node+0x6a/0x150
         __bpf_map_area_alloc+0x141/0x170
         bpf_map_area_alloc+0x10/0x20
         array_map_alloc+0x11f/0x310
         map_create+0x28a/0xb40
         __sys_bpf+0x753/0x37c0
         __x64_sys_bpf+0x44/0x60
         do_syscall_64+0x36/0xb0
         entry_SYSCALL_64_after_hwframe+0x6e/0x76
      
        Freed by task 11:
         kasan_save_stack+0x26/0x50
         kasan_set_track+0x25/0x40
         kasan_save_free_info+0x2b/0x50
         __kasan_slab_free+0x113/0x190
         slab_free_freelist_hook+0xd7/0x1e0
         __kmem_cache_free+0x170/0x260
         kfree+0x9b/0x160
         kvfree+0x2d/0x40
         bpf_map_area_free+0xe/0x20
         array_map_free+0x120/0x2c0
         bpf_map_free_deferred+0xd7/0x1e0
         process_one_work+0x462/0x990
         worker_thread+0x370/0x670
         kthread+0x1b0/0x200
         ret_from_fork+0x3a/0x70
         ret_from_fork_asm+0x1b/0x30
      
        Last potentially related work creation:
         kasan_save_stack+0x26/0x50
         __kasan_record_aux_stack+0x94/0xb0
         kasan_record_aux_stack_noalloc+0xb/0x20
         __queue_work+0x331/0x950
         queue_work_on+0x75/0x80
         bpf_map_put+0xfa/0x160
         bpf_map_fd_put_ptr+0xe/0x20
         bpf_fd_array_map_update_elem+0x174/0x1b0
         bpf_map_update_value+0x2b7/0x4a0
         __sys_bpf+0x2551/0x37c0
         __x64_sys_bpf+0x44/0x60
         do_syscall_64+0x36/0xb0
         entry_SYSCALL_64_after_hwframe+0x6e/0x76
      
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20231204140425.1480317-7-houtao@huaweicloud.com
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1624918b
    • Hou Tao's avatar
      bpf: Optimize the free of inner map · af66bfd3
      Hou Tao authored
      
      
      When removing the inner map from the outer map, the inner map will be
      freed after one RCU grace period and one RCU tasks trace grace
      period, so it is certain that the bpf program, which may access the
      inner map, has exited before the inner map is freed.
      
      However there is no need to wait for one RCU tasks trace grace period if
      the outer map is only accessed by non-sleepable program. So adding
      sleepable_refcnt in bpf_map and increasing sleepable_refcnt when adding
      the outer map into env->used_maps for sleepable program. Although the
      max number of bpf program is INT_MAX - 1, the number of bpf programs
      which are being loaded may be greater than INT_MAX, so using atomic64_t
      instead of atomic_t for sleepable_refcnt. When removing the inner map
      from the outer map, using sleepable_refcnt to decide whether or not a
      RCU tasks trace grace period is needed before freeing the inner map.
      
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20231204140425.1480317-6-houtao@huaweicloud.com
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      af66bfd3
    • Hou Tao's avatar
      bpf: Defer the free of inner map when necessary · 87667336
      Hou Tao authored
      When updating or deleting an inner map in map array or map htab, the map
      may still be accessed by non-sleepable program or sleepable program.
      However bpf_map_fd_put_ptr() decreases the ref-counter of the inner map
      directly through bpf_map_put(), if the ref-counter is the last one
      (which is true for most cases), the inner map will be freed by
      ops->map_free() in a kworker. But for now, most .map_free() callbacks
      don't use synchronize_rcu() or its variants to wait for the elapse of a
      RCU grace period, so after the invocation of ops->map_free completes,
      the bpf program which is accessing the inner map may incur
      use-after-free problem.
      
      Fix the free of inner map by invoking bpf_map_free_deferred() after both
      one RCU grace period and one tasks trace RCU grace period if the inner
      map has been removed from the outer map before. The deferment is
      accomplished by using call_rcu() or call_rcu_tasks_trace() when
      releasing the last ref-counter of bpf map. The newly-added rcu_head
      field in bpf_map shares the same storage space with work field to
      reduce the size of bpf_map.
      
      Fixes: bba1dc0b ("bpf: Remove redundant synchronize_rcu.")
      Fixes: 638e4b82
      
       ("bpf: Allows per-cpu maps and map-in-map in sleepable programs")
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20231204140425.1480317-5-houtao@huaweicloud.com
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      87667336
    • Hou Tao's avatar
      bpf: Set need_defer as false when clearing fd array during map free · 79d93b3c
      Hou Tao authored
      
      
      Both map deletion operation, map release and map free operation use
      fd_array_map_delete_elem() to remove the element from fd array and
      need_defer is always true in fd_array_map_delete_elem(). For the map
      deletion operation and map release operation, need_defer=true is
      necessary, because the bpf program, which accesses the element in fd
      array, may still alive. However for map free operation, it is certain
      that the bpf program which owns the fd array has already been exited, so
      setting need_defer as false is appropriate for map free operation.
      
      So fix it by adding need_defer parameter to bpf_fd_array_map_clear() and
      adding a new helper __fd_array_map_delete_elem() to handle the map
      deletion, map release and map free operations correspondingly.
      
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20231204140425.1480317-4-houtao@huaweicloud.com
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      79d93b3c
    • Hou Tao's avatar
      bpf: Add map and need_defer parameters to .map_fd_put_ptr() · 20c20bd1
      Hou Tao authored
      
      
      map is the pointer of outer map, and need_defer needs some explanation.
      need_defer tells the implementation to defer the reference release of
      the passed element and ensure that the element is still alive before
      the bpf program, which may manipulate it, exits.
      
      The following three cases will invoke map_fd_put_ptr() and different
      need_defer values will be passed to these callers:
      
      1) release the reference of the old element in the map during map update
         or map deletion. The release must be deferred, otherwise the bpf
         program may incur use-after-free problem, so need_defer needs to be
         true.
      2) release the reference of the to-be-added element in the error path of
         map update. The to-be-added element is not visible to any bpf
         program, so it is OK to pass false for need_defer parameter.
      3) release the references of all elements in the map during map release.
         Any bpf program which has access to the map must have been exited and
         released, so need_defer=false will be OK.
      
      These two parameters will be used by the following patches to fix the
      potential use-after-free problem for map-in-map.
      
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20231204140425.1480317-3-houtao@huaweicloud.com
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      20c20bd1
    • Hou Tao's avatar
      bpf: Check rcu_read_lock_trace_held() before calling bpf map helpers · 169410eb
      Hou Tao authored
      
      
      These three bpf_map_{lookup,update,delete}_elem() helpers are also
      available for sleepable bpf program, so add the corresponding lock
      assertion for sleepable bpf program, otherwise the following warning
      will be reported when a sleepable bpf program manipulates bpf map under
      interpreter mode (aka bpf_jit_enable=0):
      
        WARNING: CPU: 3 PID: 4985 at kernel/bpf/helpers.c:40 ......
        CPU: 3 PID: 4985 Comm: test_progs Not tainted 6.6.0+ #2
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......
        RIP: 0010:bpf_map_lookup_elem+0x54/0x60
        ......
        Call Trace:
         <TASK>
         ? __warn+0xa5/0x240
         ? bpf_map_lookup_elem+0x54/0x60
         ? report_bug+0x1ba/0x1f0
         ? handle_bug+0x40/0x80
         ? exc_invalid_op+0x18/0x50
         ? asm_exc_invalid_op+0x1b/0x20
         ? __pfx_bpf_map_lookup_elem+0x10/0x10
         ? rcu_lockdep_current_cpu_online+0x65/0xb0
         ? rcu_is_watching+0x23/0x50
         ? bpf_map_lookup_elem+0x54/0x60
         ? __pfx_bpf_map_lookup_elem+0x10/0x10
         ___bpf_prog_run+0x513/0x3b70
         __bpf_prog_run32+0x9d/0xd0
         ? __bpf_prog_enter_sleepable_recur+0xad/0x120
         ? __bpf_prog_enter_sleepable_recur+0x3e/0x120
         bpf_trampoline_6442580665+0x4d/0x1000
         __x64_sys_getpgid+0x5/0x30
         ? do_syscall_64+0x36/0xb0
         entry_SYSCALL_64_after_hwframe+0x6e/0x76
         </TASK>
      
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20231204140425.1480317-2-houtao@huaweicloud.com
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      169410eb