Skip to content
  1. Mar 31, 2020
    • John Fastabend's avatar
      bpf: Verifier, do explicit ALU32 bounds tracking · 3f50f132
      John Fastabend authored
      
      
      It is not possible for the current verifier to track ALU32 and JMP ops
      correctly. This can result in the verifier aborting with errors even though
      the program should be verifiable. BPF codes that hit this can work around
      it by changin int variables to 64-bit types, marking variables volatile,
      etc. But this is all very ugly so it would be better to avoid these tricks.
      
      But, the main reason to address this now is do_refine_retval_range() was
      assuming return values could not be negative. Once we fixed this code that
      was previously working will no longer work. See do_refine_retval_range()
      patch for details. And we don't want to suddenly cause programs that used
      to work to fail.
      
      The simplest example code snippet that illustrates the problem is likely
      this,
      
       53: w8 = w0                    // r8 <- [0, S32_MAX],
                                      // w8 <- [-S32_MIN, X]
       54: w8 <s 0                    // r8 <- [0, U32_MAX]
                                      // w8 <- [0, X]
      
      The expected 64-bit and 32-bit bounds after each line are shown on the
      right. The current issue is without the w* bounds we are forced to use
      the worst case bound of [0, U32_MAX]. To resolve this type of case,
      jmp32 creating divergent 32-bit bounds from 64-bit bounds, we add explicit
      32-bit register bounds s32_{min|max}_value and u32_{min|max}_value. Then
      from branch_taken logic creating new bounds we can track 32-bit bounds
      explicitly.
      
      The next case we observed is ALU ops after the jmp32,
      
       53: w8 = w0                    // r8 <- [0, S32_MAX],
                                      // w8 <- [-S32_MIN, X]
       54: w8 <s 0                    // r8 <- [0, U32_MAX]
                                      // w8 <- [0, X]
       55: w8 += 1                    // r8 <- [0, U32_MAX+1]
                                      // w8 <- [0, X+1]
      
      In order to keep the bounds accurate at this point we also need to track
      ALU32 ops. To do this we add explicit ALU32 logic for each of the ALU
      ops, mov, add, sub, etc.
      
      Finally there is a question of how and when to merge bounds. The cases
      enumerate here,
      
      1. MOV ALU32   - zext 32-bit -> 64-bit
      2. MOV ALU64   - copy 64-bit -> 32-bit
      3. op  ALU32   - zext 32-bit -> 64-bit
      4. op  ALU64   - n/a
      5. jmp ALU32   - 64-bit: var32_off | upper_32_bits(var64_off)
      6. jmp ALU64   - 32-bit: (>> (<< var64_off))
      
      Details for each case,
      
      For "MOV ALU32" BPF arch zero extends so we simply copy the bounds
      from 32-bit into 64-bit ensuring we truncate var_off and 64-bit
      bounds correctly. See zext_32_to_64.
      
      For "MOV ALU64" copy all bounds including 32-bit into new register. If
      the src register had 32-bit bounds the dst register will as well.
      
      For "op ALU32" zero extend 32-bit into 64-bit the same as move,
      see zext_32_to_64.
      
      For "op ALU64" calculate both 32-bit and 64-bit bounds no merging
      is done here. Except we have a special case. When RSH or ARSH is
      done we can't simply ignore shifting bits from 64-bit reg into the
      32-bit subreg. So currently just push bounds from 64-bit into 32-bit.
      This will be correct in the sense that they will represent a valid
      state of the register. However we could lose some accuracy if an
      ARSH is following a jmp32 operation. We can handle this special
      case in a follow up series.
      
      For "jmp ALU32" mark 64-bit reg unknown and recalculate 64-bit bounds
      from tnum by setting var_off to ((<<(>>var_off)) | var32_off). We
      special case if 64-bit bounds has zero'd upper 32bits at which point
      we can simply copy 32-bit bounds into 64-bit register. This catches
      a common compiler trick where upper 32-bits are zeroed and then
      32-bit ops are used followed by a 64-bit compare or 64-bit op on
      a pointer. See __reg_combine_64_into_32().
      
      For "jmp ALU64" cast the bounds of the 64bit to their 32-bit
      counterpart. For example s32_min_value = (s32)reg->smin_value. For
      tnum use only the lower 32bits via, (>>(<<var_off)). See
      __reg_combine_64_into_32().
      
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/158560419880.10843.11448220440809118343.stgit@john-Precision-5820-Tower
      3f50f132
    • John Fastabend's avatar
      bpf: Verifier, do_refine_retval_range may clamp umin to 0 incorrectly · 10060503
      John Fastabend authored
      do_refine_retval_range() is called to refine return values from specified
      helpers, probe_read_str and get_stack at the moment, the reasoning is
      because both have a max value as part of their input arguments and
      because the helper ensure the return value will not be larger than this
      we can set smax values of the return register, r0.
      
      However, the return value is a signed integer so setting umax is incorrect
      It leads to further confusion when the do_refine_retval_range() then calls,
      __reg_deduce_bounds() which will see a umax value as meaning the value is
      unsigned and then assuming it is unsigned set the smin = umin which in this
      case results in 'smin = 0' and an 'smax = X' where X is the input argument
      from the helper call.
      
      Here are the comments from _reg_deduce_bounds() on why this would be safe
      to do.
      
       /* Learn sign from unsigned bounds.  Signed bounds cross the sign
        * boundary, so we must be careful.
        */
       if ((s64)reg->umax_value >= 0) {
      	/* Positive.  We can't learn anything from the smin, but smax
      	 * is positive, hence safe.
      	 */
      	reg->smin_value = reg->umin_value;
      	reg->smax_value = reg->umax_value = min_t(u64, reg->smax_value,
      						  reg->umax_value);
      
      But now we incorrectly have a return value with type int with the
      signed bounds (0,X). Suppose the return value is negative, which is
      possible the we have the verifier and reality out of sync. Among other
      things this may result in any error handling code being falsely detected
      as dead-code and removed. For instance the example below shows using
      bpf_probe_read_str() causes the error path to be identified as dead
      code and removed.
      
      >From the 'llvm-object -S' dump,
      
       r2 = 100
       call 45
       if r0 s< 0 goto +4
       r4 = *(u32 *)(r7 + 0)
      
      But from dump xlate
      
        (b7) r2 = 100
        (85) call bpf_probe_read_compat_str#-96768
        (61) r4 = *(u32 *)(r7 +0)  <-- dropped if goto
      
      Due to verifier state after call being
      
       R0=inv(id=0,umax_value=100,var_off=(0x0; 0x7f))
      
      To fix omit setting the umax value because its not safe. The only
      actual bounds we know is the smax. This results in the correct bounds
      (SMIN, X) where X is the max length from the helper. After this the
      new verifier state looks like the following after call 45.
      
      R0=inv(id=0,smax_value=100)
      
      Then xlated version no longer removed dead code giving the expected
      result,
      
        (b7) r2 = 100
        (85) call bpf_probe_read_compat_str#-96768
        (c5) if r0 s< 0x0 goto pc+4
        (61) r4 = *(u32 *)(r7 +0)
      
      Note, bpf_probe_read_* calls are root only so we wont hit this case
      with non-root bpf users.
      
      v3: comment had some documentation about meta set to null case which
      is not relevant here and confusing to include in the comment.
      
      v2 note: In original version we set msize_smax_value from check_func_arg()
      and propagated this into smax of retval. The logic was smax is the bound
      on the retval we set and because the type in the helper is ARG_CONST_SIZE
      we know that the reg is a positive tnum_const() so umax=smax. Alexei
      pointed out though this is a bit odd to read because the register in
      check_func_arg() has a C type of u32 and the umax bound would be the
      normally relavent bound here. Pulling in extra knowledge about future
      checks makes reading the code a bit tricky. Further having a signed
      meta data that can only ever be positive is also a bit odd. So dropped
      the msize_smax_value metadata and made it a u64 msize_max_value to
      indicate its unsigned. And additionally save bound from umax value in
      check_arg_funcs which is the same as smax due to as noted above tnumx_cont
      and negative check but reads better. By my analysis nothing functionally
      changes in v2 but it does get easier to read so that is win.
      
      Fixes: 849fa506
      
       ("bpf/verifier: refine retval R0 state for bpf_get_stack helper")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/158560417900.10843.14351995140624628941.stgit@john-Precision-5820-Tower
      10060503
    • KP Singh's avatar
      bpf, lsm: Make BPF_LSM depend on BPF_EVENTS · 4edf16b7
      KP Singh authored
      LSM and tracing programs share their helpers with bpf_tracing_func_proto
      which is only defined (in bpf_trace.c) when BPF_EVENTS is enabled.
      
      Instead of adding __weak symbol, make BPF_LSM depend on BPF_EVENTS so
      that both tracing and LSM programs can actually share helpers.
      
      Fixes: fc611f47
      
       ("bpf: Introduce BPF_PROG_TYPE_LSM")
      Reported-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarKP Singh <kpsingh@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200330204059.13024-1-kpsingh@chromium.org
      4edf16b7
    • Alexei Starovoitov's avatar
      Merge branch 'bpf_sk_assign' · c58b1558
      Alexei Starovoitov authored
      
      
      Joe Stringer says:
      
      ====================
      Introduce a new helper that allows assigning a previously-found socket
      to the skb as the packet is received towards the stack, to cause the
      stack to guide the packet towards that socket subject to local routing
      configuration. The intention is to support TProxy use cases more
      directly from eBPF programs attached at TC ingress, to simplify and
      streamline Linux stack configuration in scale environments with Cilium.
      
      Normally in ip{,6}_rcv_core(), the skb will be orphaned, dropping any
      existing socket reference associated with the skb. Existing tproxy
      implementations in netfilter get around this restriction by running the
      tproxy logic after ip_rcv_core() in the PREROUTING table. However, this
      is not an option for TC-based logic (including eBPF programs attached at
      TC ingress).
      
      This series introduces the BPF helper bpf_sk_assign() to associate the
      socket with the skb on the ingress path as the packet is passed up the
      stack. The initial patch in the series simply takes a reference on the
      socket to ensure safety, but later patches relax this for listen
      sockets.
      
      To ensure delivery to the relevant socket, we still consult the routing
      table, for full examples of how to configure see the tests in patch #5;
      the simplest form of the route would look like this:
      
        $ ip route add local default dev lo
      
      This series is laid out as follows:
      * Patch 1 extends the eBPF API to add sk_assign() and defines a new
        socket free function to allow the later paths to understand when the
        socket associated with the skb should be kept through receive.
      * Patches 2-3 optimize the receive path to avoid taking a reference on
        listener sockets during receive.
      * Patches 4-5 extends the selftests with examples of the new
        functionality and validation of correct behaviour.
      
      Changes since v4:
      * Fix build with CONFIG_INET disabled
      * Rebase
      
      Changes since v3:
      * Use sock_gen_put() directly instead of sock_edemux() from sock_pfree()
      * Commit message wording fixups
      * Add acks from Martin, Lorenz
      * Rebase
      
      Changes since v2:
      * Add selftests for UDP socket redirection
      * Drop the early demux optimization patch (defer for more testing)
      * Fix check for orphaning after TC act return
      * Tidy up the tests to clean up properly and be less noisy.
      
      Changes since v1:
      * Replace the metadata_dst approach with using the skb->destructor to
        determine whether the socket has been prefetched. This is much
        simpler.
      * Avoid taking a reference on listener sockets during receive
      * Restrict assigning sockets across namespaces
      * Restrict assigning SO_REUSEPORT sockets
      * Fix cookie usage for socket dst check
      * Rebase the tests against test_progs infrastructure
      * Tidy up commit messages
      ====================
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c58b1558
    • Joe Stringer's avatar
      selftests: bpf: Extend sk_assign tests for UDP · 8a02a170
      Joe Stringer authored
      
      
      Add support for testing UDP sk_assign to the existing tests.
      
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarLorenz Bauer <lmb@cloudflare.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200329225342.16317-6-joe@wand.net.nz
      8a02a170
    • Lorenz Bauer's avatar
      selftests: bpf: Add test for sk_assign · 2d7824ff
      Lorenz Bauer authored
      
      
      Attach a tc direct-action classifier to lo in a fresh network
      namespace, and rewrite all connection attempts to localhost:4321
      to localhost:1234 (for port tests) and connections to unreachable
      IPv4/IPv6 IPs to the local socket (for address tests). Includes
      implementations for both TCP and UDP.
      
      Keep in mind that both client to server and server to client traffic
      passes the classifier.
      
      Signed-off-by: default avatarLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200329225342.16317-5-joe@wand.net.nz
      
      Co-authored-by: default avatarJoe Stringer <joe@wand.net.nz>
      2d7824ff
    • Joe Stringer's avatar
      bpf: Don't refcount LISTEN sockets in sk_assign() · 7ae215d2
      Joe Stringer authored
      
      
      Avoid taking a reference on listen sockets by checking the socket type
      in the sk_assign and in the corresponding skb_steal_sock() code in the
      the transport layer, and by ensuring that the prefetch free (sock_pfree)
      function uses the same logic to check whether the socket is refcounted.
      
      Suggested-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200329225342.16317-4-joe@wand.net.nz
      7ae215d2
    • Joe Stringer's avatar
      net: Track socket refcounts in skb_steal_sock() · 71489e21
      Joe Stringer authored
      
      
      Refactor the UDP/TCP handlers slightly to allow skb_steal_sock() to make
      the determination of whether the socket is reference counted in the case
      where it is prefetched by earlier logic such as early_demux.
      
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200329225342.16317-3-joe@wand.net.nz
      71489e21
    • Joe Stringer's avatar
      bpf: Add socket assign support · cf7fbe66
      Joe Stringer authored
      
      
      Add support for TPROXY via a new bpf helper, bpf_sk_assign().
      
      This helper requires the BPF program to discover the socket via a call
      to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
      helper takes its own reference to the socket in addition to any existing
      reference that may or may not currently be obtained for the duration of
      BPF processing. For the destination socket to receive the traffic, the
      traffic must be routed towards that socket via local route. The
      simplest example route is below, but in practice you may want to route
      traffic more narrowly (eg by CIDR):
      
        $ ip route add local default dev lo
      
      This patch avoids trying to introduce an extra bit into the skb->sk, as
      that would require more invasive changes to all code interacting with
      the socket to ensure that the bit is handled correctly, such as all
      error-handling cases along the path from the helper in BPF through to
      the orphan path in the input. Instead, we opt to use the destructor
      variable to switch on the prefetch of the socket.
      
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200329225342.16317-2-joe@wand.net.nz
      cf7fbe66
    • Daniel Borkmann's avatar
      bpf, doc: Add John as official reviewer to BPF subsystem · b49e42a2
      Daniel Borkmann authored
      
      
      We've added John Fastabend to our weekly BPF patch review rotation over
      last months now where he provided excellent and timely feedback on BPF
      patches. Therefore, add him to the BPF core reviewer team to the MAINTAINERS
      file to reflect that.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/0e9a74933b3f21f4c5b5a3bc7f8e900b39805639.1585556231.git.daniel@iogearbox.net
      b49e42a2
    • KP Singh's avatar
      bpf: btf: Fix arg verification in btf_ctx_access() · f50b49a0
      KP Singh authored
      The bounds checking for the arguments accessed in the BPF program breaks
      when the expected_attach_type is not BPF_TRACE_FEXIT, BPF_LSM_MAC or
      BPF_MODIFY_RETURN resulting in no check being done for the default case
      (the programs which do not receive the return value of the attached
      function in its arguments) when the index of the argument being accessed
      is equal to the number of arguments (nr_args).
      
      This was a result of a misplaced "else if" block  introduced by the
      Commit 6ba43b76 ("bpf: Attachment verification for
      BPF_MODIFY_RETURN")
      
      Fixes: 6ba43b76
      
       ("bpf: Attachment verification for BPF_MODIFY_RETURN")
      Reported-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarKP Singh <kpsingh@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200330144246.338-1-kpsingh@chromium.org
      f50b49a0
    • Jann Horn's avatar
      bpf: Simplify reg_set_min_max_inv handling · 0fc31b10
      Jann Horn authored
      
      
      reg_set_min_max_inv() contains exactly the same logic as reg_set_min_max(),
      just flipped around. While this makes sense in a cBPF verifier (where ALU
      operations are not symmetric), it does not make sense for eBPF.
      
      Replace reg_set_min_max_inv() with a helper that flips the opcode around,
      then lets reg_set_min_max() do the complicated work.
      
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200330160324.15259-4-daniel@iogearbox.net
      0fc31b10
    • Jann Horn's avatar
      bpf: Fix tnum constraints for 32-bit comparisons · 604dca5e
      Jann Horn authored
      The BPF verifier tried to track values based on 32-bit comparisons by
      (ab)using the tnum state via 581738a6 ("bpf: Provide better register
      bounds after jmp32 instructions"). The idea is that after a check like
      this:
      
          if ((u32)r0 > 3)
            exit
      
      We can't meaningfully constrain the arithmetic-range-based tracking, but
      we can update the tnum state to (value=0,mask=0xffff'ffff'0000'0003).
      However, the implementation from 581738a6 didn't compute the tnum
      constraint based on the fixed operand, but instead derives it from the
      arithmetic-range-based tracking. This means that after the following
      sequence of operations:
      
          if (r0 >= 0x1'0000'0001)
            exit
          if ((u32)r0 > 7)
            exit
      
      The verifier assumed that the lower half of r0 is in the range (0, 0)
      and apply the tnum constraint (value=0,mask=0xffff'ffff'0000'0000) thus
      causing the overall tnum to be (value=0,mask=0x1'0000'0000), which was
      incorrect. Provide a fixed implementation.
      
      Fixes: 581738a6
      
       ("bpf: Provide better register bounds after jmp32 instructions")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200330160324.15259-3-daniel@iogearbox.net
      604dca5e
    • Daniel Borkmann's avatar
      bpf: Undo incorrect __reg_bound_offset32 handling · f2d67fec
      Daniel Borkmann authored
      Anatoly has been fuzzing with kBdysch harness and reported a hang in
      one of the outcomes:
      
        0: (b7) r0 = 808464432
        1: (7f) r0 >>= r0
        2: (14) w0 -= 808464432
        3: (07) r0 += 808464432
        4: (b7) r1 = 808464432
        5: (de) if w1 s<= w0 goto pc+0
         R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x30303020;0x10000001f)) R1_w=invP808464432 R10=fp0
        6: (07) r0 += -2144337872
        7: (14) w0 -= -1607454672
        8: (25) if r0 > 0x30303030 goto pc+0
         R0_w=invP(id=0,umin_value=271581184,umax_value=271581311,var_off=(0x10300000;0x7f)) R1_w=invP808464432 R10=fp0
        9: (76) if w0 s>= 0x303030 goto pc+2
        12: (95) exit
      
        from 8 to 9: safe
      
        from 5 to 6: R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x30303020;0x10000001f)) R1_w=invP808464432 R10=fp0
        6: (07) r0 += -2144337872
        7: (14) w0 -= -1607454672
        8: (25) if r0 > 0x30303030 goto pc+0
         R0_w=invP(id=0,umin_value=271581184,umax_value=271581311,var_off=(0x10300000;0x7f)) R1_w=invP808464432 R10=fp0
        9: safe
      
        from 8 to 9: safe
        verification time 589 usec
        stack depth 0
        processed 17 insns (limit 1000000) [...]
      
      The underlying program was xlated as follows:
      
        # bpftool p d x i 9
         0: (b7) r0 = 808464432
         1: (7f) r0 >>= r0
         2: (14) w0 -= 808464432
         3: (07) r0 += 808464432
         4: (b7) r1 = 808464432
         5: (de) if w1 s<= w0 goto pc+0
         6: (07) r0 += -2144337872
         7: (14) w0 -= -1607454672
         8: (25) if r0 > 0x30303030 goto pc+0
         9: (76) if w0 s>= 0x303030 goto pc+2
        10: (05) goto pc-1
        11: (05) goto pc-1
        12: (95) exit
      
      The verifier rewrote original instructions it recognized as dead code with
      'goto pc-1', but reality differs from verifier simulation in that we're
      actually able to trigger a hang due to hitting the 'goto pc-1' instructions.
      
      Taking different examples to make the issue more obvious: in this example
      we're probing bounds on a completely unknown scalar variable in r1:
      
        [...]
        5: R0_w=inv1 R1_w=inv(id=0) R10=fp0
        5: (18) r2 = 0x4000000000
        7: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R10=fp0
        7: (18) r3 = 0x2000000000
        9: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R3_w=inv137438953472 R10=fp0
        9: (18) r4 = 0x400
        11: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R10=fp0
        11: (18) r5 = 0x200
        13: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0
        13: (2d) if r1 > r2 goto pc+4
         R0_w=inv1 R1_w=inv(id=0,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0
        14: R0_w=inv1 R1_w=inv(id=0,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0
        14: (ad) if r1 < r3 goto pc+3
         R0_w=inv1 R1_w=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0
        15: R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0
        15: (2e) if w1 > w4 goto pc+2
         R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7f00000000)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0
        16: R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7f00000000)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0
        16: (ae) if w1 < w5 goto pc+1
         R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7f00000000)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0
        [...]
      
      We're first probing lower/upper bounds via jmp64, later we do a similar
      check via jmp32 and examine the resulting var_off there. After fall-through
      in insn 14, we get the following bounded r1 with 0x7fffffffff unknown marked
      bits in the variable section.
      
      Thus, after knowing r1 <= 0x4000000000 and r1 >= 0x2000000000:
      
        max: 0b100000000000000000000000000000000000000 / 0x4000000000
        var: 0b111111111111111111111111111111111111111 / 0x7fffffffff
        min: 0b010000000000000000000000000000000000000 / 0x2000000000
      
      Now, in insn 15 and 16, we perform a similar probe with lower/upper bounds
      in jmp32.
      
      Thus, after knowing r1 <= 0x4000000000 and r1 >= 0x2000000000 and
                          w1 <= 0x400        and w1 >= 0x200:
      
        max: 0b100000000000000000000000000000000000000 / 0x4000000000
        var: 0b111111100000000000000000000000000000000 / 0x7f00000000
        min: 0b010000000000000000000000000000000000000 / 0x2000000000
      
      The lower/upper bounds haven't changed since they have high bits set in
      u64 space and the jmp32 tests can only refine bounds in the low bits.
      
      However, for the var part the expectation would have been 0x7f000007ff
      or something less precise up to 0x7fffffffff. A outcome of 0x7f00000000
      is not correct since it would contradict the earlier probed bounds
      where we know that the result should have been in [0x200,0x400] in u32
      space. Therefore, tests with such info will lead to wrong verifier
      assumptions later on like falsely predicting conditional jumps to be
      always taken, etc.
      
      The issue here is that __reg_bound_offset32()'s implementation from
      commit 581738a6 ("bpf: Provide better register bounds after jmp32
      instructions") makes an incorrect range assumption:
      
        static void __reg_bound_offset32(struct bpf_reg_state *reg)
        {
              u64 mask = 0xffffFFFF;
              struct tnum range = tnum_range(reg->umin_value & mask,
                                             reg->umax_value & mask);
              struct tnum lo32 = tnum_cast(reg->var_off, 4);
              struct tnum hi32 = tnum_lshift(tnum_rshift(reg->var_off, 32), 32);
      
              reg->var_off = tnum_or(hi32, tnum_intersect(lo32, range));
        }
      
      In the above walk-through example, __reg_bound_offset32() as-is chose
      a range after masking with 0xffffffff of [0x0,0x0] since umin:0x2000000000
      and umax:0x4000000000 and therefore the lo32 part was clamped to 0x0 as
      well. However, in the umin:0x2000000000 and umax:0x4000000000 range above
      we'd end up with an actual possible interval of [0x0,0xffffffff] for u32
      space instead.
      
      In case of the original reproducer, the situation looked as follows at
      insn 5 for r0:
      
        [...]
        5: R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x0; 0x1ffffffff)) R1_w=invP808464432 R10=fp0
                                     0x30303030           0x13030302f
        5: (de) if w1 s<= w0 goto pc+0
         R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x30303020; 0x10000001f)) R1_w=invP808464432 R10=fp0
                                   0x30303030           0x13030302f
        [...]
      
      After the fall-through, we similarly forced the var_off result into
      the wrong range [0x30303030,0x3030302f] suggesting later on that fixed
      bits must only be of 0x30303020 with 0x10000001f unknowns whereas such
      assumption can only be made when both bounds in hi32 range match.
      
      Originally, I was thinking to fix this by moving reg into a temp reg and
      use proper coerce_reg_to_size() helper on the temp reg where we can then
      based on that define the range tnum for later intersection:
      
        static void __reg_bound_offset32(struct bpf_reg_state *reg)
        {
              struct bpf_reg_state tmp = *reg;
              struct tnum lo32, hi32, range;
      
              coerce_reg_to_size(&tmp, 4);
              range = tnum_range(tmp.umin_value, tmp.umax_value);
              lo32 = tnum_cast(reg->var_off, 4);
              hi32 = tnum_lshift(tnum_rshift(reg->var_off, 32), 32);
              reg->var_off = tnum_or(hi32, tnum_intersect(lo32, range));
        }
      
      In the case of the concrete example, this gives us a more conservative unknown
      section. Thus, after knowing r1 <= 0x4000000000 and r1 >= 0x2000000000 and
                                   w1 <= 0x400        and w1 >= 0x200:
      
        max: 0b100000000000000000000000000000000000000 / 0x4000000000
        var: 0b111111111111111111111111111111111111111 / 0x7fffffffff
        min: 0b010000000000000000000000000000000000000 / 0x2000000000
      
      However, above new __reg_bound_offset32() has no effect on refining the
      knowledge of the register contents. Meaning, if the bounds in hi32 range
      mismatch we'll get the identity function given the range reg spans
      [0x0,0xffffffff] and we cast var_off into lo32 only to later on binary
      or it again with the hi32.
      
      Likewise, if the bounds in hi32 range match, then we mask both bounds
      with 0xffffffff, use the resulting umin/umax for the range to later
      intersect the lo32 with it. However, _prior_ called __reg_bound_offset()
      did already such intersection on the full reg and we therefore would only
      repeat the same operation on the lo32 part twice.
      
      Given this has no effect and the original commit had false assumptions,
      this patch reverts the code entirely which is also more straight forward
      for stable trees: apparently 581738a6 got auto-selected by Sasha's
      ML system and misclassified as a fix, so it got sucked into v5.4 where
      it should never have landed. A revert is low-risk also from a user PoV
      since it requires a recent kernel and llc to opt-into -mcpu=v3 BPF CPU
      to generate jmp32 instructions. A proper bounds refinement would need a
      significantly more complex approach which is currently being worked, but
      no stable material [0]. Hence revert is best option for stable. After the
      revert, the original reported program gets rejected as follows:
      
        1: (7f) r0 >>= r0
        2: (14) w0 -= 808464432
        3: (07) r0 += 808464432
        4: (b7) r1 = 808464432
        5: (de) if w1 s<= w0 goto pc+0
         R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x0; 0x1ffffffff)) R1_w=invP808464432 R10=fp0
        6: (07) r0 += -2144337872
        7: (14) w0 -= -1607454672
        8: (25) if r0 > 0x30303030 goto pc+0
         R0_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x3fffffff)) R1_w=invP808464432 R10=fp0
        9: (76) if w0 s>= 0x303030 goto pc+2
         R0=invP(id=0,umax_value=3158063,var_off=(0x0; 0x3fffff)) R1=invP808464432 R10=fp0
        10: (30) r0 = *(u8 *)skb[808464432]
        BPF_LD_[ABS|IND] uses reserved fields
        processed 11 insns (limit 1000000) [...]
      
        [0] https://lore.kernel.org/bpf/158507130343.15666.8018068546764556975.stgit@john-Precision-5820-Tower/T/
      
      Fixes: 581738a6
      
       ("bpf: Provide better register bounds after jmp32 instructions")
      Reported-by: default avatarAnatoly Trosinenko <anatoly.trosinenko@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200330160324.15259-2-daniel@iogearbox.net
      f2d67fec
  2. Mar 30, 2020
    • Daniel Borkmann's avatar
      Merge branch 'bpf-lsm' · 641cd7b0
      Daniel Borkmann authored
      
      
      KP Singh says:
      
      ====================
      ** Motivation
      
      Google does analysis of rich runtime security data to detect and thwart
      threats in real-time. Currently, this is done in custom kernel modules
      but we would like to replace this with something that's upstream and
      useful to others.
      
      The current kernel infrastructure for providing telemetry (Audit, Perf
      etc.) is disjoint from access enforcement (i.e. LSMs).  Augmenting the
      information provided by audit requires kernel changes to audit, its
      policy language and user-space components. Furthermore, building a MAC
      policy based on the newly added telemetry data requires changes to
      various LSMs and their respective policy languages.
      
      This patchset allows BPF programs to be attached to LSM hooks This
      facilitates a unified and dynamic (not requiring re-compilation of the
      kernel) audit and MAC policy.
      
      ** Why an LSM?
      
      Linux Security Modules target security behaviours rather than the
      kernel's API. For example, it's easy to miss out a newly added system
      call for executing processes (eg. execve, execveat etc.) but the LSM
      framework ensures that all process executions trigger the relevant hooks
      irrespective of how the process was executed.
      
      Allowing users to implement LSM hooks at runtime also benefits the LSM
      eco-system by enabling a quick feedback loop from the security community
      about the kind of behaviours that the LSM Framework should be targeting.
      
      ** How does it work?
      
      The patchset introduces a new eBPF (https://docs.cilium.io/en/v1.6/bpf/)
      program type BPF_PROG_TYPE_LSM which can only be attached to LSM hooks.
      Loading and attachment of BPF programs requires CAP_SYS_ADMIN.
      
      The new LSM registers nop functions (bpf_lsm_<hook_name>) as LSM hook
      callbacks. Their purpose is to provide a definite point where BPF
      programs can be attached as BPF_TRAMP_MODIFY_RETURN trampoline programs
      for hooks that return an int, and BPF_TRAMP_FEXIT trampoline programs
      for void LSM hooks.
      
      Audit logs can be written using a format chosen by the eBPF program to
      the perf events buffer or to global eBPF variables or maps and can be
      further processed in user-space.
      
      ** BTF Based Design
      
      The current design uses BTF:
      
        * https://facebookmicrosites.github.io/bpf/blog/2018/11/14/btf-enhancement.html
        * https://lwn.net/Articles/803258
      
      which allows verifiable read-only structure accesses by field names
      rather than fixed offsets. This allows accessing the hook parameters
      using a dynamically created context which provides a certain degree of
      ABI stability:
      
        // Only declare the structure and fields intended to be used
        // in the program
        struct vm_area_struct {
          unsigned long vm_start;
        } __attribute__((preserve_access_index));
      
        // Declare the eBPF program mprotect_audit which attaches to
        // to the file_mprotect LSM hook and accepts three arguments.
        SEC("lsm/file_mprotect")
        int BPF_PROG(mprotect_audit, struct vm_area_struct *vma,
               unsigned long reqprot, unsigned long prot, int ret)
        {
          unsigned long vm_start = vma->vm_start;
          return 0;
        }
      
      By relocating field offsets, BTF makes a large portion of kernel data
      structures readily accessible across kernel versions without requiring a
      large corpus of BPF helper functions and requiring recompilation with
      every kernel version. The BTF type information is also used by the BPF
      verifier to validate memory accesses within the BPF program and also
      prevents arbitrary writes to the kernel memory.
      
      The limitations of BTF compatibility are described in BPF Co-Re
      (http://vger.kernel.org/bpfconf2019_talks/bpf-core.pdf, i.e. field
      renames, #defines and changes to the signature of LSM hooks).  This
      design imposes that the MAC policy (eBPF programs) be updated when the
      inspected kernel structures change outside of BTF compatibility
      guarantees. In practice, this is only required when a structure field
      used by a current policy is removed (or renamed) or when the used LSM
      hooks change. We expect the maintenance cost of these changes to be
      acceptable as compared to the design presented in the RFC.
      
      (https://lore.kernel.org/bpf/20190910115527.5235-1-kpsingh@chromium.org/).
      
      ** Usage Examples
      
      A simple example and some documentation is included in the patchset.
      In order to better illustrate the capabilities of the framework some
      more advanced prototype (not-ready for review) code has also been
      published separately:
      
      * Logging execution events (including environment variables and
        arguments)
        https://github.com/sinkap/linux-krsi/blob/patch/v1/examples/samples/bpf/lsm_audit_env.c
      
      * Detecting deletion of running executables:
        https://github.com/sinkap/linux-krsi/blob/patch/v1/examples/samples/bpf/lsm_detect_exec_unlink.c
      
      * Detection of writes to /proc/<pid>/mem:
        https://github.com/sinkap/linux-krsi/blob/patch/v1/examples/samples/bpf/lsm_audit_env.c
      
      We have updated Google's internal telemetry infrastructure and have
      started deploying this LSM on our Linux Workstations. This gives us more
      confidence in the real-world applications of such a system.
      
      ** Changelog:
      
      - v8 -> v9:
        https://lore.kernel.org/bpf/20200327192854.31150-1-kpsingh@chromium.org/
      * Fixed a selftest crash when CONFIG_LSM doesn't have "bpf".
      * Added James' Ack.
      * Rebase.
      
      - v7 -> v8:
        https://lore.kernel.org/bpf/20200326142823.26277-1-kpsingh@chromium.org/
      * Removed CAP_MAC_ADMIN check from bpf_lsm_verify_prog. LSMs can add it
        in their own bpf_prog hook. This can be revisited as a separate patch.
      * Added Andrii and James' Ack/Review tags.
      * Fixed an indentation issue and missing newlines in selftest error
        a cases.
      * Updated a comment as suggested by Alexei.
      * Updated the documentation to use the newer libbpf API and some other
        fixes.
      * Rebase
      
      - v6 -> v7:
        https://lore.kernel.org/bpf/20200325152629.6904-1-kpsingh@chromium.org/
      * Removed __weak from the LSM attachment nops per Kees' suggestion.
        Will send a separate patch (if needed) to update the noinline
        definition in include/linux/compiler_attributes.h.
      * waitpid to wait specifically for the forked child in selftests.
      * Comment format fixes in security/... as suggested by Casey.
      * Added Acks from Kees and Andrii and Casey's Reviewed-by: tags to
        the respective patches.
      * Rebase
      
      - v5 -> v6:
        https://lore.kernel.org/bpf/20200323164415.12943-1-kpsingh@chromium.org/
      * Updated LSM_HOOK macro to define a default value and cleaned up the
        BPF LSM hook declarations.
      * Added Yonghong's Acks and Kees' Reviewed-by tags.
      * Simplification of the selftest code.
      * Rebase and fixes suggested by Andrii and Yonghong and some other minor
        fixes noticed in internal review.
      
      - v4 -> v5:
        https://lore.kernel.org/bpf/20200220175250.10795-1-kpsingh@chromium.org/
      * Removed static keys and special casing of BPF calls from the LSM
        framework.
      * Initialized the BPF callbacks (nops) as proper LSM hooks.
      * Updated to using the newly introduced BPF_TRAMP_MODIFY_RETURN
        trampolines in https://lkml.org/lkml/2020/3/4/877
      * Addressed Andrii's feedback and rebased.
      
      - v3 -> v4:
      * Moved away from allocating a separate security_hook_heads and adding a
        new special case for arch_prepare_bpf_trampoline to using BPF fexit
        trampolines called from the right place in the LSM hook and toggled by
        static keys based on the discussion in:
        https://lore.kernel.org/bpf/CAG48ez25mW+_oCxgCtbiGMX07g_ph79UOJa07h=o_6B6+Q-u5g@mail.gmail.com/
      * Since the code does not deal with security_hook_heads anymore, it goes
        from "being a BPF LSM" to "BPF program attachment to LSM hooks".
      * Added a new test case which ensures that the BPF programs' return value
        is reflected by the LSM hook.
      
      - v2 -> v3 does not change the overall design and has some minor fixes:
      * LSM_ORDER_LAST is introduced to represent the behaviour of the BPF LSM
      * Fixed the inadvertent clobbering of the LSM Hook error codes
      * Added GPL license requirement to the commit log
      * The lsm_hook_idx is now the more conventional 0-based index
      * Some changes were split into a separate patch ("Load btf_vmlinux only
        once per object")
        https://lore.kernel.org/bpf/20200117212825.11755-1-kpsingh@chromium.org/
      * Addressed Andrii's feedback on the BTF implementation
      * Documentation update for using generated vmlinux.h to simplify
        programs
      * Rebase
      
      - Changes since v1:
        https://lore.kernel.org/bpf/20191220154208.15895-1-kpsingh@chromium.org
      * Eliminate the requirement to maintain LSM hooks separately in
        security/bpf/hooks.h Use BPF trampolines to dynamically allocate
        security hooks
      * Drop the use of securityfs as bpftool provides the required
        introspection capabilities.  Update the tests to use the bpf_skeleton
        and global variables
      * Use O_CLOEXEC anonymous fds to represent BPF attachment in line with
        the other BPF programs with the possibility to use bpf program pinning
        in the future to provide "permanent attachment".
      * Drop the logic based on prog names for handling re-attachment.
      * Drop bpf_lsm_event_output from this series and send it as a separate
        patch.
      ====================
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      641cd7b0
    • KP Singh's avatar
      bpf: lsm: Add Documentation · 4dece7f3
      KP Singh authored
      
      
      Document how eBPF programs (BPF_PROG_TYPE_LSM) can be loaded and
      attached (BPF_LSM_MAC) to the LSM hooks.
      
      Signed-off-by: default avatarKP Singh <kpsingh@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarBrendan Jackman <jackmanb@google.com>
      Reviewed-by: default avatarFlorent Revest <revest@google.com>
      Reviewed-by: default avatarThomas Garnier <thgarnie@google.com>
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Link: https://lore.kernel.org/bpf/20200329004356.27286-9-kpsingh@chromium.org
      4dece7f3
    • KP Singh's avatar
      bpf: lsm: Add selftests for BPF_PROG_TYPE_LSM · 03e54f10
      KP Singh authored
      
      
      * Load/attach a BPF program that hooks to file_mprotect (int)
        and bprm_committed_creds (void).
      * Perform an action that triggers the hook.
      * Verify if the audit event was received using the shared global
        variables for the process executed.
      * Verify if the mprotect returns a -EPERM.
      
      Signed-off-by: default avatarKP Singh <kpsingh@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarBrendan Jackman <jackmanb@google.com>
      Reviewed-by: default avatarFlorent Revest <revest@google.com>
      Reviewed-by: default avatarThomas Garnier <thgarnie@google.com>
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200329004356.27286-8-kpsingh@chromium.org
      03e54f10
    • KP Singh's avatar
      tools/libbpf: Add support for BPF_PROG_TYPE_LSM · 1e092a03
      KP Singh authored
      
      
      Since BPF_PROG_TYPE_LSM uses the same attaching mechanism as
      BPF_PROG_TYPE_TRACING, the common logic is refactored into a static
      function bpf_program__attach_btf_id.
      
      A new API call bpf_program__attach_lsm is still added to avoid userspace
      conflicts if this ever changes in the future.
      
      Signed-off-by: default avatarKP Singh <kpsingh@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarBrendan Jackman <jackmanb@google.com>
      Reviewed-by: default avatarFlorent Revest <revest@google.com>
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200329004356.27286-7-kpsingh@chromium.org
      1e092a03
    • KP Singh's avatar
      bpf: lsm: Initialize the BPF LSM hooks · 520b7aa0
      KP Singh authored
      
      
      * The hooks are initialized using the definitions in
        include/linux/lsm_hook_defs.h.
      * The LSM can be enabled / disabled with CONFIG_BPF_LSM.
      
      Signed-off-by: default avatarKP Singh <kpsingh@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarBrendan Jackman <jackmanb@google.com>
      Reviewed-by: default avatarFlorent Revest <revest@google.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Link: https://lore.kernel.org/bpf/20200329004356.27286-6-kpsingh@chromium.org
      520b7aa0
    • KP Singh's avatar
      bpf: lsm: Implement attach, detach and execution · 9e4e01df
      KP Singh authored
      
      
      JITed BPF programs are dynamically attached to the LSM hooks
      using BPF trampolines. The trampoline prologue generates code to handle
      conversion of the signature of the hook to the appropriate BPF context.
      
      The allocated trampoline programs are attached to the nop functions
      initialized as LSM hooks.
      
      BPF_PROG_TYPE_LSM programs must have a GPL compatible license and
      and need CAP_SYS_ADMIN (required for loading eBPF programs).
      
      Upon attachment:
      
      * A BPF fexit trampoline is used for LSM hooks with a void return type.
      * A BPF fmod_ret trampoline is used for LSM hooks which return an
        int. The attached programs can override the return value of the
        bpf LSM hook to indicate a MAC Policy decision.
      
      Signed-off-by: default avatarKP Singh <kpsingh@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarBrendan Jackman <jackmanb@google.com>
      Reviewed-by: default avatarFlorent Revest <revest@google.com>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Link: https://lore.kernel.org/bpf/20200329004356.27286-5-kpsingh@chromium.org
      9e4e01df
    • KP Singh's avatar
      bpf: lsm: Provide attachment points for BPF LSM programs · 9d3fdea7
      KP Singh authored
      
      
      When CONFIG_BPF_LSM is enabled, nop functions, bpf_lsm_<hook_name>, are
      generated for each LSM hook. These functions are initialized as LSM
      hooks in a subsequent patch.
      
      Signed-off-by: default avatarKP Singh <kpsingh@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarBrendan Jackman <jackmanb@google.com>
      Reviewed-by: default avatarFlorent Revest <revest@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Link: https://lore.kernel.org/bpf/20200329004356.27286-4-kpsingh@chromium.org
      9d3fdea7
    • KP Singh's avatar
      security: Refactor declaration of LSM hooks · 98e828a0
      KP Singh authored
      
      
      The information about the different types of LSM hooks is scattered
      in two locations i.e. union security_list_options and
      struct security_hook_heads. Rather than duplicating this information
      even further for BPF_PROG_TYPE_LSM, define all the hooks with the
      LSM_HOOK macro in lsm_hook_defs.h which is then used to generate all
      the data structures required by the LSM framework.
      
      The LSM hooks are defined as:
      
        LSM_HOOK(<return_type>, <default_value>, <hook_name>, args...)
      
      with <default_value> acccessible in security.c as:
      
        LSM_RET_DEFAULT(<hook_name>)
      
      Signed-off-by: default avatarKP Singh <kpsingh@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarBrendan Jackman <jackmanb@google.com>
      Reviewed-by: default avatarFlorent Revest <revest@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarCasey Schaufler <casey@schaufler-ca.com>
      Acked-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Link: https://lore.kernel.org/bpf/20200329004356.27286-3-kpsingh@chromium.org
      98e828a0
    • KP Singh's avatar
      bpf: Introduce BPF_PROG_TYPE_LSM · fc611f47
      KP Singh authored
      
      
      Introduce types and configs for bpf programs that can be attached to
      LSM hooks. The programs can be enabled by the config option
      CONFIG_BPF_LSM.
      
      Signed-off-by: default avatarKP Singh <kpsingh@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarBrendan Jackman <jackmanb@google.com>
      Reviewed-by: default avatarFlorent Revest <revest@google.com>
      Reviewed-by: default avatarThomas Garnier <thgarnie@google.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Link: https://lore.kernel.org/bpf/20200329004356.27286-2-kpsingh@chromium.org
      fc611f47
    • Toke Høiland-Jørgensen's avatar
      selftests: Add test for overriding global data value before load · e5fb60ee
      Toke Høiland-Jørgensen authored
      
      
      This adds a test to exercise the new bpf_map__set_initial_value() function.
      The test simply overrides the global data section with all zeroes, and
      checks that the new value makes it into the kernel map on load.
      
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200329132253.232541-2-toke@redhat.com
      e5fb60ee
    • Toke Høiland-Jørgensen's avatar
      libbpf: Add setter for initial value for internal maps · e2842be5
      Toke Høiland-Jørgensen authored
      
      
      For internal maps (most notably the maps backing global variables), libbpf
      uses an internal mmaped area to store the data after opening the object.
      This data is subsequently copied into the kernel map when the object is
      loaded.
      
      This adds a function to set a new value for that data, which can be used to
      before it is loaded into the kernel. This is especially relevant for RODATA
      maps, since those are frozen on load.
      
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200329132253.232541-1-toke@redhat.com
      e2842be5
    • Daniel Borkmann's avatar
      bpf, net: Fix build issue when net ns not configured · 5a95cbb8
      Daniel Borkmann authored
      Fix a redefinition of 'net_gen_cookie' error that was overlooked
      when net ns is not configured.
      
      Fixes: f318903c
      
       ("bpf: Add netns cookie and enable it for bpf cgroup hooks")
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      5a95cbb8
  3. Mar 29, 2020
  4. Mar 28, 2020
    • Alexei Starovoitov's avatar
      Merge branch 'cgroup-helpers' · 2cf69d3f
      Alexei Starovoitov authored
      
      
      Daniel Borkmann says:
      
      ====================
      This adds various straight-forward helper improvements and additions to BPF
      cgroup based connect(), sendmsg(), recvmsg() and bind-related hooks which
      would allow to implement more fine-grained policies and improve current load
      balancer limitations we're seeing. For details please see individual patches.
      I've tested them on Kubernetes & Cilium and also added selftests for the small
      verifier extension. Thanks!
      ====================
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2cf69d3f
    • Daniel Borkmann's avatar
      bpf: Add selftest cases for ctx_or_null argument type · 23599ada
      Daniel Borkmann authored
      
      
      Add various tests to make sure the verifier keeps catching them:
      
        # ./test_verifier
        [...]
        #230/p pass ctx or null check, 1: ctx OK
        #231/p pass ctx or null check, 2: null OK
        #232/p pass ctx or null check, 3: 1 OK
        #233/p pass ctx or null check, 4: ctx - const OK
        #234/p pass ctx or null check, 5: null (connect) OK
        #235/p pass ctx or null check, 6: null (bind) OK
        #236/p pass ctx or null check, 7: ctx (bind) OK
        #237/p pass ctx or null check, 8: null (bind) OK
        [...]
        Summary: 1595 PASSED, 0 SKIPPED, 0 FAILED
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/c74758d07b1b678036465ef7f068a49e9efd3548.1585323121.git.daniel@iogearbox.net
      23599ada
    • Daniel Borkmann's avatar
      bpf: Enable retrival of pid/tgid/comm from bpf cgroup hooks · 834ebca8
      Daniel Borkmann authored
      
      
      We already have the bpf_get_current_uid_gid() helper enabled, and
      given we now have perf event RB output available for connect(),
      sendmsg(), recvmsg() and bind-related hooks, add a trivial change
      to enable bpf_get_current_pid_tgid() and bpf_get_current_comm()
      as well.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/18744744ed93c06343be8b41edcfd858706f39d7.1585323121.git.daniel@iogearbox.net
      834ebca8
    • Daniel Borkmann's avatar
      bpf: Enable bpf cgroup hooks to retrieve cgroup v2 and ancestor id · 0f09abd1
      Daniel Borkmann authored
      Enable the bpf_get_current_cgroup_id() helper for connect(), sendmsg(),
      recvmsg() and bind-related hooks in order to retrieve the cgroup v2
      context which can then be used as part of the key for BPF map lookups,
      for example. Given these hooks operate in process context 'current' is
      always valid and pointing to the app that is performing mentioned
      syscalls if it's subject to a v2 cgroup. Also with same motivation of
      commit 77236281
      
       ("bpf: Introduce bpf_skb_ancestor_cgroup_id helper")
      enable retrieval of ancestor from current so the cgroup id can be used
      for policy lookups which can then forbid connect() / bind(), for example.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/d2a7ef42530ad299e3cbb245e6c12374b72145ef.1585323121.git.daniel@iogearbox.net
      0f09abd1
    • Daniel Borkmann's avatar
      bpf: Allow to retrieve cgroup v1 classid from v2 hooks · 5a52ae4e
      Daniel Borkmann authored
      
      
      Today, Kubernetes is still operating on cgroups v1, however, it is
      possible to retrieve the task's classid based on 'current' out of
      connect(), sendmsg(), recvmsg() and bind-related hooks for orchestrators
      which attach to the root cgroup v2 hook in a mixed env like in case
      of Cilium, for example, in order to then correlate certain pod traffic
      and use it as part of the key for BPF map lookups.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/555e1c69db7376c0947007b4951c260e1074efc3.1585323121.git.daniel@iogearbox.net
      5a52ae4e
    • Daniel Borkmann's avatar
      bpf: Add netns cookie and enable it for bpf cgroup hooks · f318903c
      Daniel Borkmann authored
      
      
      In Cilium we're mainly using BPF cgroup hooks today in order to implement
      kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*),
      ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic
      between Cilium managed nodes. While this works in its current shape and avoids
      packet-level NAT for inter Cilium managed node traffic, there is one major
      limitation we're facing today, that is, lack of netns awareness.
      
      In Kubernetes, the concept of Pods (which hold one or multiple containers)
      has been built around network namespaces, so while we can use the global scope
      of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing
      NodePort ports on loopback addresses), we also have the need to differentiate
      between initial network namespaces and non-initial one. For example, ExternalIP
      services mandate that non-local service IPs are not to be translated from the
      host (initial) network namespace as one example. Right now, we have an ugly
      work-around in place where non-local service IPs for ExternalIP services are
      not xlated from connect() and friends BPF hooks but instead via less efficient
      packet-level NAT on the veth tc ingress hook for Pod traffic.
      
      On top of determining whether we're in initial or non-initial network namespace
      we also have a need for a socket-cookie like mechanism for network namespaces
      scope. Socket cookies have the nice property that they can be combined as part
      of the key structure e.g. for BPF LRU maps without having to worry that the
      cookie could be recycled. We are planning to use this for our sessionAffinity
      implementation for services. Therefore, add a new bpf_get_netns_cookie() helper
      which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would
      provide the cookie for the initial network namespace while passing the context
      instead of NULL would provide the cookie from the application's network namespace.
      We're using a hole, so no size increase; the assignment happens only once.
      Therefore this allows for a comparison on initial namespace as well as regular
      cookie usage as we have today with socket cookies. We could later on enable
      this helper for other program types as well as we would see need.
      
        (*) Both externalTrafficPolicy={Local|Cluster} types
        [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net
      f318903c
    • Daniel Borkmann's avatar
      bpf: Enable perf event rb output for bpf cgroup progs · fcf752ea
      Daniel Borkmann authored
      Currently, connect(), sendmsg(), recvmsg() and bind-related hooks
      are all lacking perf event rb output in order to push notifications
      or monitoring events up to user space. Back in commit a5a3a828
      
      
      ("bpf: add perf event notificaton support for sock_ops"), I've worked
      with Sowmini to enable them for sock_ops where the context part is
      not used (as opposed to skbs for example where the packet data can
      be appended). Make the bpf_sockopt_event_output() helper generic and
      enable it for mentioned hooks.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/69c39daf87e076b31e52473c902e9bfd37559124.1585323121.git.daniel@iogearbox.net
      fcf752ea