Skip to content
  1. May 02, 2018
  2. Apr 27, 2018
    • John Fastabend's avatar
      bpf: fix uninitialized variable in bpf tools · 81542556
      John Fastabend authored
      Here the variable cont is used as the saved_pointer for a call to
      strtok_r(). It is safe to use the value uninitialized in this
      context however and the later reference is only ever used if
      the strtok_r is successful. But, 'gcc-5' at least doesn't have all
      this knowledge so initialize cont to NULL. Additionally, do the
      natural NULL check before accessing just for completness.
      
      The warning is the following:
      
      ./bpf/tools/bpf/bpf_dbg.c: In function ‘cmd_load’:
      ./bpf/tools/bpf/bpf_dbg.c:1077:13: warning: ‘cont’ may be used uninitialized in this function [-Wmaybe-uninitialized]
        } else if (matches(subcmd, "pcap") == 0) {
      
      Fixes: fd981e3c
      
       "filter: bpf_dbg: add minimal bpf debugger"
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      81542556
  3. Apr 26, 2018
  4. Apr 25, 2018
    • Gianluca Borello's avatar
      bpf, x64: fix JIT emission for dead code · 1612a981
      Gianluca Borello authored
      Commit 2a5418a1 ("bpf: improve dead code sanitizing") replaced dead
      code with a series of ja-1 instructions, for safety. That made JIT
      compilation much more complex for some BPF programs. One instance of such
      programs is, for example:
      
      bool flag = false
      ...
      /* A bunch of other code */
      ...
      if (flag)
              do_something()
      
      In some cases llvm is not able to remove at compile time the code for
      do_something(), so the generated BPF program ends up with a large amount
      of dead instructions. In one specific real life example, there are two
      series of ~500 and ~1000 dead instructions in the program. When the
      verifier replaces them with a series of ja-1 instructions, it causes an
      interesting behavior at JIT time.
      
      During the first pass, since all the instructions are estimated at 64
      bytes, the ja-1 instructions end up being translated as 5 bytes JMP
      instructions (0xE9), since the jump offsets become increasingly large (>
      127) as each instruction gets discovered to be 5 bytes instead of the
      estimated 64.
      
      Starting from the second pass, the first N instructions of the ja-1
      sequence get translated into 2 bytes JMPs (0xEB) because the jump offsets
      become <= 127 this time. In particular, N is defined as roughly 127 / (5
      - 2) ~= 42. So, each further pass will make the subsequent N JMP
      instructions shrink from 5 to 2 bytes, making the image shrink every time.
      This means that in order to have the entire program converge, there need
      to be, in the real example above, at least ~1000 / 42 ~= 24 passes just
      for translating the dead code. If we add this number to the passes needed
      to translate the other non dead code, it brings such program to 40+
      passes, and JIT doesn't complete. Ultimately the userspace loader fails
      because such BPF program was supposed to be part of a prog array owner
      being JITed.
      
      While it is certainly possible to try to refactor such programs to help
      the compiler remove dead code, the behavior is not really intuitive and it
      puts further burden on the BPF developer who is not expecting such
      behavior. To make things worse, such programs are working just fine in all
      the kernel releases prior to the ja-1 fix.
      
      A possible approach to mitigate this behavior consists into noticing that
      for ja-1 instructions we don't really need to rely on the estimated size
      of the previous and current instructions, we know that a -1 BPF jump
      offset can be safely translated into a 0xEB instruction with a jump offset
      of -2.
      
      Such fix brings the BPF program in the previous example to complete again
      in ~9 passes.
      
      Fixes: 2a5418a1
      
       ("bpf: improve dead code sanitizing")
      Signed-off-by: default avatarGianluca Borello <g.borello@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      1612a981
    • William Tu's avatar
      bpf: clear the ip_tunnel_info. · 5540fbf4
      William Tu authored
      
      
      The percpu metadata_dst might carry the stale ip_tunnel_info
      and cause incorrect behavior.  When mixing tests using ipv4/ipv6
      bpf vxlan and geneve tunnel, the ipv6 tunnel info incorrectly uses
      ipv4's src ip addr as its ipv6 src address, because the previous
      tunnel info does not clean up.  The patch zeros the fields in
      ip_tunnel_info.
      
      Signed-off-by: default avatarWilliam Tu <u9012063@gmail.com>
      Reported-by: default avatarYifeng Sun <pkusunyifeng@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      5540fbf4
    • Linus Torvalds's avatar
      Merge branch 'userns-linus' of... · 3be4aaf4
      Linus Torvalds authored
      Merge branch 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
      
      Pull userns bug fix from Eric Biederman:
       "Just a small fix to properly set the return code on error"
      
      * 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
        commoncap: Handle memory allocation failure.
      3be4aaf4
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 24cac700
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix rtnl deadlock in ipvs, from Julian Anastasov.
      
       2) s390 qeth fixes from Julian Wiedmann (control IO completion stalls,
          bad MAC address update sequence, request side races on command IO
          timeouts).
      
       3) Handle seq_file overflow properly in l2tp, from Guillaume Nault.
      
       4) Fix VLAN priority mappings in cpsw driver, from Ivan Khoronzhuk.
      
       5) Packet scheduler ife action fixes (malformed TLV lengths, etc.) from
          Alexander Aring.
      
       6) Fix out of bounds access in tcp md5 option parser, from Jann Horn.
      
       7) Missing netlink attribute policies in rtm_ipv6_policy table, from
          Eric Dumazet.
      
       8) Missing socket address length checks in l2tp and pppoe connect, from
          Guillaume Nault.
      
       9) Fix netconsole over team and bonding, from Xin Long.
      
      10) Fix race with AF_PACKET socket state bitfields, from Willem de
          Bruijn.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (51 commits)
        ice: Fix insufficient memory issue in ice_aq_manage_mac_read
        sfc: ARFS filter IDs
        net: ethtool: Add missing kernel doc for FEC parameters
        packet: fix bitfield update race
        ice: Do not check INTEVENT bit for OICR interrupts
        ice: Fix incorrect comment for action type
        ice: Fix initialization for num_nodes_added
        igb: Fix the transmission mode of queue 0 for Qav mode
        ixgbevf: ensure xdp_ring resources are free'd on error exit
        team: fix netconsole setup over team
        amd-xgbe: Only use the SFP supported transceiver signals
        amd-xgbe: Improve KR auto-negotiation and training
        amd-xgbe: Add pre/post auto-negotiation phy hooks
        pppoe: check sockaddr length in pppoe_connect()
        l2tp: check sockaddr length in pppol2tp_connect()
        net: phy: marvell: clear wol event before setting it
        ipv6: add RTA_TABLE and RTA_PREFSRC to rtm_ipv6_policy
        bonding: do not set slave_dev npinfo before slave_enable_netpoll in bond_enslave
        tcp: don't read out-of-bounds opsize
        ibmvnic: Clean actual number of RX or TX pools
        ...
      24cac700
    • David S. Miller's avatar
      Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue · d19efb72
      David S. Miller authored
      
      
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates 2018-04-24
      
      This series contains fixes to ixgbevf, igb and ice drivers.
      
      Colin Ian King fixes the return value on error for the new XDP support
      that went into ixgbevf for 4.17.
      
      Vinicius provides a fix for queue 0 for igb, which was not receiving all
      the credits it needed when QAV mode was enabled.
      
      Anirudh provides several fixes for the new ice driver, starting with
      properly initializing num_nodes_added to zero.  Fixed up a code comment
      to better reflect what is really going on in the code.  Fixed how to
      detect if an OICR interrupt has occurred to a more reliable method.
      
      Md Fahad fixes the ice driver to allocate the right amount of memory
      when reading and storing the devices MAC addresses.  The device can have
      up to 2 MAC addresses (LAN and WoL), while WoL is currently not
      supported, we need to ensure it can be properly handled when support is
      added.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d19efb72
    • Md Fahad Iqbal Polash's avatar
      ice: Fix insufficient memory issue in ice_aq_manage_mac_read · d6fef10c
      Md Fahad Iqbal Polash authored
      For the MAC read operation, the device can return up to two (LAN and WoL)
      MAC addresses. Without access to adequate memory, the device will return
      an error. Fixed this by allocating the right amount of memory. Also, logic
      to detect and copy the LAN MAC address into the port_info structure has
      been added. Note that the WoL MAC address is ignored currently as the WoL
      feature isn't supported yet.
      
      Fixes: dc49c772
      
       ("ice: Get MAC/PHY/link info and scheduler topology")
      Signed-off-by: default avatarMd Fahad Iqbal Polash <md.fahad.iqbal.polash@intel.com>
      Signed-off-by: default avatarAnirudh Venkataramanan <anirudh.venkataramanan@intel.com>
      Tested-by: default avatarTony Brelinski <tonyx.brelinski@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      d6fef10c
    • Edward Cree's avatar
      sfc: ARFS filter IDs · f8d62037
      Edward Cree authored
      Associate an arbitrary ID with each ARFS filter, allowing to properly query
       for expiry.  The association is maintained in a hash table, which is
       protected by a spinlock.
      
      v3: fix build warnings when CONFIG_RFS_ACCEL is disabled (thanks lkp-robot).
      v2: fixed uninitialised variable (thanks davem and lkp-robot).
      
      Fixes: 3af0f342
      
       ("sfc: replace asynchronous filter operations")
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8d62037
    • Florian Fainelli's avatar
      net: ethtool: Add missing kernel doc for FEC parameters · d805c520
      Florian Fainelli authored
      While adding support for ethtool::get_fecparam and set_fecparam, kernel
      doc for these functions was missed, add those.
      
      Fixes: 1a5f3da2
      
       ("net: ethtool: add support for forward error correction modes")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Acked-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d805c520
    • Willem de Bruijn's avatar
      packet: fix bitfield update race · a6361f0c
      Willem de Bruijn authored
      Updates to the bitfields in struct packet_sock are not atomic.
      Serialize these read-modify-write cycles.
      
      Move po->running into a separate variable. Its writes are protected by
      po->bind_lock (except for one startup case at packet_create). Also
      replace a textual precondition warning with lockdep annotation.
      
      All others are set only in packet_setsockopt. Serialize these
      updates by holding the socket lock. Analogous to other field updates,
      also hold the lock when testing whether a ring is active (pg_vec).
      
      Fixes: 8dc41944
      
       ("[PACKET]: Add optional checksum computation for recvmsg")
      Reported-by: default avatarDaeRyong Jeong <threeearcat@gmail.com>
      Reported-by: default avatarByoungyoung Lee <byoungyoung@purdue.edu>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a6361f0c
    • Ben Shelton's avatar
      ice: Do not check INTEVENT bit for OICR interrupts · 30d84397
      Ben Shelton authored
      According to the hardware spec, checking the INTEVENT bit isn't a
      reliable way to detect if an OICR interrupt has occurred. This is
      because this bit can be cleared by the hardware/firmware before the
      interrupt service routine has run. So instead, just check for OICR
      events every time.
      
      Fixes: 940b61af
      
       ("ice: Initialize PF and setup miscellaneous interrupt")
      Signed-off-by: default avatarBen Shelton <benjamin.h.shelton@intel.com>
      Signed-off-by: default avatarAnirudh Venkataramanan <anirudh.venkataramanan@intel.com>
      Tested-by: default avatarTony Brelinski <tonyx.brelinski@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      30d84397
  5. Apr 24, 2018
  6. Apr 23, 2018
    • Xin Long's avatar
      bonding: do not set slave_dev npinfo before slave_enable_netpoll in bond_enslave · ddea788c
      Xin Long authored
      After Commit 8a8efa22 ("bonding: sync netpoll code with bridge"), it
      would set slave_dev npinfo in slave_enable_netpoll when enslaving a dev
      if bond->dev->npinfo was set.
      
      However now slave_dev npinfo is set with bond->dev->npinfo before calling
      slave_enable_netpoll. With slave_dev npinfo set, __netpoll_setup called
      in slave_enable_netpoll will not call slave dev's .ndo_netpoll_setup().
      It causes that the lower dev of this slave dev can't set its npinfo.
      
      One way to reproduce it:
      
        # modprobe bonding
        # brctl addbr br0
        # brctl addif br0 eth1
        # ifconfig bond0 192.168.122.1/24 up
        # ifenslave bond0 eth2
        # systemctl restart netconsole
        # ifenslave bond0 br0
        # ifconfig eth2 down
        # systemctl restart netconsole
      
      The netpoll won't really work.
      
      This patch is to remove that slave_dev npinfo setting in bond_enslave().
      
      Fixes: 8a8efa22
      
       ("bonding: sync netpoll code with bridge")
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ddea788c
    • Jann Horn's avatar
      tcp: don't read out-of-bounds opsize · 7e5a206a
      Jann Horn authored
      The old code reads the "opsize" variable from out-of-bounds memory (first
      byte behind the segment) if a broken TCP segment ends directly after an
      opcode that is neither EOL nor NOP.
      
      The result of the read isn't used for anything, so the worst thing that
      could theoretically happen is a pagefault; and since the physmap is usually
      mostly contiguous, even that seems pretty unlikely.
      
      The following C reproducer triggers the uninitialized read - however, you
      can't actually see anything happen unless you put something like a
      pr_warn() in tcp_parse_md5sig_option() to print the opsize.
      
      ====================================
      #define _GNU_SOURCE
      #include <arpa/inet.h>
      #include <stdlib.h>
      #include <errno.h>
      #include <stdarg.h>
      #include <net/if.h>
      #include <linux/if.h>
      #include <linux/ip.h>
      #include <linux/tcp.h>
      #include <linux/in.h>
      #include <linux/if_tun.h>
      #include <err.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      #include <string.h>
      #include <stdio.h>
      #include <unistd.h>
      #include <sys/ioctl.h>
      #include <assert.h>
      
      void systemf(const char *command, ...) {
        char *full_command;
        va_list ap;
        va_start(ap, command);
        if (vasprintf(&full_command, command, ap) == -1)
          err(1, "vasprintf");
        va_end(ap);
        printf("systemf: <<<%s>>>\n", full_command);
        system(full_command);
      }
      
      char *devname;
      
      int tun_alloc(char *name) {
        int fd = open("/dev/net/tun", O_RDWR);
        if (fd == -1)
          err(1, "open tun dev");
        static struct ifreq req = { .ifr_flags = IFF_TUN|IFF_NO_PI };
        strcpy(req.ifr_name, name);
        if (ioctl(fd, TUNSETIFF, &req))
          err(1, "TUNSETIFF");
        devname = req.ifr_name;
        printf("device name: %s\n", devname);
        return fd;
      }
      
      #define IPADDR(a,b,c,d) (((a)<<0)+((b)<<8)+((c)<<16)+((d)<<24))
      
      void sum_accumulate(unsigned int *sum, void *data, int len) {
        assert((len&2)==0);
        for (int i=0; i<len/2; i++) {
          *sum += ntohs(((unsigned short *)data)[i]);
        }
      }
      
      unsigned short sum_final(unsigned int sum) {
        sum = (sum >> 16) + (sum & 0xffff);
        sum = (sum >> 16) + (sum & 0xffff);
        return htons(~sum);
      }
      
      void fix_ip_sum(struct iphdr *ip) {
        unsigned int sum = 0;
        sum_accumulate(&sum, ip, sizeof(*ip));
        ip->check = sum_final(sum);
      }
      
      void fix_tcp_sum(struct iphdr *ip, struct tcphdr *tcp) {
        unsigned int sum = 0;
        struct {
          unsigned int saddr;
          unsigned int daddr;
          unsigned char pad;
          unsigned char proto_num;
          unsigned short tcp_len;
        } fakehdr = {
          .saddr = ip->saddr,
          .daddr = ip->daddr,
          .proto_num = ip->protocol,
          .tcp_len = htons(ntohs(ip->tot_len) - ip->ihl*4)
        };
        sum_accumulate(&sum, &fakehdr, sizeof(fakehdr));
        sum_accumulate(&sum, tcp, tcp->doff*4);
        tcp->check = sum_final(sum);
      }
      
      int main(void) {
        int tun_fd = tun_alloc("inject_dev%d");
        systemf("ip link set %s up", devname);
        systemf("ip addr add 192.168.42.1/24 dev %s", devname);
      
        struct {
          struct iphdr ip;
          struct tcphdr tcp;
          unsigned char tcp_opts[20];
        } __attribute__((packed)) syn_packet = {
          .ip = {
            .ihl = sizeof(struct iphdr)/4,
            .version = 4,
            .tot_len = htons(sizeof(syn_packet)),
            .ttl = 30,
            .protocol = IPPROTO_TCP,
            /* FIXUP check */
            .saddr = IPADDR(192,168,42,2),
            .daddr = IPADDR(192,168,42,1)
          },
          .tcp = {
            .source = htons(1),
            .dest = htons(1337),
            .seq = 0x12345678,
            .doff = (sizeof(syn_packet.tcp)+sizeof(syn_packet.tcp_opts))/4,
            .syn = 1,
            .window = htons(64),
            .check = 0 /*FIXUP*/
          },
          .tcp_opts = {
            /* INVALID: trailing MD5SIG opcode after NOPs */
            1, 1, 1, 1, 1,
            1, 1, 1, 1, 1,
            1, 1, 1, 1, 1,
            1, 1, 1, 1, 19
          }
        };
        fix_ip_sum(&syn_packet.ip);
        fix_tcp_sum(&syn_packet.ip, &syn_packet.tcp);
        while (1) {
          int write_res = write(tun_fd, &syn_packet, sizeof(syn_packet));
          if (write_res != sizeof(syn_packet))
            err(1, "packet write failed");
        }
      }
      ====================================
      
      Fixes: cfb6eeb4
      
       ("[TCP]: MD5 Signature Option (RFC2385) support.")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7e5a206a