Skip to content
  1. Nov 05, 2019
  2. Nov 04, 2019
    • David S. Miller's avatar
      Merge tag 'mlx5-updates-2019-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 1574cf83
      David S. Miller authored
      
      
      Saeed Mahameed says:
      
      ====================
      mlx5-updates-2019-11-01
      
      Misc updates for mlx5 netdev and core driver
      
      1) Steering Core: Replace CRC32 internal implementation with standard
         kernel lib.
      2) Steering Core: Support IPv4 and IPv6 mixed matcher.
      3) Steering Core: Lockless FTE read lookups
      4) TC: Bit sized fields rewrite support.
      5) FPGA: Standalone FPGA support.
      6) SRIOV: Reset VF parameters configurations on SRIOV disable.
      7) netdev: Dump WQs wqe descriptors on CQE with error events.
      8) MISC Cleanups.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1574cf83
    • YueHaibing's avatar
      mISDN: remove unused variable 'faxmodulation_s' · a37ac8ae
      YueHaibing authored
      
      
      drivers/isdn/hardware/mISDN/mISDNisar.c:30:17:
       warning: faxmodulation_s defined but not used [-Wunused-const-variable=]
      
      It is never used, so can be removed.
      
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a37ac8ae
    • Vincent Cheng's avatar
      ptp: Add a ptp clock driver for IDT ClockMatrix. · 3a6ba7dc
      Vincent Cheng authored
      
      
      The IDT ClockMatrix (TM) family includes integrated devices that provide
      eight PLL channels.  Each PLL channel can be independently configured as a
      frequency synthesizer, jitter attenuator, digitally controlled
      oscillator (DCO), or a digital phase lock loop (DPLL).  Typically
      these devices are used as timing references and clock sources for PTP
      applications.  This patch adds support for the device.
      
      Co-developed-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarVincent Cheng <vincent.cheng.xh@renesas.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3a6ba7dc
    • Vincent Cheng's avatar
      dt-bindings: ptp: Add device tree binding for IDT ClockMatrix based PTP clock · 5c5e7aac
      Vincent Cheng authored
      
      
      Add device tree binding doc for the IDT ClockMatrix PTP clock.
      
      Signed-off-by: default avatarVincent Cheng <vincent.cheng.xh@renesas.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5c5e7aac
    • Francesco Ruggeri's avatar
      net: icmp6: provide input address for traceroute6 · fac6fce9
      Francesco Ruggeri authored
      
      
      traceroute6 output can be confusing, in that it shows the address
      that a router would use to reach the sender, rather than the address
      the packet used to reach the router.
      Consider this case:
      
              ------------------------ N2
               |                    |
             ------              ------  N3  ----
             | R1 |              | R2 |------|H2|
             ------              ------      ----
               |                    |
              ------------------------ N1
                        |
                       ----
                       |H1|
                       ----
      
      where H1's default route is through R1, and R1's default route is
      through R2 over N2.
      traceroute6 from H1 to H2 shows R2's address on N1 rather than on N2.
      
      The script below can be used to reproduce this scenario.
      
      traceroute6 output without this patch:
      
      traceroute to 2000:103::4 (2000:103::4), 30 hops max, 80 byte packets
       1  2000:101::1 (2000:101::1)  0.036 ms  0.008 ms  0.006 ms
       2  2000:101::2 (2000:101::2)  0.011 ms  0.008 ms  0.007 ms
       3  2000:103::4 (2000:103::4)  0.013 ms  0.010 ms  0.009 ms
      
      traceroute6 output with this patch:
      
      traceroute to 2000:103::4 (2000:103::4), 30 hops max, 80 byte packets
       1  2000:101::1 (2000:101::1)  0.056 ms  0.019 ms  0.006 ms
       2  2000:102::2 (2000:102::2)  0.013 ms  0.008 ms  0.008 ms
       3  2000:103::4 (2000:103::4)  0.013 ms  0.009 ms  0.009 ms
      
      #!/bin/bash
      #
      #        ------------------------ N2
      #         |                    |
      #       ------              ------  N3  ----
      #       | R1 |              | R2 |------|H2|
      #       ------              ------      ----
      #         |                    |
      #        ------------------------ N1
      #                  |
      #                 ----
      #                 |H1|
      #                 ----
      #
      # N1: 2000:101::/64
      # N2: 2000:102::/64
      # N3: 2000:103::/64
      #
      # R1's host part of address: 1
      # R2's host part of address: 2
      # H1's host part of address: 3
      # H2's host part of address: 4
      #
      # For example:
      # the IPv6 address of R1's interface on N2 is 2000:102::1/64
      #
      # Nets are implemented by macvlan interfaces (bridge mode) over
      # dummy interfaces.
      #
      
      # Create net namespaces
      ip netns add host1
      ip netns add host2
      ip netns add rtr1
      ip netns add rtr2
      
      # Create nets
      ip link add net1 type dummy; ip link set net1 up
      ip link add net2 type dummy; ip link set net2 up
      ip link add net3 type dummy; ip link set net3 up
      
      # Add interfaces to net1, move them to their nemaspaces
      ip link add link net1 dev host1net1 type macvlan mode bridge
      ip link set host1net1 netns host1
      ip link add link net1 dev rtr1net1 type macvlan mode bridge
      ip link set rtr1net1 netns rtr1
      ip link add link net1 dev rtr2net1 type macvlan mode bridge
      ip link set rtr2net1 netns rtr2
      
      # Add interfaces to net2, move them to their nemaspaces
      ip link add link net2 dev rtr1net2 type macvlan mode bridge
      ip link set rtr1net2 netns rtr1
      ip link add link net2 dev rtr2net2 type macvlan mode bridge
      ip link set rtr2net2 netns rtr2
      
      # Add interfaces to net3, move them to their nemaspaces
      ip link add link net3 dev rtr2net3 type macvlan mode bridge
      ip link set rtr2net3 netns rtr2
      ip link add link net3 dev host2net3 type macvlan mode bridge
      ip link set host2net3 netns host2
      
      # Configure interfaces and routes in host1
      ip netns exec host1 ip link set lo up
      ip netns exec host1 ip link set host1net1 up
      ip netns exec host1 ip -6 addr add 2000:101::3/64 dev host1net1
      ip netns exec host1 ip -6 route add default via 2000:101::1
      
      # Configure interfaces and routes in rtr1
      ip netns exec rtr1 ip link set lo up
      ip netns exec rtr1 ip link set rtr1net1 up
      ip netns exec rtr1 ip -6 addr add 2000:101::1/64 dev rtr1net1
      ip netns exec rtr1 ip link set rtr1net2 up
      ip netns exec rtr1 ip -6 addr add 2000:102::1/64 dev rtr1net2
      ip netns exec rtr1 ip -6 route add default via 2000:102::2
      ip netns exec rtr1 sysctl net.ipv6.conf.all.forwarding=1
      
      # Configure interfaces and routes in rtr2
      ip netns exec rtr2 ip link set lo up
      ip netns exec rtr2 ip link set rtr2net1 up
      ip netns exec rtr2 ip -6 addr add 2000:101::2/64 dev rtr2net1
      ip netns exec rtr2 ip link set rtr2net2 up
      ip netns exec rtr2 ip -6 addr add 2000:102::2/64 dev rtr2net2
      ip netns exec rtr2 ip link set rtr2net3 up
      ip netns exec rtr2 ip -6 addr add 2000:103::2/64 dev rtr2net3
      ip netns exec rtr2 sysctl net.ipv6.conf.all.forwarding=1
      
      # Configure interfaces and routes in host2
      ip netns exec host2 ip link set lo up
      ip netns exec host2 ip link set host2net3 up
      ip netns exec host2 ip -6 addr add 2000:103::4/64 dev host2net3
      ip netns exec host2 ip -6 route add default via 2000:103::2
      
      # Ping host2 from host1
      ip netns exec host1 ping6 -c5 2000:103::4
      
      # Traceroute host2 from host1
      ip netns exec host1 traceroute6 2000:103::4
      
      # Delete nets
      ip link del net3
      ip link del net2
      ip link del net1
      
      # Delete namespaces
      ip netns del rtr2
      ip netns del rtr1
      ip netns del host2
      ip netns del host1
      
      Signed-off-by: default avatarFrancesco Ruggeri <fruggeri@arista.com>
      Original-patch-by: default avatarHonggang Xu <hxu@arista.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fac6fce9
    • Tuong Lien's avatar
      tipc: improve message bundling algorithm · 06e7c70c
      Tuong Lien authored
      As mentioned in commit e95584a8
      
       ("tipc: fix unlimited bundling of
      small messages"), the current message bundling algorithm is inefficient
      that can generate bundles of only one payload message, that causes
      unnecessary overheads for both the sender and receiver.
      
      This commit re-designs the 'tipc_msg_make_bundle()' function (now named
      as 'tipc_msg_try_bundle()'), so that when a message comes at the first
      place, we will just check & keep a reference to it if the message is
      suitable for bundling. The message buffer will be put into the link
      backlog queue and processed as normal. Later on, when another one comes
      we will make a bundle with the first message if possible and so on...
      This way, a bundle if really needed will always consist of at least two
      payload messages. Otherwise, we let the first buffer go its way without
      any need of bundling, so reduce the overheads to zero.
      
      Moreover, since now we have both the messages in hand, we can even
      optimize the 'tipc_msg_bundle()' function, make bundle of a very large
      (size ~ MSS) and small messages which is not with the current algorithm
      e.g. [1400-byte message] + [10-byte message] (MTU = 1500).
      
      Acked-by: default avatarYing Xue <ying.xue@windreiver.com>
      Acked-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      06e7c70c
    • Francesco Ruggeri's avatar
      net: icmp: use input address in traceroute · 2adf81c0
      Francesco Ruggeri authored
      
      
      Even with icmp_errors_use_inbound_ifaddr set, traceroute returns the
      primary address of the interface the packet was received on, even if
      the path goes through a secondary address. In the example:
      
                          1.0.3.1/24
       ---- 1.0.1.3/24    1.0.1.1/24 ---- 1.0.2.1/24    1.0.2.4/24 ----
       |H1|--------------------------|R1|--------------------------|H2|
       ----            N1            ----            N2            ----
      
      where 1.0.3.1/24 is R1's primary address on N1, traceroute from
      H1 to H2 returns:
      
      traceroute to 1.0.2.4 (1.0.2.4), 30 hops max, 60 byte packets
       1  1.0.3.1 (1.0.3.1)  0.018 ms  0.006 ms  0.006 ms
       2  1.0.2.4 (1.0.2.4)  0.021 ms  0.007 ms  0.007 ms
      
      After applying this patch, it returns:
      
      traceroute to 1.0.2.4 (1.0.2.4), 30 hops max, 60 byte packets
       1  1.0.1.1 (1.0.1.1)  0.033 ms  0.007 ms  0.006 ms
       2  1.0.2.4 (1.0.2.4)  0.011 ms  0.007 ms  0.007 ms
      
      Original-patch-by: default avatarBill Fenner <fenner@arista.com>
      Signed-off-by: default avatarFrancesco Ruggeri <fruggeri@arista.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2adf81c0
    • David S. Miller's avatar
      Merge branch 'optimize-openvswitch-flow-looking-up' · c219a166
      David S. Miller authored
      
      
      Tonghao Zhang says:
      
      ====================
      optimize openvswitch flow looking up
      
      This series patch optimize openvswitch for performance or simplify
      codes.
      
      Patch 1, 2, 4: Port Pravin B Shelar patches to
      linux upstream with little changes.
      
      Patch 5, 6, 7: Optimize the flow looking up and
      simplify the flow hash.
      
      Patch 8, 9: are bugfix.
      
      The performance test is on Intel Xeon E5-2630 v4.
      The test topology is show as below:
      
      +-----------------------------------+
      |   +---------------------------+   |
      |   | eth0   ovs-switch    eth1 |   | Host0
      |   +---------------------------+   |
      +-----------------------------------+
            ^                       |
            |                       |
            |                       |
            |                       |
            |                       v
      +-----+----+             +----+-----+
      | netperf  | Host1       | netserver| Host2
      +----------+             +----------+
      
      We use netperf send the 64B packets, and insert 255+ flow-mask:
      $ ovs-dpctl add-flow ovs-switch "in_port(1),eth(dst=00:01:00:00:00:00/ff:ff:ff:ff:ff:01),eth_type(0x0800),ipv4(frag=no)" 2
      ...
      $ ovs-dpctl add-flow ovs-switch "in_port(1),eth(dst=00:ff:00:00:00:00/ff:ff:ff:ff:ff:ff),eth_type(0x0800),ipv4(frag=no)" 2
      $
      $ netperf -t UDP_STREAM -H 2.2.2.200 -l 40 -- -m 18
      
      * Without series patch, throughput 8.28Mbps
      * With series patch, throughput 46.05Mbps
      
      v6:
      some coding style fixes
      
      v5:
      rewrite patch 8, release flow-mask when freeing flow
      
      v4:
      access ma->count with READ_ONCE/WRITE_ONCE API. More information,
      see patch 5 comments.
      
      v3:
      update ma point when realloc mask_array in patch 5
      
      v2:
      simplify codes. e.g. use kfree_rcu instead of call_rcu
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c219a166
    • Tonghao Zhang's avatar
      net: openvswitch: simplify the ovs_dp_cmd_new · eec62ead
      Tonghao Zhang authored
      
      
      use the specified functions to init resource.
      
      Signed-off-by: default avatarTonghao Zhang <xiangxia.m.yue@gmail.com>
      Tested-by: default avatarGreg Rose <gvrose8192@gmail.com>
      Acked-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eec62ead
    • Tonghao Zhang's avatar
      net: openvswitch: don't unlock mutex when changing the user_features fails · 4c76bf69
      Tonghao Zhang authored
      Unlocking of a not locked mutex is not allowed.
      Other kernel thread may be in critical section while
      we unlock it because of setting user_feature fail.
      
      Fixes: 95a7233c
      
       ("net: openvswitch: Set OvS recirc_id from tc chain index")
      Cc: Paul Blakey <paulb@mellanox.com>
      Signed-off-by: default avatarTonghao Zhang <xiangxia.m.yue@gmail.com>
      Tested-by: default avatarGreg Rose <gvrose8192@gmail.com>
      Acked-by: default avatarWilliam Tu <u9012063@gmail.com>
      Acked-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4c76bf69
    • Tonghao Zhang's avatar
      net: openvswitch: fix possible memleak on destroy flow-table · 50b0e61b
      Tonghao Zhang authored
      
      
      When we destroy the flow tables which may contain the flow_mask,
      so release the flow mask struct.
      
      Signed-off-by: default avatarTonghao Zhang <xiangxia.m.yue@gmail.com>
      Tested-by: default avatarGreg Rose <gvrose8192@gmail.com>
      Acked-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      50b0e61b
    • Tonghao Zhang's avatar
      net: openvswitch: add likely in flow_lookup · 0a3e0137
      Tonghao Zhang authored
      
      
      The most case *index < ma->max, and flow-mask is not NULL.
      We add un/likely for performance.
      
      Signed-off-by: default avatarTonghao Zhang <xiangxia.m.yue@gmail.com>
      Tested-by: default avatarGreg Rose <gvrose8192@gmail.com>
      Acked-by: default avatarWilliam Tu <u9012063@gmail.com>
      Acked-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a3e0137
    • Tonghao Zhang's avatar
      net: openvswitch: simplify the flow_hash · 515b65a4
      Tonghao Zhang authored
      
      
      Simplify the code and remove the unnecessary BUILD_BUG_ON.
      
      Signed-off-by: default avatarTonghao Zhang <xiangxia.m.yue@gmail.com>
      Tested-by: default avatarGreg Rose <gvrose8192@gmail.com>
      Acked-by: default avatarWilliam Tu <u9012063@gmail.com>
      Acked-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      515b65a4
    • Tonghao Zhang's avatar
      net: openvswitch: optimize flow-mask looking up · 57f7d7b9
      Tonghao Zhang authored
      
      
      The full looking up on flow table traverses all mask array.
      If mask-array is too large, the number of invalid flow-mask
      increase, performance will be drop.
      
      One bad case, for example: M means flow-mask is valid and NULL
      of flow-mask means deleted.
      
      +-------------------------------------------+
      | M | NULL | ...                  | NULL | M|
      +-------------------------------------------+
      
      In that case, without this patch, openvswitch will traverses all
      mask array, because there will be one flow-mask in the tail. This
      patch changes the way of flow-mask inserting and deleting, and the
      mask array will be keep as below: there is not a NULL hole. In the
      fast path, we can "break" "for" (not "continue") in flow_lookup
      when we get a NULL flow-mask.
      
               "break"
                  v
      +-------------------------------------------+
      | M | M |  NULL |...           | NULL | NULL|
      +-------------------------------------------+
      
      This patch don't optimize slow or control path, still using ma->max
      to traverse. Slow path:
      * tbl_mask_array_realloc
      * ovs_flow_tbl_lookup_exact
      * flow_mask_find
      
      Signed-off-by: default avatarTonghao Zhang <xiangxia.m.yue@gmail.com>
      Tested-by: default avatarGreg Rose <gvrose8192@gmail.com>
      Acked-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      57f7d7b9
    • Tonghao Zhang's avatar
      net: openvswitch: optimize flow mask cache hash collision · a7f35e78
      Tonghao Zhang authored
      
      
      Port the codes to linux upstream and with little changes.
      
      Pravin B Shelar, says:
      | In case hash collision on mask cache, OVS does extra flow
      | lookup. Following patch avoid it.
      
      Link: https://github.com/openvswitch/ovs/commit/0e6efbe2712da03522532dc5e84806a96f6a0dd1
      Signed-off-by: default avatarTonghao Zhang <xiangxia.m.yue@gmail.com>
      Tested-by: default avatarGreg Rose <gvrose8192@gmail.com>
      Signed-off-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a7f35e78
    • Tonghao Zhang's avatar
      net: openvswitch: shrink the mask array if necessary · 1689754d
      Tonghao Zhang authored
      
      
      When creating and inserting flow-mask, if there is no available
      flow-mask, we realloc the mask array. When removing flow-mask,
      if necessary, we shrink mask array.
      
      Signed-off-by: default avatarTonghao Zhang <xiangxia.m.yue@gmail.com>
      Tested-by: default avatarGreg Rose <gvrose8192@gmail.com>
      Acked-by: default avatarWilliam Tu <u9012063@gmail.com>
      Acked-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1689754d
    • Tonghao Zhang's avatar
      net: openvswitch: convert mask list in mask array · 4bc63b1b
      Tonghao Zhang authored
      
      
      Port the codes to linux upstream and with little changes.
      
      Pravin B Shelar, says:
      | mask caches index of mask in mask_list. On packet recv OVS
      | need to traverse mask-list to get cached mask. Therefore array
      | is better for retrieving cached mask. This also allows better
      | cache replacement algorithm by directly checking mask's existence.
      
      Link: https://github.com/openvswitch/ovs/commit/d49fc3ff53c65e4eca9cabd52ac63396746a7ef5
      Signed-off-by: default avatarTonghao Zhang <xiangxia.m.yue@gmail.com>
      Tested-by: default avatarGreg Rose <gvrose8192@gmail.com>
      Acked-by: default avatarWilliam Tu <u9012063@gmail.com>
      Signed-off-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4bc63b1b
    • Tonghao Zhang's avatar
      net: openvswitch: add flow-mask cache for performance · 04b7d136
      Tonghao Zhang authored
      
      
      The idea of this optimization comes from a patch which
      is committed in 2014, openvswitch community. The author
      is Pravin B Shelar. In order to get high performance, I
      implement it again. Later patches will use it.
      
      Pravin B Shelar, says:
      | On every packet OVS needs to lookup flow-table with every
      | mask until it finds a match. The packet flow-key is first
      | masked with mask in the list and then the masked key is
      | looked up in flow-table. Therefore number of masks can
      | affect packet processing performance.
      
      Link: https://github.com/openvswitch/ovs/commit/5604935e4e1cbc16611d2d97f50b717aa31e8ec5
      Signed-off-by: default avatarTonghao Zhang <xiangxia.m.yue@gmail.com>
      Tested-by: default avatarGreg Rose <gvrose8192@gmail.com>
      Acked-by: default avatarWilliam Tu <u9012063@gmail.com>
      Signed-off-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      04b7d136
  3. Nov 03, 2019
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · ae8a76fb
      David S. Miller authored
      
      
      Alexei Starovoitov says:
      
      ====================
      pull-request: bpf-next 2019-11-02
      
      The following pull-request contains BPF updates for your *net-next* tree.
      
      We've added 30 non-merge commits during the last 7 day(s) which contain
      a total of 41 files changed, 1864 insertions(+), 474 deletions(-).
      
      The main changes are:
      
      1) Fix long standing user vs kernel access issue by introducing
         bpf_probe_read_user() and bpf_probe_read_kernel() helpers, from Daniel.
      
      2) Accelerated xskmap lookup, from Björn and Maciej.
      
      3) Support for automatic map pinning in libbpf, from Toke.
      
      4) Cleanup of BTF-enabled raw tracepoints, from Alexei.
      
      5) Various fixes to libbpf and selftests.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ae8a76fb
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · d31e9558
      David S. Miller authored
      
      
      The only slightly tricky merge conflict was the netdevsim because the
      mutex locking fix overlapped a lot of driver reload reorganization.
      
      The rest were (relatively) trivial in nature.
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d31e9558
    • Alexei Starovoitov's avatar
      Merge branch 'bpf_probe_read_user' · 358fdb45
      Alexei Starovoitov authored
      Daniel Borkmann says:
      
      ====================
      This set adds probe_read_{user,kernel}(), probe_read_str_{user,kernel}()
      helpers, fixes probe_write_user() helper and selftests. For details please
      see individual patches.
      
      Thanks!
      
      v2 -> v3:
        - noticed two more things that are fixed in here:
         - bpf uapi helper description used 'int size' for *_str helpers, now u32
         - we need TASK_SIZE_MAX + guard page on x86-64 in patch 2 otherwise
           we'll trigger the 00c42373
      
       warn as well, so full range covered now
      v1 -> v2:
        - standardize unsafe_ptr terminology in uapi header comment (Andrii)
        - probe_read_{user,kernel}[_str] naming scheme (Andrii)
        - use global data in last test case, remove relaxed_maps (Andrii)
        - add strict non-pagefault kernel read funcs to avoid warning in
          kernel probe read helpers (Alexei)
      ====================
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      358fdb45
    • Daniel Borkmann's avatar
      bpf, testing: Add selftest to read/write sockaddr from user space · fa553d9b
      Daniel Borkmann authored
      
      
      Tested on x86-64 and Ilya was also kind enough to give it a spin on
      s390x, both passing with probe_user:OK there. The test is using the
      newly added bpf_probe_read_user() to dump sockaddr from connect call
      into .bss BPF map and overrides the user buffer via bpf_probe_write_user():
      
        # ./test_progs
        [...]
        #17 pkt_md_access:OK
        #18 probe_user:OK
        #19 prog_run_xattr:OK
        [...]
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Tested-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/90f449d8af25354e05080e82fc6e2d3179da30ea.1572649915.git.daniel@iogearbox.net
      fa553d9b
    • Daniel Borkmann's avatar
      bpf, testing: Convert prog tests to probe_read_{user, kernel}{, _str} helper · 50f9aa44
      Daniel Borkmann authored
      
      
      Use probe read *_{kernel,user}{,_str}() helpers instead of bpf_probe_read()
      or bpf_probe_read_user_str() for program tests where appropriate.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/4a61d4b71ce3765587d8ef5cb93afa18515e5b3e.1572649915.git.daniel@iogearbox.net
      50f9aa44
    • Daniel Borkmann's avatar
      bpf, samples: Use bpf_probe_read_user where appropriate · 251e2d33
      Daniel Borkmann authored
      
      
      Use bpf_probe_read_user() helper instead of bpf_probe_read() for samples that
      attach to kprobes probing on user addresses.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/5b0144b3f8e031ec5e2438bd7de8d7877e63bf2f.1572649915.git.daniel@iogearbox.net
      251e2d33
    • Daniel Borkmann's avatar
      bpf: Switch BPF probe insns to bpf_probe_read_kernel · 6e07a634
      Daniel Borkmann authored
      Commit 2a02759e
      
       ("bpf: Add support for BTF pointers to interpreter")
      explicitly states that the pointer to BTF object is a pointer to a kernel
      object or NULL. Therefore we should also switch to using the strict kernel
      probe helper which is restricted to kernel addresses only when architectures
      have non-overlapping address spaces.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/d2b90827837685424a4b8008dfe0460558abfada.1572649915.git.daniel@iogearbox.net
      6e07a634
    • Daniel Borkmann's avatar
      bpf: Add probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers · 6ae08ae3
      Daniel Borkmann authored
      The current bpf_probe_read() and bpf_probe_read_str() helpers are broken
      in that they assume they can be used for probing memory access for kernel
      space addresses /as well as/ user space addresses.
      
      However, plain use of probe_kernel_read() for both cases will attempt to
      always access kernel space address space given access is performed under
      KERNEL_DS and some archs in-fact have overlapping address spaces where a
      kernel pointer and user pointer would have the /same/ address value and
      therefore accessing application memory via bpf_probe_read{,_str}() would
      read garbage values.
      
      Lets fix BPF side by making use of recently added 3d708182 ("uaccess:
      Add non-pagefault user-space read functions"). Unfortunately, the only way
      to fix this status quo is to add dedicated bpf_probe_read_{user,kernel}()
      and bpf_probe_read_{user,kernel}_str() helpers. The bpf_probe_read{,_str}()
      helpers are kept as-is to retain their current behavior.
      
      The two *_user() variants attempt the access always under USER_DS set, the
      two *_kernel() variants will -EFAULT when accessing user memory if the
      underlying architecture has non-overlapping address ranges, also avoiding
      throwing the kernel warning via 00c42373 ("x86-64: add warning for
      non-canonical user access address dereferences").
      
      Fixes: a5e8c070 ("bpf: add bpf_probe_read_str helper")
      Fixes: 2541517c
      
       ("tracing, perf: Implement BPF programs attached to kprobes")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/796ee46e948bc808d54891a1108435f8652c6ca4.1572649915.git.daniel@iogearbox.net
      6ae08ae3
    • Daniel Borkmann's avatar
      bpf: Make use of probe_user_write in probe write helper · eb1b6688
      Daniel Borkmann authored
      Convert the bpf_probe_write_user() helper to probe_user_write() such that
      writes are not attempted under KERNEL_DS anymore which is buggy as kernel
      and user space pointers can have overlapping addresses. Also, given we have
      the access_ok() check inside probe_user_write(), the helper doesn't need
      to do it twice.
      
      Fixes: 96ae5227
      
       ("bpf: Add bpf_probe_write_user BPF helper to be called in tracers")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/841c461781874c07a0ee404a454c3bc0459eed30.1572649915.git.daniel@iogearbox.net
      eb1b6688
    • Daniel Borkmann's avatar
      uaccess: Add strict non-pagefault kernel-space read function · 75a1a607
      Daniel Borkmann authored
      Add two new probe_kernel_read_strict() and strncpy_from_unsafe_strict()
      helpers which by default alias to the __probe_kernel_read() and the
      __strncpy_from_unsafe(), respectively, but can be overridden by archs
      which have non-overlapping address ranges for kernel space and user
      space in order to bail out with -EFAULT when attempting to probe user
      memory including non-canonical user access addresses [0]:
      
        4-level page tables:
          user-space mem: 0x0000000000000000 - 0x00007fffffffffff
          non-canonical:  0x0000800000000000 - 0xffff7fffffffffff
      
        5-level page tables:
          user-space mem: 0x0000000000000000 - 0x00ffffffffffffff
          non-canonical:  0x0100000000000000 - 0xfeffffffffffffff
      
      The idea is that these helpers are complementary to the probe_user_read()
      and strncpy_from_unsafe_user() which probe user-only memory. Both added
      helpers here do the same, but for kernel-only addresses.
      
      Both set of helpers are going to be used for BPF tracing. They also
      explicitly avoid throwing the splat for non-canonical user addresses from
      00c42373
      
       ("x86-64: add warning for non-canonical user access address
      dereferences").
      
      For compat, the current probe_kernel_read() and strncpy_from_unsafe() are
      left as-is.
      
        [0] Documentation/x86/x86_64/mm.txt
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: x86@kernel.org
      Link: https://lore.kernel.org/bpf/eefeefd769aa5a013531f491a71f0936779e916b.1572649915.git.daniel@iogearbox.net
      75a1a607
    • Daniel Borkmann's avatar
      uaccess: Add non-pagefault user-space write function · 1d1585ca
      Daniel Borkmann authored
      Commit 3d708182
      
       ("uaccess: Add non-pagefault user-space read functions")
      missed to add probe write function, therefore factor out a probe_write_common()
      helper with most logic of probe_kernel_write() except setting KERNEL_DS, and
      add a new probe_user_write() helper so it can be used from BPF side.
      
      Again, on some archs, the user address space and kernel address space can
      co-exist and be overlapping, so in such case, setting KERNEL_DS would mean
      that the given address is treated as being in kernel address space.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Link: https://lore.kernel.org/bpf/9df2542e68141bfa3addde631441ee45503856a8.1572649915.git.daniel@iogearbox.net
      1d1585ca
    • Alexei Starovoitov's avatar
      Merge branch 'map-pinning' · e1cb7d2d
      Alexei Starovoitov authored
      
      
      Toke Høiland-Jørgensen says:
      
      ====================
      This series adds support to libbpf for reading 'pinning' settings from BTF-based
      map definitions. It introduces a new open option which can set the pinning path;
      if no path is set, /sys/fs/bpf is used as the default. Callers can customise the
      pinning between open and load by setting the pin path per map, and still get the
      automatic reuse feature.
      
      The semantics of the pinning is similar to the iproute2 "PIN_GLOBAL" setting,
      and the eventual goal is to move the iproute2 implementation to be based on
      libbpf and the functions introduced in this series.
      
      Changelog:
      
      v6:
        - Fix leak of struct bpf_object in selftest
        - Make struct bpf_map arg const in bpf_map__is_pinned() and bpf_map__get_pin_path()
      
      v5:
        - Don't pin maps with pinning set, but with a value of LIBBPF_PIN_NONE
        - Add a few more selftests:
          - Should not pin map with pinning set, but value LIBBPF_PIN_NONE
          - Should fail to load a map with an invalid pinning value
          - Should fail to re-use maps with parameter mismatch
        - Alphabetise libbpf.map
        - Whitespace and typo fixes
      
      v4:
        - Don't check key_type_id and value_type_id when checking for map reuse
          compatibility.
        - Move building of map->pin_path into init_user_btf_map()
        - Get rid of 'pinning' attribute in struct bpf_map
        - Make sure we also create parent directory on auto-pin (new patch 3).
        - Abort the selftest on error instead of attempting to continue.
        - Support unpinning all pinned maps with bpf_object__unpin_maps(obj, NULL)
        - Support pinning at map->pin_path with bpf_object__pin_maps(obj, NULL)
        - Make re-pinning a map at the same path a noop
        - Rename the open option to pin_root_path
        - Add a bunch more self-tests for pin_maps(NULL) and unpin_maps(NULL)
        - Fix a couple of smaller nits
      
      v3:
        - Drop bpf_object__pin_maps_opts() and just use an open option to customise
          the pin path; also don't touch bpf_object__{un,}pin_maps()
        - Integrate pinning and reuse into bpf_object__create_maps() instead of having
          multiple loops though the map structure
        - Make errors in map reuse and pinning fatal to the load procedure
        - Add selftest to exercise pinning feature
        - Rebase series to latest bpf-next
      
      v2:
        - Drop patch that adds mounting of bpffs
        - Only support a single value of the pinning attribute
        - Add patch to fixup error handling in reuse_fd()
        - Implement the full automatic pinning and map reuse logic on load
      ====================
      
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e1cb7d2d
    • Toke Høiland-Jørgensen's avatar
      selftests: Add tests for automatic map pinning · 2f4a32cc
      Toke Høiland-Jørgensen authored
      
      
      This adds a new BPF selftest to exercise the new automatic map pinning
      code.
      
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/157269298209.394725.15420085139296213182.stgit@toke.dk
      2f4a32cc
    • Toke Høiland-Jørgensen's avatar
      libbpf: Add auto-pinning of maps when loading BPF objects · 57a00f41
      Toke Høiland-Jørgensen authored
      
      
      This adds support to libbpf for setting map pinning information as part of
      the BTF map declaration, to get automatic map pinning (and reuse) on load.
      The pinning type currently only supports a single PIN_BY_NAME mode, where
      each map will be pinned by its name in a path that can be overridden, but
      defaults to /sys/fs/bpf.
      
      Since auto-pinning only does something if any maps actually have a
      'pinning' BTF attribute set, we default the new option to enabled, on the
      assumption that seamless pinning is what most callers want.
      
      When a map has a pin_path set at load time, libbpf will compare the map
      pinned at that location (if any), and if the attributes match, will re-use
      that map instead of creating a new one. If no existing map is found, the
      newly created map will instead be pinned at the location.
      
      Programs wanting to customise the pinning can override the pinning paths
      using bpf_map__set_pin_path() before calling bpf_object__load() (including
      setting it to NULL to disable pinning of a particular map).
      
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/157269298092.394725.3966306029218559681.stgit@toke.dk
      57a00f41
    • Toke Høiland-Jørgensen's avatar
      libbpf: Move directory creation into _pin() functions · 196f8487
      Toke Høiland-Jørgensen authored
      
      
      The existing pin_*() functions all try to create the parent directory
      before pinning. Move this check into the per-object _pin() functions
      instead. This ensures consistent behaviour when auto-pinning is
      added (which doesn't go through the top-level pin_maps() function), at the
      cost of a few more calls to mkdir().
      
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/157269297985.394725.5882630952992598610.stgit@toke.dk
      196f8487
    • Toke Høiland-Jørgensen's avatar
      libbpf: Store map pin path and status in struct bpf_map · 4580b25f
      Toke Høiland-Jørgensen authored
      
      
      Support storing and setting a pin path in struct bpf_map, which can be used
      for automatic pinning. Also store the pin status so we can avoid attempts
      to re-pin a map that has already been pinned (or reused from a previous
      pinning).
      
      The behaviour of bpf_object__{un,}pin_maps() is changed so that if it is
      called with a NULL path argument (which was previously illegal), it will
      (un)pin only those maps that have a pin_path set.
      
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/157269297876.394725.14782206533681896279.stgit@toke.dk
      4580b25f
    • Toke Høiland-Jørgensen's avatar
      libbpf: Fix error handling in bpf_map__reuse_fd() · d1b4574a
      Toke Høiland-Jørgensen authored
      
      
      bpf_map__reuse_fd() was calling close() in the error path before returning
      an error value based on errno. However, close can change errno, so that can
      lead to potentially misleading error messages. Instead, explicitly store
      errno in the err variable before each goto.
      
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/157269297769.394725.12634985106772698611.stgit@toke.dk
      d1b4574a
  4. Nov 02, 2019
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 1204c70d
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix free/alloc races in batmanadv, from Sven Eckelmann.
      
       2) Several leaks and other fixes in kTLS support of mlx5 driver, from
          Tariq Toukan.
      
       3) BPF devmap_hash cost calculation can overflow on 32-bit, from Toke
          Høiland-Jørgensen.
      
       4) Add an r8152 device ID, from Kazutoshi Noguchi.
      
       5) Missing include in ipv6's addrconf.c, from Ben Dooks.
      
       6) Use siphash in flow dissector, from Eric Dumazet. Attackers can
          easily infer the 32-bit secret otherwise etc.
      
       7) Several netdevice nesting depth fixes from Taehee Yoo.
      
       8) Fix several KCSAN reported errors, from Eric Dumazet. For example,
          when doing lockless skb_queue_empty() checks, and accessing
          sk_napi_id/sk_incoming_cpu lockless as well.
      
       9) Fix jumbo packet handling in RXRPC, from David Howells.
      
      10) Bump SOMAXCONN and tcp_max_syn_backlog values, from Eric Dumazet.
      
      11) Fix DMA synchronization in gve driver, from Yangchun Fu.
      
      12) Several bpf offload fixes, from Jakub Kicinski.
      
      13) Fix sk_page_frag() recursion during memory reclaim, from Tejun Heo.
      
      14) Fix ping latency during high traffic rates in hisilicon driver, from
          Jiangfent Xiao.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (146 commits)
        net: fix installing orphaned programs
        net: cls_bpf: fix NULL deref on offload filter removal
        selftests: bpf: Skip write only files in debugfs
        selftests: net: reuseport_dualstack: fix uninitalized parameter
        r8169: fix wrong PHY ID issue with RTL8168dp
        net: dsa: bcm_sf2: Fix IMP setup for port different than 8
        net: phylink: Fix phylink_dbg() macro
        gve: Fixes DMA synchronization.
        inet: stop leaking jiffies on the wire
        ixgbe: Remove duplicate clear_bit() call
        Documentation: networking: device drivers: Remove stray asterisks
        e1000: fix memory leaks
        i40e: Fix receive buffer starvation for AF_XDP
        igb: Fix constant media auto sense switching when no cable is connected
        net: ethernet: arc: add the missed clk_disable_unprepare
        igb: Enable media autosense for the i350.
        igb/igc: Don't warn on fatal read failures when the device is removed
        tcp: increase tcp_max_syn_backlog max value
        net: increase SOMAXCONN to 4096
        netdevsim: Fix use-after-free during device dismantle
        ...
      1204c70d
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-5.4-3' of git://git.linux-nfs.org/projects/anna/linux-nfs · 372bf6c1
      Linus Torvalds authored
      Pull NFS client bugfixes from Anna Schumaker:
       "This contains two delegation fixes (with the RCU lock leak fix marked
        for stable), and three patches to fix destroying the the sunrpc back
        channel.
      
        Stable bugfixes:
      
         - Fix an RCU lock leak in nfs4_refresh_delegation_stateid()
      
        Other fixes:
      
         - The TCP back channel mustn't disappear while requests are
           outstanding
      
         - The RDMA back channel mustn't disappear while requests are
           outstanding
      
         - Destroy the back channel when we destroy the host transport
      
         - Don't allow a cached open with a revoked delegation"
      
      * tag 'nfs-for-5.4-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
        NFS: Fix an RCU lock leak in nfs4_refresh_delegation_stateid()
        NFSv4: Don't allow a cached open with a revoked delegation
        SUNRPC: Destroy the back channel when we destroy the host transport
        SUNRPC: The RDMA back channel mustn't disappear while requests are outstanding
        SUNRPC: The TCP back channel mustn't disappear while requests are outstanding
      372bf6c1
    • Linus Torvalds's avatar
      Merge tag 'for-linus-20191101' of git://git.kernel.dk/linux-block · 0821de28
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
      
       - Two small nvme fixes, one is a fabrics connection fix, the other one
         a cleanup made possible by that fix (Anton, via Keith)
      
       - Fix requeue handling in umb ubd (Anton)
      
       - Fix spin_lock_irq() nesting in blk-iocost (Dan)
      
       - Three small io_uring fixes:
           - Install io_uring fd after done with ctx (me)
           - Clear ->result before every poll issue (me)
           - Fix leak of shadow request on error (Pavel)
      
      * tag 'for-linus-20191101' of git://git.kernel.dk/linux-block:
        iocost: don't nest spin_lock_irq in ioc_weight_write()
        io_uring: ensure we clear io_kiocb->result before each issue
        um-ubd: Entrust re-queue to the upper layers
        nvme-multipath: remove unused groups_only mode in ana log
        nvme-multipath: fix possible io hang after ctrl reconnect
        io_uring: don't touch ctx in setup after ring fd install
        io_uring: Fix leaked shadow_req
      0821de28
    • Linus Torvalds's avatar
      Merge tag 'riscv/for-v5.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · e5897c7d
      Linus Torvalds authored
      Pull RISC-V fixes from Paul Walmsley:
       "One fix for PCIe users:
      
         - Fix legacy PCI I/O port access emulation
      
        One set of cleanups:
      
         - Resolve most of the warnings generated by sparse across arch/riscv.
           No functional changes
      
        And one MAINTAINERS update:
      
         - Update Palmer's E-mail address"
      
      * tag 'riscv/for-v5.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
        MAINTAINERS: Change to my personal email address
        RISC-V: Add PCIe I/O BAR memory mapping
        riscv: for C functions called only from assembly, mark with __visible
        riscv: fp: add missing __user pointer annotations
        riscv: add missing header file includes
        riscv: mark some code and data as file-static
        riscv: init: merge split string literals in preprocessor directive
        riscv: add prototypes for assembly language functions from head.S
      e5897c7d