Skip to content
  1. Nov 07, 2014
  2. Nov 06, 2014
    • Tom Herbert's avatar
      fou: Fix typo in returning flags in netlink · e1b2cb65
      Tom Herbert authored
      
      
      When filling netlink info, dport is being returned as flags. Fix
      instances to return correct value.
      
      Signed-off-by: default avatarTom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e1b2cb65
    • hayeswang's avatar
      r8152: disable the tasklet by default · 93ffbeab
      hayeswang authored
      
      
      Let the tasklet only be enabled after open(), and be disabled for
      the other situation. The tasklet is only necessary after open() for
      tx/rx, so it could be disabled by default.
      
      Signed-off-by: default avatarHayes Wang <hayeswang@realtek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      93ffbeab
    • Daniel Borkmann's avatar
      ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs · 4c672e4b
      Daniel Borkmann authored
      It has been reported that generating an MLD listener report on
      devices with large MTUs (e.g. 9000) and a high number of IPv6
      addresses can trigger a skb_over_panic():
      
      skbuff: skb_over_panic: text:ffffffff80612a5d len:3776 put:20
      head:ffff88046d751000 data:ffff88046d751010 tail:0xed0 end:0xec0
      dev:port1
       ------------[ cut here ]------------
      kernel BUG at net/core/skbuff.c:100!
      invalid opcode: 0000 [#1] SMP
      Modules linked in: ixgbe(O)
      CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 3.14.23+ #4
      [...]
      Call Trace:
       <IRQ>
       [<ffffffff80578226>] ? skb_put+0x3a/0x3b
       [<ffffffff80612a5d>] ? add_grhead+0x45/0x8e
       [<ffffffff80612e3a>] ? add_grec+0x394/0x3d4
       [<ffffffff80613222>] ? mld_ifc_timer_expire+0x195/0x20d
       [<ffffffff8061308d>] ? mld_dad_timer_expire+0x45/0x45
       [<ffffffff80255b5d>] ? call_timer_fn.isra.29+0x12/0x68
       [<ffffffff80255d16>] ? run_timer_softirq+0x163/0x182
       [<ffffffff80250e6f>] ? __do_softirq+0xe0/0x21d
       [<ffffffff8025112b>] ? irq_exit+0x4e/0xd3
       [<ffffffff802214bb>] ? smp_apic_timer_interrupt+0x3b/0x46
       [<ffffffff8063f10a>] ? apic_timer_interrupt+0x6a/0x70
      
      mld_newpack() skb allocations are usually requested with dev->mtu
      in size, since commit 72e09ad1 ("ipv6: avoid high order allocations")
      we have changed the limit in order to be less likely to fail.
      
      However, in MLD/IGMP code, we have some rather ugly AVAILABLE(skb)
      macros, which determine if we may end up doing an skb_put() for
      adding another record. To avoid possible fragmentation, we check
      the skb's tailroom as skb->dev->mtu - skb->len, which is a wrong
      assumption as the actual max allocation size can be much smaller.
      
      The IGMP case doesn't have this issue as commit 57e1ab6e
      
      
      ("igmp: refine skb allocations") stores the allocation size in
      the cb[].
      
      Set a reserved_tailroom to make it fit into the MTU and use
      skb_availroom() helper instead. This also allows to get rid of
      igmp_skb_size().
      
      Reported-by: default avatarWei Liu <lw1a2.jing@gmail.com>
      Fixes: 72e09ad1
      
       ("ipv6: avoid high order allocations")
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: David L Stevens <david.stevens@oracle.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4c672e4b
    • Joe Perches's avatar
      net: Convert SEQ_START_TOKEN/seq_printf to seq_puts · 1744bea1
      Joe Perches authored
      
      
      Using a single fixed string is smaller code size than using
      a format and many string arguments.
      
      Reduces overall code size a little.
      
      $ size net/ipv4/igmp.o* net/ipv6/mcast.o* net/ipv6/ip6_flowlabel.o*
         text	   data	    bss	    dec	    hex	filename
        34269	   7012	  14824	  56105	   db29	net/ipv4/igmp.o.new
        34315	   7012	  14824	  56151	   db57	net/ipv4/igmp.o.old
        30078	   7869	  13200	  51147	   c7cb	net/ipv6/mcast.o.new
        30105	   7869	  13200	  51174	   c7e6	net/ipv6/mcast.o.old
        11434	   3748	   8580	  23762	   5cd2	net/ipv6/ip6_flowlabel.o.new
        11491	   3748	   8580	  23819	   5d0b	net/ipv6/ip6_flowlabel.o.old
      
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1744bea1
    • Hannes Frederic Sowa's avatar
      fast_hash: avoid indirect function calls · e5a2c899
      Hannes Frederic Sowa authored
      
      
      By default the arch_fast_hash hashing function pointers are initialized
      to jhash(2). If during boot-up a CPU with SSE4.2 is detected they get
      updated to the CRC32 ones. This dispatching scheme incurs a function
      pointer lookup and indirect call for every hashing operation.
      
      rhashtable as a user of arch_fast_hash e.g. stores pointers to hashing
      functions in its structure, too, causing two indirect branches per
      hashing operation.
      
      Using alternative_call we can get away with one of those indirect branches.
      
      Acked-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e5a2c899
    • David S. Miller's avatar
      Merge branch 'amd-xgbe-next' · 2c99cd91
      David S. Miller authored
      
      
      Tom Lendacky says:
      
      ====================
      amd-xgbe: AMD XGBE driver updates 2014-11-04
      
      The following series of patches includes functional updates to the
      driver as well as some trivial changes for function renaming and
      spelling fixes.
      
      - Move channel and ring structure allocation into the device open path
      - Rename the pre_xmit function to dev_xmit
      - Explicitly use the u32 data type for the device descriptors
      - Use page allocation for the receive buffers
      - Add support for split header/payload receive
      - Add support for per DMA channel interrupts
      - Add support for receive side scaling (RSS)
      - Add support for ethtool receive side scaling commands
      - Fix the spelling of descriptors
      - After a PCS reset, sync the PCS and PHY modes
      - Add dependency on HAS_IOMEM to both the amd-xgbe and amd-xgbe-phy
        drivers
      
      This patch series is based on net-next.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2c99cd91
    • Lendacky, Thomas's avatar
      amd-xgbe-phy: Let AMD_XGBE_PHY depend on HAS_IOMEM · 5cdec679
      Lendacky, Thomas authored
      
      
      The amd-xgbe-phy driver needs to perform ioremap calls, so add HAS_IOMEM
      to its build dependency.
      
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5cdec679
    • Lendacky, Thomas's avatar
      amd-xgbe: Let AMD_XGBE depend on HAS_IOMEM · 474809b9
      Lendacky, Thomas authored
      
      
      The amd-xgbe driver needs to perform ioremap calls, so add HAS_IOMEM
      to its build dependency.
      
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      474809b9
    • Lendacky, Thomas's avatar
      amd-xgbe-phy: Sync PCS and PHY modes after reset · 0c95a1fa
      Lendacky, Thomas authored
      
      
      This patch adds support to sync the states of the PCS and the PHY
      after a reset is performed.  If the PCS and the PHY are not in the
      same state after reset an extra mode change would be performed. This
      extra mode change might not be needed if the PCS and the PHY are
      synced up after reset.
      
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c95a1fa
    • Lendacky, Thomas's avatar
      amd-xgbe: Fix a spelling error · a7beaf23
      Lendacky, Thomas authored
      
      
      This patch fixes the spelling of the word "descriptor" in a couple
      of locations.
      
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a7beaf23
    • Lendacky, Thomas's avatar
      amd-xgbe: Add receive side scaling ethtool support · f6ac8628
      Lendacky, Thomas authored
      
      
      This patch adds support for ethtool receive side scaling (RSS) commands.
      Support is added to get/set the RSS hash key and the RSS lookup table.
      
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f6ac8628
    • Lendacky, Thomas's avatar
      amd-xgbe: Provide support for receive side scaling · 5b9dfe29
      Lendacky, Thomas authored
      
      
      This patch provides support for receive side scaling (RSS). RSS allows
      for spreading incoming network packets across the Rx queues.  When used
      in conjunction with the per DMA channel interrupt support, this allows
      the receive processing to be spread across multiple processors.
      
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5b9dfe29
    • Lendacky, Thomas's avatar
      amd-xgbe: Add support for per DMA channel interrupts · 9227dc5e
      Lendacky, Thomas authored
      
      
      This patch provides support for interrupts that are generated by the
      Tx/Rx DMA channel pairs of the device.  This allows for Tx and Rx
      processing to run across multiple processsors.
      
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9227dc5e
    • Lendacky, Thomas's avatar
      amd-xgbe: Implement split header receive support · 174fd259
      Lendacky, Thomas authored
      
      
      Provide support for splitting IP packets so that the header and
      payload can be sent to different DMA addresses.  This will allow
      the IP header to be put into the linear part of the skb while the
      payload can be added as frags.
      
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      174fd259
    • Lendacky, Thomas's avatar
      amd-xgbe: Use page allocations for Rx buffers · 08dcc47c
      Lendacky, Thomas authored
      
      
      Use page allocations for Rx buffers instead of pre-allocating skbs
      of a set size.
      
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      08dcc47c
    • Lendacky, Thomas's avatar
      amd-xgbe: Use the u32 data type for descriptors · aa96bd3c
      Lendacky, Thomas authored
      
      
      The Tx and Rx descriptors are unsigned 32 bit values.  Use the u32
      type, rather than unsigned int, to map these descriptors.
      
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aa96bd3c
    • Lendacky, Thomas's avatar
      amd-xgbe: Rename pre_xmit function to dev_xmit · a9d41981
      Lendacky, Thomas authored
      
      
      The pre_xmit function name implies that it performs operations prior
      to transmitting the packet when in fact it is responsible for setting
      up the descriptors and initiating the transmit.  Rename this to
      function from pre_xmit to dev_xmit, which is consistent with the name
      used during receive processing - dev_read.
      
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a9d41981
    • Lendacky, Thomas's avatar
      amd-xgbe: Move ring allocation to device open · 4780b7ca
      Lendacky, Thomas authored
      
      
      Move the channel and ring tracking structures allocation to device
      open.  This will allow for future support to vary the number of Tx/Rx
      queues without unloading the module.
      
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4780b7ca
    • WANG Cong's avatar
      ipv6: move INET6_MATCH() to include/net/inet6_hashtables.h · 25de4668
      WANG Cong authored
      
      
      It is only used in net/ipv6/inet6_hashtables.c.
      
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      25de4668
    • David S. Miller's avatar
      net: Add and use skb_copy_datagram_msg() helper. · 51f3d02b
      David S. Miller authored
      
      
      This encapsulates all of the skb_copy_datagram_iovec() callers
      with call argument signature "skb, offset, msghdr->msg_iov, length".
      
      When we move to iov_iters in the networking, the iov_iter object will
      sit in the msghdr.
      
      Having a helper like this means there will be less places to touch
      during that transformation.
      
      Based upon descriptions and patch from Al Viro.
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      51f3d02b
    • David S. Miller's avatar
      Merge branch 'gue-next' · 1d76c1d0
      David S. Miller authored
      Tom Herbert says:
      
      ====================
      gue: Remote checksum offload
      
      This patch set implements remote checksum offload for
      GUE, which is a mechanism that provides checksum offload of
      encapsulated packets using rudimentary offload capabilities found in
      most Network Interface Card (NIC) devices. The outer header checksum
      for UDP is enabled in packets and, with some additional meta
      information in the GUE header, a receiver is able to deduce the
      checksum to be set for an inner encapsulated packet. Effectively this
      offloads the computation of the inner checksum. Enabling the outer
      checksum in encapsulation has the additional advantage that it covers
      more of the packet than the inner checksum including the encapsulation
      headers.
      
      Remote checksum offload is described in:
      http://tools.ietf.org/html/draft-herbert-remotecsumoffload-01
      
      
      
      The GUE transmit and receive paths are modified to support the
      remote checksum offload option. The option contains a checksum
      offset and checksum start which are directly derived from values
      set in stack when doing CHECKSUM_PARTIAL. On receipt of the option, the
      operation is to calculate the packet checksum from "start" to end of
      the packet (normally derived for checksum complete), and then set
      the resultant value at checksum "offset" (the checksum field has
      already been primed with the pseudo header). This emulates a NIC
      that implements NETIF_F_HW_CSUM.
      
      The primary purpose of this feature is to eliminate cost of performing
      checksum calculation over a packet when encpasulating.
      
      In this patch set:
        - Move fou_build_header into fou.c and split it into a couple of
          functions
        - Enable offloading of outer UDP checksum in encapsulation
        - Change udp_offload to support remote checksum offload, includes
          new GSO type and ensuring encapsulated layers (TCP) doesn't try to
          set a checksum covered by RCO
        - TX support for RCO with GUE. This is configured through ip_tunnel
          and set the option on transmit when packet being encapsulated is
          CHECKSUM_PARTIAL
        - RX support for RCO with GUE for normal and GRO paths. Includes
          resolving the offloaded checksum
      
      v2:
        Address comments from davem: Move accounting for private option
        field in gue_encap_hlen to patch in which we add the remote checksum
        offload option.
      
      Testing:
      
      I ran performance numbers using netperf TCP_STREAM and TCP_RR with 200
      streams, comparing GUE with and without remote checksum offload (doing
      checksum-unnecessary to complete conversion in both cases). These
      were run on mlnx4 and bnx2x. Some mlnx4 results are below.
      
      GRE/GUE
          TCP_STREAM
            IPv4, with remote checksum offload
              9.71% TX CPU utilization
              7.42% RX CPU utilization
              36380 Mbps
            IPv4, without remote checksum offload
              12.40% TX CPU utilization
              7.36% RX CPU utilization
              36591 Mbps
          TCP_RR
            IPv4, with remote checksum offload
              77.79% CPU utilization
      	91/144/216 90/95/99% latencies
              1.95127e+06 tps
            IPv4, without remote checksum offload
              78.70% CPU utilization
              89/152/297 90/95/99% latencies
              1.95458e+06 tps
      
      IPIP/GUE
          TCP_STREAM
            With remote checksum offload
              10.30% TX CPU utilization
              7.43% RX CPU utilization
              36486 Mbps
            Without remote checksum offload
              12.47% TX CPU utilization
              7.49% RX CPU utilization
              36694 Mbps
          TCP_RR
            With remote checksum offload
              77.80% CPU utilization
              87/153/270 90/95/99% latencies
              1.98735e+06 tps
            Without remote checksum offload
              77.98% CPU utilization
              87/150/287 90/95/99% latencies
              1.98737e+06 tps
      
      SIT/GUE
          TCP_STREAM
            With remote checksum offload
              9.68% TX CPU utilization
              7.36% RX CPU utilization
              35971 Mbps
            Without remote checksum offload
              12.95% TX CPU utilization
              8.04% RX CPU utilization
              36177 Mbps
          TCP_RR
            With remote checksum offload
              79.32% CPU utilization
              94/158/295 90/95/99% latencies
              1.88842e+06 tps
            Without remote checksum offload
              80.23% CPU utilization
              94/149/226 90/95/99% latencies
              1.90338e+06 tps
      
      VXLAN
          TCP_STREAM
              35.03% TX CPU utilization
              20.85% RX CPU utilization
              36230 Mbps
          TCP_RR
              77.36% CPU utilization
              84/146/270 90/95/99% latencies
              2.08063e+06 tps
      
      We can also look at CPU time in csum_partial using perf (with bnx2x
      setup). For GRE with TCP_STREAM I see:
      
          With remote checksum offload
              0.33% TX
              1.81% RX
          Without remote checksum offload
              6.00% TX
              0.51% RX
      
      I suspect the fact that time in csum_partial noticably increases
      with remote checksum offload for RX is due to taking the cache miss on
      the encapsulated header in that function. By similar reasoning, if on
      the TX side the packet were not in cache (say we did a splice from a
      file whose data was never touched by the CPU) the CPU savings for TX
      would probably be more pronounced.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1d76c1d0