Skip to content
  1. Sep 22, 2016
  2. Sep 21, 2016
    • David S. Miller's avatar
      Merge branch 'mlxse-resource-query' · 2d7a8926
      David S. Miller authored
      
      
      Jiri Pirko says:
      
      ====================
      mlxsw: Replace Hw related const with resource query results
      
      Nogah says:
      
      Many of the ASIC's properties can be read from the HW with resources query.
      This patchset adds new resources to the resource query and implement
      using them, instead of the constants that we currently use.
      Those resources are lag, kvd and router related.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2d7a8926
    • Nogah Frankel's avatar
      mlxsw: spectrum: Implement max rif resource · 8f8a62d4
      Nogah Frankel authored
      
      
      Replace max rif const with using the result from resource query.
      
      Signed-off-by: default avatarNogah Frankel <nogahf@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8f8a62d4
    • Nogah Frankel's avatar
      mlxsw: pci: Add max router interface resource · 274df7fb
      Nogah Frankel authored
      
      
      Add the max number of rif (router interfaces) to resource query.
      
      Signed-off-by: default avatarNogah Frankel <nogahf@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      274df7fb
    • Nogah Frankel's avatar
      mlxsw: pci: Add some miscellaneous resources · e44d49cb
      Nogah Frankel authored
      
      
      Add max system ports, max regions and max vlan groups to resource query.
      
      Signed-off-by: default avatarNogah Frankel <nogahf@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e44d49cb
    • Nogah Frankel's avatar
      mlxsw: spectrum: Implement max virtual routers resource · 9497c042
      Nogah Frankel authored
      
      
      Replace max virtual routers const with the result from
      the resource query.
      
      Signed-off-by: default avatarNogah Frankel <nogahf@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9497c042
    • Nogah Frankel's avatar
      mlxsw: pci: Add max virtual routers resource · b8a09f0a
      Nogah Frankel authored
      
      
      Add the max number of virtual routers to resource query.
      
      Signed-off-by: default avatarNogah Frankel <nogahf@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b8a09f0a
    • Nogah Frankel's avatar
      mlxsw: profile: Add KVD resources to profile config · 403547d3
      Nogah Frankel authored
      
      
      Use resources from resource query to determine values for
      the profile configuration.
      Add KVD determined section sizes to the resources struct.
      Change the profile struct and value to match this changes.
      
      Signed-off-by: default avatarNogah Frankel <nogahf@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      403547d3
    • Nogah Frankel's avatar
      mlxsw: pci: Add KVD size relate resources · 2acd10c5
      Nogah Frankel authored
      
      
      Add KVD size, and minimum sizes for the single and double
      sections resources to resources query.
      
      Signed-off-by: default avatarNogah Frankel <nogahf@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2acd10c5
    • Nogah Frankel's avatar
      mlxsw: spectrum: lag resources- use resources data instead of consts · ce0bd2b0
      Nogah Frankel authored
      
      
      Use max lag and max ports in lag resources as the result of resource query
      instead of using const to save them.
      
      Signed-off-by: default avatarNogah Frankel <nogahf@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ce0bd2b0
    • Nogah Frankel's avatar
      mlxsw: pci: Add lag related resources to resources query · 9f7f797c
      Nogah Frankel authored
      
      
      Add max lag and max ports in lag resources to resources query.
      
      Signed-off-by: default avatarNogah Frankel <nogahf@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9f7f797c
    • Or Gerlitz's avatar
      mlxsw: spectrum: Make offloads stats functions static · 4bdcc6ca
      Or Gerlitz authored
      The offloads stats functions are local to this file, make them static.
      
      Fixes: fc1bbb0f
      
       ('mlxsw: spectrum: Implement offload stats ndo [..]')
      Signed-off-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4bdcc6ca
    • David S. Miller's avatar
      Merge branch 'tcp-bbr' · a624f93c
      David S. Miller authored
      
      
      Neal Cardwell says:
      
      ====================
      tcp: BBR congestion control algorithm
      
      This patch series implements a new TCP congestion control algorithm:
      BBR (Bottleneck Bandwidth and RTT). A paper with a detailed
      description of BBR will be published in ACM Queue, September-October
      2016, as "BBR: Congestion-Based Congestion Control". BBR is widely
      deployed in production at Google.
      
      The patch series starts with a set of supporting infrastructure
      changes, including a few that extend the congestion control
      framework. The last patch adds BBR as a TCP congestion control
      module. Please see individual patches for the details.
      
      - v3 -> v4:
       - Updated tcp_bbr.c in "tcp_bbr: add BBR congestion control"
         to use const to qualify all the constant parameters.
         Thanks to Stephen Hemminger.
       - In "tcp_bbr: add BBR congestion control", remove the bbr_rate_kbps()
         function, which had a 64-bit divide that would be problematic on some
         architectures, and just use bbr_rate_bytes_per_sec() directly.
         Thanks to Kenneth Klette Jonassen for suggesting this.
       - In "tcp: switch back to proper tcp_skb_cb size check in tcp_init()",
         switched from sizeof(skb->cb) to FIELD_SIZEOF.
         Thanks to Lance Richardson for suggesting this.
       - Updated "tcp_bbr: add BBR congestion control" commit message with
         performance data, more details about deployment at Google, and
         another reminder to use fq with BBR.
       - Updated tcp_bbr.c in "tcp_bbr: add BBR congestion control"
         to use MODULE_LICENSE("Dual BSD/GPL").
      
      - v2 -> v3: fix another issue caught by build bots:
       - adjust rate_sample struct initialization syntax to allow gcc-4.4 to compile
         the "tcp: track data delivery rate for a TCP connection" patch; also
         adjusted some similar syntax in "tcp_bbr: add BBR congestion control"
      
      - v1 -> v2: fix issues caught by build bots:
       - fix "tcp: export data delivery rate" to use rate64 instead of rate,
         so there is a 64-bit numerator for the do_div call
       - fix conflicting definitions for minmax caused by
         "tcp: use windowed min filter library for TCP min_rtt estimation"
         with a new commit:
         tcp: cdg: rename struct minmax in tcp_cdg.c to avoid a naming conflict
       - fix warning about the use of __packed in
         "tcp: track data delivery rate for a TCP connection",
         which involves the addition of a new commit:
         tcp: switch back to proper tcp_skb_cb size check in tcp_init()
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a624f93c
    • Neal Cardwell's avatar
      tcp_bbr: add BBR congestion control · 0f8782ea
      Neal Cardwell authored
      This commit implements a new TCP congestion control algorithm: BBR
      (Bottleneck Bandwidth and RTT). A detailed description of BBR will be
      published in ACM Queue, Vol. 14 No. 5, September-October 2016, as
      "BBR: Congestion-Based Congestion Control".
      
      BBR has significantly increased throughput and reduced latency for
      connections on Google's internal backbone networks and google.com and
      YouTube Web servers.
      
      BBR requires only changes on the sender side, not in the network or
      the receiver side. Thus it can be incrementally deployed on today's
      Internet, or in datacenters.
      
      The Internet has predominantly used loss-based congestion control
      (largely Reno or CUBIC) since the 1980s, relying on packet loss as the
      signal to slow down. While this worked well for many years, loss-based
      congestion control is unfortunately out-dated in today's networks. On
      today's Internet, loss-based congestion control causes the infamous
      bufferbloat problem, often causing seconds of needless queuing delay,
      since it fills the bloated buffers in many last-mile links. On today's
      high-speed long-haul links using commodity switches with shallow
      buffers, loss-based congestion control has abysmal throughput because
      it over-reacts to losses caused by transient traffic bursts.
      
      In 1981 Kleinrock and Gale showed that the optimal operating point for
      a network maximizes delivered bandwidth while minimizing delay and
      loss, not only for single connections but for the network as a
      whole. Finding that optimal operating point has been elusive, since
      any single network measurement is ambiguous: network measurements are
      the result of both bandwidth and propagation delay, and those two
      cannot be measured simultaneously.
      
      While it is impossible to disambiguate any single bandwidth or RTT
      measurement, a connection's behavior over time tells a clearer
      story. BBR uses a measurement strategy designed to resolve this
      ambiguity. It combines these measurements with a robust servo loop
      using recent control systems advances to implement a distributed
      congestion control algorithm that reacts to actual congestion, not
      packet loss or transient queue delay, and is designed to converge with
      high probability to a point near the optimal operating point.
      
      In a nutshell, BBR creates an explicit model of the network pipe by
      sequentially probing the bottleneck bandwidth and RTT. On the arrival
      of each ACK, BBR derives the current delivery rate of the last round
      trip, and feeds it through a windowed max-filter to estimate the
      bottleneck bandwidth. Conversely it uses a windowed min-filter to
      estimate the round trip propagation delay. The max-filtered bandwidth
      and min-filtered RTT estimates form BBR's model of the network pipe.
      
      Using its model, BBR sets control parameters to govern sending
      behavior. The primary control is the pacing rate: BBR applies a gain
      multiplier to transmit faster or slower than the observed bottleneck
      bandwidth. The conventional congestion window (cwnd) is now the
      secondary control; the cwnd is set to a small multiple of the
      estimated BDP (bandwidth-delay product) in order to allow full
      utilization and bandwidth probing while bounding the potential amount
      of queue at the bottleneck.
      
      When a BBR connection starts, it enters STARTUP mode and applies a
      high gain to perform an exponential search to quickly probe the
      bottleneck bandwidth (doubling its sending rate each round trip, like
      slow start). However, instead of continuing until it fills up the
      buffer (i.e. a loss), or until delay or ACK spacing reaches some
      threshold (like Hystart), it uses its model of the pipe to estimate
      when that pipe is full: it estimates the pipe is full when it notices
      the estimated bandwidth has stopped growing. At that point it exits
      STARTUP and enters DRAIN mode, where it reduces its pacing rate to
      drain the queue it estimates it has created.
      
      Then BBR enters steady state. In steady state, PROBE_BW mode cycles
      between first pacing faster to probe for more bandwidth, then pacing
      slower to drain any queue that created if no more bandwidth was
      available, and then cruising at the estimated bandwidth to utilize the
      pipe without creating excess queue. Occasionally, on an as-needed
      basis, it sends significantly slower to probe for RTT (PROBE_RTT
      mode).
      
      BBR has been fully deployed on Google's wide-area backbone networks
      and we're experimenting with BBR on Google.com and YouTube on a global
      scale.  Replacing CUBIC with BBR has resulted in significant
      improvements in network latency and application (RPC, browser, and
      video) metrics. For more details please refer to our upcoming ACM
      Queue publication.
      
      Example performance results, to illustrate the difference between BBR
      and CUBIC:
      
      Resilience to random loss (e.g. from shallow buffers):
        Consider a netperf TCP_STREAM test lasting 30 secs on an emulated
        path with a 10Gbps bottleneck, 100ms RTT, and 1% packet loss
        rate. CUBIC gets 3.27 Mbps, and BBR gets 9150 Mbps (2798x higher).
      
      Low latency with the bloated buffers common in today's last-mile links:
        Consider a netperf TCP_STREAM test lasting 120 secs on an emulated
        path with a 10Mbps bottleneck, 40ms RTT, and 1000-packet bottleneck
        buffer. Both fully utilize the bottleneck bandwidth, but BBR
        achieves this with a median RTT 25x lower (43 ms instead of 1.09
        secs).
      
      Our long-term goal is to improve the congestion control algorithms
      used on the Internet. We are hopeful that BBR can help advance the
      efforts toward this goal, and motivate the community to do further
      research.
      
      Test results, performance evaluations, feedback, and BBR-related
      discussions are very welcome in the public e-mail list for BBR:
      
        https://groups.google.com/forum/#!forum/bbr-dev
      
      
      
      NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing
      enabled, since pacing is integral to the BBR design and
      implementation. BBR without pacing would not function properly, and
      may incur unnecessary high packet loss rates.
      
      Signed-off-by: default avatarVan Jacobson <vanj@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0f8782ea
    • Neal Cardwell's avatar
      tcp: increase ICSK_CA_PRIV_SIZE from 64 bytes to 88 · 7e744171
      Neal Cardwell authored
      
      
      The TCP CUBIC module already uses 64 bytes.
      The upcoming TCP BBR module uses 88 bytes.
      
      Signed-off-by: default avatarVan Jacobson <vanj@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7e744171
    • Yuchung Cheng's avatar
      tcp: new CC hook to set sending rate with rate_sample in any CA state · c0402760
      Yuchung Cheng authored
      
      
      This commit introduces an optional new "omnipotent" hook,
      cong_control(), for congestion control modules. The cong_control()
      function is called at the end of processing an ACK (i.e., after
      updating sequence numbers, the SACK scoreboard, and loss
      detection). At that moment we have precise delivery rate information
      the congestion control module can use to control the sending behavior
      (using cwnd, TSO skb size, and pacing rate) in any CA state.
      
      This function can also be used by a congestion control that prefers
      not to use the default cwnd reduction approach (i.e., the PRR
      algorithm) during CA_Recovery to control the cwnd and sending rate
      during loss recovery.
      
      We take advantage of the fact that recent changes defer the
      retransmission or transmission of new data (e.g. by F-RTO) in recovery
      until the new tcp_cong_control() function is run.
      
      With this commit, we only run tcp_update_pacing_rate() if the
      congestion control is not using this new API. New congestion controls
      which use the new API do not want the TCP stack to run the default
      pacing rate calculation and overwrite whatever pacing rate they have
      chosen at initialization time.
      
      Signed-off-by: default avatarVan Jacobson <vanj@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c0402760
    • Yuchung Cheng's avatar
      tcp: allow congestion control to expand send buffer differently · 77bfc174
      Yuchung Cheng authored
      
      
      Currently the TCP send buffer expands to twice cwnd, in order to allow
      limited transmits in the CA_Recovery state. This assumes that cwnd
      does not increase in the CA_Recovery.
      
      For some congestion control algorithms, like the upcoming BBR module,
      if the losses in recovery do not indicate congestion then we may
      continue to raise cwnd multiplicatively in recovery. In such cases the
      current multiplier will falsely limit the sending rate, much as if it
      were limited by the application.
      
      This commit adds an optional congestion control callback to use a
      different multiplier to expand the TCP send buffer. For congestion
      control modules that do not specificy this callback, TCP continues to
      use the previous default of 2.
      
      Signed-off-by: default avatarVan Jacobson <vanj@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      77bfc174
    • Neal Cardwell's avatar
      tcp: export tcp_mss_to_mtu() for congestion control modules · 556c6b46
      Neal Cardwell authored
      
      
      Export tcp_mss_to_mtu(), so that congestion control modules can use
      this to help calculate a pacing rate.
      
      Signed-off-by: default avatarVan Jacobson <vanj@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      556c6b46
    • Neal Cardwell's avatar
      tcp: export tcp_tso_autosize() and parameterize minimum number of TSO segments · 1b3878ca
      Neal Cardwell authored
      
      
      To allow congestion control modules to use the default TSO auto-sizing
      algorithm as one of the ingredients in their own decision about TSO sizing:
      
      1) Export tcp_tso_autosize() so that CC modules can use it.
      
      2) Change tcp_tso_autosize() to allow callers to specify a minimum
         number of segments per TSO skb, in case the congestion control
         module has a different notion of the best floor for TSO skbs for
         the connection right now. For very low-rate paths or policed
         connections it can be appropriate to use smaller TSO skbs.
      
      Signed-off-by: default avatarVan Jacobson <vanj@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1b3878ca
    • Neal Cardwell's avatar
      tcp: allow congestion control module to request TSO skb segment count · ed6e7268
      Neal Cardwell authored
      
      
      Add the tso_segs_goal() function in tcp_congestion_ops to allow the
      congestion control module to specify the number of segments that
      should be in a TSO skb sent by tcp_write_xmit() and
      tcp_xmit_retransmit_queue(). The congestion control module can either
      request a particular number of segments in TSO skb that we transmit,
      or return 0 if it doesn't care.
      
      This allows the upcoming BBR congestion control module to select small
      TSO skb sizes if the module detects that the bottleneck bandwidth is
      very low, or that the connection is policed to a low rate.
      
      Signed-off-by: default avatarVan Jacobson <vanj@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ed6e7268
    • Yuchung Cheng's avatar
      tcp: export data delivery rate · eb8329e0
      Yuchung Cheng authored
      
      
      This commit export two new fields in struct tcp_info:
      
        tcpi_delivery_rate: The most recent goodput, as measured by
          tcp_rate_gen(). If the socket is limited by the sending
          application (e.g., no data to send), it reports the highest
          measurement instead of the most recent. The unit is bytes per
          second (like other rate fields in tcp_info).
      
        tcpi_delivery_rate_app_limited: A boolean indicating if the goodput
          was measured when the socket's throughput was limited by the
          sending application.
      
      This delivery rate information can be useful for applications that
      want to know the current throughput the TCP connection is seeing,
      e.g. adaptive bitrate video streaming. It can also be very useful for
      debugging or troubleshooting.
      
      Signed-off-by: default avatarVan Jacobson <vanj@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eb8329e0
    • Soheil Hassas Yeganeh's avatar
      tcp: track application-limited rate samples · d7722e85
      Soheil Hassas Yeganeh authored
      
      
      This commit adds code to track whether the delivery rate represented
      by each rate_sample was limited by the application.
      
      Upon each transmit, we store in the is_app_limited field in the skb a
      boolean bit indicating whether there is a known "bubble in the pipe":
      a point in the rate sample interval where the sender was
      application-limited, and did not transmit even though the cwnd and
      pacing rate allowed it.
      
      This logic marks the flow app-limited on a write if *all* of the
      following are true:
      
        1) There is less than 1 MSS of unsent data in the write queue
           available to transmit.
      
        2) There is no packet in the sender's queues (e.g. in fq or the NIC
           tx queue).
      
        3) The connection is not limited by cwnd.
      
        4) There are no lost packets to retransmit.
      
      The tcp_rate_check_app_limited() code in tcp_rate.c determines whether
      the connection is application-limited at the moment. If the flow is
      application-limited, it sets the tp->app_limited field. If the flow is
      application-limited then that means there is effectively a "bubble" of
      silence in the pipe now, and this silence will be reflected in a lower
      bandwidth sample for any rate samples from now until we get an ACK
      indicating this bubble has exited the pipe: specifically, until we get
      an ACK for the next packet we transmit.
      
      When we send every skb we record in scb->tx.is_app_limited whether the
      resulting rate sample will be application-limited.
      
      The code in tcp_rate_gen() checks to see when it is safe to mark all
      known application-limited bubbles of silence as having exited the
      pipe. It does this by checking to see when the delivered count moves
      past the tp->app_limited marker. At this point it zeroes the
      tp->app_limited marker, as all known bubbles are out of the pipe.
      
      We make room for the tx.is_app_limited bit in the skb by borrowing a
      bit from the in_flight field used by NV to record the number of bytes
      in flight. The receive window in the TCP header is 16 bits, and the
      max receive window scaling shift factor is 14 (RFC 1323). So the max
      receive window offered by the TCP protocol is 2^(16+14) = 2^30. So we
      only need 30 bits for the tx.in_flight used by NV.
      
      Signed-off-by: default avatarVan Jacobson <vanj@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7722e85
    • Yuchung Cheng's avatar
      tcp: track data delivery rate for a TCP connection · b9f64820
      Yuchung Cheng authored
      
      
      This patch generates data delivery rate (throughput) samples on a
      per-ACK basis. These rate samples can be used by congestion control
      modules, and specifically will be used by TCP BBR in later patches in
      this series.
      
      Key state:
      
      tp->delivered: Tracks the total number of data packets (original or not)
      	       delivered so far. This is an already-existing field.
      
      tp->delivered_mstamp: the last time tp->delivered was updated.
      
      Algorithm:
      
      A rate sample is calculated as (d1 - d0)/(t1 - t0) on a per-ACK basis:
      
        d1: the current tp->delivered after processing the ACK
        t1: the current time after processing the ACK
      
        d0: the prior tp->delivered when the acked skb was transmitted
        t0: the prior tp->delivered_mstamp when the acked skb was transmitted
      
      When an skb is transmitted, we snapshot d0 and t0 in its control
      block in tcp_rate_skb_sent().
      
      When an ACK arrives, it may SACK and ACK some skbs. For each SACKed
      or ACKed skb, tcp_rate_skb_delivered() updates the rate_sample struct
      to reflect the latest (d0, t0).
      
      Finally, tcp_rate_gen() generates a rate sample by storing
      (d1 - d0) in rs->delivered and (t1 - t0) in rs->interval_us.
      
      One caveat: if an skb was sent with no packets in flight, then
      tp->delivered_mstamp may be either invalid (if the connection is
      starting) or outdated (if the connection was idle). In that case,
      we'll re-stamp tp->delivered_mstamp.
      
      At first glance it seems t0 should always be the time when an skb was
      transmitted, but actually this could over-estimate the rate due to
      phase mismatch between transmit and ACK events. To track the delivery
      rate, we ensure that if packets are in flight then t0 and and t1 are
      times at which packets were marked delivered.
      
      If the initial and final RTTs are different then one may be corrupted
      by some sort of noise. The noise we see most often is sending gaps
      caused by delayed, compressed, or stretched acks. This either affects
      both RTTs equally or artificially reduces the final RTT. We approach
      this by recording the info we need to compute the initial RTT
      (duration of the "send phase" of the window) when we recorded the
      associated inflight. Then, for a filter to avoid bandwidth
      overestimates, we generalize the per-sample bandwidth computation
      from:
      
          bw = delivered / ack_phase_rtt
      
      to the following:
      
          bw = delivered / max(send_phase_rtt, ack_phase_rtt)
      
      In large-scale experiments, this filtering approach incorporating
      send_phase_rtt is effective at avoiding bandwidth overestimates due to
      ACK compression or stretched ACKs.
      
      Signed-off-by: default avatarVan Jacobson <vanj@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b9f64820
    • Neal Cardwell's avatar
      tcp: count packets marked lost for a TCP connection · 0682e690
      Neal Cardwell authored
      
      
      Count the number of packets that a TCP connection marks lost.
      
      Congestion control modules can use this loss rate information for more
      intelligent decisions about how fast to send.
      
      Specifically, this is used in TCP BBR policer detection. BBR uses a
      high packet loss rate as one signal in its policer detection and
      policer bandwidth estimation algorithm.
      
      The BBR policer detection algorithm cannot simply track retransmits,
      because a retransmit can be (and often is) an indicator of packets
      lost long, long ago. This is particularly true in a long CA_Loss
      period that repairs the initial massive losses when a policer kicks
      in.
      
      Signed-off-by: default avatarVan Jacobson <vanj@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0682e690
    • Eric Dumazet's avatar
      tcp: switch back to proper tcp_skb_cb size check in tcp_init() · b2d3ea4a
      Eric Dumazet authored
      Revert to the tcp_skb_cb size check that tcp_init() had before commit
      b4772ef8 ("net: use common macro for assering skb->cb[] available
      size in protocol families"). As related commit 744d5a3e ("net:
      move skb->dropcount to skb->cb[]") explains, the
      sock_skb_cb_check_size() mechanism was added to ensure that there is
      space for dropcount, "for protocol families using it". But TCP is not
      a protocol using dropcount, so tcp_init() doesn't need to provision
      space for dropcount in the skb->cb[], and thus we can revert to the
      older form of the tcp_skb_cb size check. Doing so allows TCP to use 4
      more bytes of the skb->cb[] space.
      
      Fixes: b4772ef8
      
       ("net: use common macro for assering skb->cb[] available size in protocol families")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b2d3ea4a