Skip to content
  1. Feb 10, 2023
  2. Feb 09, 2023
  3. Feb 08, 2023
    • David S. Miller's avatar
      Merge branch 'taprio-auto-qmaxsdu-new-tx' · e6ebe6c1
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      taprio automatic queueMaxSDU and new TXQ selection procedure
      
      This patch set addresses 2 design limitations in the taprio software scheduler:
      
      1. Software scheduling fundamentally prioritizes traffic incorrectly,
         in a way which was inspired from Intel igb/igc drivers and does not
         follow the inputs user space gives (traffic classes and TC to TXQ
         mapping). Patch 05/15 handles this, 01/15 - 04/15 are preparations
         for this work.
      
      2. Software scheduling assumes that the gate for a traffic class closes
         as soon as the next interval begins. But this isn't true.
         If consecutive schedule entries have that traffic class gate open,
         there is no "gate close" event and taprio should keep dequeuing from
         that TC without interruptions. Patches 06/15 - 15/15 handle this.
         Patch 10/15 is a generic Qdisc change required for this to work.
      
      Future development directions which depend on this patch set are:
      
      - Propagating the automatic queueMaxSDU calculation down to offloading
        device drivers, instead of letting them calculate this, as
        vsc9959_tas_guard_bands_update() does today.
      
      - A software data path for tc-taprio with preemptible traffic and
        Hold/Release events.
      
      v1 at:
      https://patchwork.kernel.org/project/netdevbpf/cover/20230128010719.2182346-1-vladimir.oltean@nxp.com/
      
      
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e6ebe6c1
    • Vladimir Oltean's avatar
      net/sched: taprio: don't segment unnecessarily · 39b02d6d
      Vladimir Oltean authored
      
      
      Improve commit 497cc002 ("taprio: Handle short intervals and large
      packets") to only perform segmentation when skb->len exceeds what
      taprio_dequeue() expects.
      
      In practice, this will make the biggest difference when a traffic class
      gate is always open in the schedule. This is because the max_frm_len
      will be U32_MAX, and such large skb->len values as Kurt reported will be
      sent just fine unsegmented.
      
      What I don't seem to know how to handle is how to make sure that the
      segmented skbs themselves are smaller than the maximum frame size given
      by the current queueMaxSDU[tc]. Nonetheless, we still need to drop
      those, otherwise the Qdisc will hang.
      
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      39b02d6d
    • Vladimir Oltean's avatar
      net/sched: taprio: split segmentation logic from qdisc_enqueue() · 2d5e8071
      Vladimir Oltean authored
      
      
      The majority of the taprio_enqueue()'s function is spent doing TCP
      segmentation, which doesn't look right to me. Compilers shouldn't have a
      problem in inlining code no matter how we write it, so move the
      segmentation logic to a separate function.
      
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2d5e8071
    • Vladimir Oltean's avatar
      net/sched: taprio: automatically calculate queueMaxSDU based on TC gate durations · fed87cc6
      Vladimir Oltean authored
      
      
      taprio today has a huge problem with small TC gate durations, because it
      might accept packets in taprio_enqueue() which will never be sent by
      taprio_dequeue().
      
      Since not much infrastructure was available, a kludge was added in
      commit 497cc002 ("taprio: Handle short intervals and large
      packets"), which segmented large TCP segments, but the fact of the
      matter is that the issue isn't specific to large TCP segments (and even
      worse, the performance penalty in segmenting those is absolutely huge).
      
      In commit a54fc09e ("net/sched: taprio: allow user input of per-tc
      max SDU"), taprio gained support for queueMaxSDU, which is precisely the
      mechanism through which packets should be dropped at qdisc_enqueue() if
      they cannot be sent.
      
      After that patch, it was necessary for the user to manually limit the
      maximum MTU per TC. This change adds the necessary logic for taprio to
      further limit the values specified (or not specified) by the user to
      some minimum values which never allow oversized packets to be sent.
      
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fed87cc6
    • Vladimir Oltean's avatar
      net/sched: keep the max_frm_len information inside struct sched_gate_list · a878fd46
      Vladimir Oltean authored
      
      
      I have one practical reason for doing this and one concerning correctness.
      
      The practical reason has to do with a follow-up patch, which aims to mix
      2 sources of max_sdu (one coming from the user and the other automatically
      calculated based on TC gate durations @current link speed). Among those
      2 sources of input, we must always select the smaller max_sdu value, but
      this can change at various link speeds. So the max_sdu coming from the
      user must be kept separated from the value that is operationally used
      (the minimum of the 2), because otherwise we overwrite it and forget
      what the user asked us to do.
      
      To solve that, this patch proposes that struct sched_gate_list contains
      the operationally active max_frm_len, and q->max_sdu contains just what
      was requested by the user.
      
      The reason having to do with correctness is based on the following
      observation: the admin sched_gate_list becomes operational at a given
      base_time in the future. Until then, it is inactive and applies no
      shaping, all gates are open, etc. So the queueMaxSDU dropping shouldn't
      apply either (this is a mechanism to ensure that packets smaller than
      the largest gate duration for that TC don't hang the port; clearly it
      makes little sense if the gates are always open).
      
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a878fd46
    • Vladimir Oltean's avatar
      net/sched: taprio: warn about missing size table · a3d91b2c
      Vladimir Oltean authored
      Vinicius intended taprio to take the L1 overhead into account when
      estimating packet transmission time through user input, specifically
      through the qdisc size table (man tc-stab).
      
      Something like this:
      
      tc qdisc replace dev $eth root stab overhead 24 taprio \
      	num_tc 8 \
      	map 0 1 2 3 4 5 6 7 \
      	queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
      	base-time 0 \
      	sched-entry S 0x7e 9000000 \
      	sched-entry S 0x82 1000000 \
      	max-sdu 0 0 0 0 0 0 0 200 \
      	flags 0x0 clockid CLOCK_TAI
      
      Without the overhead being specified, transmission times will be
      underestimated and will cause late transmissions. For an offloading
      driver, it might even cause TX hangs if there is no open gate large
      enough to send the maximum sized packets for that TC (including L1
      overhead). Properly knowing the L1 overhead will ensure that we are able
      to auto-calculate the queueMaxSDU per traffic class just right, and
      avoid these hangs due to head-of-line blocking.
      
      We can't make the stab mandatory due to existing setups, but we can warn
      the user that it's important with a warning netlink extack.
      
      Link: https://patchwork.kernel.org/project/netdevbpf/patch/20220505160357.298794-1-vladimir.oltean@nxp.com/
      
      
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a3d91b2c
    • Vladimir Oltean's avatar
      net/sched: make stab available before ops->init() call · 1f62879e
      Vladimir Oltean authored
      
      
      Some qdiscs like taprio turn out to be actually pretty reliant on a well
      configured stab, to not underestimate the skb transmission time (by
      properly accounting for L1 overhead).
      
      In a future change, taprio will need the stab, if configured by the
      user, to be available at ops->init() time. It will become even more
      important in upcoming work, when the overhead will be used for the
      queueMaxSDU calculation that is passed to an offloading driver.
      
      However, rcu_assign_pointer(sch->stab, stab) is called right after
      ops->init(), making it unavailable, and I don't really see a good reason
      for that.
      
      Move it earlier, which nicely seems to simplify the error handling path
      as well.
      
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1f62879e
    • Vladimir Oltean's avatar
      net/sched: taprio: calculate guard band against actual TC gate close time · a1e6ad30
      Vladimir Oltean authored
      
      
      taprio_dequeue_from_txq() looks at the entry->end_time to determine
      whether the skb will overrun its traffic class gate, as if at the end of
      the schedule entry there surely is a "gate close" event for it. Hint:
      maybe there isn't.
      
      For each schedule entry, introduce an array of kernel times which
      actually tracks when in the future will there be an *actual* gate close
      event for that traffic class, and use that in the guard band overrun
      calculation.
      
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a1e6ad30
    • Vladimir Oltean's avatar
      net/sched: taprio: calculate budgets per traffic class · d2ad689d
      Vladimir Oltean authored
      
      
      Currently taprio assumes that the budget for a traffic class expires at
      the end of the current interval as if the next interval contains a "gate
      close" event for this traffic class.
      
      This is, however, an unfounded assumption. Allow schedule entry
      intervals to be fused together for a particular traffic class by
      calculating the budget until the gate *actually* closes.
      
      This means we need to keep budgets per traffic class, and we also need
      to update the budget consumption procedure.
      
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d2ad689d
    • Vladimir Oltean's avatar
      net/sched: taprio: rename close_time to end_time · e5517551
      Vladimir Oltean authored
      
      
      There is a confusion in terms in taprio which makes what is called
      "close_time" to be actually used for 2 things:
      
      1. determining when an entry "closes" such that transmitted skbs are
         never allowed to overrun that time (?!)
      2. an aid for determining when to advance and/or restart the schedule
         using the hrtimer
      
      It makes more sense to call this so-called "close_time" "end_time",
      because it's not clear at all to me what "closes". Future patches will
      hopefully make better use of the term "to close".
      
      This is an absolutely mechanical change.
      
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e5517551
    • Vladimir Oltean's avatar
      net/sched: taprio: calculate tc gate durations · a306a90c
      Vladimir Oltean authored
      
      
      Current taprio code operates on a very simplistic (and incorrect)
      assumption: that egress scheduling for a traffic class can only take
      place for the duration of the current interval, or i.o.w., it assumes
      that at the end of each schedule entry, there is a "gate close" event
      for all traffic classes.
      
      As an example, traffic sent with the schedule below will be jumpy, even
      though all 8 TC gates are open, so there is absolutely no "gate close"
      event (effectively a transition from BIT(tc)==1 to BIT(tc)==0 in
      consecutive schedule entries):
      
      tc qdisc replace dev veth0 parent root taprio \
      	num_tc 2 \
      	map 0 1 \
      	queues 1@0 1@1 \
      	base-time 0 \
      	sched-entry S 0xff 4000000000 \
      	clockid CLOCK_TAI \
      	flags 0x0
      
      This qdisc simply does not have what it takes in terms of logic to
      *actually* compute the durations of traffic classes. Also, it does not
      recognize the need to use this information on a per-traffic-class basis:
      it always looks at entry->interval and entry->close_time.
      
      This change proposes that each schedule entry has an array called
      tc_gate_duration[tc]. This holds the information: "for how long will
      this traffic class gate remain open, starting from *this* schedule
      entry". If the traffic class gate is always open, that value is equal to
      the cycle time of the schedule.
      
      We'll also need to keep track, for the purpose of queueMaxSDU[tc]
      calculation, what is the maximum time duration for a traffic class
      having an open gate. This gives us directly what is the maximum sized
      packet that this traffic class will have to accept. For everything else
      it has to qdisc_drop() it in qdisc_enqueue().
      
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a306a90c
    • Vladimir Oltean's avatar
      net/sched: taprio: give higher priority to higher TCs in software dequeue mode · 2f530df7
      Vladimir Oltean authored
      
      
      Current taprio software implementation is haunted by the shadow of the
      igb/igc hardware model. It iterates over child qdiscs in increasing
      order of TXQ index, therefore giving higher xmit priority to TXQ 0 and
      lower to TXQ N. According to discussions with Vinicius, that is the
      default (perhaps even unchangeable) prioritization scheme used for the
      NICs that taprio was first written for (igb, igc), and we have a case of
      two bugs canceling out, resulting in a functional setup on igb/igc, but
      a less sane one on other NICs.
      
      To the best of my understanding, taprio should prioritize based on the
      traffic class, so it should really dequeue starting with the highest
      traffic class and going down from there. We get to the TXQ using the
      tc_to_txq[] netdev property.
      
      TXQs within the same TC have the same (strict) priority, so we should
      pick from them as fairly as we can. We can achieve that by implementing
      something very similar to q->curband from multiq_dequeue().
      
      Since igb/igc really do have TXQ 0 of higher hardware priority than
      TXQ 1 etc, we need to preserve the behavior for them as well. We really
      have no choice, because in txtime-assist mode, taprio is essentially a
      software scheduler towards offloaded child tc-etf qdiscs, so the TXQ
      selection really does matter (not all igb TXQs support ETF/SO_TXTIME,
      says Kurt Kanzenbach).
      
      To preserve the behavior, we need a capability bit so that taprio can
      determine if it's running on igb/igc, or on something else. Because igb
      doesn't offload taprio at all, we can't piggyback on the
      qdisc_offload_query_caps() call from taprio_enable_offload(), but
      instead we need a separate call which is also made for software
      scheduling.
      
      Introduce two static keys to minimize the performance penalty on systems
      which only have igb/igc NICs, and on systems which only have other NICs.
      For mixed systems, taprio will have to dynamically check whether to
      dequeue using one prioritization algorithm or using the other.
      
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2f530df7