Skip to content
  1. Apr 22, 2015
    • shli@kernel.org's avatar
      RAID5: batch adjacent full stripe write · 59fc630b
      shli@kernel.org authored
      
      
      stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k
      unit. Idealy we should use big size for adjacent full stripe writes. Bigger
      stripe cache size means less stripes runing in the state machine so can reduce
      cpu overhead. And also bigger size can cause bigger IO size dispatched to under
      layer disks.
      
      With below patch, we will automatically batch adjacent full stripe write
      together. Such stripes will be added to the batch list. Only the first stripe
      of the list will be put to handle_list and so run handle_stripe(). Some steps
      of handle_stripe() are extended to cover all stripes of the list, including
      ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes
      running in handle_stripe() and we send IO of whole stripe list together to
      increase IO size.
      
      Stripes added to a batch list have some limitations. A batch list can only
      include full stripe write and can't cross chunk boundary to make sure stripes
      have the same parity disks. Stripes in a batch list must be in the same state
      (no written, toread and so on). If a stripe is in a batch list, all new
      read/write to add_stripe_bio will be blocked to overlap conflict till the batch
      list is handled. The limitations will make sure stripes in a batch list be in
      exactly the same state in the life circly.
      
      I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6
      PCIe SSD. This patch improves around 30% performance and IO size to under layer
      disk is exactly 32k. I also run a 4k randwrite test in the same array to make
      sure the performance isn't changed with the patch.
      
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      59fc630b
    • shli@kernel.org's avatar
      raid5: track overwrite disk count · 7a87f434
      shli@kernel.org authored
      
      
      Track overwrite disk count, so we can know if a stripe is a full stripe write.
      
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      7a87f434
    • shli@kernel.org's avatar
      raid5: add a new flag to track if a stripe can be batched · da41ba65
      shli@kernel.org authored
      
      
      A freshly new stripe with write request can be batched. Any time the stripe is
      handled or new read is queued, the flag will be cleared.
      
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      da41ba65
    • shli@kernel.org's avatar
      raid5: use flex_array for scribble data · 46d5b785
      shli@kernel.org authored
      
      
      Use flex_array for scribble data. Next patch will batch several stripes
      together, so scribble data should be able to cover several stripes, so this
      patch also allocates scribble data for stripes across a chunk.
      
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      46d5b785
    • Heinz Mauelshagen's avatar
      md raid0: access mddev->queue (request queue member) conditionally because it... · 753f2856
      Heinz Mauelshagen authored
      
      md raid0: access mddev->queue (request queue member) conditionally because it is not set when accessed from dm-raid
      
      The patch makes 3 references to mddev->queue in the raid0 personality
      conditional in order to allow for it to be accessed from dm-raid.
      Mandatory, because md instances underneath dm-raid don't manage
      a request queue of their own which'd lead to oopses without the patch.
      
      Signed-off-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Tested-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      753f2856
    • NeilBrown's avatar
      md: allow resync to go faster when there is competing IO. · ac8fa419
      NeilBrown authored
      
      
      When md notices non-sync IO happening while it is trying
      to resync (or reshape or recover) it slows down to the
      set minimum.
      
      The default minimum might have made sense many years ago
      but the drives have become faster.  Changing the default
      to match the times isn't really a long term solution.
      
      This patch changes the code so that instead of waiting until the speed
      has dropped to the target, it just waits until pending requests
      have completed.
      This means that the delay inserted is a function of the speed
      of the devices.
      
      Testing shows that:
       - for some loads, the resync speed is unchanged.  For those loads
         increasing the minimum doesn't change the speed either.
         So this is a good result.  To increase resync speed under such
         loads we would probably need to increase the resync window
         size.
      
       - for other loads, resync speed does increase to a reasonable
         fraction (e.g. 20%) of maximum possible, and throughput of
         the load only drops a little bit (e.g. 10%)
      
       - for other loads, throughput of the non-sync load drops quite a bit
         more.  These seem to be latency-sensitive loads.
      
      So it isn't a perfect solution, but it is mostly an improvement.
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      ac8fa419
    • NeilBrown's avatar
      md: remove 'go_faster' option from ->sync_request() · 09314799
      NeilBrown authored
      
      
      This option is not well justified and testing suggests that
      it hardly ever makes any difference.
      
      The comment suggests there might be a need to wait for non-resync
      activity indicated by ->nr_waiting, however raise_barrier()
      already waits for all of that.
      
      So just remove it to simplify reasoning about speed limiting.
      
      This allows us to remove a 'FIXME' comment from raid5.c as that
      never used the flag.
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      09314799
    • NeilBrown's avatar
      md: don't require sync_min to be a multiple of chunk_size. · 50c37b13
      NeilBrown authored
      
      
      There is really no need for sync_min to be a multiple of
      chunk_size, and values read from here often aren't.
      That means you cannot read a value and expect to be able
      to write it back later.
      
      So remove the chunk_size check, and round down to a multiple
      of 4K, to be sure everything works with 4K-sector devices.
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      50c37b13
    • NeilBrown's avatar
      Merge branch 'cluster' into for-next · d51e4fe6
      NeilBrown authored
      d51e4fe6
    • Goldwyn Rodrigues's avatar
      md-cluster: re-add capabilities · 97f6cd39
      Goldwyn Rodrigues authored
      
      
      When "re-add" is writted to /sys/block/mdXX/md/dev-YYY/state,
      the clustered md:
      
      1. Sends RE_ADD message with the desc_nr. Nodes receiving the message
         clear the Faulty bit in their respective rdev->flags.
      2. The node initiating re-add, gathers the bitmaps of all nodes
         and copies them into the local bitmap. It does not clear the bitmap
         from which it is copying.
      3. Initiating node schedules a md recovery to sync the devices.
      
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      97f6cd39
    • Goldwyn Rodrigues's avatar
      md: re-add a failed disk · a6da4ef8
      Goldwyn Rodrigues authored
      
      
      This adds the capability of re-adding a failed disk by
      writing "re-add" to /sys/block/mdXX/md/dev-YYY/state.
      
      This facilitates adding disks which have encountered a temporary
      error such as a network disconnection/hiccup in an iSCSI device,
      or a SAN cable disconnection which has been restored. In such
      a situation, you do not need to remove and re-add the device.
      Writing re-add to the failed device's state would add it again
      to the array and perform the recovery of only the blocks which
      were written after the device failed.
      
      This works for generic md, and is not related to clustering. However,
      this patch is to ease re-add operations listed above in clustering
      environments.
      
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      a6da4ef8
    • Goldwyn Rodrigues's avatar
      md-cluster: remove capabilities · 88bcfef7
      Goldwyn Rodrigues authored
      
      
      This adds "remove" capabilities for the clustered environment.
      When a user initiates removal of a device from the array, a
      REMOVE message with disk number in the array is sent to all
      the nodes which kick the respective device in their own array.
      
      This facilitates the removal of failed devices.
      
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      88bcfef7
    • Goldwyn Rodrigues's avatar
      md: Export and rename find_rdev_nr_rcu · 57d051dc
      Goldwyn Rodrigues authored
      
      
      This is required by the clustering module (patches to follow) to
      find the device to remove or re-add.
      
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      57d051dc
    • Goldwyn Rodrigues's avatar
      md: Export and rename kick_rdev_from_array · fb56dfef
      Goldwyn Rodrigues authored
      
      
      This export is required for clustering module in order to
      co-ordinate remove/readd a rdev from all nodes.
      
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      fb56dfef
    • Guoqing Jiang's avatar
      md-cluster: correct the num for comparison · 8c58f02e
      Guoqing Jiang authored
      
      
      
      Since the node num of md-cluster is from zero, and
      cinfo->slot_number represents the slot num of dlm,
      no need to check for equality.
      
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      8c58f02e
  2. Apr 10, 2015
  3. Apr 08, 2015
    • Gu Zheng's avatar
      md: fix md io stats accounting broken · 74672d06
      Gu Zheng authored
      
      
      Simon reported the md io stats accounting issue:
      "
      I'm seeing "iostat -x -k 1" print this after a RAID1 rebuild on 4.0-rc5.
      It's not abnormal other than it's 3-disk, with one being SSD (sdc) and
      the other two being write-mostly:
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
      sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
      sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
      md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00   345.00    0.00    0.00    0.00   0.00 100.00
      md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00 58779.00    0.00    0.00    0.00   0.00 100.00
      md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00    12.00    0.00    0.00    0.00   0.00 100.00
      "
      The cause is commit "18c0b223" uses the
      generic_start_io_acct to account the disk stats rather than the open code,
      but it also introduced the increase to .in_flight[rw] which is needless to
      md. So we re-use the open code here to fix it.
      
      Reported-by: default avatarSimon Kirby <sim@hostway.ca>
      Cc: <stable@vger.kernel.org> 3.19
      Signed-off-by: default avatarGu Zheng <guz.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      74672d06
  4. Apr 07, 2015
    • Linus Torvalds's avatar
      Linux 4.0-rc7 · f22e6e84
      Linus Torvalds authored
      v4.0-rc7
      f22e6e84
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 442bb4ba
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) In TCP, don't register an FRTO for cumulatively ACK'd data that was
          previously SACK'd, from Neal Cardwell.
      
       2) Need to hold RNL mutex in ipv4 multicast code namespace cleanup,
          from Cong WANG.
      
       3) Similarly we have to hold RNL mutex for fib_rules_unregister(), also
          from Cong WANG.
      
       4) Revert and rework netns nsid allocation fix, from Nicolas Dichtel.
      
       5) When we encapsulate for a tunnel device, skb->sk still points to the
          user socket.  So this leads to cases where we retraverse the
          ipv4/ipv6 output path with skb->sk being of some other address
          family (f.e. AF_PACKET).  This can cause things to crash since the
          ipv4 output path is dereferencing an AF_PACKET socket as if it were
          an ipv4 one.
      
          The short term fix for 'net' and -stable is to elide these socket
          checks once we've entered an encapsulation sequence by testing
          xmit_recursion.
      
          Longer term we have a better solution wherein we pass the tunnel's
          socket down through the output paths, but that is way too invasive
          for 'net' and -stable.
      
          From Hannes Frederic Sowa.
      
       6) l2tp_init() failure path forgets to unregister per-net ops, from
          Cong WANG.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
        net/mlx4_core: Fix error message deprecation for ConnectX-2 cards
        net: dsa: fix filling routing table from OF description
        l2tp: unregister l2tp_net_ops on failure path
        mvneta: dont call mvneta_adjust_link() manually
        ipv6: protect skb->sk accesses from recursive dereference inside the stack
        netns: don't allocate an id for dead netns
        Revert "netns: don't clear nsid too early on removal"
        ip6mr: call del_timer_sync() in ip6mr_free_table()
        net: move fib_rules_unregister() under rtnl lock
        ipv4: take rtnl_lock and mark mrt table as freed on namespace cleanup
        tcp: fix FRTO undo on cumulative ACK of SACKed range
        xen-netfront: transmit fully GSO-sized packets
      442bb4ba
    • Jack Morgenstein's avatar
      net/mlx4_core: Fix error message deprecation for ConnectX-2 cards · fde913e2
      Jack Morgenstein authored
      
      
      Commit 1daa4303 ("net/mlx4_core: Deprecate error message at
      ConnectX-2 cards startup to debug") did the deprecation only for port 1
      of the card. Need to deprecate for port 2 as well.
      
      Fixes: 1daa4303 ("net/mlx4_core: Deprecate error message at ConnectX-2 cards startup to debug")
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarAmir Vadai <amirv@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fde913e2
    • Pavel Nakonechny's avatar
      net: dsa: fix filling routing table from OF description · 30303813
      Pavel Nakonechny authored
      
      
      According to description in 'include/net/dsa.h', in cascade switches
      configurations where there are more than one interconnected devices,
      'rtable' array in 'dsa_chip_data' structure is used to indicate which
      port on this switch should be used to send packets to that are destined
      for corresponding switch.
      
      However, dsa_of_setup_routing_table() fills 'rtable' with port numbers
      of the _target_ switch, but not current one.
      
      This commit removes redundant devicetree parsing and adds needed port
      number as a function argument. So dsa_of_setup_routing_table() now just
      looks for target switch number by parsing parent of 'link' device node.
      
      To remove possible misunderstandings with the way of determining target
      switch number, a corresponding comment was added to the source code and
      to the DSA device tree bindings documentation file.
      
      This was tested on a custom board with two Marvell 88E6095 switches with
      following corresponding routing tables: { -1, 10 } and { 8, -1 }.
      
      Signed-off-by: default avatarPavel Nakonechny <pavel.nakonechny@skitlab.ru>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      30303813
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · 9e441639
      Linus Torvalds authored
      Pull input fixes from Dmitry Torokhov:
       "Updates for the input subsystem - two more tweaks for ALPS driver to
        work out kinks after splitting the touchpad, trackstick, and potential
        external PS/2 mouse into separate input devices.
      
        Changes to support ALPS SS4 devices (protocol V8) will be coming in
        4.1..."
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
        Input: alps - document stick behavior for protocol V2
        Input: alps - report V2 Dualpoint Stick events via the right evdev node
        Input: alps - report interleaved bare PS/2 packets via dev3
      9e441639
    • WANG Cong's avatar
      67e04c29
    • Stas Sergeev's avatar
      mvneta: dont call mvneta_adjust_link() manually · ecf7b361
      Stas Sergeev authored
      
      
      mvneta_adjust_link() is a callback for of_phy_connect() and should
      not be called directly. The result of calling it directly is as below:
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ecf7b361
    • hannes@stressinduktion.org's avatar
      ipv6: protect skb->sk accesses from recursive dereference inside the stack · f60e5990
      hannes@stressinduktion.org authored
      
      
      We should not consult skb->sk for output decisions in xmit recursion
      levels > 0 in the stack. Otherwise local socket settings could influence
      the result of e.g. tunnel encapsulation process.
      
      ipv6 does not conform with this in three places:
      
      1) ip6_fragment: we do consult ipv6_npinfo for frag_size
      
      2) sk_mc_loop in ipv6 uses skb->sk and checks if we should
         loop the packet back to the local socket
      
      3) ip6_skb_dst_mtu could query the settings from the user socket and
         force a wrong MTU
      
      Furthermore:
      In sk_mc_loop we could potentially land in WARN_ON(1) if we use a
      PF_PACKET socket ontop of an IPv6-backed vxlan device.
      
      Reuse xmit_recursion as we are currently only interested in protecting
      tunnel devices.
      
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f60e5990
  5. Apr 06, 2015
  6. Apr 05, 2015
  7. Apr 04, 2015
  8. Apr 03, 2015