Skip to content
  1. Jul 24, 2012
    • Peter Zijlstra's avatar
      sched: Fix race in task_group() · 8323f26c
      Peter Zijlstra authored
      Stefan reported a crash on a kernel before a3e5d109
      
       ("sched:
      Don't call task_group() too many times in set_task_rq()"), he
      found the reason to be that the multiple task_group()
      invocations in set_task_rq() returned different values.
      
      Looking at all that I found a lack of serialization and plain
      wrong comments.
      
      The below tries to fix it using an extra pointer which is
      updated under the appropriate scheduler locks. Its not pretty,
      but I can't really see another way given how all the cgroup
      stuff works.
      
      Reported-and-tested-by: default avatarStefan Bader <stefan.bader@canonical.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8323f26c
    • Srivatsa Vaddagiri's avatar
      sched: Improve balance_cpu() to consider other cpus in its group as target of (pinned) task · 88b8dac0
      Srivatsa Vaddagiri authored
      
      
      Current load balance scheme requires only one cpu in a
      sched_group (balance_cpu) to look at other peer sched_groups for
      imbalance and pull tasks towards itself from a busy cpu. Tasks
      thus pulled by balance_cpu could later get picked up by cpus
      that are in the same sched_group as that of balance_cpu.
      
      This scheme however fails to pull tasks that are not allowed to
      run on balance_cpu (but are allowed to run on other cpus in its
      sched_group). That can affect fairness and in some worst case
      scenarios cause starvation.
      
      Consider a two core (2 threads/core) system running tasks as
      below:
      
                Core0            Core1
               /     \          /     \
      	C0     C1	 C2     C3
              |      |         |      |
              v      v         v      v
      	F0     T1        F1     [idle]
      			 T2
      
       F0 = SCHED_FIFO task (pinned to C0)
       F1 = SCHED_FIFO task (pinned to C2)
       T1 = SCHED_OTHER task (pinned to C1)
       T2 = SCHED_OTHER task (pinned to C1 and C2)
      
      F1 could become a cpu hog, which will starve T2 unless C1 pulls
      it. Between C0 and C1 however, C0 is required to look for
      imbalance between cores, which will fail to pull T2 towards
      Core0. T2 will starve eternally in this case. The same scenario
      can arise in presence of non-rt tasks as well (say we replace F1
      with high irq load).
      
      We tackle this problem by having balance_cpu move pinned tasks
      to one of its sibling cpus (where they can run). We first check
      if load balance goal can be met by ignoring pinned tasks,
      failing which we retry move_tasks() with a new env->dst_cpu.
      
      This patch modifies load balance semantics on who can move load
      towards a given cpu in a given sched_domain.
      
      Before this patch, a given_cpu or a ilb_cpu acting on behalf of
      an idle given_cpu is responsible for moving load to given_cpu.
      
      With this patch applied, balance_cpu can in addition decide on
      moving some load to a given_cpu.
      
      There is a remote possibility that excess load could get moved
      as a result of this (balance_cpu and given_cpu/ilb_cpu deciding
      *independently* and at *same* time to move some load to a
      given_cpu). However we should see less of such conflicting
      decisions in practice and moreover subsequent load balance
      cycles should correct the excess load moved to given_cpu.
      
      Signed-off-by: default avatarSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: default avatarPrashanth Nageshappa <prashanth@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/4FE06CDB.2060605@linux.vnet.ibm.com
      [ minor edits ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      88b8dac0
    • Prashanth Nageshappa's avatar
      sched: Reset loop counters if all tasks are pinned and we need to redo load balance · bbf18b19
      Prashanth Nageshappa authored
      
      
      While load balancing, if all tasks on the source runqueue are pinned,
      we retry after excluding the corresponding source cpu. However, loop counters
      env.loop and env.loop_break are not reset before retrying, which can lead
      to failure in moving the tasks. In this patch we reset env.loop and
      env.loop_break to their inital values before we retry.
      
      Signed-off-by: default avatarPrashanth Nageshappa <prashanth@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/4FE06EEF.2090709@linux.vnet.ibm.com
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      bbf18b19
    • Prashanth Nageshappa's avatar
      sched: Reorder 'struct lb_env' members to reduce its size · 85c1e7da
      Prashanth Nageshappa authored
      
      
      Members of 'struct lb_env' are not in appropriate order to reuse compiler
      added padding on 64bit architectures. In this patch we reorder those struct
      members and help reduce the size of the structure from 96 bytes to 80
      bytes on 64 bit architectures.
      
      Suggested-by: default avatarSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: default avatarPrashanth Nageshappa <prashanth@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/4FE06DDE.7000403@linux.vnet.ibm.com
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      85c1e7da
    • Mike Galbraith's avatar
      sched: Improve scalability via 'CPU buddies', which withstand random perturbations · 970e1789
      Mike Galbraith authored
      
      
      Traversing an entire package is not only expensive, it also leads to tasks
      bouncing all over a partially idle and possible quite large package.  Fix
      that up by assigning a 'buddy' CPU to try to motivate.  Each buddy may try
      to motivate that one other CPU, if it's busy, tough, it may then try its
      SMT sibling, but that's all this optimization is allowed to cost.
      
      Sibling cache buddies are cross-wired to prevent bouncing.
      
      4 socket 40 core + SMT Westmere box, single 30 sec tbench runs, higher is better:
      
       clients     1       2       4        8       16       32       64      128
       ..........................................................................
       pre        30      41     118      645     3769     6214    12233    14312
       post      299     603    1211     2418     4697     6847    11606    14557
      
      A nice increase in performance.
      
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1339471112.7352.32.camel@marge.simpson.net
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      970e1789
    • Srivatsa S. Bhat's avatar
      cpusets: Remove/update outdated comments · a1cd2b13
      Srivatsa S. Bhat authored
      
      
      cpuset_track_online_cpus() is no longer present. So remove the
      outdated comment and replace it with reference to cpuset_update_active_cpus()
      which is its equivalent.
      
      Also, we don't lack memory hot-unplug anymore. And David Rientjes pointed
      out how it is dealt with. So update that comment as well.
      
      Signed-off-by: default avatarSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20120524141700.3692.98192.stgit@srivatsabhat.in.ibm.com
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a1cd2b13
    • Srivatsa S. Bhat's avatar
      cpusets, hotplug: Restructure functions that are invoked during hotplug · 7ddf96b0
      Srivatsa S. Bhat authored
      
      
      Separate out the cpuset related handling for CPU/Memory online/offline.
      This also helps us exploit the most obvious and basic level of optimization
      that any notification mechanism (CPU/Mem online/offline) has to offer us:
      "We *know* why we have been invoked. So stop pretending that we are lost,
      and do only the necessary amount of processing!".
      
      And while at it, rename scan_for_empty_cpusets() to
      scan_cpusets_upon_hotplug(), which is more appropriate considering how
      it is restructured.
      
      Signed-off-by: default avatarSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20120524141650.3692.48637.stgit@srivatsabhat.in.ibm.com
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7ddf96b0
    • Srivatsa S. Bhat's avatar
      cpusets, hotplug: Implement cpuset tree traversal in a helper function · 80d1fa64
      Srivatsa S. Bhat authored
      
      
      At present, the functions that deal with cpusets during CPU/Mem hotplug
      are quite messy, since a lot of the functionality is mixed up without clear
      separation. And this takes a toll on optimization as well. For example,
      the function cpuset_update_active_cpus() is called on both CPU offline and CPU
      online events; and it invokes scan_for_empty_cpusets(), which makes sense
      only for CPU offline events. And hence, the current code ends up unnecessarily
      traversing the cpuset tree during CPU online also.
      
      As a first step towards cleaning up those functions, encapsulate the cpuset
      tree traversal in a helper function, so as to facilitate upcoming changes.
      
      Signed-off-by: default avatarSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20120524141635.3692.893.stgit@srivatsabhat.in.ibm.com
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      80d1fa64
    • Srivatsa S. Bhat's avatar
      CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume · d35be8ba
      Srivatsa S. Bhat authored
      
      
      In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
      masks as and when necessary to ensure that the tasks belonging to the cpusets
      have some place (online CPUs) to run on. And regular CPU hotplug is
      destructive in the sense that the kernel doesn't remember the original cpuset
      configurations set by the user, across hotplug operations.
      
      However, suspend/resume (which uses CPU hotplug) is a special case in which
      the kernel has the responsibility to restore the system (during resume), to
      exactly the same state it was in before suspend.
      
      In order to achieve that, do the following:
      
      1. Don't modify cpusets during suspend/resume. At all.
         In particular, don't move the tasks from one cpuset to another, and
         don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
         during the CPU hotplug operations that are carried out in the
         suspend/resume path.
      
      2. However, cpusets and sched domains are related. We just want to avoid
         altering cpusets alone. So, to keep the sched domains updated, build
         a single sched domain (containing all active cpus) during each of the
         CPU hotplug operations carried out in s/r path, effectively ignoring
         the cpusets' cpus_allowed masks.
      
         (Since userspace is frozen while doing all this, it will go unnoticed.)
      
      3. During the last CPU online operation during resume, build the sched
         domains by looking up the (unaltered) cpusets' cpus_allowed masks.
         That will bring back the system to the same original state as it was in
         before suspend.
      
      Ultimately, this will not only solve the cpuset problem related to suspend
      resume (ie., restores the cpusets to exactly what it was before suspend, by
      not touching it at all) but also speeds up suspend/resume because we avoid
      running cpuset update code for every CPU being offlined/onlined.
      
      Signed-off-by: default avatarSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20120524141611.3692.20155.stgit@srivatsabhat.in.ibm.com
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d35be8ba
    • Peter Zijlstra's avatar
      sched/x86: Remove broken power estimation · ee08d128
      Peter Zijlstra authored
      
      
      The x86 sched power implementation has been broken forever and gets in
      the way of other stuff, remove it.
      
      [ For archaeological interest, fixing this code would require dealing
        with the cross-cpu calling of these functions and more importantly, we
        need to filter idle time out of the a/m-perf stuff because the ratio
        will go down to 0 when idle, giving a 0 capacity which is not what
        we'd want. ]
      
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Link: http://lkml.kernel.org/r/1339594110.8980.38.camel@twins
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ee08d128
  2. Jul 22, 2012
  3. Jul 21, 2012
    • Linus Torvalds's avatar
      Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus · d75e2c9a
      Linus Torvalds authored
      Pull late MIPS fixes from Ralf Baechle:
       "This fixes a number of lose ends in the MIPS code and various bug
        fixes.
      
        Aside of dropping some patch that should not be in this pull request
        everything has sat in -next for quite a while and there are no known
        issues.
      
        The biggest patch in this patch set moves the allocation of an array
        that is aliased to a function (for runtime generated code) to
        assembler code.  This avoids an issue with certain toolchains when
        building for microMIPS."
      
      * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (35 commits)
        MIPS: PCI: Move fixups from __init to __devinit.
        MIPS: Fix bug.h MIPS build regression
        MIPS: sync-r4k: remove redundant irq operation
        MIPS: smp: Warn on too early irq enable
        MIPS: call set_cpu_online() on cpu being brought up with irq disabled
        MIPS: call ->smp_finish() a little late
        MIPS: Yosemite: delay irq enable to ->smp_finish()
        MIPS: SMTC: delay irq enable to ->smp_finish()
        MIPS: BMIPS: delay irq enable to ->smp_finish()
        MIPS: Octeon: delay enable irq to ->smp_finish()
        MIPS: Oprofile: Fix build as a module.
        MIPS: BCM63XX: Fix BCM6368 IPSec clock bit
        MIPS: perf: Fix build error caused by unused counters_per_cpu_to_total()
        MIPS: Fix Magic SysRq L kernel crash.
        MIPS: BMIPS: Fix duplicate header inclusion.
        mips: mark const init data with __initconst instead of __initdata
        MIPS: cmpxchg.h: Add missing include
        MIPS: Malta may also be equipped with MIPS64 R2 processors.
        MIPS: Fix typo multipy -> multiply
        MIPS: Cavium: Fix duplicate ARCH_SPARSEMEM_ENABLE in kconfig.
        ...
      d75e2c9a
    • Linus Torvalds's avatar
      Merge tag 'dm-3.5-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm · 93517374
      Linus Torvalds authored
      Pull device-mapper discard fixes from Alasdair G Kergon:
        - avoid a crash in dm-raid1 when discards coincide with mirror
          recovery;
        - avoid discarding shared data that's still needed in dm-thin;
        - don't guarantee that discarded blocks will be wiped in dm-raid1.
      
      * tag 'dm-3.5-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
        dm raid1: set discard_zeroes_data_unsupported
        dm thin: do not send discards to shared blocks
        dm raid1: fix crash with mirror recovery and discard
      93517374
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd · ce9f8d6b
      Linus Torvalds authored
      Pull pnfs/ore fixes from Boaz Harrosh:
       "These are catastrophic fixes to the pnfs objects-layout that were just
        discovered.  They are also destined for @stable.
      
        I have found these and worked on them at around RC1 time but
        unfortunately went to the hospital for kidney stones and had a very
        slow recovery.  I refrained from sending them as is, before proper
        testing, and surly I have found a bug just yesterday.
      
        So now they are all well tested, and have my sign-off.  Other then
        fixing the problem at hand, and assuming there are no bugs at the new
        code, there is low risk to any surrounding code.  And in anyway they
        affect only these paths that are now broken.  That is RAID5 in pnfs
        objects-layout code.  It does also affect exofs (which was not broken)
        but I have tested exofs and it is lower priority then objects-layout
        because no one is using exofs, but objects-layout has lots of users."
      
      * 'for-linus' of git://git.open-osd.org/linux-open-osd:
        pnfs-obj: Fix __r4w_get_page when offset is beyond i_size
        pnfs-obj: don't leak objio_state if ore_write/read fails
        ore: Unlock r4w pages in exact reverse order of locking
        ore: Remove support of partial IO request (NFS crash)
        ore: Fix NFS crash by supporting any unaligned RAID IO
      ce9f8d6b
    • Linus Torvalds's avatar
      Merge tag 'upstream-3.5-rc8' of git://git.infradead.org/linux-ubifs · 17934162
      Linus Torvalds authored
      Pull UBIFS free space fix-up bugfix from Artem Bityutskiy:
       "It's been reported already twice recently:
      
          http://lists.infradead.org/pipermail/linux-mtd/2012-May/041408.html
          http://lists.infradead.org/pipermail/linux-mtd/2012-June/042422.html
      
        and we finally have the fix.  I am quite confident the fix is correct
        because I could reproduce the problem with nandsim and verify the fix.
        It was also verified by Iwo (the reporter).
      
        I am also confident that this is OK to merge the fix so late because
        this patch affects only the fixup functionality, which is not used by
        most users."
      
      * tag 'upstream-3.5-rc8' of git://git.infradead.org/linux-ubifs:
        UBIFS: fix a bug in empty space fix-up
      17934162
  4. Jul 20, 2012
    • Mikulas Patocka's avatar
      dm raid1: set discard_zeroes_data_unsupported · 7c8d3a42
      Mikulas Patocka authored
      We can't guarantee that REQ_DISCARD on dm-mirror zeroes the data even if
      the underlying disks support zero on discard.  So this patch sets
      ti->discard_zeroes_data_unsupported.
      
      For example, if the mirror is in the process of resynchronizing, it may
      happen that kcopyd reads a piece of data, then discard is sent on the
      same area and then kcopyd writes the piece of data to another leg.
      Consequently, the data is not zeroed.
      
      The flag was made available by commit 983c7db3
      
      
      (dm crypt: always disable discard_zeroes_data).
      
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      7c8d3a42
    • Mikulas Patocka's avatar
      dm thin: do not send discards to shared blocks · 650d2a06
      Mikulas Patocka authored
      When process_discard receives a partial discard that doesn't cover a
      full block, it sends this discard down to that block. Unfortunately, the
      block can be shared and the discard would corrupt the other snapshots
      sharing this block.
      
      This patch detects block sharing and ends the discard with success when
      sending it to the shared block.
      
      The above change means that if the device supports discard it can't be
      guaranteed that a discard request zeroes data. Therefore, we set
      ti->discard_zeroes_data_unsupported.
      
      Thin target discard support with this bug arrived in commit
      104655fd
      
       (dm thin: support discards).
      
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      650d2a06
    • Mikulas Patocka's avatar
      dm raid1: fix crash with mirror recovery and discard · 751f188d
      Mikulas Patocka authored
      This patch fixes a crash when a discard request is sent during mirror
      recovery.
      
      Firstly, some background.  Generally, the following sequence happens during
      mirror synchronization:
      - function do_recovery is called
      - do_recovery calls dm_rh_recovery_prepare
      - dm_rh_recovery_prepare uses a semaphore to limit the number
        simultaneously recovered regions (by default the semaphore value is 1,
        so only one region at a time is recovered)
      - dm_rh_recovery_prepare calls __rh_recovery_prepare,
        __rh_recovery_prepare asks the log driver for the next region to
        recover. Then, it sets the region state to DM_RH_RECOVERING. If there
        are no pending I/Os on this region, the region is added to
        quiesced_regions list. If there are pending I/Os, the region is not
        added to any list. It is added to the quiesced_regions list later (by
        dm_rh_dec function) when all I/Os finish.
      - when the region is on quiesced_regions list, there are no I/Os in
        flight on this region. The region is popped from the list in
        dm_rh_recovery_start function. Then, a kcopyd job is started in the
        recover function.
      - when the kcopyd job finishes, recovery_complete is called. It calls
        dm_rh_recovery_end. dm_rh_recovery_end adds the region to
        recovered_regions or failed_recovered_regions list (depending on
        whether the copy operation was successful or not).
      
      The above mechanism assumes that if the region is in DM_RH_RECOVERING
      state, no new I/Os are started on this region. When I/O is started,
      dm_rh_inc_pending is called, which increases reg->pending count. When
      I/O is finished, dm_rh_dec is called. It decreases reg->pending count.
      If the count is zero and the region was in DM_RH_RECOVERING state,
      dm_rh_dec adds it to the quiesced_regions list.
      
      Consequently, if we call dm_rh_inc_pending/dm_rh_dec while the region is
      in DM_RH_RECOVERING state, it could be added to quiesced_regions list
      multiple times or it could be added to this list when kcopyd is copying
      data (it is assumed that the region is not on any list while kcopyd does
      its jobs). This results in memory corruption and crash.
      
      There already exist bypasses for REQ_FLUSH requests: REQ_FLUSH requests
      do not belong to any region, so they are always added to the sync list
      in do_writes. dm_rh_inc_pending does not increase count for REQ_FLUSH
      requests. In mirror_end_io, dm_rh_dec is never called for REQ_FLUSH
      requests. These bypasses avoid the crash possibility described above.
      
      These bypasses were improperly implemented for REQ_DISCARD when
      the mirror target gained discard support in commit
      5fc2ffea
      
       (dm raid1: support discard).
      
      In do_writes, REQ_DISCARD requests is always added to the sync queue and
      immediately dispatched (even if the region is in DM_RH_RECOVERING).  However,
      dm_rh_inc and dm_rh_dec is called for REQ_DISCARD resusts.  So it violates the
      rule that no I/Os are started on DM_RH_RECOVERING regions, and causes the list
      corruption described above.
      
      This patch changes it so that REQ_DISCARD requests follow the same path
      as REQ_FLUSH. This avoids the crash.
      
      Reference: https://bugzilla.redhat.com/837607
      
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      751f188d
    • Boaz Harrosh's avatar
      pnfs-obj: Fix __r4w_get_page when offset is beyond i_size · c999ff68
      Boaz Harrosh authored
      
      
      It is very common for the end of the file to be unaligned on
      stripe size. But since we know it's beyond file's end then
      the XOR should be preformed with all zeros.
      
      Old code used to just read zeros out of the OSD devices, which is a great
      waist. But what scares me more about this situation is that, we now have
      pages attached to the file's mapping that are beyond i_size. I don't
      like the kind of bugs this calls for.
      
      Fix both birds, by returning a global zero_page, if offset is beyond
      i_size.
      
      TODO:
      	Change the API to ->__r4w_get_page() so a NULL can be
      	returned without being considered as error, since XOR API
      	treats NULL entries as zero_pages.
      
      [Bug since 3.2. Should apply the same way to all Kernels since]
      CC: Stable Tree <stable@kernel.org>
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      c999ff68
    • Boaz Harrosh's avatar
      pnfs-obj: don't leak objio_state if ore_write/read fails · 9909d45a
      Boaz Harrosh authored
      
      
      [Bug since 3.2 Kernel]
      CC: Stable Tree <stable@kernel.org>
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      9909d45a
    • Boaz Harrosh's avatar
      ore: Unlock r4w pages in exact reverse order of locking · 537632e0
      Boaz Harrosh authored
      
      
      The read-4-write pages are locked in address ascending order.
      But where unlocked in a way easiest for coding. Fix that,
      locks should be released in opposite order of locking, .i.e
      descending address order.
      
      I have not hit this dead-lock. It was found by inspecting the
      dbug print-outs. I suspect there is an higher lock at caller that
      protects us, but fix it regardless.
      
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      537632e0
    • Boaz Harrosh's avatar
      ore: Remove support of partial IO request (NFS crash) · 62b62ad8
      Boaz Harrosh authored
      
      
      Do to OOM situations the ore might fail to allocate all resources
      needed for IO of the full request. If some progress was possible
      it would proceed with a partial/short request, for the sake of
      forward progress.
      
      Since this crashes NFS-core and exofs is just fine without it just
      remove this contraption, and fail.
      
      TODO:
      	Support real forward progress with some reserved allocations
      	of resources, such as mem pools and/or bio_sets
      
      [Bug since 3.2 Kernel]
      CC: Stable Tree <stable@kernel.org>
      CC: Benny Halevy <bhalevy@tonian.com>
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      62b62ad8
    • Boaz Harrosh's avatar
      ore: Fix NFS crash by supporting any unaligned RAID IO · 9ff19309
      Boaz Harrosh authored
      
      
      In RAID_5/6 We used to not permit an IO that it's end
      byte is not stripe_size aligned and spans more than one stripe.
      .i.e the caller must check if after submission the actual
      transferred bytes is shorter, and would need to resubmit
      a new IO with the remainder.
      
      Exofs supports this, and NFS was supposed to support this
      as well with it's short write mechanism. But late testing has
      exposed a CRASH when this is used with none-RPC layout-drivers.
      
      The change at NFS is deep and risky, in it's place the fix
      at ORE to lift the limitation is actually clean and simple.
      So here it is below.
      
      The principal here is that in the case of unaligned IO on
      both ends, beginning and end, we will send two read requests
      one like old code, before the calculation of the first stripe,
      and also a new site, before the calculation of the last stripe.
      If any "boundary" is aligned or the complete IO is within a single
      stripe. we do a single read like before.
      
      The code is clean and simple by splitting the old _read_4_write
      into 3 even parts:
      1._read_4_write_first_stripe
      2. _read_4_write_last_stripe
      3. _read_4_write_execute
      
      And calling 1+3 at the same place as before. 2+3 before last
      stripe, and in the case of all in a single stripe then 1+2+3
      is preformed additively.
      
      Why did I not think of it before. Well I had a strike of
      genius because I have stared at this code for 2 years, and did
      not find this simple solution, til today. Not that I did not try.
      
      This solution is much better for NFS than the previous supposedly
      solution because the short write was dealt  with out-of-band after
      IO_done, which would cause for a seeky IO pattern where as in here
      we execute in order. At both solutions we do 2 separate reads, only
      here we do it within a single IO request. (And actually combine two
      writes into a single submission)
      
      NFS/exofs code need not change since the ORE API communicates the new
      shorter length on return, what will happen is that this case would not
      occur anymore.
      
      hurray!!
      
      [Stable this is an NFS bug since 3.2 Kernel should apply cleanly]
      CC: Stable Tree <stable@kernel.org>
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      9ff19309
    • Artem Bityutskiy's avatar
      UBIFS: fix a bug in empty space fix-up · c6727932
      Artem Bityutskiy authored
      
      
      UBIFS has a feature called "empty space fix-up" which is a quirk to work-around
      limitations of dumb flasher programs. Namely, of those flashers that are unable
      to skip NAND pages full of 0xFFs while flashing, resulting in empty space at
      the end of half-filled eraseblocks to be unusable for UBIFS. This feature is
      relatively new (introduced in v3.0).
      
      The fix-up routine (fixup_free_space()) is executed only once at the very first
      mount if the superblock has the 'space_fixup' flag set (can be done with -F
      option of mkfs.ubifs). It basically reads all the UBIFS data and metadata and
      writes it back to the same LEB. The routine assumes the image is pristine and
      does not have anything in the journal.
      
      There was a bug in 'fixup_free_space()' where it fixed up the log incorrectly.
      All but one LEB of the log of a pristine file-system are empty. And one
      contains just a commit start node. And 'fixup_free_space()' just unmapped this
      LEB, which resulted in wiping the commit start node. As a result, some users
      were unable to mount the file-system next time with the following symptom:
      
      UBIFS error (pid 1): replay_log_leb: first log node at LEB 3:0 is not CS node
      UBIFS error (pid 1): replay_log_leb: log error detected while replaying the log at LEB 3:0
      
      The root-cause of this bug was that 'fixup_free_space()' wrongly assumed
      that the beginning of empty space in the log head (c->lhead_offs) was known
      on mount. However, it is not the case - it was always 0. UBIFS does not store
      in it the master node and finds out by scanning the log on every mount.
      
      The fix is simple - just pass commit start node size instead of 0 to
      'fixup_leb()'.
      
      Signed-off-by: default avatarArtem Bityutskiy <Artem.Bityutskiy@linux.intel.com>
      Cc: stable@vger.kernel.org [v3.0+]
      Reported-by: default avatarIwo Mergler <Iwo.Mergler@netcommwireless.com>
      Tested-by: default avatarIwo Mergler <Iwo.Mergler@netcommwireless.com>
      Reported-by: default avatarJames Nute <newten82@gmail.com>
      c6727932
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client · 85efc72a
      Linus Torvalds authored
      Pull last minute Ceph fixes from Sage Weil:
       "The important one fixes a bug in the socket failure handling behavior
        that was turned up in some recent failure injection testing.  The
        other two are minor bug fixes."
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
        rbd: endian bug in rbd_req_cb()
        rbd: Fix ceph_snap_context size calculation
        libceph: fix messenger retry
      85efc72a
  5. Jul 19, 2012
    • Linus Torvalds's avatar
      Merge tag 'md-3.5-fixes' of git://neil.brown.name/md · 3e4b9459
      Linus Torvalds authored
      Pull three md bugfixes from NeilBrown:
       "One of the bugs was introduced in 3.5-rc1.  Others have been there for
        longer."
      
      * tag 'md-3.5-fixes' of git://neil.brown.name/md:
        md/raid1: close some possible races on write errors during resync
        md: avoid crash when stopping md array races with closing other open fds.
        md: fix bug in handling of new_data_offset
      3e4b9459
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 309d4b00
      Linus Torvalds authored
      Pull networking changes from David Miller:
       "Ok, we should be good to go now"
      
      1) We have to statically initialize the init_net device list head rather
         than do so in an initcall, otherwise netprio_cgroup crashes if it's
         built statically rather than modular (Mark D.  Rustad)
      
      2) Fix SKB null oopser in CIPSO ipv4 option processing (Paul Moore)
      
      3) Qlogic maintainers update (Anirban Chakraborty)
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
        net: Statically initialize init_net.dev_base_head
        MAINTAINERS: Changes in qlcnic and qlge maintainers list
        cipso: don't follow a NULL pointer when setsockopt() is called
      309d4b00
    • Linus Torvalds's avatar
      Merge branch 'upstream-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid · 61c901c5
      Linus Torvalds authored
      Pull HID update from Jiri Kosina:
       "A final round of changes for HID for 3.5: just device ID additions."
      
      * 'upstream-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
        HID: hid-multitouch: add support for Zytronic panels
        HID: add Sennheiser BTD500USB device support
        HID: add battery quirk for Apple Wireless ANSI
      61c901c5
    • Ezequiel Garcia's avatar
      cx25821: Remove bad strcpy to read-only char* · 380e99fc
      Ezequiel Garcia authored
      
      
      The strcpy was being used to set the name of the board.  Since the
      destination char* was read-only and the name is set statically at
      compile time; this was both wrong and redundant.
      
      The type of char* is changed to const char* to prevent future errors.
      
      Reported-by: default avatarRadek Masin <radek@masin.eu>
      Signed-off-by: default avatarEzequiel Garcia <elezegarcia@gmail.com>
      [ Taking directly due to vacations   - Linus ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      380e99fc
    • Benjamin Tissoires's avatar
    • Sebastian Andrzej Siewior's avatar
      MIPS: PCI: Move fixups from __init to __devinit. · 85a053fa
      Sebastian Andrzej Siewior authored
      
      
      Fixups are executed once the pci-device is found which is during boot
      process so __init seems fine as long as the platform does not support
      hotplug.
      However it is possible to remove the PCI bus at run time and have it
      rediscovered again via "echo 1 > /sys/bus/pci/rescan" and this will call
      the fixups again.
      
      [ralf@linux-mips.org: Made piixirqmap[] in malta_piix_func0_fixup()
      __initdata.]
      
      Signed-off-by: default avatarSebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: linux-mips@linux-mips.org
      Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      85a053fa
    • Yoichi Yuasa's avatar
      MIPS: Fix bug.h MIPS build regression · 3592c3cd
      Yoichi Yuasa authored
      Commit: 37778088
      
       [bug.h: need linux/kernel.h
      for TAINT_WARN.] breaks all MIPS builds.
      
        CC      arch/mips/kernel/machine_kexec.o
      In file included from include/linux/kernel.h:20:0,
                       from include/asm-generic/bug.h:35,
                       from /home/yuasa/src/linux/kernel/git/linux-2.6/arch/mips/include/asm/bug.h:41,
                       from /home/yuasa/src/linux/kernel/git/linux-2.6/arch/mips/include/asm/bitops.h:20,
                       from include/linux/bitops.h:22,
                       from include/linux/signal.h:38,
                       from include/linux/elfcore.h:5,
                       from include/linux/kexec.h:60,
                       from arch/mips/kernel/machine_kexec.c:9:
      include/linux/log2.h: In function '__ilog2_u32':
      include/linux/log2.h:34:2: error: implicit declaration of function 'fls' [-Werror=implicit-function-declaration]
      include/linux/log2.h: In function '__ilog2_u64':
      include/linux/log2.h:42:2: error: implicit declaration of function 'fls64' [-Werror=implicit-function-declaration]
      include/linux/log2.h: In function '__roundup_pow_of_two':
      include/linux/log2.h:63:2: error: implicit declaration of function 'fls_long' [-Werror=implicit-function-declaration]
      In file included from include/linux/bitops.h:22:0,
                       from include/linux/signal.h:38,
                       from include/linux/elfcore.h:5,
                       from include/linux/kexec.h:60,
                       from arch/mips/kernel/machine_kexec.c:9:
      /home/yuasa/src/linux/kernel/git/linux-2.6/arch/mips/include/asm/bitops.h: At top level:
      /home/yuasa/src/linux/kernel/git/linux-2.6/arch/mips/include/asm/bitops.h:615:19: error: static declaration of 'fls' follows non-static declaration
      include/linux/log2.h:34:9: note: previous implicit declaration of 'fls' was here
      In file included from /home/yuasa/src/linux/kernel/git/linux-2.6/arch/mips/include/asm/bitops.h:651:0,
                       from include/linux/bitops.h:22,
                       from include/linux/signal.h:38,
                       from include/linux/elfcore.h:5,
                       from include/linux/kexec.h:60,
                       from arch/mips/kernel/machine_kexec.c:9:
      include/asm-generic/bitops/fls64.h:18:28: error: static declaration of 'fls64' follows non-static declaration
      include/linux/log2.h:42:9: note: previous implicit declaration of 'fls64' was here
      In file included from include/linux/signal.h:38:0,
                       from include/linux/elfcore.h:5,
                       from include/linux/kexec.h:60,
                       from arch/mips/kernel/machine_kexec.c:9:
      include/linux/bitops.h:160:24: error: conflicting types for 'fls_long'
      include/linux/log2.h:63:16: note: previous implicit declaration of 'fls_long' was here
      cc1: all warnings being treated as errors
      
      make[2]: *** [arch/mips/kernel/machine_kexec.o] Error 1
      
      Signed-off-by: default avatarYoichi Yuasa <yuasa@linux-mips.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: yuasa@linux-mips.org
      Cc: linux-kernel@vger.kernel.org
      Cc: Linuxppc-dev <linuxppc-dev@ozlabs.org>
      Cc: Linux MIPS Mailing List <linux-mips@linux-mips.org>
      Cc: Linux-sh list <linux-sh@vger.kernel.org>
      Cc: Chris Zankel <chris@zankel.net>
      Patchwork: https://patchwork.linux-mips.org/patch/4000/
      Tested-by: default avatarJohn Crispin <blogic@openwrt.org>
      Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      3592c3cd
    • Yong Zhang's avatar
      MIPS: sync-r4k: remove redundant irq operation · f2b88d65
      Yong Zhang authored
      
      
      Since we have delayed irq enabling to ->smp_finish()
      
      Signed-off-by: default avatarYong Zhang <yong.zhang0@gmail.com>
      Cc: Sergei Shtylyov <sshtylyov@mvista.com>
      Cc: David Daney <david.daney@cavium.com>
      Acked-by: default avatarDavid Daney <david.daney@cavium.com>
      Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      f2b88d65
    • Yong Zhang's avatar
      MIPS: smp: Warn on too early irq enable · b789ad63
      Yong Zhang authored
      
      
      Just to catch a potential issue.
      
      Signed-off-by: default avatarYong Zhang <yong.zhang0@gmail.com>
      Cc: Sergei Shtylyov <sshtylyov@mvista.com>
      Cc: David Daney <david.daney@cavium.com>
      Acked-by: default avatarDavid Daney <david.daney@cavium.com>
      Patchwork: https://patchwork.linux-mips.org/patch/3852/
      Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      b789ad63