- Aug 09, 2014
-
-
Josh Triplett authored
The new mergeconfig helper makes it easier to add other partial configurations similar to kvmconfig. Architecture-independent portions of those partial configurations should go in kernel/configs/${name}.config, and architecture-dependent portions should go in arch/${arch}/configs/${name}.config. Based on a patch by Luis R. Rodriguez <mcgrof@suse.com>. Originally-Signed-off-by:
Luis R. Rodriguez <mcgrof@suse.com> Modified to make the helper name more general than just virtualization, support architecture-dependent and architecture-independent partial configurations, move the helper and kvmconfig to scripts/kconfig/Makefile, and factor out more of the common file path. Signed-off-by:
Josh Triplett <josh@joshtriplett.org>
-
- Aug 06, 2014
-
-
David S. Miller authored
Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Thomas Gleixner authored
Commit ea431643 ("x86/mce: Fix CMCI preemption bugs") breaks RT by the completely unrelated conversion of the cmci_discover_lock to a regular (non raw) spinlock. This lock was annotated in commit 59d958d2 ("locking, x86: mce: Annotate cmci_discover_lock as raw") with a proper explanation why. The argument for converting the lock back to a regular spinlock was: - it does percpu ops without disabling preemption. Preemption is not disabled due to the mistaken use of a raw spinlock. Which is complete nonsense. The raw_spinlock is disabling preemption in the same way as a regular spinlock. In mainline spinlock maps to raw_spinlock, in RT spinlock becomes a "sleeping" lock. raw_spinlock has on RT exactly the same semantics as in mainline. And because this lock is taken in non preemptible context it must be raw on RT. Undo the locking brainfart. Reported-by:
Clark Williams <williams@redhat.com> Reported-by:
Steven Rostedt <rostedt@goodmis.org> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Theodore Ts'o authored
The getrandom(2) system call was requested by the LibreSSL Portable developers. It is analoguous to the getentropy(2) system call in OpenBSD. The rationale of this system call is to provide resiliance against file descriptor exhaustion attacks, where the attacker consumes all available file descriptors, forcing the use of the fallback code where /dev/[u]random is not available. Since the fallback code is often not well-tested, it is better to eliminate this potential failure mode entirely. The other feature provided by this new system call is the ability to request randomness from the /dev/urandom entropy pool, but to block until at least 128 bits of entropy has been accumulated in the /dev/urandom entropy pool. Historically, the emphasis in the /dev/urandom development has been to ensure that urandom pool is initialized as quickly as possible after system boot, and preferably before the init scripts start execution. This is because changing /dev/urandom reads to block represents an interface change that could potentially break userspace which is not acceptable. In practice, on most x86 desktop and server systems, in general the entropy pool can be initialized before it is needed (and in modern kernels, we will printk a warning message if not). However, on an embedded system, this may not be the case. And so with this new interface, we can provide the functionality of blocking until the urandom pool has been initialized. Any userspace program which uses this new functionality must take care to assure that if it is used during the boot process, that it will not cause the init scripts or other portions of the system startup to hang indefinitely. SYNOPSIS #include <linux/random.h> int getrandom(void *buf, size_t buflen, unsigned int flags); DESCRIPTION The system call getrandom() fills the buffer pointed to by buf with up to buflen random bytes which can be used to seed user space random number generators (i.e., DRBG's) or for other cryptographic uses. It should not be used for Monte Carlo simulations or other programs/algorithms which are doing probabilistic sampling. If the GRND_RANDOM flags bit is set, then draw from the /dev/random pool instead of the /dev/urandom pool. The /dev/random pool is limited based on the entropy that can be obtained from environmental noise, so if there is insufficient entropy, the requested number of bytes may not be returned. If there is no entropy available at all, getrandom(2) will either block, or return an error with errno set to EAGAIN if the GRND_NONBLOCK bit is set in flags. If the GRND_RANDOM bit is not set, then the /dev/urandom pool will be used. Unlike using read(2) to fetch data from /dev/urandom, if the urandom pool has not been sufficiently initialized, getrandom(2) will block (or return -1 with the errno set to EAGAIN if the GRND_NONBLOCK bit is set in flags). The getentropy(2) system call in OpenBSD can be emulated using the following function: int getentropy(void *buf, size_t buflen) { int ret; if (buflen > 256) goto failure; ret = getrandom(buf, buflen, 0); if (ret < 0) return ret; if (ret == buflen) return 0; failure: errno = EIO; return -1; } RETURN VALUE On success, the number of bytes that was filled in the buf is returned. This may not be all the bytes requested by the caller via buflen if insufficient entropy was present in the /dev/random pool, or if the system call was interrupted by a signal. On error, -1 is returned, and errno is set appropriately. ERRORS EINVAL An invalid flag was passed to getrandom(2) EFAULT buf is outside the accessible address space. EAGAIN The requested entropy was not available, and getentropy(2) would have blocked if the GRND_NONBLOCK flag was not set. EINTR While blocked waiting for entropy, the call was interrupted by a signal handler; see the description of how interrupted read(2) calls on "slow" devices are handled with and without the SA_RESTART flag in the signal(7) man page. NOTES For small requests (buflen <= 256) getrandom(2) will not return EINTR when reading from the urandom pool once the entropy pool has been initialized, and it will return all of the bytes that have been requested. This is the recommended way to use getrandom(2), and is designed for compatibility with OpenBSD's getentropy() system call. However, if you are using GRND_RANDOM, then getrandom(2) may block until the entropy accounting determines that sufficient environmental noise has been gathered such that getrandom(2) will be operating as a NRBG instead of a DRBG for those people who are working in the NIST SP 800-90 regime. Since it may block for a long time, these guarantees do *not* apply. The user may want to interrupt a hanging process using a signal, so blocking until all of the requested bytes are returned would be unfriendly. For this reason, the user of getrandom(2) MUST always check the return value, in case it returns some error, or if fewer bytes than requested was returned. In the case of !GRND_RANDOM and small request, the latter should never happen, but the careful userspace code (and all crypto code should be careful) should check for this anyway! Finally, unless you are doing long-term key generation (and perhaps not even then), you probably shouldn't be using GRND_RANDOM. The cryptographic algorithms used for /dev/urandom are quite conservative, and so should be sufficient for all purposes. The disadvantage of GRND_RANDOM is that it can block, and the increased complexity required to deal with partially fulfilled getrandom(2) requests. Signed-off-by:
Theodore Ts'o <tytso@mit.edu> Reviewed-by:
Zach Brown <zab@zabbo.net>
-
- Aug 05, 2014
-
-
David L Stevens authored
This patches adds an "install" target to install kernel builds for SPARC, modeled after the i386 script. Signed-off-by:
David L Stevens <david.stevens@oracle.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Andrey Utkin authored
This commit is a guesswork, but it seems to make sense to drop this break, as otherwise the following line is never executed and becomes dead code. And that following line actually saves the result of local calculation by the pointer given in function argument. So the proposed change makes sense if this code in the whole makes sense (but I am unable to analyze it in the whole). Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81641 Reported-by:
David Binderman <dcb314@hotmail.com> Signed-off-by:
Andrey Utkin <andrey.krieger.utkin@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Sowmini Varadhan authored
The LDC handshake could have been asynchronously triggered after ldc_bind() enables the ldc_rx() receive interrupt-handler (and thus intercepts incoming control packets) and before vio_port_up() calls ldc_connect(). If that is the case, ldc_connect() should return 0 and let the state-machine progress. Signed-off-by:
Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by:
Karl Volz <karl.volz@oracle.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Based almost entirely upon a patch by Christopher Alexander Tobias Schulze. In commit db64fe02 ("mm: rewrite vmap layer") lazy VMAP tlb flushing was added to the vmalloc layer. This causes problems on sparc64. Sparc64 has two VMAP mapped regions and they are not contiguous with eachother. First we have the malloc mapping area, then another unrelated region, then the vmalloc region. This "another unrelated region" is where the firmware is mapped. If the lazy TLB flushing logic in the vmalloc code triggers after we've had both a module unload and a vfree or similar, it will pass an address range that goes from somewhere inside the malloc region to somewhere inside the vmalloc region, and thus covering the openfirmware area entirely. The sparc64 kernel learns about openfirmware's dynamic mappings in this region early in the boot, and then services TLB misses in this area. But openfirmware has some locked TLB entries which are not mentioned in those dynamic mappings and we should thus not disturb them. These huge lazy TLB flush ranges causes those openfirmware locked TLB entries to be removed, resulting in all kinds of problems including hard hangs and crashes during reboot/reset. Besides causing problems like this, such huge TLB flush ranges are also incredibly inefficient. A plea has been made with the author of the VMAP lazy TLB flushing code, but for now we'll put a safety guard into our flush_tlb_kernel_range() implementation. Since the implementation has become non-trivial, stop defining it as a macro and instead make it a function in a C source file. Signed-off-by:
David S. Miller <davem@davemloft.net>
-
David S. Miller authored
The assumption was that update_mmu_cache() (and the equivalent for PMDs) would only be called when the PTE being installed will be accessible by the user. This is not true for code paths originating from remove_migration_pte(). There are dire consequences for placing a non-valid PTE into the TSB. The TLB miss frramework assumes thatwhen a TSB entry matches we can just load it into the TLB and return from the TLB miss trap. So if a non-valid PTE is in there, we will deadlock taking the TLB miss over and over, never satisfying the miss. Just exit early from update_mmu_cache() and friends in this situation. Based upon a report and patch from Christopher Alexander Tobias Schulze. Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Aug 03, 2014
-
-
Dan Carpenter authored
I don't know if we really need 64 bits here but these variables are declared as u64 and it can't hurt to cast this so we prevent any shift wrapping. Signed-off-by:
Dan Carpenter <dan.carpenter@oracle.com> Acked-by:
Aubrey Li <aubrey.li@linux.intel.com> Link: http://lkml.kernel.org/r/20140801082715.GE28869@mwanda Signed-off-by:
H. Peter Anvin <hpa@zytor.com>
-
Alexei Starovoitov authored
clean up names related to socket filtering and bpf in the following way: - everything that deals with sockets keeps 'sk_*' prefix - everything that is pure BPF is changed to 'bpf_*' prefix split 'struct sk_filter' into struct sk_filter { atomic_t refcnt; struct rcu_head rcu; struct bpf_prog *prog; }; and struct bpf_prog { u32 jited:1, len:31; struct sock_fprog_kern *orig_prog; unsigned int (*bpf_func)(const struct sk_buff *skb, const struct bpf_insn *filter); union { struct sock_filter insns[0]; struct bpf_insn insnsi[0]; struct work_struct work; }; }; so that 'struct bpf_prog' can be used independent of sockets and cleans up 'unattached' bpf use cases split SK_RUN_FILTER macro into: SK_RUN_FILTER to be used with 'struct sk_filter *' and BPF_PROG_RUN to be used with 'struct bpf_prog *' __sk_filter_release(struct sk_filter *) gains __bpf_prog_release(struct bpf_prog *) helper function also perform related renames for the functions that work with 'struct bpf_prog *', since they're on the same lines: sk_filter_size -> bpf_prog_size sk_filter_select_runtime -> bpf_prog_select_runtime sk_filter_free -> bpf_prog_free sk_unattached_filter_create -> bpf_prog_create sk_unattached_filter_destroy -> bpf_prog_destroy sk_store_orig_filter -> bpf_prog_store_orig_filter sk_release_orig_filter -> bpf_release_orig_filter __sk_migrate_filter -> bpf_migrate_filter __sk_prepare_filter -> bpf_prepare_filter API for attaching classic BPF to a socket stays the same: sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *) and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program which is used by sockets, tun, af_packet API for 'unattached' BPF programs becomes: bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *) and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf Signed-off-by:
Alexei Starovoitov <ast@plumgrid.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Alexei Starovoitov authored
to indicate that this function is converting classic BPF into eBPF and not related to sockets Signed-off-by:
Alexei Starovoitov <ast@plumgrid.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Aug 02, 2014
-
-
Omar Sandoval authored
The kgdb breakpoint hooks (kgdb_brk_fn and kgdb_compiled_brk_fn) should only be entered when a kgdb break instruction is executed from the kernel. Otherwise, if kgdb is enabled, a userspace program can cause the kernel to drop into the debugger by executing either KGDB_BREAKINST or KGDB_COMPILED_BREAK. Acked-by:
Will Deacon <will.deacon@arm.com> Signed-off-by:
Omar Sandoval <osandov@osandov.com> Signed-off-by:
Russell King <rmk+kernel@arm.linux.org.uk>
-
Russell King authored
Add a note about the usage of the identity mapping; we do not support accesses outside of the identity map region and kernel image while a CPU is using the identity map. This is because the identity mapping may overwrite vmalloc space, IO mappings, the vectors pages, etc. Acked-by:
Will Deacon <will.deacon@arm.com> Signed-off-by:
Russell King <rmk+kernel@arm.linux.org.uk>
-
Russell King authored
Add further comments to the early page table remap code to explain what the code is doing, why it is doing it, but more importantly to explain that the code is not architecturally compliant and is squarely in "UNPREDICTABLE" behaviour territory. Add a warning and tainting of the kernel too. Signed-off-by:
Russell King <rmk+kernel@arm.linux.org.uk>
-
Shawn Guo authored
With SCU standby enabled, SCU CLK will be turned off when all processors are in WFI mode. And the clock will be turned on when any processor leaves WFI mode. This behavior should be preferable in terms of power efficiency of system idle. So let's set the SCU standby bit to enable the support in function scu_enable(). Cortex-A9 earlier than r2p0 has no standby bit in SCU, so we need to skip setting the bit for those. Signed-off-by:
Shawn Guo <shawn.guo@freescale.com> Acked-by:
Soren Brinkmann <soren.brinkmann@xilinx.com> Signed-off-by:
Russell King <rmk+kernel@arm.linux.org.uk>
-
Shawn Guo authored
Use macro instead of magic number for SCU enable bit. Signed-off-by:
Shawn Guo <shawn.guo@freescale.com> Acked-by:
Soren Brinkmann <soren.brinkmann@xilinx.com> Signed-off-by:
Russell King <rmk+kernel@arm.linux.org.uk>
-
Jussi Kivilinna authored
This patch adds ARM NEON assembly implementation of SHA-512 and SHA-384 algorithms. tcrypt benchmark results on Cortex-A8, sha512-generic vs sha512-neon-asm: block-size bytes/update old-vs-new 16 16 2.99x 64 16 2.67x 64 64 3.00x 256 16 2.64x 256 64 3.06x 256 256 3.33x 1024 16 2.53x 1024 256 3.39x 1024 1024 3.52x 2048 16 2.50x 2048 256 3.41x 2048 1024 3.54x 2048 2048 3.57x 4096 16 2.49x 4096 256 3.42x 4096 1024 3.56x 4096 4096 3.59x 8192 16 2.48x 8192 256 3.42x 8192 1024 3.56x 8192 4096 3.60x 8192 8192 3.60x Acked-by:
Ard Biesheuvel <ard.biesheuvel@linaro.org> Tested-by:
Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by:
Jussi Kivilinna <jussi.kivilinna@iki.fi> Signed-off-by:
Russell King <rmk+kernel@arm.linux.org.uk>
-
Jussi Kivilinna authored
This patch adds ARM NEON assembly implementation of SHA-1 algorithm. tcrypt benchmark results on Cortex-A8, sha1-arm-asm vs sha1-neon-asm: block-size bytes/update old-vs-new 16 16 1.04x 64 16 1.02x 64 64 1.05x 256 16 1.03x 256 64 1.04x 256 256 1.30x 1024 16 1.03x 1024 256 1.36x 1024 1024 1.52x 2048 16 1.03x 2048 256 1.39x 2048 1024 1.55x 2048 2048 1.59x 4096 16 1.03x 4096 256 1.40x 4096 1024 1.57x 4096 4096 1.62x 8192 16 1.03x 8192 256 1.40x 8192 1024 1.58x 8192 4096 1.63x 8192 8192 1.63x Acked-by:
Ard Biesheuvel <ard.biesheuvel@linaro.org> Tested-by:
Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by:
Jussi Kivilinna <jussi.kivilinna@iki.fi> Signed-off-by:
Russell King <rmk+kernel@arm.linux.org.uk>
-
Jussi Kivilinna authored
Common SHA-1 structures are defined in <crypto/sha.h> for code sharing. This patch changes SHA-1/ARM glue code to use these structures. Acked-by:
Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by:
Jussi Kivilinna <jussi.kivilinna@iki.fi> Signed-off-by:
Russell King <rmk+kernel@arm.linux.org.uk>
-
Pawel Moll authored
The bus devices created to be parents for other peripherals were using platform_bus as a parent, not being platform devices themselves. Remove the references, making them virtual devices instead. Cc: Sascha Hauer <kernel@pengutronix.de> Cc: Russell King <linux@arm.linux.org.uk> Signed-off-by:
Pawel Moll <pawel.moll@arm.com> Acked-by:
Shawn Guo <shawn.guo@freescale.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- Aug 01, 2014
-
-
Mark Rutland authored
Due to a missing newline in the I-cache policy detection log output, it's possible to get some ratehr unfortunate output at boot time: CPU1: Booted secondary processor Detected VIPT I-cache on CPU1CPU2: Booted secondary processor Detected VIPT I-cache on CPU2CPU3: Booted secondary processor Detected VIPT I-cache on CPU3CPU4: Booted secondary processor Detected PIPT I-cache on CPU4CPU5: Booted secondary processor Detected PIPT I-cache on CPU5Brought up 6 CPUs SMP: Total of 6 processors activated. This patch adds the missing newline to the format string, cleaning up the output. Fixes: 59ccc0d4 ("arm64: cachetype: report weakest cache policy") Signed-off-by:
Mark Rutland <mark.rutland@arm.com> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-
Vince Bridgers authored
This patch adds socfpga Ethernet filter attributes for multicast and unicast filters per Synopsys Ethernet IP configuration chosen by Altera for the Cyclone 5 and Arria SOC FPGAs. Signed-off-by:
Vince Bridgers <vbridgers2013@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Jul 31, 2014
-
-
Dave Hansen authored
This has been run through Intel's LKP tests across a wide range of modern sytems and workloads and it wasn't shown to make a measurable performance difference positive or negative. Now that we have some shiny new tracepoints, we can actually figure out what the heck is going on. During a kernel compile, 60% of the flush_tlb_mm_range() calls are for a single page. It breaks down like this: size percent percent<= V V V GLOBAL: 2.20% 2.20% avg cycles: 2283 1: 56.92% 59.12% avg cycles: 1276 2: 13.78% 72.90% avg cycles: 1505 3: 8.26% 81.16% avg cycles: 1880 4: 7.41% 88.58% avg cycles: 2447 5: 1.73% 90.31% avg cycles: 2358 6: 1.32% 91.63% avg cycles: 2563 7: 1.14% 92.77% avg cycles: 2862 8: 0.62% 93.39% avg cycles: 3542 9: 0.08% 93.47% avg cycles: 3289 10: 0.43% 93.90% avg cycles: 3570 11: 0.20% 94.10% avg cycles: 3767 12: 0.08% 94.18% avg cycles: 3996 13: 0.03% 94.20% avg cycles: 4077 14: 0.02% 94.23% avg cycles: 4836 15: 0.04% 94.26% avg cycles: 5699 16: 0.06% 94.32% avg cycles: 5041 17: 0.57% 94.89% avg cycles: 5473 18: 0.02% 94.91% avg cycles: 5396 19: 0.03% 94.95% avg cycles: 5296 20: 0.02% 94.96% avg cycles: 6749 21: 0.18% 95.14% avg cycles: 6225 22: 0.01% 95.15% avg cycles: 6393 23: 0.01% 95.16% avg cycles: 6861 24: 0.12% 95.28% avg cycles: 6912 25: 0.05% 95.32% avg cycles: 7190 26: 0.01% 95.33% avg cycles: 7793 27: 0.01% 95.34% avg cycles: 7833 28: 0.01% 95.35% avg cycles: 8253 29: 0.08% 95.42% avg cycles: 8024 30: 0.03% 95.45% avg cycles: 9670 31: 0.01% 95.46% avg cycles: 8949 32: 0.01% 95.46% avg cycles: 9350 33: 3.11% 98.57% avg cycles: 8534 34: 0.02% 98.60% avg cycles: 10977 35: 0.02% 98.62% avg cycles: 11400 We get in to dimishing returns pretty quickly. On pre-IvyBridge CPUs, we used to set the limit at 8 pages, and it was set at 128 on IvyBrige. That 128 number looks pretty silly considering that less than 0.5% of the flushes are that large. The previous code tried to size this number based on the size of the TLB. Good idea, but it's error-prone, needs maintenance (which it didn't get up to now), and probably would not matter in practice much. Settting it to 33 means that we cover the mallopt M_TRIM_THRESHOLD, which is the most universally common size to do flushes. That's the short version. Here's the long one for why I chose 33: 1. These numbers have a constant bias in the timestamps from the tracing. Probably counts for a couple hundred cycles in each of these tests, but it should be fairly _even_ across all of them. The smallest delta between the tracepoints I have ever seen is 335 cycles. This is one reason the cycles/page cost goes down in general as the flushes get larger. The true cost is nearer to 100 cycles. 2. A full flush is more expensive than a single invlpg, but not by much (single percentages). 3. A dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns (~34 cycles). At those rates, refilling the 512-entry dTLB takes 22,000 cycles. 4. 22,000 cycles is approximately the equivalent of doing 85 invlpg operations. But, the odds are that the TLB can actually be filled up faster than that because TLB misses that are close in time also tend to leverage the same caches. 6. ~98% of flushes are <=33 pages. There are a lot of flushes of 33 pages, probably because libc's M_TRIM_THRESHOLD is set to 128k (32 pages) 7. I've found no consistent data to support changing the IvyBridge vs. SandyBridge tunable by a factor of 16 I used the performance counters on this hardware (IvyBridge i5-3320M) to figure out the tlb miss costs: ocperf.py stat -e dtlb_load_misses.walk_duration,dtlb_load_misses.walk_completed,dtlb_store_misses.walk_duration,dtlb_store_misses.walk_completed,itlb_misses.walk_duration,itlb_misses.walk_completed,itlb.itlb_flush 7,720,030,970 dtlb_load_misses_walk_duration [57.13%] 169,856,353 dtlb_load_misses_walk_completed [57.15%] 708,832,859 dtlb_store_misses_walk_duration [57.17%] 19,346,823 dtlb_store_misses_walk_completed [57.17%] 2,779,687,402 itlb_misses_walk_duration [57.15%] 82,241,148 itlb_misses_walk_completed [57.13%] 770,717 itlb_itlb_flush [57.11%] Show that a dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns (~34 cycles). At those rates, refilling the 512-entry dTLB takes 22,000 cycles. On a SandyBridge system with more cores and larger caches, those are dtlb=13.4ns and itlb=9.5ns. cat perf.stat.txt | perl -pe 's/,//g' | awk '/itlb_misses_walk_duration/ { icyc+=$1 } /itlb_misses_walk_completed/ { imiss+=$1 } /dtlb_.*_walk_duration/ { dcyc+=$1 } /dtlb_.*.*completed/ { dmiss+=$1 } END {print "itlb cyc/miss: ", icyc/imiss, " dtlb cyc/miss: ", dcyc/dmiss, " ----- ", icyc,imiss, dcyc,dmiss } On Westmere CPUs, the counters to use are: itlb_flush,itlb_misses.walk_cycles,itlb_misses.any,dtlb_misses.walk_cycles,dtlb_misses.any The assumptions that this code went in under: https://lkml.org/lkml/2012/6/12/119 say that a flush and a refill are about 100ns. Being generous, that is over by a factor of 6 on the refill side, although it is fairly close on the cost of an invlpg. An increase of a single invlpg operation seems to lengthen the flush range operation by about 200 cycles. Here is one example of the data collected for flushing 10 and 11 pages (full data are below): 10: 0.43% 93.90% avg cycles: 3570 cycles/page: 357 samples: 4714 11: 0.20% 94.10% avg cycles: 3767 cycles/page: 342 samples: 2145 How to generate this table: echo 10000 > /sys/kernel/debug/tracing/buffer_size_kb echo x86-tsc > /sys/kernel/debug/tracing/trace_clock echo 'reason != 0' > /sys/kernel/debug/tracing/events/tlb/tlb_flush/filter echo 1 > /sys/kernel/debug/tracing/events/tlb/tlb_flush/enable Pipe the trace output in to this script: http://sr71.net/~dave/intel/201402-tlb/trace-time-diff-process.pl.txt Note that these data were gathered with the invlpg threshold set to 150 pages. Only data points with >=50 of samples were printed: Flush % of %<= in flush this pages es size ------------------------------------------------------------------------------ -1: 2.20% 2.20% avg cycles: 2283 cycles/page: xxxx samples: 23960 1: 56.92% 59.12% avg cycles: 1276 cycles/page: 1276 samples: 620895 2: 13.78% 72.90% avg cycles: 1505 cycles/page: 752 samples: 150335 3: 8.26% 81.16% avg cycles: 1880 cycles/page: 626 samples: 90131 4: 7.41% 88.58% avg cycles: 2447 cycles/page: 611 samples: 80877 5: 1.73% 90.31% avg cycles: 2358 cycles/page: 471 samples: 18885 6: 1.32% 91.63% avg cycles: 2563 cycles/page: 427 samples: 14397 7: 1.14% 92.77% avg cycles: 2862 cycles/page: 408 samples: 12441 8: 0.62% 93.39% avg cycles: 3542 cycles/page: 442 samples: 6721 9: 0.08% 93.47% avg cycles: 3289 cycles/page: 365 samples: 917 10: 0.43% 93.90% avg cycles: 3570 cycles/page: 357 samples: 4714 11: 0.20% 94.10% avg cycles: 3767 cycles/page: 342 samples: 2145 12: 0.08% 94.18% avg cycles: 3996 cycles/page: 333 samples: 864 13: 0.03% 94.20% avg cycles: 4077 cycles/page: 313 samples: 289 14: 0.02% 94.23% avg cycles: 4836 cycles/page: 345 samples: 236 15: 0.04% 94.26% avg cycles: 5699 cycles/page: 379 samples: 390 16: 0.06% 94.32% avg cycles: 5041 cycles/page: 315 samples: 643 17: 0.57% 94.89% avg cycles: 5473 cycles/page: 321 samples: 6229 18: 0.02% 94.91% avg cycles: 5396 cycles/page: 299 samples: 224 19: 0.03% 94.95% avg cycles: 5296 cycles/page: 278 samples: 367 20: 0.02% 94.96% avg cycles: 6749 cycles/page: 337 samples: 185 21: 0.18% 95.14% avg cycles: 6225 cycles/page: 296 samples: 1964 22: 0.01% 95.15% avg cycles: 6393 cycles/page: 290 samples: 83 23: 0.01% 95.16% avg cycles: 6861 cycles/page: 298 samples: 61 24: 0.12% 95.28% avg cycles: 6912 cycles/page: 288 samples: 1307 25: 0.05% 95.32% avg cycles: 7190 cycles/page: 287 samples: 533 26: 0.01% 95.33% avg cycles: 7793 cycles/page: 299 samples: 94 27: 0.01% 95.34% avg cycles: 7833 cycles/page: 290 samples: 66 28: 0.01% 95.35% avg cycles: 8253 cycles/page: 294 samples: 73 29: 0.08% 95.42% avg cycles: 8024 cycles/page: 276 samples: 846 30: 0.03% 95.45% avg cycles: 9670 cycles/page: 322 samples: 296 31: 0.01% 95.46% avg cycles: 8949 cycles/page: 288 samples: 79 32: 0.01% 95.46% avg cycles: 9350 cycles/page: 292 samples: 60 33: 3.11% 98.57% avg cycles: 8534 cycles/page: 258 samples: 33936 34: 0.02% 98.60% avg cycles: 10977 cycles/page: 322 samples: 268 35: 0.02% 98.62% avg cycles: 11400 cycles/page: 325 samples: 177 36: 0.01% 98.63% avg cycles: 11504 cycles/page: 319 samples: 161 37: 0.02% 98.65% avg cycles: 11596 cycles/page: 313 samples: 182 38: 0.02% 98.66% avg cycles: 11850 cycles/page: 311 samples: 195 39: 0.01% 98.68% avg cycles: 12158 cycles/page: 311 samples: 128 40: 0.01% 98.68% avg cycles: 11626 cycles/page: 290 samples: 78 41: 0.04% 98.73% avg cycles: 11435 cycles/page: 278 samples: 477 42: 0.01% 98.73% avg cycles: 12571 cycles/page: 299 samples: 74 43: 0.01% 98.74% avg cycles: 12562 cycles/page: 292 samples: 78 44: 0.01% 98.75% avg cycles: 12991 cycles/page: 295 samples: 108 45: 0.01% 98.76% avg cycles: 13169 cycles/page: 292 samples: 78 46: 0.02% 98.78% avg cycles: 12891 cycles/page: 280 samples: 261 47: 0.01% 98.79% avg cycles: 13099 cycles/page: 278 samples: 67 48: 0.01% 98.80% avg cycles: 13851 cycles/page: 288 samples: 77 49: 0.01% 98.80% avg cycles: 13749 cycles/page: 280 samples: 66 50: 0.01% 98.81% avg cycles: 13949 cycles/page: 278 samples: 73 52: 0.00% 98.82% avg cycles: 14243 cycles/page: 273 samples: 52 54: 0.01% 98.83% avg cycles: 15312 cycles/page: 283 samples: 87 55: 0.01% 98.84% avg cycles: 15197 cycles/page: 276 samples: 109 56: 0.02% 98.86% avg cycles: 15234 cycles/page: 272 samples: 208 57: 0.00% 98.86% avg cycles: 14888 cycles/page: 261 samples: 53 58: 0.01% 98.87% avg cycles: 15037 cycles/page: 259 samples: 59 59: 0.01% 98.87% avg cycles: 15752 cycles/page: 266 samples: 63 62: 0.00% 98.89% avg cycles: 16222 cycles/page: 261 samples: 54 64: 0.02% 98.91% avg cycles: 17179 cycles/page: 268 samples: 248 65: 0.12% 99.03% avg cycles: 18762 cycles/page: 288 samples: 1324 85: 0.00% 99.10% avg cycles: 21649 cycles/page: 254 samples: 50 127: 0.01% 99.18% avg cycles: 32397 cycles/page: 255 samples: 75 128: 0.13% 99.31% avg cycles: 31711 cycles/page: 247 samples: 1466 129: 0.18% 99.49% avg cycles: 33017 cycles/page: 255 samples: 1927 181: 0.33% 99.84% avg cycles: 2489 cycles/page: 13 samples: 3547 256: 0.05% 99.91% avg cycles: 2305 cycles/page: 9 samples: 550 512: 0.03% 99.95% avg cycles: 2133 cycles/page: 4 samples: 304 1512: 0.01% 99.99% avg cycles: 3038 cycles/page: 2 samples: 65 Here are the tlb counters during a 10-second slice of a kernel compile for a SandyBridge system. It's better than IvyBridge, but probably due to the larger caches since this was one of the 'X' extreme parts. 10,873,007,282 dtlb_load_misses_walk_duration 250,711,333 dtlb_load_misses_walk_completed 1,212,395,865 dtlb_store_misses_walk_duration 31,615,772 dtlb_store_misses_walk_completed 5,091,010,274 itlb_misses_walk_duration 163,193,511 itlb_misses_walk_completed 1,321,980 itlb_itlb_flush 10.008045158 seconds time elapsed # cat perf.stat.1392743721.txt | perl -pe 's/,//g' | awk '/itlb_misses_walk_duration/ { icyc+=$1 } /itlb_misses_walk_completed/ { imiss+=$1 } /dtlb_.*_walk_duration/ { dcyc+=$1 } /dtlb_.*.*completed/ { dmiss+=$1 } END {print "itlb cyc/miss: ", icyc/imiss/3.3, " dtlb cyc/miss: ", dcyc/dmiss/3.3, " ----- ", icyc,imiss, dcyc,dmiss }' itlb ns/miss: 9.45338 dtlb ns/miss: 12.9716 Signed-off-by:
Dave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154103.10C1115E@viggo.jf.intel.com Acked-by:
Rik van Riel <riel@redhat.com> Acked-by:
Mel Gorman <mgorman@suse.de> Signed-off-by:
H. Peter Anvin <hpa@linux.intel.com>
-
Dave Hansen authored
Most of the logic here is in the documentation file. Please take a look at it. I know we've come full-circle here back to a tunable, but this new one is *WAY* simpler. I challenge anyone to describe in one sentence how the old one worked. Here's the way the new one works: If we are flushing more pages than the ceiling, we use the full flush, otherwise we use per-page flushes. Signed-off-by:
Dave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154101.12B52CAF@viggo.jf.intel.com Acked-by:
Rik van Riel <riel@redhat.com> Acked-by:
Mel Gorman <mgorman@suse.de> Signed-off-by:
H. Peter Anvin <hpa@linux.intel.com>
-
Dave Hansen authored
We don't have any good way to figure out what kinds of flushes are being attempted. Right now, we can try to use the vm counters, but those only tell us what we actually did with the hardware (one-by-one vs full) and don't tell us what was actually _requested_. This allows us to select out "interesting" TLB flushes that we might want to optimize (like the ranged ones) and ignore the ones that we have very little control over (the ones at context switch). Signed-off-by:
Dave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154059.4C96CBA5@viggo.jf.intel.com Acked-by:
Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by:
H. Peter Anvin <hpa@linux.intel.com>
-
Dave Hansen authored
There are currently three paths through the remote flush code: 1. full invalidation 2. single page invalidation using invlpg 3. ranged invalidation using invlpg This takes 2 and 3 and combines them in to a single path by making the single-page one just be the start and end be start plus a single page. This makes placement of our tracepoint easier. Signed-off-by:
Dave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154058.E0F90408@viggo.jf.intel.com Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by:
H. Peter Anvin <hpa@linux.intel.com>
-
Dave Hansen authored
If we take the if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) { local_flush_tlb(); goto out; } path out of flush_tlb_mm_range(), we will have flushed the tlb, but not incremented NR_TLB_LOCAL_FLUSH_ALL. This unifies the way out of the function so that we always take a single path when doing a full tlb flush. Signed-off-by:
Dave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154056.FF763B76@viggo.jf.intel.com Acked-by:
Rik van Riel <riel@redhat.com> Acked-by:
Mel Gorman <mgorman@suse.de> Signed-off-by:
H. Peter Anvin <hpa@linux.intel.com>
-
Dave Hansen authored
I think the flush_tlb_mm_range() code that tries to tune the flush sizes based on the CPU needs to get ripped out for several reasons: 1. It is obviously buggy. It uses mm->total_vm to judge the task's footprint in the TLB. It should certainly be using some measure of RSS, *NOT* ->total_vm since only resident memory can populate the TLB. 2. Haswell, and several other CPUs are missing from the intel_tlb_flushall_shift_set() function. Thus, it has been demonstrated to bitrot quickly in practice. 3. It is plain wrong in my vm: [ 0.037444] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0 [ 0.037444] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0 [ 0.037444] tlb_flushall_shift: 6 Which leads to it to never use invlpg. 4. The assumptions about TLB refill costs are wrong: http://lkml.kernel.org/r/1337782555-8088-3-git-send-email-alex.shi@intel.com (more on this in later patches) 5. I can not reproduce the original data: https://lkml.org/lkml/2012/5/17/59 I believe the sample times were too short. Running the benchmark in a loop yields times that vary quite a bit. Note that this leaves us with a static ceiling of 1 page. This is a conservative, dumb setting, and will be revised in a later patch. This also removes the code which attempts to predict whether we are flushing data or instructions. We expect instruction flushes to be relatively rare and not worth tuning for explicitly. Signed-off-by:
Dave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154055.ABC88E89@viggo.jf.intel.com Acked-by:
Rik van Riel <riel@redhat.com> Acked-by:
Mel Gorman <mgorman@suse.de> Signed-off-by:
H. Peter Anvin <hpa@linux.intel.com>
-
Dave Hansen authored
The if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) line of code is not exactly the easiest to audit, especially when it ends up at two different indentation levels. This eliminates one of the the copy-n-paste versions. It also gives us a unified exit point for each path through this function. We need this in a minute for our tracepoint. Signed-off-by:
Dave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154054.44F1CDDC@viggo.jf.intel.com Acked-by:
Rik van Riel <riel@redhat.com> Acked-by:
Mel Gorman <mgorman@suse.de> Signed-off-by:
H. Peter Anvin <hpa@linux.intel.com>
-
Mark D Rustad authored
Resolve shadow warnings that appear in W=2 builds. Instead of using ret to hold the return pointer, save the length in a new variable saved_len and compute the pointer on exit. This also resolves a very technical error, in that ret was declared as a const char *, when it really was a char * const. Signed-off-by:
Mark Rustad <mark.d.rustad@intel.com> Signed-off-by:
Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Will Deacon authored
This reverts commit a28e3f4b . Ard and Yi Li report that this patch is broken by design, so revert it and let them sort it out for 3.18 instead. Reported-by:
Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-
byungchul.park authored
Commit 190f1ca8 ("arm64: add support for kernel mode NEON in interrupt context") introduced a typing error in fpsimd_save_partial_state ENDPROC. This patch fixes the typing error. Acked-by:
Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by:
byungchul.park <byungchul.park@lge.com> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-
Will Deacon authored
Our break hooks are used to handle brk exceptions from kgdb (and potentially kprobes if that code ever resurfaces), so don't bother calling them if the BRK exception comes from userspace. This prevents userspace from trapping to a kdb shell on systems where kgdb is enabled and active. Cc: <stable@vger.kernel.org> Reported-by:
Omar Sandoval <osandov@osandov.com> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-
David Hildenbrand authored
A VCPU might never stop if it intercepts (for whatever reason) between "fake interrupt delivery" and execution of the stop function. Heart of the problem is that SIGP STOP is an interrupt that has to be processed on every SIE entry until the VCPU finally executes the stop function. This problem was made apparent by commit 7dfc63cf (KVM: s390: allow only one SIGP STOP (AND STORE STATUS) at a time). With the old code, the guest could (incorrectly) inject SIGP STOPs multiple times. The bug of losing a sigp stop exists in KVM before 7dfc63cf , but it was hidden by Linux guests doing a sigp stop loop. The new code (rightfully) returns CC=2 and does not queue a new interrupt. This patch is a simple fix of the problem. Longterm we are going to rework that code - e.g. get rid of the action bits and so on. Signed-off-by:
David Hildenbrand <dahi@linux.vnet.ibm.com> Reviewed-by:
Christian Borntraeger <borntraeger@de.ibm.com> Acked-by:
Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by:
Christian Borntraeger <borntraeger@de.ibm.com> [some additional patch description]
-
Linus Walleij authored
The GPIO pin connected to card detect was inverted twice: once by the argument to the GPIO line itself where it was magically marked as active low by the flag GPIO_ACTIVE_LOW (0x01) in the third cell, and also marked active low AGAIN by explicitly stating "cd-inverted" (a deprecated method). After commit 78f87df2 "mmc: mmci: Use the common mmc DT parser" this results in the line being inverted twice so it was effectively uninverted, while the old code would not have this effect, instead disregarding the flag on the GPIO line altogether, which is a bug. I admit the semantics may be unclear but inverting twice is as good a definition as any on how this should work. So fix up the buggy device tree. Use proper #includes so the DTS is clear and readable. Cc: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by:
Linus Walleij <linus.walleij@linaro.org> Signed-off-by:
Olof Johansson <olof@lixom.net>
-
- Jul 30, 2014
-
-
Robert Richter authored
Matching x86 and making it more convenient to run the arm64 default kernel as distros like Ubuntu need this option. Signed-off-by:
Robert Richter <rrichter@cavium.com> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-
Chris J Arges authored
Remove a prototype which was added by both 93c4adc7 and 36be0b9d . Signed-off-by:
Chris J Arges <chris.j.arges@canonical.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Arun Chandran authored
Building a kernel with CPU_BIG_ENDIAN fails if there are stale objects from a !CPU_BIG_ENDIAN build. Due to a missing FORCE prerequisite on an if_changed rule in the VDSO Makefile, we attempt to link a stale LE object into the new BE kernel. According to Documentation/kbuild/makefiles.txt, FORCE is required for if_changed rules and forgetting it is a common mistake, so fix it by 'Forcing' the build of vdso. This patch fixes build errors like these: arch/arm64/kernel/vdso/note.o: compiled for a little endian system and target is big endian failed to merge target specific data of file arch/arm64/kernel/vdso/note.o arch/arm64/kernel/vdso/sigreturn.o: compiled for a little endian system and target is big endian failed to merge target specific data of file arch/arm64/kernel/vdso/sigreturn.o Tested-by:
Mark Rutland <mark.rutland@arm.com> Signed-off-by:
Arun Chandran <achandran@mvista.com> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-
Christian Borntraeger authored
commit 7dfc63cf (KVM: s390: allow only one SIGP STOP (AND STORE STATUS) at a time) introduced a memory leak if a sigp stop is already pending. Free the allocated inti structure. Signed-off-by:
Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by:
David Hildenbrand <dahi@linux.vnet.ibm.com>
-