mm, hugetlb: improve page-fault scalability
The kernel can currently only handle a single hugetlb page fault at a time. This is due to a single mutex that serializes the entire path. This lock protects from spurious OOM errors under conditions of low availability of free hugepages. This problem is specific to hugepages, because it is normal to want to use every single hugepage in the system - with normal pages we simply assume there will always be a few spare pages which can be used temporarily until the race is resolved. Address this problem by using a table of mutexes, allowing a better chance of parallelization, where each hugepage is individually serialized. The hash key is selected depending on the mapping type. For shared ones it consists of the address space and file offset being faulted; while for private ones the mm and virtual address are used. The size of the table is selected based on a compromise of collisions and memory footprint of a series of database workloads. Large database workloads that make heavy use of hugepages can be particularly exposed to this issue, causing start-up times to be painfully slow. This patch reduces the startup time of a 10 Gb Oracle DB (with ~5000 faults) from 37.5 secs to 25.7 secs. Larger workloads will naturally benefit even more. NOTE: The only downside to this patch, detected by Joonsoo Kim, is that a small race is possible in private mappings: A child process (with its own mm, after cow) can instantiate a page that is already being handled by the parent in a cow fault. When low on pages, can trigger spurious OOMs. I have not been able to think of a efficient way of handling this... but do we really care about such a tiny window? We already maintain another theoretical race with normal pages. If not, one possible way to is to maintain the single hash for private mappings -- any workloads that *really* suffer from this scaling problem should already use shared mappings. [akpm@linux-foundation.org: remove stray + characters, go BUG if hugetlb_init() kmalloc fails] Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Please register or sign in to comment