Commit 9b624988 authored by Eric W. Biederman's avatar Eric W. Biederman
Browse files

ucounts: Count rlimits in each user namespace

Preface
-------
These patches are for binding the rlimit counters to a user in user namespace.
This patch set can be applied on top of:

git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git v5.12-rc4

Problem
-------
The RLIMIT_NPROC, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_MSGQUEUE rlimits
implementation places the counters in user_struct [1]. These limits are global
between processes and persists for the lifetime of the process, even if
processes are in different user namespaces.

To illustrate the impact of rlimits, let's say there is a program that does not
fork. Some service-A wants to run this program as user X in multiple containers.
Since the program never fork the service wants to set RLIMIT_NPROC=1.

service-A
 \- program (uid=1000, container1, rlimit_nproc=1)
 \- program (uid=1000, container2, rlimit_nproc=1)

The service-A sets RLIMIT_NPROC=1 and runs the program in container1. When the
service-A tries to run a program with RLIMIT_NPROC=1 in container2 it fails
since user X already has one running process.

The problem is not that the limit from container1 affects container2. The
problem is that limit is verified against the global counter that reflects
the number of processes in all containers.

This problem can be worked around by using different users for each container
but in this case we face a different problem of uid mapping when transferring
files from one container to another.

Eric W. Biederman mentioned this issue [2][3].

Introduced changes
------------------
To address the problem, we bind rlimit counters to user namespace. Each counter
reflects the number of processes in a given uid in a given user namespace. The
result is a tree of rlimit counters with the biggest value at the root (aka
init_user_ns). The limit is considered exceeded if it's exceeded up in the tree.

[1]: https://lore.kernel.org/containers/87imd2incs.fsf@x220.int.ebiederm.org/
[2]: https://lists.linuxfoundation.org/pipermail/containers/2020-August/042096.html
[3]: https://lists.linuxfoundation.org/pipermail/containers/2020-October/042524.html

Changelog
---------
v11:
* Revert most of changes in signal.c to fix performance issues and remove
  unnecessary memory allocations.
* Fixed issue found by lkp robot (again).

v10:
* Fixed memory leak in __sigqueue_alloc.
* Handled an unlikely situation when all consumers will return ucounts at once.
* Addressed other review comments from Eric W. Biederman.

v9:
* Used a negative value to check that the ucounts->count is close to overflow.
* Rebased onto v5.12-rc4.

v8:
* Used atomic_t for ucounts reference counting. Also added counter overflow
  check (thanks to Linus Torvalds for the idea).
* Fixed other issues found by lkp-tests project in the patch that Reimplements
  RLIMIT_MEMLOCK on top of ucounts.

v7:
* Fixed issues found by lkp-tests project in the patch that Reimplements
  RLIMIT_MEMLOCK on top of ucounts.

v6:
* Fixed issues found by lkp-tests project.
* Rebased onto v5.11.

v5:
* Split the first commit into two commits: change ucounts.count type to atomic_long_t
  and add ucounts to cred. These commits were merged by mistake during the rebase.
* The __get_ucounts() renamed to alloc_ucounts().
* The cred.ucounts update has been moved from commit_creds() as it did not allow
  to handle errors.
* Added error handling of set_cred_ucounts().

v4:
* Reverted the type change of ucounts.count to refcount_t.
* Fixed typo in the kernel/cred.c

v3:
* Added get_ucounts() function to increase the reference count. The existing
  get_counts() function renamed to __get_ucounts().
* The type of ucounts.count changed from atomic_t to refcount_t.
* Dropped 'const' from set_cred_ucounts() arguments.
* Fixed a bug with freeing the cred structure after calling cred_alloc_blank().
* Commit messages have been updated.
* Added selftest.

v2:
* RLIMIT_MEMLOCK, RLIMIT_SIGPENDING and RLIMIT_MSGQUEUE are migrated to ucounts.
* Added ucounts for pair uid and user namespace into cred.
* Added the ability to increase ucount by more than 1.

v1:
* After discussion with Eric W. Biederman, I increased the size of ucounts to
  atomic_long_t.
* Added ucount_max to avoid the fork bomb.

--

Alexey Gladkov (9):
  Increase size of ucounts to atomic_long_t
  Add a reference to ucounts for each cred
  Use atomic_t for ucounts reference counting
  Reimplement RLIMIT_NPROC on top of ucounts
  Reimplement RLIMIT_MSGQUEUE on top of ucounts
  Reimplement RLIMIT_SIGPENDING on top of ucounts
  Reimplement RLIMIT_MEMLOCK on top of ucounts
  kselftests: Add test to check for rlimit changes in different user
    namespaces
  ucounts: Set ucount_max to the largest positive value the type can
    hold

 fs/exec.c                                     |   6 +-
 fs/hugetlbfs/inode.c                          |  16 +-
 fs/proc/array.c                               |   2 +-
 include/linux/cred.h                          |   4 +
 include/linux/hugetlb.h                       |   4 +-
 include/linux/mm.h                            |   4 +-
 include/linux/sched/user.h                    |   7 -
 include/linux/shmem_fs.h                      |   2 +-
 include/linux/signal_types.h                  |   4 +-
 include/linux/user_namespace.h                |  31 +++-
 ipc/mqueue.c                                  |  40 ++---
 ipc/shm.c                                     |  26 +--
 kernel/cred.c                                 |  50 +++++-
 kernel/exit.c                                 |   2 +-
 kernel/fork.c                                 |  18 +-
 kernel/signal.c                               |  25 +--
 kernel/sys.c                                  |  14 +-
 kernel/ucount.c                               | 116 ++++++++++---
 kernel/user.c                                 |   3 -
 kernel/user_namespace.c                       |   9 +-
 mm/memfd.c                                    |   4 +-
 mm/mlock.c                                    |  22 ++-
 mm/mmap.c                                     |   4 +-
 mm/shmem.c                                    |  10 +-
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/rlimits/.gitignore    |   2 +
 tools/testing/selftests/rlimits/Makefile      |   6 +
 tools/testing/selftests/rlimits/config        |   1 +
 .../selftests/rlimits/rlimits-per-userns.c    | 161 ++++++++++++++++++
 29 files changed, 467 insertions(+), 127 deletions(-)
 create mode 100644 tools/testing/selftests/rlimits/.gitignore
 create mode 100644 tools/testing/selftests/rlimits/Makefile
 create mode 100644 tools/testing/selftests/rlimits/config
 create mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns.c

Link: https://lkml.kernel.org/r/cover.1619094428.git.legion@kernel.org
parents 9f4ad9e4 c1ada3dc
Loading
Loading
Loading
Loading
+5 −1
Original line number Diff line number Diff line
@@ -1360,6 +1360,10 @@ int begin_new_exec(struct linux_binprm * bprm)
	WRITE_ONCE(me->self_exec_id, me->self_exec_id + 1);
	flush_signal_handlers(me, 0);

	retval = set_cred_ucounts(bprm->cred);
	if (retval < 0)
		goto out_unlock;

	/*
	 * install the new credentials for this executable
	 */
@@ -1874,7 +1878,7 @@ static int do_execveat_common(int fd, struct filename *filename,
	 * whether NPROC limit is still exceeded.
	 */
	if ((current->flags & PF_NPROC_EXCEEDED) &&
	    atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
	    is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
		retval = -EAGAIN;
		goto out_ret;
	}
+8 −8
Original line number Diff line number Diff line
@@ -1443,7 +1443,7 @@ static int get_hstate_idx(int page_size_log)
 * otherwise hugetlb_reserve_pages reserves one less hugepages than intended.
 */
struct file *hugetlb_file_setup(const char *name, size_t size,
				vm_flags_t acctflag, struct user_struct **user,
				vm_flags_t acctflag, struct ucounts **ucounts,
				int creat_flags, int page_size_log)
{
	struct inode *inode;
@@ -1455,20 +1455,20 @@ struct file *hugetlb_file_setup(const char *name, size_t size,
	if (hstate_idx < 0)
		return ERR_PTR(-ENODEV);

	*user = NULL;
	*ucounts = NULL;
	mnt = hugetlbfs_vfsmount[hstate_idx];
	if (!mnt)
		return ERR_PTR(-ENOENT);

	if (creat_flags == HUGETLB_SHMFS_INODE && !can_do_hugetlb_shm()) {
		*user = current_user();
		if (user_shm_lock(size, *user)) {
		*ucounts = current_ucounts();
		if (user_shm_lock(size, *ucounts)) {
			task_lock(current);
			pr_warn_once("%s (%d): Using mlock ulimits for SHM_HUGETLB is deprecated\n",
				current->comm, current->pid);
			task_unlock(current);
		} else {
			*user = NULL;
			*ucounts = NULL;
			return ERR_PTR(-EPERM);
		}
	}
@@ -1495,9 +1495,9 @@ struct file *hugetlb_file_setup(const char *name, size_t size,

	iput(inode);
out:
	if (*user) {
		user_shm_unlock(size, *user);
		*user = NULL;
	if (*ucounts) {
		user_shm_unlock(size, *ucounts);
		*ucounts = NULL;
	}
	return file;
}
+1 −1
Original line number Diff line number Diff line
@@ -284,7 +284,7 @@ static inline void task_sig(struct seq_file *m, struct task_struct *p)
		collect_sigign_sigcatch(p, &ignored, &caught);
		num_threads = get_nr_threads(p);
		rcu_read_lock();  /* FIXME: is this correct? */
		qsize = atomic_read(&__task_cred(p)->user->sigpending);
		qsize = get_ucounts_value(task_ucounts(p), UCOUNT_RLIMIT_SIGPENDING);
		rcu_read_unlock();
		qlim = task_rlimit(p, RLIMIT_SIGPENDING);
		unlock_task_sighand(p, &flags);
+4 −0
Original line number Diff line number Diff line
@@ -144,6 +144,7 @@ struct cred {
#endif
	struct user_struct *user;	/* real user ID subscription */
	struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
	struct ucounts *ucounts;
	struct group_info *group_info;	/* supplementary groups for euid/fsgid */
	/* RCU deletion */
	union {
@@ -170,6 +171,7 @@ extern int set_security_override_from_ctx(struct cred *, const char *);
extern int set_create_files_as(struct cred *, struct inode *);
extern int cred_fscmp(const struct cred *, const struct cred *);
extern void __init cred_init(void);
extern int set_cred_ucounts(struct cred *);

/*
 * check for validity of credentials
@@ -370,6 +372,7 @@ static inline void put_cred(const struct cred *_cred)

#define task_uid(task)		(task_cred_xxx((task), uid))
#define task_euid(task)		(task_cred_xxx((task), euid))
#define task_ucounts(task)	(task_cred_xxx((task), ucounts))

#define current_cred_xxx(xxx)			\
({						\
@@ -386,6 +389,7 @@ static inline void put_cred(const struct cred *_cred)
#define current_fsgid() 	(current_cred_xxx(fsgid))
#define current_cap()		(current_cred_xxx(cap_effective))
#define current_user()		(current_cred_xxx(user))
#define current_ucounts()	(current_cred_xxx(ucounts))

extern struct user_namespace init_user_ns;
#ifdef CONFIG_USER_NS
+2 −2
Original line number Diff line number Diff line
@@ -434,7 +434,7 @@ static inline struct hugetlbfs_inode_info *HUGETLBFS_I(struct inode *inode)
extern const struct file_operations hugetlbfs_file_operations;
extern const struct vm_operations_struct hugetlb_vm_ops;
struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
				struct user_struct **user, int creat_flags,
				struct ucounts **ucounts, int creat_flags,
				int page_size_log);

static inline bool is_file_hugepages(struct file *file)
@@ -454,7 +454,7 @@ static inline struct hstate *hstate_inode(struct inode *i)
#define is_file_hugepages(file)			false
static inline struct file *
hugetlb_file_setup(const char *name, size_t size, vm_flags_t acctflag,
		struct user_struct **user, int creat_flags,
		struct ucounts **ucounts, int creat_flags,
		int page_size_log)
{
	return ERR_PTR(-ENOSYS);
Loading