Illustrate runC Escape Vulnerability CVE-2024-21626
runC, a container runtime component, published version 1.1.12
to fix CVE-2024-21626 at 31, Jan 2024, which leads to escaping from containers. The range of affected versions are >= v1.0.0-rc93, <=1.1.11. For containerd the fixed versions are 1.6.28 and 1.7.13, the range of affected versions are 1.4.7 to 1.6.27 and 1.7.0 to 1.7.12. For Docker the fixed version is 25.0.2.
Repdoruce
My environment to repdouce it is:
- Linux distro: Arch Linux
- Linux kernel: 6.4.12-arch1-1
- Docker version: 24.0.6
- runc version: 1.1.9
According to the root cause of the vulnerability, attackers can exploit via two different ways:
- Set the working directory of the container to
/proc/self/fd/<fd>
(where<fd>
stands for the file descriptor when opening/sys/fs/cgroup
in host filesystem. Usually it’s 7 or 8) when running a container. - Create a symlink for
/proc/self/fd/<fd>
(where<fd>
stands for the file descriptor when opening/sys/fs/cgroup
in host filesystem. Usually it’s 7 or 8). When users execute commands inside the container viadocker exec
orkubectl exec
by setting the working directory to the symlink, attackers can access host filesystem through/proc/<PID>/cwd
, where<PID>
stands for the PID of the process generated bydocker exec
orkubectl exec
command.
Exploit via Setting Working Directory to /proc/self/fd/
Just run the following command:
docker run -w /proc/self/fd/8 --name cve-2024-21626 --rm -it debian:bookworm
Exploit via docker exec
Start a container and create a symlink for /proc/self/fd/8
.
Execute docker exec
command with -w
parameter to execute sleep
command in the container.
Inside the container, find PID of sleep
command, then access host filesystem via /proc/<PID>/cwd
. For example, execute cat /proc/<PID>/cwd/../../../../../etc/shadow
to get shadow file in host filesystem.
Analyze CVE-2024-21626
In this section, I’ll describe the calling relationship among the components at first. Then reproduce the vulnerability again with runc run
command, and explain how the vulnerability happens. Lastly analyze the code to fix the vulnerability.
How Docker Engine Calls runC
When running a container with docker run
command, the calling relationship among dockerd
, containerd
, containerd-shim-runc-v2
and runc
is:
- Docker Engine (dockerd) calls RPC methods of containerd to create and run a container via
/run/containerd/containerd.sock
. - containerd executes
containerd-shim-runc-v2
command to run a standalone RPC service via UNIX domain socket, the path of which is stored in file/run/containerd/io.containerd.v2.task/moby/<containerID>/address
by default. The definition of the RPC service lies on file/api/runtime/task/v3/shim.proto
. - When containerd calls
Create
method of containerd-shim-runc-v2 to create a container, containerd-shim-runc-v2 executesrunc create
command. When containerd callsStart
method of containerd-shim-runc-v2 to start a container, containerd-shim-runc-v2 executesrunc start
command.
By the way, containerd creates a package called github.com/containerd/go-runc
to encapsulate operations of runC.
Reproduce via runC Itself
Use Docker to run a container with alpine image, then export it as a tar archive. We’ll use it as rootfs of our container later.
Execute runc spec
command to generate a default config file config.json
, and change the value of key cwd
to /proc/self/fd/7
.
Run a container by executing runc run
command to create a exploitable container. Note that --log
parameter is necessary!
How the Vulnerability Happens
runc run
command creates a libcontainer.linuxContainer
object at first. In order to create the object, runC needs to create an interface object called cgroups.Manager
, which is used to manage cgroupfs. It’ll open /sys/fs/cgroup
in host filesystem, and subsequent operations to cgroup files are based on openat2(2)
system call and the file descriptor of /sys/fs/cgroup
. But runC donesn’t close the file descriptor of /sys/fs/cgroup
in time when forking child processes, so that child processes can access host filesystem through /proc/self/fd/<fdnum>
.
Notice that if calling openat2(2)
system call failed (if openat2(2)
doesn’t exist), runC will call function openFallback()
to open cgroup files with absolute paths.
Why runC Decides to Use openat2(2)
runC added support to openat2(2)
in 4, Dec 2024, aka version v1.0.0-rc93
. The answer, in brief, is to prevent potential security risks when mounting directories in host filesystem into mount namespace of containers. It’s a long story to give a detailed explanation, and deserved to write another article. For now you can refer to this article and manual of openat2(2)
.
Why the File Descriptor of /sys/fs/cgroup is 7
Well, it’s related to Golang runtime. First there is no doubt that file descriptor 0, 1 and 2 are stands for stdin, stdout, stderr. The file descriptor of the log file specified by --log
parameter is 3. Golang runtime subsequently calls epoll_create(2)
to create file descriptor 4 and pipe(2)
to create two file descriptors 5 and 6. Now, opening /sys/fs/cgroup
creates file descriptor 7.
The reason why opening the log file at first, then Go runtime calling
epoll_create(2)
andpipe2(2)
is related to the implementation of Go runtime, it’s a long story again.
Why the File Descriptor of /sys/fs/cgroup is 8 when Running a Container with docker exec
From the first part of this section, we know that it’s containerd-shim-runc-v2 that calls runc command, and containerd-shim-runc-v2 provides a RPC service via UNIX domain socket before executing runC, so the file descriptor which stands for the UNIX domain socket is passed to runC process by mistake.
We can prove the reason in this way. Add a new line to call sleep()
function at the very beginning of nsexec()
function in file nsexec.c
. We can get the relationship of file descriptors between containerd-shim-runc-v2 and runc create
.
Process runc create
is blocked immediately after it’s been created because of our added sleep function. From the screenshot above Process runc create
has 4 file descriptors:
- 0 stands for stdin. It’s been redirected to /dev/null because containerd-shim-runc-v2 don’t need to send any input data to runC.
- 1 and 2 stands for stdout and stderr. They refer to the same pipe from containerd-shim-runc-v2, because containerd-shim-runc-v2 wants to collect and store them.
- 3 stands for the UNIX domain socket used to be provide the RPC service.
PID 1374988 in the screenshot above stands for runc:[2:INIT]
process, which in turn will become the container process after calling execve(2)
. We can see that the fd of /sys/fs/cgroup
is 8, just beucase of the UNIX domain socket offering RPC service!
It’s still unclear why sometimes the fd of /sys/fs/cgroup is still 7 when running containers via
docker exec
. Guess that it’s still related to Go runtime.
The Magic --log Parameter
when reproducing the vulnerability via runC itself, if we didn’t specify --log
parameter, the fd of /sys/fs/cgroup would become 3, thus the exploit would not happen, cause Go runtime would close it. The reason is related to the implementation of Go runtime.
As we all know, we use fork(2)
to create a child process on Linux, and the child process will inherit all opened file descriptors from the parent process. But when executing a new program by using execve(2)
, Linux kernel will close those file descriptors which have O_CLOEXEC
flag set. But the remaining opened file descriptors are still accessible for the newly loaded program. In this case, it may lead to potential security risk.
In order to solve the problem, the design of Golang is to prevent child processes from inheriting all file descriptors by default. Developers need to pass those file descriptors to be inherited explicitly. Specifically, Golang runtime sets O_CLOEXEC
flag for each file descriptor. For those to be inherited, Golang runtime uses dup3(2)
to duplicate new file descriptors. Be notice that those file descriptors created by calling raw syscalls don’t be flagged as O_CLOEXEC
by default, so for these file descriptors whether they can be inherited by child processes depends on the actual situation. If the value of file descriptor is greater than len(Cmd.ExtraFiles) + 3
, it can be inherited successfully. Otherwise it’ll be closed. You can read the source code of Golang to get more detail.
How the Official Fixes it
We see four commits to fix the vulnerability in the repo, 8e1cd2、f2f162、89c93d、ee7309. The last three commits aims to close unnecessary file descriptors before forking children. The first commit aims to verify whether the current working directory is inside container or not.
How to detect
The exploits have the following characteristics:
- A container will
execve(2)
a process with a special working directory which starts with/proc/self/fd/
. - A container will create symbolic links via
symlink(2)
orsymlinkat(2)
with a special target directory link which starts with/proc/self/fd/
. - A container will open files via
open(2)
,openat(2)
oropenat2(2)
with filenames like/proc/\d+/cwd/.*
.
Leaky vessels dynamic detector from synk
Synk offers a tool to detect this vulnerability. It’s implemented by using eBPF, but eBPF code is not open-source. I’ll write a article later to RE it.
Falco
Here is the custom Falco rule:
- macro: container
condition: (container.id != host and container.name exists)
- rule: CVE-2024-21626 (runC escape through /proc/[PID]/cwd) exploited
desc: >
Detect CVE-2024-21626, runC escape vulerability through /proc/[PID]/cwd.
condition: >
container and ((evt.type = execve and proc.cwd startswith "/proc/self/fd") or (evt.type in (open, openat, openat2) and fd.name glob "/proc/*/cwd/*") or (evt.type in (symlink, symlinkat) and fs.path.target startswith "/proc/self/fd/")) and proc.name != "runc:[1:CHILD]"
output: CVE-2024-21626 exploited (%container.info evt_type=%evt.type process=%proc.name command=%proc.cmdline target=%fs.path.targetraw)
priority: CRITICAL
But filtering false positives with proc.name
is not a good idea.
References
https://github.com/opencontainers/runc/security/advisories/GHSA-xr7r-f8xq-vfvv
https://github.com/opencontainers/runc/commit/8e1cd2f56d518f8d6292b8bb39f0d0932e4b6c2a
https://github.com/opencontainers/runc/commit/f2f16213e174fb63e931fe0546bbbad1d9bbed6f
https://github.com/opencontainers/runc/commit/89c93ddf289437d5c8558b37047c54af6a0edb48
https://github.com/opencontainers/runc/commit/ee73091a8d28692fa4868bac81aa40a0b05f9780
https://snyk.io/blog/cve-2024-21626-runc-process-cwd-container-breakout/