Illustrate runC Escape Vulnerability CVE-2024-21626

runC, a container runtime component, published version 1.1.12 to fix CVE-2024-21626 at 31, Jan 2024, which leads to escaping from containers. The range of affected versions are >= v1.0.0-rc93, <=1.1.11. For containerd the fixed versions are 1.6.28 and 1.7.13, the range of affected versions are 1.4.7 to 1.6.27 and 1.7.0 to 1.7.12. For Docker the fixed version is 25.0.2.

Repdoruce

My environment to repdouce it is:

  • Linux distro: Arch Linux
  • Linux kernel: 6.4.12-arch1-1
  • Docker version: 24.0.6
  • runc version: 1.1.9

According to the root cause of the vulnerability, attackers can exploit via two different ways:

  • Set the working directory of the container to /proc/self/fd/<fd> (where <fd> stands for the file descriptor when opening /sys/fs/cgroup in host filesystem. Usually it’s 7 or 8) when running a container.
  • Create a symlink for /proc/self/fd/<fd> (where <fd> stands for the file descriptor when opening /sys/fs/cgroup in host filesystem. Usually it’s 7 or 8). When users execute commands inside the container via docker exec or kubectl exec by setting the working directory to the symlink, attackers can access host filesystem through /proc/<PID>/cwd, where <PID> stands for the PID of the process generated by docker exec or kubectl exec command.

Just run the following command:

docker run -w /proc/self/fd/8 --name cve-2024-21626 --rm -it debian:bookworm

/images/cve-2024-21626-escape-via-crafted-image.gif

Start a container and create a symlink for /proc/self/fd/8.

Execute docker exec command with -w parameter to execute sleep command in the container.

Inside the container, find PID of sleep command, then access host filesystem via /proc/<PID>/cwd. For example, execute cat /proc/<PID>/cwd/../../../../../etc/shadow to get shadow file in host filesystem.

/images/cve-2024-21626-escape-via-exec.gif

Analyze CVE-2024-21626

In this section, I’ll describe the calling relationship among the components at first. Then reproduce the vulnerability again with runc run command, and explain how the vulnerability happens. Lastly analyze the code to fix the vulnerability.

When running a container with docker run command, the calling relationship among dockerd, containerd, containerd-shim-runc-v2 and runc is:

  • Docker Engine (dockerd) calls RPC methods of containerd to create and run a container via /run/containerd/containerd.sock.
  • containerd executes containerd-shim-runc-v2 command to run a standalone RPC service via UNIX domain socket, the path of which is stored in file /run/containerd/io.containerd.v2.task/moby/<containerID>/address by default. The definition of the RPC service lies on file /api/runtime/task/v3/shim.proto.
  • When containerd calls Create method of containerd-shim-runc-v2 to create a container, containerd-shim-runc-v2 executes runc create command. When containerd calls Start method of containerd-shim-runc-v2 to start a container, containerd-shim-runc-v2 executes runc start command.

By the way, containerd creates a package called github.com/containerd/go-runc to encapsulate operations of runC.

Use Docker to run a container with alpine image, then export it as a tar archive. We’ll use it as rootfs of our container later.

Execute runc spec command to generate a default config file config.json, and change the value of key cwd to /proc/self/fd/7.

Run a container by executing runc run command to create a exploitable container. Note that --log parameter is necessary!

/images/cve-2024-21626-reproduce-via-runc.gif

runc run command creates a libcontainer.linuxContainer object at first. In order to create the object, runC needs to create an interface object called cgroups.Manager, which is used to manage cgroupfs. It’ll open /sys/fs/cgroup in host filesystem, and subsequent operations to cgroup files are based on openat2(2) system call and the file descriptor of /sys/fs/cgroup. But runC donesn’t close the file descriptor of /sys/fs/cgroup in time when forking child processes, so that child processes can access host filesystem through /proc/self/fd/<fdnum>.

Notice that if calling openat2(2) system call failed (if openat2(2) doesn’t exist), runC will call function openFallback() to open cgroup files with absolute paths.

runC added support to openat2(2) in 4, Dec 2024, aka version v1.0.0-rc93. The answer, in brief, is to prevent potential security risks when mounting directories in host filesystem into mount namespace of containers. It’s a long story to give a detailed explanation, and deserved to write another article. For now you can refer to this article and manual of openat2(2).

Well, it’s related to Golang runtime. First there is no doubt that file descriptor 0, 1 and 2 are stands for stdin, stdout, stderr. The file descriptor of the log file specified by --log parameter is 3. Golang runtime subsequently calls epoll_create(2) to create file descriptor 4 and pipe(2) to create two file descriptors 5 and 6. Now, opening /sys/fs/cgroup creates file descriptor 7.

The reason why opening the log file at first, then Go runtime calling epoll_create(2) and pipe2(2) is related to the implementation of Go runtime, it’s a long story again.

From the first part of this section, we know that it’s containerd-shim-runc-v2 that calls runc command, and containerd-shim-runc-v2 provides a RPC service via UNIX domain socket before executing runC, so the file descriptor which stands for the UNIX domain socket is passed to runC process by mistake.

We can prove the reason in this way. Add a new line to call sleep() function at the very beginning of nsexec() function in file nsexec.c. We can get the relationship of file descriptors between containerd-shim-runc-v2 and runc create.

/images/rpc-socket-passed-to-runc.png

Process runc create is blocked immediately after it’s been created because of our added sleep function. From the screenshot above Process runc create has 4 file descriptors:

  • 0 stands for stdin. It’s been redirected to /dev/null because containerd-shim-runc-v2 don’t need to send any input data to runC.
  • 1 and 2 stands for stdout and stderr. They refer to the same pipe from containerd-shim-runc-v2, because containerd-shim-runc-v2 wants to collect and store them.
  • 3 stands for the UNIX domain socket used to be provide the RPC service.

/images/sys-fs-cgroup-with-fd-8.png

PID 1374988 in the screenshot above stands for runc:[2:INIT] process, which in turn will become the container process after calling execve(2). We can see that the fd of /sys/fs/cgroup is 8, just beucase of the UNIX domain socket offering RPC service!

It’s still unclear why sometimes the fd of /sys/fs/cgroup is still 7 when running containers via docker exec. Guess that it’s still related to Go runtime.

when reproducing the vulnerability via runC itself, if we didn’t specify --log parameter, the fd of /sys/fs/cgroup would become 3, thus the exploit would not happen, cause Go runtime would close it. The reason is related to the implementation of Go runtime.

As we all know, we use fork(2) to create a child process on Linux, and the child process will inherit all opened file descriptors from the parent process. But when executing a new program by using execve(2), Linux kernel will close those file descriptors which have O_CLOEXEC flag set. But the remaining opened file descriptors are still accessible for the newly loaded program. In this case, it may lead to potential security risk.

In order to solve the problem, the design of Golang is to prevent child processes from inheriting all file descriptors by default. Developers need to pass those file descriptors to be inherited explicitly. Specifically, Golang runtime sets O_CLOEXEC flag for each file descriptor. For those to be inherited, Golang runtime uses dup3(2) to duplicate new file descriptors. Be notice that those file descriptors created by calling raw syscalls don’t be flagged as O_CLOEXEC by default, so for these file descriptors whether they can be inherited by child processes depends on the actual situation. If the value of file descriptor is greater than len(Cmd.ExtraFiles) + 3, it can be inherited successfully. Otherwise it’ll be closed. You can read the source code of Golang to get more detail.

We see four commits to fix the vulnerability in the repo, 8e1cd2f2f16289c93dee7309. The last three commits aims to close unnecessary file descriptors before forking children. The first commit aims to verify whether the current working directory is inside container or not.

How to detect

The exploits have the following characteristics:

  • A container will execve(2) a process with a special working directory which starts with /proc/self/fd/.
  • A container will create symbolic links via symlink(2) or symlinkat(2) with a special target directory link which starts with /proc/self/fd/.
  • A container will open files via open(2), openat(2) or openat2(2) with filenames like /proc/\d+/cwd/.*.

Synk offers a tool to detect this vulnerability. It’s implemented by using eBPF, but eBPF code is not open-source. I’ll write a article later to RE it.

Here is the custom Falco rule:

- macro: container
  condition: (container.id != host and container.name exists)

- rule: CVE-2024-21626 (runC escape through /proc/[PID]/cwd) exploited
  desc: >
    Detect CVE-2024-21626, runC escape vulerability through /proc/[PID]/cwd.    
  condition: >
    container and ((evt.type = execve and proc.cwd startswith "/proc/self/fd") or (evt.type in (open, openat, openat2) and fd.name glob "/proc/*/cwd/*") or (evt.type in (symlink, symlinkat) and fs.path.target startswith "/proc/self/fd/")) and proc.name != "runc:[1:CHILD]"    
  output: CVE-2024-21626 exploited (%container.info evt_type=%evt.type process=%proc.name command=%proc.cmdline target=%fs.path.targetraw)
  priority: CRITICAL

But filtering false positives with proc.name is not a good idea.

/images/cve-2024-21626-detect-via-falco.gif

References