Skip to content

Commit

Permalink
Add linux containers namespaces
Browse files Browse the repository at this point in the history
  • Loading branch information
yarkhinephyo committed Feb 3, 2024
1 parent 1929571 commit b4254b3
Showing 1 changed file with 139 additions and 0 deletions.
139 changes: 139 additions & 0 deletions _posts/2024-02-02-containers-under-the-hood.markdown
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
---
layout: post
title: "Learning Points: Containers Under the Hood"
date: 2024-02-02 15:00:00 +0800
category: [Tech]
tags: [Operating-System, Learning-Points]
---

In this [video](https://youtu.be/BM3aH-wultc?si=RTcj8mXG19HxQw2R), Gerlof Langeveld from AT Computing talks about how Linux containers work under the hood. The focus is mostly on the Linux namespaces and not Cgroups.

### Conventional Unix Approach

All processes run in one ecosystem, which includes:

- Hostname.
- PID numbers.
- Mounted filesystems.
- Network stack.
- IPC objects (semaphores, pipes).
- Users.

Even before containers, every process could have its own root directory (chroot) which it is limited to for usage.

If a process assumes the root identity means all the privileged actions are allowed. Non-root identity meant no privileged actions at all.

There were no tools to control resource consumption for each process.

### Containerized Approach

Processes are isolated from other processes in the host. Containers are implemented by administering the root directory, namespaces and control groups of a child process differently from the parent.

- Private filesystem - chroot.
- Isolated hostname - namespace 'uts'.
- Isolated IPC (shmem, semaphores, message queues) - namespace 'ipc'. Processes in the same namespace can share the objects.
- Isolated PID numbering - namespace 'pid'. The process will have different PID in ancestor namespace and its own namespace. Process with PID 1 in any namespace reaps the orphaned children in the namespace. The `/proc` has to be mounted accordingly in the new PID namespace too.
- Isolated users - namespace 'user'.
- Isolated mount - namespace 'mnt'. Processes connected to the same namespace share mountpoints. When a new namespace is created, the mount structure is inherited. However, when there are updates to the mount points, there is no impact on the original namespaces.
- Private network stack - namespace 'net'. A new network namespace initially has only the loopback interface but more can be added. Physical devices can only be in one namespace. Network namespaces can be connected via veth pairs.
- Limited privilege under root identity - capabilities.
- Limited utilization of CPU - cgroup 'cpu'.
- Limited utilization of memory - cgroup 'memory'.
- Limited utilization of disk - cgroup 'blkio'.

### Namespaces

Every process refers to namespaces. When there is a fork, the child processes inherit the binding of namespaces by default. Processes can unshare namespaces.

The namespace details are in the pseudo-filesystem `/proc`. The `$$` refers to the current process. The number in the bracket refers to the inode representing the namespace.

```
$ ls -l /proc/$$/ns
lrwxrwxrwx 1 kyar users 0 Feb 2 02:15 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 kyar users 0 Feb 2 02:15 mnt -> 'mnt:[4026531841]'
lrwxrwxrwx 1 kyar users 0 Feb 2 02:15 net -> 'net:[4026531840]'
lrwxrwxrwx 1 kyar users 0 Feb 2 02:15 pid -> 'pid:[4026531836]'
...
```

The `unshare` command executes a specified program in new namespaces. It uses the system call `unshare`. The `nsenter` command connects with existing namespaces of other processes. It uses the system call `setns`.


### Docker Namespaces

Docker allows sharing of namespaces. For example, `docker run --pid=host` shares the PID namespace with the host and `docker run --pid=container:CID` shares the PID namespace with anothe container.

### Modified Root Directory

Even before containers were prevalent, every process has own root directory. Usually, all the processes inherit the root directory of the entire filesystem from systemd. Use `chroot` to use a prepared directory as root. Use `pivot_root` to change root directory for all processes in the mount namespace.

```
sudo chroot topdir bash --login
```

### Capabilities

Traditional Unix privilege scheme only checks whether UID = 0. Linux privilege scheme has a collection of distinct privileges that can be set for each process. For example: CAP_CHOWN, CAP_KILL, CAP_SYS_BOOT etc. Thread running with effective UID = 0 initially has all the capabilities set.

```
docker run --cap-add foo --cap-drop bar ...
```

### Build a Container Step by Step (Without Cgroups)

In the `step1.sh`, a new hostname namespace is created.


```
#!/bin/bash
unshare -u bash step2.sh
```

In `step2.sh`, the hostname for the child process is changed. Then, `unshare` is run again to create a new PID namespace. The current process' environment is not modified. The forked child process will be in the new namespace with PID 1.

```
hostname mycontainer
unshare -p --fork --mount-proc bash step3.sh
```

In `step3.sh`, a new network namespace is created.

```
unshare -n bash step4.sh
```

In `step4.sh`, the local loopback link is set to up. Then, the script uses `nsenter` to set up veth pairing interface `mybr0` in the parent process's network stack namespace. Then the script sets up `mybr1` device for its own namespace. Now packets transmitted on one device in the pair are immediately received on the other device.

At the end, a new mount namespace is created.

```
ip link set dev lo up
nsenter -n -t 1 ip link add name mybr0 type veth peer name mybr1 netns $$
nsenter -n -t 1 ip addr add 192.168.47.11/24 dev mybr0
nsenter -n -t 1 ip link set dev mybr0 up
ip addr add 192.168.47.12/24 dev my br1
ip link set dev mybr1 up
unshare -m bash step5.sh
```

In `step5.sh`, a new directory for the container's root is created. A tmpfs filesystem (in-memory filesystem) of 50MB is mounted to the new root's directory. This mount point will be not visible within the parent's process. The contents from a skeleton root folder is copied into the mount point. `pivot_root` is used to change the root directory.

```
ROOTDIR=${pwd}/newroot
[ -d ${ROOTDIR} ] || mkdir "$ROOTDIR"
mount -n -t tmpfs -i size=50M none "${ROOTDIR}"
rsync -a skeletonfs/ "${ROOTDIR}"
cd "${ROOTDIR}"
pivot_root . oldroot
mount -t proc proc /proc
export PS1="[\u$\h \W]# "
bash
```

0 comments on commit b4254b3

Please sign in to comment.