hardened script host programs

Unprivileged sandboxing of popular script languages.

Unprivileged	sandboxing	script languages
Alice: "Why not Docker?" Bob: "Because `sudo docker`".	landlock seccomp	lua -> hlua python -> hpython node -> hnode bash -> hsh

DISCLAIMER

This project is a work in progress and has not been audited by security experts.

However I think it remains useful for educational purposes regarding Linux's sometimes daunting security features and using strace to illustrate how a program written in a high level language is translated into syscalls to obtain its desired (or undesired) effects.

... and it is certainly better than nothing as I will try to exemplify in the following section. But as always, remember that sandboxing and containerization only limit the extent of a successful attack, and don't give you carte blanche to willy-nilly execute untrusted code.

Raison d'être

So given that disclaimer, why did I write this? Showcasing Linux's security features is only a secondary goal; my primary goal is for the reader to add strace to her list of favorite tools.

Alice's game

Assume Alice is a game designer with malicious intent and you are her intended victim. Being a fan of indie games you of course accept to be a beta-tester for her latest creation. She sends you the fun.lua game and hidden within is the statement:

os.execute("sudo rm -rf --no-preserve-root /")

(or she'll try sudo --askpass if the credentials aren't cached). A diligent code-reviewer might catch such an obviously malicious statement, but it can be surprisingly easy to miss in a hurried glance; try to allow yourself only a few seconds to read the following:

function run(cmdline)
    local s = os.getenv("SUDO")
    if not s then
        cmdline = "sudo -A " .. cmdline
    end
    os.execute(cmdline)
end

function clean_cache()
    local project = os.getenv("PROJECT_ROOT") or ".."
    local cache = os.getenv("CACHE") or "/tmp"
    run(string.format("rm -r %s/%s", cache, project))
end

Did you spot the malicious or unintended transposition? This is the hardship presented to us by the PR-culture, and can provide a false sense of security. There are also programming languages designed to be difficult to read. And speaking of programming languages: "the greatest thing about Lua is that you don't have to write Lua." Meaning that it's very feasible to bundle a compiler for another language, however non-esoteric (check out: fennel and Amulet). But Lua (as well as Python, Node.js, C and many many more) are: any-effect-at-any-time languages. This in contrast with Haskell (check out Learn You a Haskell for Great Good!) or maybe eff if you're feeling adventurous. That means that an expected pure/side-effect free operation such as compiling a piece of source code can include an obfuscated os.execute-attack or worse if the attacker has a more insidious mind.

Considering that compilers are usually quite extensive pieces of software they provide ample forestry to hide a malicious tree. Alice, I suggest you split your malicious code into several commits and PR:s (preferably large ones close to a deadline). For the victim, I recommend Ken Thompson's "Reflections on Trusting Trust", which if you haven't read I expect will shatter any trust you might have imagined you had in any binary executable (going all the way back to punchcards and the PDP-1). This may seem ridiculous, but OCaml (my yardstick language of languages): still bundle a bootstrapping binary complier to build subsequent compilers: this is very much "tusting trust". Even more so since Coq is implemented in and thus compiled by OCaml; now your trust stack ends in a binary blob: do you trust it? And do you have the incredibly time consuming task of verifying no malicious opcodes hide within?

So the world is a scary and unsatisfactory environment, let's consider mitigating the consequences of malicious and/or incompetently written code.

Enter no new privileges

Alice's sudo-based rm-attack can be mitigated by a one-liner: prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0). This call is not expected to fail, but being a conscientious developer it never hurts to crash-don't-thrash and I present a copy-pastable snippet:

#include <sys/prctl.h>
#include <stdlib.h>

void no_new_privs(void)
{
    if(0 != prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
        abort();
    }
}

You might prefer exit. I don't: libc:s commonly provide atexit which in my opinion is contrary to a fail-early/crash-don't-thrash philosophy: the operating system already has to assume the responsibility of clean up after a failing process. (Ever noticed that C coders don't free their allocations when exiting?) Using exit and atexit reminds me of languages with exceptions and the nightmare when exception handlers raise exceptions ad infinitum. Instead consider programming models where failure-is-always-an-option thinking is prevalent, such as the actor model: where the non-delivery of a message is a scenario brought to the forefront (with the real-world scenario of the fallibility of network connections). If you are curious I recommend Erlang (check out Learn You Some Erlang for great good!). (Erlang does not implement the Actor model in a strict way, but provide a very enjoyable way to explore its concepts while writing highly concurrent applications.)

Back to mitigating Alice's attacks: the above no_new_privs call is so simple it should always, always, be used: unless explicitly necessary to gain new privileges. This is the Principle of least privilege: if the functionality you intend to provide does not require privileges your process should not have any privileges, and this is the red thread in this attempted raison d'être. But in the RealWorld, processes inherit quite a handful of privileges that Alice can still abuse, as we shall see.

So Alice can't sudo anymore thanks to PR_SET_NO_NEW_PRIVS, but even a sneaky os.execute("rm -r ~") would still be a mayor buzzkill.

The naive Lua specific mitigation is to os.execute = nil before running the entrypoint of Alice's game. Well, that may be good enough (I haven't figured out a way around that mitigation, but I'm reasonably sure there is an exploit and would be interested in seeing it). Continuing this idea we can tweak it into at least making this first naive mitigation useful:

os.execute = function() error("not allowed") end

Especially since the mitigation I suggest below do not even allow for the program to try to provide a user-friendly error message.

Enter seccomp

Seccomp is Linux's way of filtering syscalls and so limiting the exposed kernel surface.

Fancy words aside, this means that when you receive in your email inbox the notification of a new vulnerability you can feel certain that you are not affected because the vulnerable syscalls are rejected by your program. If you aren't subscribed to any CVE mailing lists I recommend:

Arch Linux's,
Ubuntu's or
OpenBSD's mailing lists.

The simplest seccomp filters are essentially accept/reject lists, but they can do more complex things. But as always when it comes to security: easily understandable code is always preferred.

Back to Alice's os.execute-based attacks: with seccomp enabled with a filter that forbids exec:s, the kernel will politely kill your process and suggest to the rest of the system that you received a SIGSYS signal. In practice this means that your process immediately vanishes, so without a syscall inspection tool such as strace one is reduced to debugging by: thou shalt printf("are we nearly here yet?");

Enter strace

If you haven't invoked strace before, or you are curious what syscalls are being used by a program, then try:

strace lua -e 'print("hello")'
strace python -c 'print("hello")'

The output of strace can be quite extensive (and therefore strace provides sophisticated ways to filter what is traced). For our hello world example the interesting syscall can be found towards the end:

write(1, "hello\n", 6)                  = 6

Other interesting syscalls to look for are memory allocating syscalls such as brk and mmap:

mmap(NULL, 151552, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f55a8b74000

Continuing our investigation of Alice's os.execute-attack, we can trace an almost as trivial piece of code:

strace -f lua -e 'os.execute("echo hello")'
strace -f python -c 'import os; os.system("echo hello")'

Note the -f (--follow-forks) option that tells strace to continue tracing spawned child processes. And now we look for clone (the syscall that implements fork) and execve. From the trace in the parent:

clone3({flags=CLONE_VM|CLONE_VFORK, exit_signal=SIGCHLD, stack=0x7f8160953000, stack_size=0x9000}, 88) = 2304236

and in the child:

execve("/bin/sh", ["sh", "-c", "echo hello"], 0x7ffebf549aa8 /* 52 vars */) = 0

So why not reject clone as well? Remember, in Linux threads and processes are the same abstraction: essentially one with shared virtual memory space and the other without (but even a casual glance at clone's options makes the difference no longer so clear cut). Now with clone rejected: both threads as well as processes are no longer things you need to reason about.

So how do we actually tell seccomp what to accept and what to reject?

Enter Berkeley Packet Filters

Seccomp filters are expected to be binary representations of cBPF, the c stands for "classic" BPF (in contrast with extended BPF (eBPF).

While cBPF is not theoretically Turing complete because of lack of infinite memory; it is restricted to the scratch memory: uint32_t M[16]. That only presents an interesting challenge: which Project Euler or Advent of Code problems can be solved in cBPF?

Therefore working with seccomp can provide somewhat of a challenge. So you may want to use an assembler and a preprocessor (I've bundled them together as bpfc) that can interpret the constants commonly used when making syscalls.

I always start with a "reject everything" filter:

bad: ret #$SECCOMP_RET_KILL_THREAD
good: ret #$SECCOMP_RET_ALLOW

then run a test under strace, look for SIGSYS, reason about the offending syscall, reluctantly add it to the allowed list and iterate:

ld [$$offsetof(struct seccomp_data, nr)$$]

jeq #$__NR_brk, good

bad: ret #$SECCOMP_RET_KILL_THREAD
good: ret #$SECCOMP_RET_ALLOW

Eventually when the test passes you have achieved a list of syscalls living up to the principle of least privilege.

The filter you produce might appear very long, but remember that Linux has a massive amount of syscalls. Viewed from a security perspective even a moderately long filter is still a huge reduction of the exposed kernel surface. But a simple (but still very effective) yes/no approach to filtering syscalls falls short when it encounters the "functionality-grouping" syscalls such as fcntl, ioctl and prctl (which we encountered above). For these syscalls it becomes necessary to inspect the call arguments (from hlua's filter):

jne #$__NR_fcntl, fcntl_end
ld [$$offsetof(struct seccomp_data, args[1])$$]
jeq #$F_GETFL, good
jmp bad
fcntl_end:

Sometimes it may be useful to "tamper" with a syscall instead of rejecting it outright: return -1 and set errno to EPERM or ENOSYS to allow a child to recover: see for example the prlimit check in hnode's seccomp filter.

Doing the "test-n-strace" dance for a non-trivial test-case you quickly end up with a filter usually including the read, write and close syscalls. (Unsurprisingly these have syscall numbers: 0, 1 and 3.)

write is particularly fun to think about: without it how can you communicate the result of any computation in a "everything is a file" system? The syscall filtering way of expressing this is seccomp's strict mode: only allow write and exit. The reasoning being is that you are only allowed to write to already opened file descriptors (since in this setting open is forbidden, or more accurately not expressively allowed).

But even moderately interesting Lua applications enjoy using require. So it's not unreasonable to allow Lua to open files (which fills in the number 2 syscall numbering slot). But then Alice changes her fun.lua game to include (obfuscated of course):

io.open(os.getenv("HOME") .. "/.aws/credentials", "r"):read("*a")

Now Alice has to get this information back to her, but maybe it's a multiplayer game? Or she obfuscates it in the game's log file and exclaims: "Oh the game crashed, why don't you send me the logs?"

Alice's intentions might only go as far as griefing, and she will try to os.remove your access tokens. Alice, try removing Chrome/Firefox cookies as well. This would definitely lose me my sunny disposition.

Removing files map to the unlink syscall. Certainly it commonly makes sense to reject it, but a plausible scenario is using unlink is to remove intermediately created files (during compilation maybe).

So what do we do about Alice's intent to remove your with-blood-sweat-and-tears.doc file?

Enter landlock

Landlock is a fairly recently added security feature, which is meant to restrict filesystem access for unprivileged processes, in addition to the standard UNIX file permissions. (I will argue landlock is fairly recent when its new syscalls have, at the time of writing: the highest syscall number.)

In essence landlock grants or restricts rights to filesystem operations on whole filesystem hierarchies. (Note that a single file is a trivial hierarchy.) So we can grant read access to /usr/lib only and mitigate Alice's attack on your access tokens in your home directory. And maybe allow both read and write to /tmp, and maybe allow removing (i.e. unlinking). Unless you allow open's O_TMPFILE flag in your seccomp filter of course.

The reason this section is bare of example code is that I found, and hope you will do too, the concepts behind landlock easily understandable and yet very powerful. Therefore I will not include any sample code here since the the sample code provided with landlock is excellent, and is relatively verbatim what I use. My experienced ratio of positive security impact versus time spent learning the feature is huge.

My one criticism of the current implementation of landlock is the inability to hide files: that is, even though landlock restricts access to a file, for example /etc/passwd, then stat (or similar) responds with EACCESS instead of ENOENT. The knowledge that a Linux installation has a /etc/passwd maybe of limited value, but revealing that ~/.aws/credentials exist can enable an attacker to target her attack more effectively against the discovered files.

Furthermore an attacker can, given enough access to (lots, lots and lots of) system time, enumerate your entire file tree. This is of course a ridiculous endeavour; #define NAME_MAX 255 and pp ('z'-'a')+('Z'-'A')+('9'-'0') says that the amount of filenames to try are bounded from below by 59^255: which evaluates to an integer that starts with 36920 and ends with 89299 (not mentioning the other 442 digits).

The counter-argument is that there are other, perhaps better, ways of achieving this functionality (chroot maybe), reducing my criticism to a mere down-prioritized item on a wishlist.

The wrinkle in our concrete setting of providing script hosts is that sometimes the interpreters want to dynamically load shared libraries which boast a notorious elusiveness and never appear in the same place twice (which implies we at least know their velocity). Hence I have added a set of tools to, at compile time, tell hpython's landlock rules to allow read access to the path where the embedded Python instance will look for, say, the libz library. This is the functionality exercised in hpython's import test. The journey starts with the paths utility:

paths --python-site -lz

which for my system suggest that these file system trees are of particular interest:

/usr/lib/python3.10
/usr/lib/libz.so.1

Behind the scenes dlinfo is used to resolve the shared libraries. The paths are then inspected and converted into a relevant landlock rules code snippet which is then included and applied in the main program.

Continuing the same wrinkle into the hsh project which executes a bash in the same security setting as the other script hosts programs produce another complication. Being a fully-fledged shell it desires to link quite a bit of dynamic libraries, which in turn desire even more of that shared binary goodness. ldd /bin/bash expose the extent of their desire:

     linux-vdso.so.1 (0x00007ffe3bed2000)
     libreadline.so.8 => /usr/lib/libreadline.so.8 (0x00007f45e8bcc000)
     libdl.so.2 => /usr/lib/libdl.so.2 (0x00007f45e8bc7000)
     libc.so.6 => /usr/lib/libc.so.6 (0x00007f45e89e0000)
     libncursesw.so.6 => /usr/lib/libncursesw.so.6 (0x00007f45e896c000)
     /lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007f45e8d44000)

That indeed is a wrinkle to iron out given the diversity of Linux distributions. My awk-front-leds and sed-mid-legs twitch but there is a better approach using objdump:

objdump -p /path/to/program | grep NEEDED

which said insect has bundled into the poor_ldd utility (using the above mentioned dlinfo based lib utility). Now poor_ldd /bin/bash produce a similar output as ldd:

/lib64/ld-linux-x86-64.so.2
/usr/lib/libc.so.6
/usr/lib/libdl.so.2
/usr/lib/libncursesw.so.6
/usr/lib/libreadline.so.8

which then can be handed of to landlockc and grant a very limited set of read-access rules.

Alice's set attack vectors are now quite diminished, but we can do even better.

Enter drop capabilities

I have included a code snippet to drop capabilities. This is a Linux feature I previously hadn't had the need to explore (so take that code and what comes next with a grain of salt and always: "trust, but verify"). The classic selling point of capabilities is the scenario to allow unprivileged users to run ping. In a pre-capabilities world one would have to have to obtain the full power of the privileged user (root) in order to use ping. Of course setsuid reduces the mess of every user su:ing, but still provides a nice potential attack vector on the ping binary. The capabilities is basically the idea to split root into separate, well, capabilities that can be granted independently. (ping requires the CAP_NET_RAW capability).

In this project this scenario isn't really applicable (since we start out as unprivileged users.) But what may be applicable is the functionality to relinquish granted capabilities from the current process. Maybe this sounds convoluted, but in our current Dockerized world I would say its fairly common to see images invoke executables in a privileged mode (i.e. not setting another user).

And a noteworthy configuration option of Linux is that you don't have to include the bothersome userland. Here I imagine a barebones server setup: the kernel, a single stand-alone server executable (serving as the init process) and nothing else. In that setting dropping capabilities could be useful.

But even with these restrictions Alice can cause quite a bother:

Enter rlimits

Now Alice is restricted to using a pre-approved set of syscalls and restricted to a pre-approved set of file-system operations on an equally pre-approved subset of the file-system tree.

Her last-ditch effort is to execute a Denial-of-service attack. I suggest Alice tries to while(1) allocate at least one page of memory (getconf PAGESIZE: 4096 bytes), write a single pseudo-randomly generated byte to each allocation: forcing the kernel to copy-on-write. This will quickly exhaust all available memory, and any unfortunate Linux user will attest to the ensuing misery.

The mitigation is to apply strict rlimits. In this attack RLIMIT_AS might be the most efficient mitigation. The common way of applying rlimits is by using the shell's ulimit command.

Alice then tries a fork bomb. Rejecting the clone syscall will of course mitigate such an attack, but for instance: node is determined to spawn worker threads making such a mitigation ineffective. Once more rlimits come to the rescue: RLIMIT_NPROC restricts the number of process a process can spawn (including threads of course).

Alice, your next attack vector should be to exhaust any available block-devices by creating huge files with your pseduo-random generator. But again rlimits provides the mitigation: RLIMIT_FSIZE.

The pattern should be obvious: restrict all available rlimits to the minimum required to make the intended functionality succeed. The code-snippet used to restrict the rlimits zeroes any resource restriction not expressively required non-zero limit. Check the #define RLIMIT_DEFAULT_:s at the top of hlua, hpython and hnode. Again we encounter the principle of least privilege.

For instance, this approach guarantees that a file-descriptor-exhaustion attack is no longer viable: RLIMIT_NOFILE.

Conclusion

So given Alice's game you're itching to play regardless of her malicious intent: do you now feel safe enough to evaluate her code?

We can enforce a list of allowed syscalls and their arguments using seccomp
We can impose an additional layer of access restriction upon the file system hierarchy using landlock
We can enforce strict resource usage limits on: memory usage, file-descriptor and thread/processes allocation

You might feel safe enough: but what surreal thing will she think of next‽

Frequently unasked questions

"No, but seriously, why not sudo docker?"
Yes, and seriously, no sudo, but yes Docker. My opinion is that Docker is great (for me Docker := cgroups+overlayfs packaged into a sleek product, but that's fine), especially in professional CI/Kubernetes/what-have-you settings. With this project I want to showcase a Linux way of sandboxing applications unprivileged: hence no sudo, but yes Docker. Also my guiding principle (other than "trust, but verify" that is); principle of least privilege: why require privileges to do something that can be achieved without?
"Why no binary packages?"
Because each embedded script host may have a different license, and I do not want to spend the time to study each of them and mess up anyway. Also my aim for this project to be an educational showcase and a sandbox to let users experiment and get hands on experience with the Linux security features in a non-toy setting: reducing the value of a pre-built binary package distributed without the sources and tools. The laziness argument coupled with this argument guides me to only offer source packages: mostly as a guide for users who do not feel comfortable to jump into the deep end with git clone and make.

Installation and build instructions

Building from sources

This project is intended to be built using Makefile:s and common C build tools, except for the bpf_asm tool (found in the Linux kernel sources). Arch Linux users can use the bpf package, but other distributions might have to build their own copies. I have prepared a build script which is used when bpf_asm is not found (the script is used in the Ubuntu workflow job).

The steps to build the project is then:

make tools
make build
make check

If these steps fail because of missing dependencies you may consult the following table (derived from the packages installed during the Build and test workflow).

	runtime	build	check
Ubuntu 22.04	`libcap2` `lua5.4` `python3` `libnode72` `bash`	`make` `pkg-config` `gcc` `libcap-dev` `wget` `ca-certificates` `bison` `flex` `liblua5.4-dev` `python3` `libpython3-dev` `libnode-dev`	`python3-toml`
Arch Linux	`lua` `python` `nodejs` `bash`	`make` `gcc` `pkgconf` `bpf`	`python-toml`

Building from a Ubuntu source package

Pick a release and download the Ubuntu source package asset. Included within are the sources and two helper scripts:

build-package that runs dpkg-buildpackage, as well as checking for missing build-time dependencies
install-package installs the built package using apt-get, but note that you can try out the built binaries without a system-wide installation

These scripts are intended to be running as an unprivileged user, but might need sudo access to apt-get in order to install missing dependencies. Both scripts accept an -s option for this case, or you can set the SUDO environment variable (e.g. SUDO="sudo --askpass").

Building from an Arch Linux PKGBUILD

Pick a release and download the Arch Linux PKGBUILD asset, place it in a suitably empty directory and invoke makepkg, possibly with --syncdeps and/or --install options when desired. Note that you can try out the built binaries (found in the created src subfolder) without a system-wide installation.

TODO

use strace statistics to sort seccomp filters with respect to number of calls
landlock ABI=2 (see the sandbox example)
readline or naive REPL with rlwrap
reference OpenBSD's pledge(2)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

hardened script host programs

DISCLAIMER

Raison d'être

Alice's game

Enter no new privileges

Enter seccomp

Enter strace

Enter Berkeley Packet Filters

Enter landlock

Enter drop capabilities

Enter rlimits

Conclusion

Frequently unasked questions

Installation and build instructions

Building from sources

Building from a Ubuntu source package

Building from an Arch Linux PKGBUILD

TODO

Files

README.md

Latest commit

History

README.md

File metadata and controls

hardened script host programs

DISCLAIMER

Raison d'être

Alice's game

Enter no new privileges

Enter seccomp

Enter strace

Enter Berkeley Packet Filters

Enter landlock

Enter drop capabilities

Enter rlimits

Conclusion

Frequently unasked questions

Installation and build instructions

Building from sources

Building from a Ubuntu source package

Building from an Arch Linux PKGBUILD

TODO