In recent years, since the Internet has become available to almost anyone, application and runtime security is important more than ever. Be it an (unknown) application you download and run from the Internet or some server application you expose to the Internet, it’s almost certainly a bad idea to run apps without any security restrictions applied:
Unknown (untrusted) applications from the Internet could well include some malware trying to steal data from you. Server applications can even be attacked remotely by triggering security vulnerabilities such as buffer overflows, file inclusion bugs and what not.
While most solutions such as using multiple devices (or using a dedicated device for each process you would normally run on your PC) are impractical and cumbersome to use, other techniques such as application sandboxing — which we will explore later on — exist.
The problem here is that applications (by default) can access data and use operating system functionality they should better not be able to use; The trojan horse you just downloaded can easily access your photos and other kinds of sensitive data stored on your PC or inject itself to your browser and steal money while you’re performing some bank transactions over there. The server application you expose to the Internet can run arbitrary code after a buffer overflow was triggered in it and pretty much cause the same harm as any other kind of malware.
(Source: https://xkcd.com/1957/)
This is the first of two posts on the security of modern operating systems. This part lays and explains the groundwork of security in the Linux Kernel.
Part two shows how the security mechanisms introduced in this post can be combined to create containerization platforms such as Docker and OS-level application isolation techniques such as LXC; It as well introduces other kinds of isolation techniques such as virtualization.
Sandboxing
The idea behind application sandboxing is simple: A sandboxed application (which in fact is a process) is isolated from all other processes running on a PC. It can only cause harm to anything inside its sandbox and this sandbox should only include the bare-minimum data and functionality required to run the application.
Probably the sandbox in most common use (and even active on your PC right now) is inside your browser! Modern browsers such as Firefox isolate all tabs from one another and only allow communication between them through specially crafted communication interfaces.
Sandboxing — if implemented correctly — is a great solution to the problem from our introduction in that it allows to use a single PC for running multiple processes without any security implications.
But how is sandboxing actually achieved in modern operating systems? Let’s find out!
Security in the Linux Kernel
Most servers on the Internet use Linux as their operating system kernel and modern containerization platforms such as Docker currently only run on Linux, Linux is open source and has the most active development community and companies behind it (source 1, source 2). That’s why we want to take a deeper look into its inner workings and explore the security mechanisms provided by Linux.
chroot — change root
One of the features to sandbox applications (which by the way is supported by almost all UNIX-descendants, including BSD and System V) has been in Linux since its inception — chroot.
chroot is a system call allowing the kernel to set the apparent root directory of a process; chroot-jailed processes see a different file system view than other processes.
Let’s take a look at the following file system view to better illustrate how a chroot actually works. Suppose you have the following files on your drive:
In order to create a chroot jail for a process with a different file system view, such a “view” must exist. This simply means all necessary files required to launch an operating system such as Debian must exist in some sub-folder on the file system. This might look as follows (see the chroot
folder):
Let’s spawn a new /bin/sh
shell and chroot it to /chroot
:
A chroot-jailed process can create new files only in his jail:
Our file system now looks as follows:
chroot-capable directories can either created by hand or automatically by tools like debootstrap
. debootstrap
can be used to easily set up a Debian system:
chroot forms the basis of every application sandbox in that it enabled different processes to see different files. However, chroots by themselves do not provide any security against malicious attacks. This is due to the fact that the root
user inside the jail has the same user id as the root
user outside of it. Escaping a chroot jail without any additional restrictions is quite easy!
If implemented correctly, a chrooted process can not see files outside of its chroot, yet a malicious or misbehaving application can still harm the system by e.g. exhausting the system’s hardware resources or accessing network interfaces when they shouldn’t — additional security mechanisms are required to prevent that!
namespaces
Support for namespaces was added to the Linux Kernel back in 2002. namespaces affects which system resources a process can see and interact with. As of the most recent Kernel v5.8, this includes the following 8 system resource types:
- Interprocess Communication (
IPC
)- Controls which processes can IPC with each other (using shared memory — SHM)
- Network (
net
)- Controls which network interfaces a namespace uses
- Mount (
mnt
)- Controls which mounts are available to a namespace
- Process ID (
PID
)- Controls which processes a namespace can interact with
- UNIX Time-Sharing (
UTS
)- Controls which system hostname a namespace uses
- User ID (
user
)- Controls which users a namespace can interact with
- Time (
time
)- Controls which system time a namespace uses
- Control group (
cgroup
)- Controls which hardware resources a namespace can use
Every process on a Linux system must be part of one namespace and it can use all system resources which are available to that namespace.
Consider the following graphical illustration where we have two net
namespaces. One of them, aptly named No_Network
having access to no network interface and another namespace, named With_Network
, which has access to the lo0
and eth0
network interface:
Process A
and B
use the No_Network
namespace so they are not allowed to do any network-related activities.
Process C
and D
on the other hand are in the With_Network
namespace, so they do have access to the lo0
and eth0
network interfaces.
Internally, in the Kernel, each namespace is identified by a namespace ID. This ID can be shown for each process by viewing the proc
file system’s ns
file:
Let’s create our own UTS
namespace to use a different system hostname for a new process — and only for this process!
cgroup — Control groups
Control groups (abbreviated cgroups) are another kind of system resource namespace but they are the most powerful one, so they deserve their own section! Support for cgroups was added to the Linux Kernel in 2008.
cgroups allow to specify which hardware resources a process can use, this includes the following hardware resource types:
– CPU
– RAM
– Storage I/O
– Network I/O
– etc.
Additionally, certain cgroups can get a higher priority than others (this is how the nice
CLI app works). cgroups also allow to measure their resource usage which can be used for billing of shared computation resources, e.g. as done by VPS providers.
(Source: https://mairin.wordpress.com/2011/05/13/ideas-for-a-cgroups-ui/)
Let’s see them in action and create our own cgroup!
seccomp — Secure computing mode
The secure computing mode (introduced to Linux in 2005) allows processes to make a one-way transition into a “secure” mode. During this transition, the process relinquishes the right to use certain system calls and from then on is no longer able to use them (the process is killed by the Kernel if it still tries to). This allows to follow the principle of least privileges where an application can only do whatever it must be allowed to do to complete the task it is intended to complete.
Applications written in memory-unsafe programming languages such as C are often vulnerable to buffer overflows where an attacker can execute arbitrary code in the context of the application and can therefore run system calls which are not even used by the application under normal conditions.
seccomp is the only security feature we will explore which the application itself has to implement; it’s a feature which protects the application from itself.
In its strictest form, the strict mode, seccomp only allows the following 4 system calls to be made:
exit
sigreturn
read
write
Processes using the strict seccomp mode need to make all required file descriptors available before performing the one-way transition to their locked-down version.
If the strict mode is too strict, for example if a call to the open
system call is still required after the transition, an allow list of valid system calls can be specified.
In C lang, seccomp can be used as follows:
Capabilities
Capabilities affects in more or less fine-grained detail which capabilities (i.e. which access-control settings) a process is allowed to use. As of Kernel v5.8, the list of capabilities encompasses:
CAP_NET_BIND_SERVICE
— Allow binding to TCP/UDP sockets below 1024CAP_CHOWN
— Allow the use of thechown()
system call to change file and group ownershipCAP_SYS_CHROOT
— Allow the use of thechroot()
system callCAP_SYS_PTRACE
— Allow toptrace()
any processCAP_NET_BROADCAST
— Allow broadcasting and listen to multicastCAP_NET_RAW
— Allow the use of RAW socketsCAP_SYS_BOOT
— Allow the use of thereboot()
system call to reboot the system- …
Capabilities in action:
LSMs — Linux Security Modules
In order to better understand the concept of Linux Security Modules (LSMs), we first need to talk about Linux Kernel Modules.
Linux basically is a large application written in C which, after being compiled, can no longer be extended with new functionality. That’s where Linux Kernel Modules (LKMs) come into play. Through special interfaces provided by Linux, LKMs allow to modify the Linux Kernel at runtime.
A simple Linux Kernel Module might look as follows:
Running make load
and make unload
adds the following output to the Kernel ring buffer (visible by the dmesg
command):
That’s it for our small excursus to Linux Kernel Modules — LKMs.
Linux Security Modules — LSMs — are very similar to LKMs, although they can’t be loaded and unloaded at runtime (what’s the point of a Security module after all if it can simply be unloaded by a malicious program?).
LSMs extend the Linux Kernel with additional security features whose use should not be mandatory. Instead, LSMs can be enabled and disabled through the bootloader configuration (for GRUB this is as easy as adding apparmor=1 security=apparmor
to the GRUB_CMDLINE_LINUX_DEFAULT
config option to enable the AppArmor LSM which we will get to know later on).
Several LSMs are already included in the Linux Kernel source tree, some of them even enabled by default:
- Yama
- Yama restricts the usage on
ptrace()
- Yama restricts the usage on
- LoadPin
- LoadPin ensures all kernel-loaded files (modules, firmware, etc.) originate from the same file system and not some external one
- SafeSetID
- SafeSetID restricts UID/GID process transitions by a system-wide whitelist.
- Example: Using SafeSetID, one can specify the following:
- “User1 may start process as User2”
- “User1 may NOT start process as User3”
- SELinux
- SELinux implements fine-grained Mandatory Access Control (MAC). To achieve a MAC, it labels objects (who is allowed to do what, e.g. User1 is allowed to modify the system timezone)
- Initially developed by the NSA in 2000 in the form of Kernel patches. Only later became a LSM and even part of the Linux source tree
- Enabled by default on Android since v4.3
- Comes with a GUI with is more or less user-friendly
- Smack
- Smack is similar to SELinux but much easier to use
- AppArmor
- AppArmor implements a MAC for confining applications. Compared to SELinux it uses no object-labeling; instead, the security policy is applied to pathnames
- Enabled by default in Debian 10 (Buster)
- Explained in more detail further below
- TOMOYO
- TOMOYO is similar to AppArmor but “domains” (trees of process invocation) are targeted instead of pathnames
- Example: Using TOMOYO, one can specify the following:
- The call chain
boot -> init -> sh -> ping
is allowed - The call chain
boot -> init -> sh -> bash -> ping
is not allowed - Therefore,
bash
is not allowed to launch theping
binary
- The call chain
AppArmor
AppArmor is one of the better-known Linux Security Modules which is part of Linux since 2009. It is similar to SELinux in that it implements a Mandatory Access Control, but it identifies subjects (files) based on their path instead of their inode allowing to make configuration profiles easier to create.
AppArmor comes with three modes of behaving:
- audit mode — Verification mode
- log all actions
- complain mode — Learning mode
- log but do not block restricted actions
- enforce mode — Enforcement mode
- log and block restricted actions
AppArmor relies on configuration profiles to limit the actions certain binaries are allowed to perform. Such a configuration profile might look as follows:
AppArmor by default comes with a variety of profiles for all kinds of applications and makes it easy to create profiles for other applications as well. Let’s walk through the process.
Say we have a small app (apparmor-demo
) which tries to read the file /bin/ping
and either print success
or failure
based on whether it succeeded:
Let’s try to launch the app:
Obviously this succeeds.
Now we want to create an AppArmor profile for our binary to explicitly grant read-access to that file.
First, create a new empty profile:
Enable the new profile using apparmor_parser
:
And try to launch our binary again:
This time it failed to read /bin/ping
due to the fact that AppArmor works with an allow list instead of a disallow list which means all allowed actions must be explicitly specified instead of specifying which actions are not allowed.
As we’re pretty lazy in editing the profile ourself, let’s make use of some of AppArmor’s handy utilities which allow us to update profiles in an interactive way.
We first need to set our profile to complain
mode so all actions are allowed and logged:
Running our app succeeds again and produces the expected log entries:
aa-logprof
can be used to read those log messages and create an entry in the profile to either allow or deny the action:
And indeed, our profile was updated!
Now that we explicitly allow read access to /bin/ping
, let’s put our application in enforce mode again:
Our app now works even with an active AppArmor profile!
LSMs such as AppArmor might seem incredibly complicated to implement, but they are actually quite simple:
In it’s init
function, AppArmor registers to all kinds of Kernel hooks such as file_open
which is fired before a file is opened by the Linux Kernel.
(Source: linux-1b50440210/security/apparmor/lsm.c#L1194)
AppArmor’s apparmor_file_open
function then, based on the relevant AppArmor profile, decides whether the call should be aborted (if the action is not allowed) or if the Kernel shall continue to open the file.
Where to go from here?
This was our introduction to some of Linux’ security features. And let us tell you what: these are pretty much the only security features to make containers (as used by Docker et al.) possible and secure to even run untrusted code. You should now be able to implement the core of Docker yourself!
All of these security features are useful, yet they require manual work of either the application developer or an (experienced) computer user to enable. For server applications, when set up by a system administrator, this knowledge probably exists, but for the larger part of the computer user base, the home user, it doesn’t. Security features which are available but not enabled by default are pretty useless.
This brings us to part two of this blog post which explains more about containers, virtualization and security solutions for the end-user.
Sources
All links were last accessed on 2020-09-08.
- https://www.kernel.org/doc/html/latest/userspace-api/seccomp_filter.html
- https://github.com/torvalds/linux/blob/1b5044021070efa3259f3e9548dc35d1eb6aa844/Documentation/admin-guide/cgroup-v2.rst
- https://github.com/torvalds/linux/blob/1b5044021070efa3259f3e9548dc35d1eb6aa844/include/uapi/linux/capability.h
- https://github.com/torvalds/linux/tree/1b5044021070efa3259f3e9548dc35d1eb6aa844/Documentation/admin-guide/LSM
- https://man7.org/linux/man-pages/man7/capabilities.7.html
- https://man7.org/linux/man-pages/man2/seccomp.2.html
- https://man7.org/linux/man-pages/man2/syscalls.2.html
- https://www.kernel.org/doc/html/latest/admin-guide/LSM/index.html
- https://www.linux.com/training-tutorials/overview-linux-kernel-security-features/
- https://ajxchapman.github.io/linux/2016/08/31/seccomp-and-seccomp-bpf.html
- https://en.wikipedia.org/wiki/Cgroups
- https://en.wikipedia.org/wiki/Security-Enhanced_Linux
- https://gitlab.com/apparmor/apparmor/-/wikis/AppArmor_Failures
- https://debian-handbook.info/browse/en-US/stable/sect.apparmor.html
- https://wiki.ubuntuusers.de/AppArmor/
- https://wiki.archlinux.org/index.php/Chroot
- https://github.com/moby/moby/blob/master/profiles/seccomp/default.json
Leave a Reply
You must be logged in to post a comment.