Implementing fast lightweight containers in Go with bst and btrfs (Part 1)
Containers, namespaces, and execution model
This is the first part of a series to build a toy container system for fun. In this segment, we’ll work on the initial setup to run arbitrary commands on an Alpine Linux image.
Containers are not exactly trivial to implement from scratch. It takes a lot of work to understand and use namespaces, and getting the semantics right is even harder – so let’s do it! What could possibly go wrong?
Snark aside, we’ll try to implement a toy container system in this article, and see what we learn. However, the fact remains that the namespacing part alone would probably take months to do, so what can we do? Well, it turns out that at Arista Networks, we released a tool called bst, which makes that work significantly easier. It is still not completely trivial, but it becomes possible to write a toy container system over a few days.
What do we want out of a container system? A few things come to mind:
- We need it to manage container images.
- We need it to manage per-user containers.
- We’d like it to be fast.
- We’d like it to be usable unprivileged (i.e. the user should not have to use
sudo
)
The setup
For this setup, we’ll be needing a few things. First and foremost, a Linux
system. Containers are mostly an abstraction over a root filesystem and
Linux namespaces. We won’t go into the specifics of namespaces here, but
if you’re curious, the manual page for namespaces(7)
1 contains a nice
overview of the existing namespace types, with links to more detailed
explanations for each type.
As previously mentioned, the other tool we’ll need is bst. Grab a release and either build it from source or install the static binaries.
We’ll eventually be using BTRFS for our container image management. We don’t
really need any specific tool, but installing btrfs-progs
is typically a
good idea. However, we won’t be using BTRFS in this first part – this will
happen in the next parts.
For our programming language, we’ll be using Go. You can pretty much pick whatever language you fancy, but I’m choosing Go because it’s relatively easy to read and understand unknown Go code, and it’s fast to build.
You might be asking yourselves why we’re using a third-party utility like
bst
when we could technically just use Go to setup our namespaces. There
are a few reasons2, but the short of it is that Go is good at a lot of
things, but not the particular use-case that led to bst
. But that’s okay
– we can use Go to build the unprivileged part of our container system,
and it shines way better there.
First steps: know thy tools
Let’s take some time to understand what we can do with bst. The first visible side-effect seems to be that bst drops us in a root shell:
This is not the host root user – by virtue of user namespaces, this is our user ID being mapped as the root user. To demonstrate:
You might have noticed the two blaring warnings about not having enough IDs to map. This happens because our real UID and GID do not have any associated sub[ug]id. Let’s allocate ourselves 100000 of them, and confirm that we have the right IDs allocated:
We can indeed see that our current UID (1000) is mapped to UID 0, while UIDs
1000000 to 1100000 are mapped starting from UID 1. More information about
these mappings can be found in the manual page for subuid(5)
3 and related
pages45.
Being the root user of a user namespace does give us full control over other namespaces created under it. For instance, bst gives us our own mount namespace, so we are able to mount certain things (but not everything) unprivileged.
We couldn’t mount /dev/sda1 onto /mnt because doing so would have security implications – if we could, a malicious user could mount and change the host root filesystem.
You might now be thinking: “OK, that’s great and all, but all this gives me is access to my own rootfs. Can’t we use something else?”
We can indeed! Let’s try to download an Alpine Linux minirootfs, and extract it somewhere we can use:
Alpine Linux is a light, minimalist distribution that is fairly popular in the Docker world, which makes it a great distribution to hack on.
Let’s try using our newfangled rootfs:
Looks good! How about installing some package?
Uh oh. It turns out that we haven’t configured any network access whatsoever, so we can’t install anything. Let’s exit the shell and pass some more flags to bst:
This does a few things:
First, we add some very lazy networking support – actually bridging the inner network namespace to the outside is non-trivial and out of scope for this article. We do this by sharing the network namespace with the host, and bind-mounting the host /etc/resolv.conf to the inner /etc/resolv.conf to get DNS to work.
Second, we add default mounts for /proc, /dev, /tmp, and /run. A lot of programs expect these mounts to be present, so we oblige.
With the new shell, sure enough:
Victory!
At this point, we should have a pretty good idea on how to use bst. It’s high time we started to build the framework of our container system – our bst command-line was starting to get pretty big and cumbersome to type.
Writing the PoC
Let’s start writing some Go. The first few things we need to do are writing the data types describing our containers, and a simple CLI to start a container.
What does a container need? If you think about it, a container has pretty much the following properties:
- A name
- A root filesystem
- A startup program
- System resource definitions (mounts, interfaces, …)
For now, let’s go with the following data types:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// container.go
package main
// Container represents the properties of a container.
type Container struct {
// Name is the name of this container.
Name string
// Root is the path to the root filesystem of this container.
Root string
// Argv is the argv array of the command executed by this container.
Argv []string
// Mounts is the list of mounts to perform at the container startup.
Mounts []MountEntry
}
// MountEntry represents a container mount operation.
type MountEntry struct {
// Source is the mount source. If a path is specified, it is interpreted
// relative to the host filesystem.
Source string
// Target is the mount target path. It is always interpreted relative to the
// container root filesystem.
Target string
// Type is the mount type. Defaults to "none".
Type string
// Options is the list of mount options.
Options []string
}
This should get us pretty far. Let’s write a function that goes from a Container to a runnable exec.Cmd:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import (
"fmt"
"os/exec"
"strings"
)
...
func (c *Container) Command() (*exec.Cmd, error) {
args := []string{
"-r", c.Root,
}
for _, mount := range c.Mounts {
if mount.Target == "" {
return nil, fmt.Errorf("Mount entry must have a non-empty target")
}
if mount.Source == "" {
mount.Source = "none"
}
if mount.Type == "" {
mount.Type = "none"
}
mountArg := fmt.Sprintf("source=%s,target=%s,type=%s,%s",
mount.Source,
mount.Target,
mount.Type,
strings.Join(mount.Options, ","))
args = append(args, "--mount", mountArg)
}
args = append(args, "--workdir", "/", "--")
args = append(args, c.Command...)
return exec.Command("bst", args...), nil
}
With this in hand, let’s use it in a dumb main function to see if we can re-enter our alpine rootfs:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
// main.go
package main
import (
"fmt"
"os"
)
func fatalf(exit int, msg string, args ...interface{}) {
fmt.Fprintf(os.Stderr, "%s: ", os.Args[0])
fmt.Fprintf(os.Stderr, msg, args...)
fmt.Fprintf(os.Stderr, "\n")
os.Exit(exit)
}
func main() {
ctnr := Container{
Name: "alpine",
Root: "./rootfs",
Mounts: []MountEntry{
{
Source: "proc",
Target: "/proc",
Type: "proc",
},
{
Source: "dev",
Target: "/dev",
Type: "devtmpfs",
Options: []string{
"mode=755",
},
},
{
Source: "run",
Target: "/run",
Type: "tmpfs",
Options: []string{
"mode=755",
},
},
{
Source: "tmp",
Target: "/tmp",
Type: "tmpfs",
},
},
Argv: os.Args[1:],
}
cmd, err := ctnr.Command()
if err != nil {
fatalf(1, "%v", err)
}
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.Env = []string{
"TERM=" + os.Getenv("TERM"),
"PATH=/bin:/usr/bin:/sbin:/usr/sbin",
}
if err := cmd.Run(); err != nil {
fatalf(1, "%v", err)
}
}
After building this with go build -o poc .
, Lo and Behold:
Persisting the container
Now that we have a PoC, we have to formulate a way to persist information about the container, as well as its current runtime state. After all, we’d want to discover running containers and possibly execute new programs in them, and right now, we not only have no way to do this, but we also lose our containers altogether when the program exits.
The first straightforward thing to do is to marshal the Container object
somewhere. We can json-encode and store it under a directory that the user
controls, for instance $XDG_STATE_HOME
6. This is also a good
place to store the container rootfs.
Let’s setup by hand our alpine container:
Let’s think about what we would like our CLI to look like. We need a command to create the above container metadata and setup the rootfs, and we need one to execute a command on it, one to kill a container, one to remove it entirely, and one to list all of them.
$ toyc create <archive> <name>
$ toyc exec <name> <args...>
$ toyc kill <name>
$ toyc rm <name>
$ toyc ps
We won’t be implementing all of them in this part alone, but this should give us a good idea of where we want to go.
We can define the commandline boilerplate with something like Cobra:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// main.go
package main
import (
"fmt"
"os"
"github.com/spf13/cobra"
)
func fatalf(exit int, msg string, args ...interface{}) {
fmt.Fprintf(os.Stderr, "%s: ", os.Args[0])
fmt.Fprintf(os.Stderr, msg, args...)
fmt.Fprintf(os.Stderr, "\n")
os.Exit(exit)
}
var root = &cobra.Command{
Use: "toyc <command>",
Short: "toyc is a fast, lightweight toy container system.",
}
func main() {
if err := root.Execute(); err != nil {
fatalf(2, "%v", err)
}
}
Since we just hand-crafted a container, we can start by porting the PoC to the exec
subcommand:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
// exec.go
package main
import (
"os"
"os/exec"
"path/filepath"
"strings"
"syscall"
"github.com/spf13/cobra"
)
func init() {
cmd := cobra.Command{
Use: "exec [options] [--] <name> <program> [args...]",
Short: "execute a program in the named container.",
Run: execCmd,
Args: cobra.MinimumNArgs(2),
}
root.AddCommand(&cmd)
}
func execCmd(_ *cobra.Command, args []string) {
var (
name = args[0]
argv = args[1:]
)
stateHome := os.Getenv("XDG_STATE_HOME")
if stateHome == "" {
home := os.Getenv("HOME")
if home == "" {
fatalf(1, "no state home configured -- set the XDG_STATE_HOME " +
"or HOME environment variable.")
}
stateHome = filepath.Join(home, ".local", "state")
}
path := filepath.Join(stateHome, "toyc", "containers", name, "container.json")
ctnr, err := LoadContainerConfig(path) // Not implemented yet
if err != nil {
fatalf(1, "exec %s %s: loading container: %v", name, strings.Join(argv, " "), err)
}
ctnr.Argv = argv
cmd, err := ctnr.Command()
if err != nil {
fatalf(1, "exec %s %s: preparing command: %v", name, strings.Join(argv, " "), err)
}
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.Env = []string{
"TERM=" + os.Getenv("TERM"),
"PATH=/bin:/usr/bin:/sbin:/usr/sbin",
}
if err := cmd.Run(); err != nil {
if err, ok := err.(*exec.ExitError); ok {
// Propagate a sensible exit status
status := err.Sys().(syscall.WaitStatus)
switch {
case status.Exited():
os.Exit(status.ExitStatus())
case status.Signaled():
os.Exit(128 + int(status.Signal()))
}
}
fatalf(1, "exec %s %s: running command: %v", name, strings.Join(argv, " "), err)
}
}
toyc exec
is fairly simple: first, construct the real path behind
${XDG_STATE_HOME:-$HOME/.local/state}/toyc/containers/<name>
, then load that
config from that path into a viable Container object, before finally using the
same logic as the PoC to execute a command in that container.
We just need to implement LoadContainerConfig:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// container.go
...
var (
ErrContainerNotExist = errors.New("container does not exist")
)
...
// LoadContainerConfig loads a Container from the specified path.
func LoadContainerConfig(path string) (Container, error) {
f, err := os.Open(filepath.Join(path, "container.json"))
if os.IsNotExist(err) {
return Container{}, ErrContainerNotExist
}
if err != nil {
return Container{}, err
}
defer f.Close()
var ctnr Container
err = json.NewDecoder(f).Decode(&ctnr)
// Resolve the container root relative to the container directory.
if !filepath.IsAbs(ctnr.Root) {
ctnr.Root = filepath.Join(path, ctnr.Root)
}
return ctnr, err
}
LoadContainerConfig is mostly an open-read-decode boilerplate function, but adds clearer error messaging when the container does not exist. It also resolves the rootfs path relative to the container directory.
Let’s see if that works!
Oh. Of course, -c
gets interpreted as a flag for toyc exec
. Let’s tell
cobra to stop processing command-line arguments when it encounters the first
non-flag argument.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// exec.go
...
func init() {
cmd := cobra.Command{
Use: "exec [options] [--] <name> <program> [args...]",
Short: "execute a program in the named container.",
Run: execCmd,
Args: cobra.MinimumNArgs(2),
}
// Disable flag parsing after the first non-flag argument. This allows us
// to type commands like `toyc exec ls -l` instead of `toyc exec -- ls -l`.
cmd.Flags().SetInterspersed(false)
root.AddCommand(&cmd)
}
Let’s try this again:
Nice. Our hand-prepared container config got loaded and used to run our little greeting.
We’re pretty much done, right? We can execute commands for a container, and they all live within the same context, right?
Not so fast:
Where did sleep infinity
go? Well, as far at the current logic is
involved, bst creates a new set of namespaces for every invocation of toyc.
We need to have any subsequent toyc exec
somehow join the same namespaces as
the first one.
Fortunately, bst has our back once again:
In this example, we asked bst to persist the namespaces of the first invocation
into ./ns
, and in the second invocation we ask it to share these namespaces
before running the commands. The end result shows that we’ve joined the mount
and pid namespaces of the first command, as we can see the contents of the
tmpfs and the sleep
process.
The ./ns
directory contains for each namespace one associated nsfs file:
The way this work is that a namespace cannot be freed while there is at most one
active reference to it. Processes that joined a namespace count towards that
reference count, but another way to do it is by bind-mounting the relevant
/proc/pid/ns/<ns>
file onto any destination, which will keep the namespace
alive while the mount is present. We can, in fact, check this:
Once done, we can kill the bst we’ve put in the background and unpersist the namespace files:
Using this feature, we can change the exec logic of our containers so that the container incorporates a runtime directory. If that directory does not exist, we create it and persist the namespaces in it. Conversly, if it exists, we enter the namespace files in the runtime directory.
Let’s change slightly our container.json for alpine:
That seems good, let’s try it:
Ah. It turns out that if the init of a PID namespace dies, the PID namespace becomes defunct and cannot be used anymore. Let’s try keeping init alive this time:
Right, there are certain things that we don’t want to perform again (like mounts) when entering an existing container. Fixing the Command function once more:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
func (c *Container) Command(init bool) (*exec.Cmd, error) {
var args []string
// Init determines whether or not we are the init process of this
// container. The init process always gets started with --persist.
if init {
args = append(args,
"-r", c.Root,
"--persist", c.RuntimePath)
for _, mount := range c.Mounts {
if mount.Target == "" {
return nil, fmt.Errorf("Mount entry must have a non-empty target")
}
if mount.Source == "" {
mount.Source = "none"
}
if mount.Type == "" {
mount.Type = "none"
}
mountArg := fmt.Sprintf("source=%s,target=%s,type=%s,%s",
mount.Source,
mount.Target,
mount.Type,
strings.Join(mount.Options, ","))
args = append(args, "--mount", mountArg)
}
} else {
args = append(args, "--share", c.RuntimePath)
}
args = append(args, "--workdir", "/", "--")
args = append(args, c.Argv...)
return exec.Command("bst", args...), nil
}
Let’s rebuild and test it once again:
Okay, things seem to be working, but we still have the dead namespace problem. The issue is that there’s just no way to recover from a dead PID namespace. Even if we tried to unshare a new PID namespace while keeping the persisted other namespaces, we would still end up in a bad state – for instance, we would have a mounted /proc that reflects the processes in the defunct PID namespace, which isn’t great.
Since the namespaces are pretty much in a bad state after init exits, we can take this as hint that we should just call bst-unpersist on the runtime directory after init exits, and let subsequent execs re-start the container.
In theory, this works. In practice, a simple Ctrl-C will not run this defer function, because the program just SIGINTs. Let’s support that through context cancellation instead:
We create a cancellable context that we cancel upon hitting a known terminating signal, then wait for bst to complete. Once done, if this was the init process, we go ahead and unpersist the runtime directory. This seems to work reasonably well:
At this point, we are done with our goal for this part. We have a small container execution system that works. Let’s try to benchmark it for posterity:
That’s fast! Here’s Docker for a (flawed) comparison:
Docker is one order of magnitude slower, but to be fair, it does a lot more things that our puny toy container system does not do. One of the big features in particular is image management. We’ll see how we can address that while still trying to be fast in part 2. Stay tuned!
The full code for this series can be found on GitHub,
and the exact state of the implementation at this point is available under the part-1
tag.
-
If you’re interested in the specific reasons:
First, it’s not possible to setup user namespaces in pure Go. The creation of a user namespace systematically fails with EINVAL if the program is threaded, and there is no way to unshare the user namespace before Go creates a thread. We could do it via cgo, but it’s still bad form, and makes things more complex than they should be.
Second, some operations during the setup of namespaces require some special privileges, which means that the program has to have some file capabilities set. Unfortunately, Go is not very good at safely manipulating capability sets to raise our own effective capabilities because of the way Goroutines are scheduled.
Third, Go is still significantly slower than C, which is not where you want to be when designing a helper to setup namespacing. ↩
-
note that
$XDG_STATE_HOME
is more of a de-facto standard introduced by Debian rather than something actually defined by the XDG base directory specification. However, its semantics are exactly the ones we want, as$XDG_CACHE_HOME
represents cache that is safe for the user to blow up, and$XDG_DATA_HOME
represents data files that should be version-controlled. ↩