« Debugging why ping was Broken in Docker Images
bugs
docker
free software
kernel
kiwi
suse
04 March 2016
Every once in a while you find a bug that just sucks you into a deep, dark hole of weird things you wish you never knew about. I recently saw a fairly innocent looking bug report which lead me down such a rabbit hole, and I thought I’d like to share the experience with you.
The Report
The bug in question was quite simple, and looked like an unusual bug (although similar bugs have been reported and fixed in Docker before). It essentially reads as follows:
If you try to use the openSUSE images, you’ll find that ping doesn’t work. You can reproduce this using the following steps: [insert steps]
The steps to reproduce it are fairly straightforward:
% docker run -it opensuse:13.2 sh
sh-4.2# ping -c1 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.026 ms
--- 127.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.026/0.026/0.026/0.000 ms
sh-4.2# useradd user
sh-4.2# su user
sh-4.2$ ping -c1 127.0.0.1
ping: icmp open socket: Operation not permitted
The first thing to check is whether this is openSUSE specific. I could reproduce this with Alpine Linux, Debian and a few other base images. Weirdly, Ubuntu didn’t appear to have this bug. But before we start tackling the actual bug, it’s time for a quick recap of the past 40 years of Unix history. A permission related error tells us that something is very wrong in the animal brain.
Unix Privilege Model
Most people agree that the Unix privilege model is a hold-over from an older
time. Concepts like “binding to a lower port requires root” are warts of the
original design of Unix. One of these historical warts is that the creation of
raw sockets, which is how ping
sends ICMP packets, requires root
. Thus,
in order to use ping
you need to have the process run as root.
Now, I know what you’re thinking. “But I don’t need to be root in order to run
ping
, what’s going on here?” Hold up, Sparky. The history lesson isn’t over
yet.
Very early in the development of the privilege model, it became clear that you
sometimes needed to grant a user the right to execute code that the user couldn’t
modify as root. passwd
is a great example: you need root in order to read and
modify /etc/shadow
, but allowing the user to read and modify /etc/shadow
directly would be a security vulnerability. Thus, several special bits in file
modes were created (the setuid
and setgid
, allowing an executable file to be
executed as the owner’s user and group respectively).
You probably knew all of the above, it’s Unix 101. However, the story doesn’t
end there. Several different Unixes have moved on from the antiquated, binary
root-or-nothing approach to priviliges. On Linux, this mechanism is known as
“capabilities” (there is a similar system with the same name which predate Linux’s
on a few BSDs). Essentially, the concept of “UID 0” has been broken up into a
bitwise mask of a set of “capabilities” that a process can have (and another mask
that defines which of those capabilities can be inherited by children). Examples
of capabilities include things like CAP_NET_BIND
, which allows you to bind to
low port numbers. The capability we’re interested in with ping
is CAP_NET_RAW
.
This was all just to bring you up to speed with all of the stuff I recalled when I took a look at this bug report. We’ll return to this history in a minute.
Where Did All the Capabilities Go?
When you start a Docker container, the “init” process has a certain capability
set by default. This isn’t the full capability set, so root inside the container
doesn’t have all the capabilities of real root. This explains why ping
works
before switching users with su
– the shell has CAP_NET_RAW
but su
removes
all of the capabilities.
So, that all sounds fairly fine. However, if you think about it for a moment, why
does “su
removing the capabilities” break ping? If you try to do the exact
same set of steps on your host, you’ll find something like this:
# ping -c1 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.031 ms
--- 127.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.031/0.031/0.031/0.000 ms
# su user
% ping -c1 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.033 ms
--- 127.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.033/0.033/0.033/0.000 ms
So, there appears to be a discontinuity between how my host and my container are acting. Linux “containers” (scare quotes because the kernel doesn’t actually understand the concept of a “container”) are precisely identical to normal processes, the only real difference is what namespace they execute within (which changes their percieved layout of the system). Since this was before 1.10 was released, I’m not running with user namespaces so there should be no discontinuity regarding permissions.
Just to double-check my sanity (I was thinking it was a bug in su
at this point
for dropping capabilities it shouldn’t), I decided to run capsh
on both my
host and inside my container to compare what happens after su
. Inside the
container:
% docker run -it opensuse:13.2 sh
sh-4.2# zypper in libcap-progs
[...]
sh-4.2# capsh --print
Current: = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap+eip
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap
[...]
sh-4.2# su user
sh-4.2$ capsh --print
Current: = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap+i
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap
[...]
Uhm, okay that’s a bit weird. The capabilities aren’t being dropped from the
“inheritable” set (note that there’s a +eip
at the end of the first output and
+i
at the end of the second). But they are being dropped from the “effective”
(+e
) and “permitted” (+p
) sets. That’s basically what we expected, the
capabilities are dropped when we do su
. Now, if we try the same for our host,
we should see something different:
# capsh --print
Current: = cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read+ep
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read
[...]
# su user
% capsh --print
Current: =
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read
[...]
… wait, what? ping
works even though the su
‘d user doesn’t have the right
capabilities! And if you try the same check for your own user (just using a
standard login shell), you’ll see that you never had the capabilities needed
for ping
to work:
cyphar@majora :: ~ % capsh --print
Current: =
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read
[...]
At this point I’m thinking “what the hell is going on here?” Because this all
seemed quite unexpected. A couple more bits of debugging caused me more and more
confusion. For example, I tried copying the ping
binary into the container –
maybe the code was different and I was grasping at straws here:
% container=$(docker run -dit opensuse:13.2 sh)
% docker cp $(which ping) $container:$(which ping)
% docker attach $container
/ # su user
/ $ ping -c1 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.023 ms
--- 127.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.023/0.023/0.023/0.000 ms
Whoa. Okay, so maybe the code is different? After running ldd
, I noticed an
interestingly named libcap.so
(“cap” for “capability”, not “pcap”). I then
tried to replace the version of libcap.so
on the container with my host’s
version:
% container=$(docker run -dit opensuse:13.2 sh)
% docker cp -L /lib64/libcap.so.2 $container:/lib64/libcap.so.2
% docker attach $container
Okay, so that doesn’t work. Maybe the code inside ping
is different, but why
would it be? I was very confused. After a few hours of Googling, I decided to
take a look at how ping
was packaged by openSUSE (you can find out by looking
in the OBS). I noticed a very peculiar line:
%post
%set_permissions %{_bindir}/ping %{_bindir}/ping6
The %set_permissions
is an rpm
macro defined somewhere else. However, the name
gives you a hint as to what it does: it “sets the permissions of the binary”. This
doesn’t make sense, because in the %install
section, they are using install
which can set permissions for them. Something seemed fishy, and searching for
%set_permissions
gives you a page on the openSUSE wiki about rpm
macros.
%set_permissions
needs to be called in the%post
script of packages that install files handled by/etc/permissions.*
. Thepermissions
package needs to be inPreReq
sochkstat
is guaranteed to be available at install time. The parameter is the name of the permissions config file of the package (usually identical to the package name).
Looking at the OBS package for permissions
, and the source
code, you find several permission profiles. After looking
through a few of them, I found permissions.secure
, which had the following lines:
#
# networking (need root for the privileged socket)
#
/usr/bin/ping root:root 0755
+capabilities cap_net_raw=ep
/usr/bin/ping6 root:root 0755
+capabilities cap_net_raw=ep
Okay, so you can … set capabilities on a file? How does that work? What the hell is going on?
Extended Atrributes
Some time ago, a filesystem developer got bored and decided that classic (portable) filesystem mode bits are too mainstream. They then created “extended attributes”, which are set of additional, non-portable file mode bits. This is traditionally used to make files “immutable” or “invisible” or other seemingly odd features.
However, when capabilities were being designed, someone noticed that the new concept of “sets of capabilities” didn’t map well to a single-bit flag in the file mode of an executable. They then decided to add “set capability flags” to extended attributes. Naturally, since this is a very strong Linux-ism and doesn’t even work on all modern filesystems Linux supports, this can present some problems.
I didn’t know this earlier to finding this bug, so maybe you made this leap faster than me (in which case I apologise for dancing around the cause of the issue, but I love a good twisting and turning story). The weird thing is what I discovered later.
You can check the capabilities on a file using the following commands, if you want to follow along on this journey:
% getcap <file>
% setcap <caps> <file>
Who is Stripping the Damn Extended Attributes?
My first thought was that Docker was accidentally stripping these capability flags from its images. This was a worrying thought, because depending on where in the codebase the bug lay, it could cause issues from every Docker image being invalid all the way through to all running Docker containers to be invalid. Luckily, this isn’t the case. Some rudimentary testing showed that Docker dealt with extended attributes perfectly fine when creating, loading, saving, spawning, pushing and pulling images. So clearly our issue is somewhere else.
I took a look at the raw image tar
archives that we generate and
automatically package (or push to the Docker Hub), and it turns out that the
actual images are missing the extended attributes. The tar
format supports
extended attributes perfectly fine (and Docker does too, as I tested earlier).
So it’s clearly a problem with what we’re using to generate the tar
archives,
which is a tool known as kiwi.
This is as far as I got in the day or two I’d been working on this issue. I then put it on the backburner (we had other things to worry about, which will be the topic of their own blog post). We figured that it probably wasn’t an issue with kiwi, and that it’s some issue with our packaging scripts. If it wasn’t for what happened next, I probably wouldn’t have ever made a blog post about this.
Kiwi
Then about two weeks ago, I was in Nürnberg for a team get-together. We’d gone out for a few beers with a few of the people at SUSE. I was talking to Richard Brown over a beer and we were swapping “horrible bug stories”. He mentioned there was some very, very unholy things about how kiwi does its packaging of virtual machines. I reckoned that it’s possible that there are equally unholy things going on with the packaging of Docker containers (because I know that the actual packages kiwi installs have the right set-capability bits set).
Fortunately, the way kiwi packages Docker images is much less crazy than the
way it packages virtual machines. All you need to do to create a Docker image is
to create a tar
archive of a rootfs, and then use docker import
. Essentially,
the process is something like this:
- Create a directory for the rootfs image, bootstrap it and then install all of the packages specified in the kiwi configuration file.
- Use
rsync
to copy the rootfs directory to somewhere else. This is done because kiwi allows you to build many different formats of OS images (VMs, etc) from the same rootfs. - Replace a bunch of files in the new rootfs directory that are specific to Docker and LXC.
- Use
tar
to create an<image>.tar.xz
file.
Now, I know what you’re thinking. “Aha! They’re not using the right rsync
flag
to preserve extended attributes!” Well, actually they were using the right flag
for the job (-X
), as you can see in this function inside modules/KIWIContainerBuilder.pm
:
#==========================================
# __copyUnpackedTreeContent
#------------------------------------------
sub __copyUnpackedTreeContent {
# ...
# Copy the unpacked image tree content to the given target directory
# ---
my $this = shift;
my $targetDir = shift;
my $cmdL = $this->{cmdL};
my $kiwi = $this->{kiwi};
my $locator = $this->{locator};
$kiwi -> info('Copy unpacked image tree');
my $origin = $cmdL -> getConfigDir();
my $tar = $locator -> getExecPath('tar');
my $cmd = "rsync -aHXA --one-file-system $origin/ $targetDir 2>&1";
my $data = KIWIQX::qxx ($cmd);
my $code = $? >> 8;
if ($code != 0) {
$kiwi -> failed();
$kiwi -> error('Could not copy the unpacked image tree data');
$kiwi -> failed();
return;
}
$kiwi -> done();
return 1;
}
Oh, didn’t I mention that it was written in Perl? Yes. It’s written in Perl
(although there is another SUSE project that implements it in Python,
and has many improvements to the original, but it not what we currently use to
generate Docker images for openSUSE and SLE). Anyway, that’s not where the problem
lied. As it turns out, tar
doesn’t support extended attributes by default. You
have to use the flag --xattrs
, which has been available since 1.2.7
(2013).
So the diff ended up being quite small:
commit 419d55400edf800527b2cd4836e94190326bd10f
Author: Aleksa Sarai <asarai@suse.com>
Date: Fri Mar 4 16:42:05 2016 +1100
modules: KIWIContainerBuilder: preserve xattrs
tar doesn't preserve extended attributes by default, causing Docker
images to not have any correct set-capabilities bits set on binaries
such as ping. Fix this by adding the --xattrs flag to the tar command
run to generate the root filesystem image.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
diff --git a/modules/KIWIContainerBuilder.pm b/modules/KIWIContainerBuilder.pm
index 305ecf024da9..5672c870ef12 100644
--- a/modules/KIWIContainerBuilder.pm
+++ b/modules/KIWIContainerBuilder.pm
@@ -367,7 +367,7 @@ sub __createContainerBundle {
return;
}
my $data = KIWIQX::qxx (
- "$tar -C $origin -cJf $baseBuildDir/$imgFlName @dirlist 2>&1"
+ "$tar --xattrs -C $origin -cJf $baseBuildDir/$imgFlName @dirlist 2>&1"
);
my $code = $? >> 8;
if ($code != 0) {
Naturally there were some outstanding problems with the CI (such as it running
on Ubuntu 12.04 which packages GNU tar 1.2.6
, which is from 2011). All of those
issues aside, this problem was finally fixed. The code was merged a few hours
after I opened the pull request. The maintainer Marcus
Schäfer also ported my fix to kiwi-ng. Phew.
Time to go grab a beer.
UPDATE: Since posting this blog post, I found out that you need to also
apply an extra flag (--xattrs-include=*
) which instructs GNU tar to include
all of the extended attributes (including security.capability
). This has
also been fixed in KIWI.
Loose Ends
I can hear you shouting “but wait, why does Ubuntu work?” Well, imaginary reader,
it’s all down to how Ubuntu packages ping
. And yes, this applies to the Ubuntu
that you have installed on your servers, desktop machines or the laptop you gave
your mum. If you do a simple ls -la $(which ping)
, you’ll notice the following:
% ls -la $(which ping)
-rwsr-xr-x 1 root root 44168 May 7 2014 /bin/ping
I don’t know about you, but having ping
be a set-uid binary definitely gives me
the chills. Luckily, if you actually read the code (don’t worry, I’ve
done it for you so you don’t have to), they do all of the right dropping of
privileges. As long as there isn’t another set-uid vulnerability,
this should be okay. So it’s not that bad, it was just a bit shocking to see
that’s why Ubuntu images don’t suffer from this problem.
Anyway, that’s all folks!
Unless otherwise stated, all of the opinions in the above post are solely my own and do not necessary represent the views of anyone else. This post is released under the Creative Commons BY-SA 4.0 license.
Want to keep up to date with my posts?
You can subscribe to the Atom Feed.