Reports

Notes about PID 1

As already mentionned in this topic, a process that runs with PID 1 in its own pid namespace inherits a specific behaviour on how to deal with SIGINT and SIGTERM which is to ignore them. This is precisely what happens when running a docker container, but not limited to it.

For example, run this command in a shell as root :

# unshare --pid --fork --mount-proc sleep infinity

This runs a sleep infinity command in its own pid namespace. You can verify it running the lsns command in another shell.

# lsns
        NS TYPE   NPROCS   PID USER COMMAND
4026532363 pid         1   292 root sleep infinity

If you tries to send a SIGINT to this process (with Ctrl+C in the first shell, or with the kill -s SIGINT <PID> command in the second shell), it will has no effect. If you want to get rid of this process, you have to hard-kill it with kill -s SIGKILL <PID> command in the second shell.

You can check that this process was running with PID 1 in its pid namespace running the ps command the same way.

# unshare --pid --fork --mount-proc ps
  PID TTY          TIME CMD
    1 pts/0    00:00:00 ps

With docker containers

Essentially you can observe the same.

# docker run -d --rm --name ubuntu ubuntu sleep infinity
d13fc1da3609407332c511f68d5b0513b31fa55df2e9b545044f53bfd0b2dc4b
# docker ps 
CONTAINER ID  IMAGE                            COMMAND         CREATED        STATUS        PORTS       NAMES
d13fc1da3609  docker.io/library/ubuntu:latest  sleep infinity  2 seconds ago  Up 3 seconds              ubuntu
# lsns
4026532384 pid         1  1062 root sleep infinity

Try killing the sleep infinity process with SIGHUP, SIGTERM and SIGKILL will result in the same behaviour as previously explained, because this process is running with PID 1 in it's own pid namespace.

# docker exec ubuntu ps x
  PID TTY      STAT   TIME COMMAND
    1 ?        Ss     0:00 sleep infinity
    2 ?        R      0:00 ps x

What `docker stop` does ?

Without any fancy option, sends a SIGTERM to the process running with PID 1 in the container pid namespace. If the process is still running after a 10 seconds timeout, sends a SIGKILL. This is why a container that runs a process that does not handle signals properly is slow to stop. The first signal is ignored, the second is not. Documentation here : Docker stop docs

You can verify it with the commands :

# TIMEFORMAT="==> Execution time = %Rs"
# time docker stop ubuntu
ubuntu
==> Execution time = 10.518s

How to handle signals properly ?

The simplest way consists in using the --init option when creating the container, which add a binary developped on the tini GitHub project in the newly created container and run it (with PID 1 in the container pid namespace) and asks it to run as a fork the command to run in the container.

Running the same commands as before show this :

# docker run --init -d --rm --name ubuntu ubuntu sleep infinity
27fc4026c264f48c8ee148796f77e7705411691845e4267467b5bc9f2aba609a
# docker exec ubuntu ps x
    PID TTY      STAT   TIME COMMAND
      1 ?        Ss     0:00 /sbin/docker-init -- sleep infinity
      7 ?        S      0:00 sleep infinity
      8 ?        Rs     0:00 ps x

A simple docker stop is very quick, showing that the SIGTERM signal is handled by the docker-init process which kills its forks and gracefully stops.

# time docker stop ubuntu
ubuntu
==> Execution time = 0.501s

What if I do not want to use the `docker --init` option ?

A dirty bash script for educational purposes

You want to make sure that your init process declares it's own signal handlers. If you're planning to run a simple sleep infinity command in your container, you can wrap it a bash script that runs the trap command prior.

BUT when you run the exec sleep command from bash, the sleep binary code is run in a blocking way, meaning that it waits to finish before the signals are interpreted again. As a consequence, the trap command becomes uneffective.

A workaround could consist in using a non blocking (signal responsive) waiting command, like read when reading from an read/write opened unix pipe created with mkfifo. Note that you can simlink a file descriptor to this unix pipe file (and even delete it !) to preserve a non-blocking read without polluting your container with unecessary pipe files.

This is an example :

#!/bin/bash

trap "exit 0" SIGINT SIGTERM
tmpdir="$(mktemp -d)"
mkfifo "$tmpdir/pipe"
exec 3<>"$tmpdir/pipe"
rm -r "$tmpdir"
read -u3

Put this content in a scripts/run.sh file on you docker host and do not forget to chmod +x it.

And now, let's run the whole bunch of commands previously mentionned, using this script as the "init" program, with PID 1 in the container.

# docker run -d --rm -v "$PWD/scripts:/scripts" --name ubuntu ubuntu /scripts/run.sh
8d947443ae6eaf0093378ffb4480c3a67ea221ff240bab251d9f92c9216385f6
# docker ps 
CONTAINER ID  IMAGE                            COMMAND         CREATED        STATUS        PORTS       NAMES
d13fc1da3609  docker.io/library/ubuntu:latest  sleep infinity  2 seconds ago  Up 3 seconds              ubuntu
# lsns
4026532384 pid         1  2551 root /bin/bash /scripts/run.sh
# TIMEFORMAT="==> Execution time = %Rs"
# time docker stop ubuntu
ubuntu
==> Execution time = 0.441s

Here's a quick docker stop, without the --init option, mimicking the sleep command with bash, with the necessary signal handling to stop without hard kill. :-)

Should I use this script in a production environment ?

Short answer : not a good idea. It is the responsibility of the init process (with PID1 in its pid namespace) on a linux system to reap zombie processes forked from it. Of course the given minimalistic bash script above does not this. More informations about zombie processes at : this link

You can spawn a 100 seconds zombie process adding the (sleep 1 & exec sleep 101) & command before the read command in the previous bash script and show it with docker exec ubuntu ps fx.

Your init process in your container must handle signals properly and reap zombie processes. The --init option in the docker command line ensures that.

79318818