How to Reload Caddy Configuration Without Breaking All Your Sites

The creator of Caddy, Matt Holt, and I had some harsh exchange of words in the Caddy support community. This is why:

I asked him if there was a way to make Caddy not break all sites if one of the sites in the Caddyfile are malconfigured or if a deployment action (like retrieve a Let's Encrypt certificate) fails.

His reply was that Caddy does not break the other sites if one fails if you update the Caddyfile the proper way.

It turns out the proper way is to update the Caddyfile live in production and then sending the Caddy process a SIGUSR1. If you do that, the Caddy process will not terminate and will keep serving the rest of the configured sites even if one fails. Matt's really upset that people keep saying that a malconfigured site will break all sites (and he will hate the title of this post), when that isn't true.

I tried arguing that the proper way is not easily applicable to an orchestrated Docker deployment where updates to containers actually means killing the old container and its processes, and starting a completely new container. On startup Caddy will not serve any sites if one fails. So updating Caddy configuration using Docker is doing it the improper way. It's not really practical in a lot of setups to directly manipulate the Caddy process. A secure setup would not allow anyone who has access to update the Caddyfile in a container full terminal access to the Docker container or Docker host.

Workaround

What I do now is nowhere near ideal, but at least it really minimizes risk of downtime:

SSH into the Caddy container
Edit the Caddyfile inside the container (installing vi if needed)
Reloading the Caddy config: :>kill -SIGUSR1 <caddypid>
If everything works, update the Caddyfile in my repo
Commit and redeploy the usual way.

The deployment can still fail, but having tried the changes first live makes the risk a lot smaller.

I would really like a startup option that would allow Caddy to serve any properly configured files and ignore any failing ones. The reason behind it is apparently that "an admin should really fix broken site configurations right away". While I agree to that, I'm not sure an admin will work better under the stress of 100 sites being offline rather than one.

Caveat

For the record: I love Caddy, and I'm really thankful to Matt for having made it. It has revolutionized encryption for websites making it freely available to a broad public by combining a simple-to-use proxy with automatic support for Let's Encrypt.