nixos/postgresql: implement auto-restart & rework dependencies of postgresql.target

At my employer's NixOS-based platform, PostgreSQL is configured with
`Restart=always` which got never upstreamed, unfortunately.

This however revealed an interesting problem when using bi-directional
BindsTo: when killing `postgresql.service`, sometimes both the service &
target starts back up and sometimes they don't. According to an upstream
bugreport[1] this is a known problem because you have two conflicting
operations scheduled in a single transaction, namely

* When (auto-)restarting, a restart job for all units bound to the
  restarting unit are immediately scheduled[2].

* Due to the `BindsTo` relationship, a stop-job for `postgresql.target`
  is scheduled immediately by the manager loop[3]. This is caused by the
  `UNIT_ATOM_CANNOT_BE_ACTIVE_WITHOUT` "atom" which is ONLY set for a
  BindsTo relationship[4].

  When this is processed first, the restart is inhibited:

      Jul 12 13:25:51 nixos systemd[1]: postgresql.service: Main process exited, code=killed, status=9/KILL
      Jul 12 13:25:51 nixos systemd[1]: postgresql.service: Changed running -> stop-sigterm
      Jul 12 13:25:51 nixos systemd[1]: postgresql.target: Trying to enqueue job postgresql.target/stop/replace
      Jul 12 13:25:51 nixos systemd[1]: postgresql.service: Installed new job postgresql.service/stop as 80053
      Jul 12 13:25:51 nixos systemd[1]: postgresql.target: Installed new job postgresql.target/stop as 80052
      Jul 12 13:25:51 nixos systemd[1]: postgresql.target: Enqueued job postgresql.target/stop as 80052
      [...]
      Jul 12 13:25:51 nixos systemd[1]: postgresql.service: Service restart not allowed.

It's subtle and non-obvious from the man-page, but the way how units are
stopped is different when using `PartOf=` or `Requires=` which don't have the
`UNIT_ATOM_CANNOT_BE_ACTIVE_WITHOUT` property, but instead schedules the
stop/start of the target AFTER the stop-job of postgresql.service which
is turned into a start-job because of Restart=always:

    Jul 12 13:33:00 nixos systemd[1]: postgresql.service: Main process exited, code=killed, status=9/KILL
    [...]
    Jul 12 13:33:00 nixos systemd[1]: postgresql.service: Failed with result 'signal'.
    Jul 12 13:33:00 nixos systemd[1]: postgresql.service: Service will restart (restart setting)
    [...]
    Jul 12 13:33:00 nixos systemd[1]: postgresql.target: Installed new job postgresql.target/restart as 80996
    Jul 12 13:33:00 nixos systemd[1]: postgresql.service: Installed new job postgresql.service/restart as 80907
    [...]
    Jul 12 13:33:00 nixos systemd[1]: postgresql.service: Scheduled restart job, restart counter is at 1.
    [...]
    Jul 12 13:33:00 nixos systemd[1]: Stopped target postgresql.target.
    Jul 12 13:33:00 nixos systemd[1]: postgresql.target: Converting job postgresql.target/restart -> postgresql.target/start
    Jul 12 13:33:00 nixos systemd[1]: Stopping postgresql.target...
    [...]
    Jul 12 13:33:00 nixos systemd[1]: Stopped postgresql.service.
    Jul 12 13:33:00 nixos systemd[1]: postgresql.service: Converting job postgresql.service/restart -> postgresql.service/start
    [...]
    Jul 12 13:33:00 nixos systemd[1]: postgresql.service: Changed dead -> running
    Jul 12 13:33:00 nixos systemd[1]: postgresql.service: Job 80907 postgresql.service/start finished, result=done
    Jul 12 13:33:00 nixos systemd[1]: Started postgresql.service.
    Jul 12 13:33:00 nixos systemd[1]: postgresql.target: Changed dead -> active
    [...]
    Jul 12 13:33:00 nixos systemd[1]: Reached target postgresql.target.

Do note that the stop job (including the restart) of postgresql.service
is fully processed here before dealing with PartOf/ConsistsOf
relationships.

I tested this against the following cases:

    | Unit               | Action       | Propagates to      |
    | ------------------ | ------------ | ------------------ |
    | postgresql.target  | restart      | postgresql.service |
    | postgresql.target  | start        | postgresql.service |
    | postgresql.target  | stop         | psotgresql.service |
    | postgresql.service | start        | postgresql.target  |
    | postgresql.service | restart      | postgresql.target  |
    | postgresql.service | stop         | postgresql.target  |
    | postgresql.service | auto-restart | postgresql.target  |
    | postgresql.service | failure      | postgresql.target  |

[1] e.g. systemd issue 8374
[2] https://github.com/systemd/systemd/blob/v256-stable/src/core/service.c#L2535-L2542
[3] https://github.com/systemd/systemd/blob/v256-stable/src/core/manager.c#L1611-L1626
[4] https://github.com/systemd/systemd/blob/v256-stable/src/core/unit-dependency-atom.c#L30-L35
This commit is contained in:
Maximilian Bosch
2025-07-26 19:09:48 +02:00
parent f63f8b2373
commit 03d0fed6f8

View File

@@ -769,7 +769,7 @@ in
systemd.targets.postgresql = {
description = "PostgreSQL";
wantedBy = [ "multi-user.target" ];
bindsTo = [
requires = [
"postgresql.service"
"postgresql-setup.service"
];
@@ -780,8 +780,13 @@ in
after = [ "network.target" ];
# To trigger the .target also on "systemctl start postgresql".
bindsTo = [ "postgresql.target" ];
# To trigger the .target also on "systemctl start postgresql" as well as on
# restarts & stops.
# Please note that postgresql.service & postgresql.target binding to
# each other makes the Restart=always rule racy and results
# in sometimes the service not being restarted.
wants = [ "postgresql.target" ];
partOf = [ "postgresql.target" ];
environment.PGDATA = cfg.dataDir;