You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a relatively busy host system that has a few OmniOS VMs. When the host is booting, sometimes the guests (which are a little oversubscribed) compete for CPU and disk I/O resources as they're getting out of bed. The PostgreSQL server service frequently fails to start under these conditions, because:
the start method invokes pg_ctl with the -w (--wait) flag, causing it to wait for the server to be ready to accept SQL connections, which it may not do for some time if it needs to replay the WAL and there was a lot of activity since the last checkpoint
the SMF start method has a timeout of 60 seconds, which is far too short
Sadly, once we hit the timeout we kill the pg_ctl process, but that apparently doesn't necessarily immediately result in the postgres process being killed. This is probably compounded by a long-standing but as-yet unfixed SMF bug, 13091 process contract escaped SMF, where the restarter does not always completely clean up the entire contract of a failed method prior to proceeding with more actions, such as kicking off another instance of the start method. This can result in hitting the maintenance state relatively reliably after a single timeout failure like this, e.g.,
[ Aug 14 01:21:34 Executing start method ("/opt/ooce/pgsql-15/bin/pg_ctl -D /var/opt/ooce/pgsql/pgsql-15 -w start"). ]
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start...........................2024-08-14 01:21:59.831 UTC [406] LOG: starting PostgreSQL 15.7 on x86_64-pc-solaris2.11, compiled by gcc (OmniOS 151046/12.2.0-il-0) 12.2.0, 64-bit
...............................2024-08-14 01:22:30.844 UTC [406] LOG: listening on IPv6 address "::1", port 5432
2024-08-14 01:22:30.844 UTC [406] LOG: listening on IPv4 address "127.0.0.1", port 5432
....[ Aug 14 01:22:34 Method or service exit timed out. Killing contract 59. ]
[ Aug 14 01:22:34 Method "start" failed due to signal KILL. ]
[ Aug 14 01:22:34 Executing start method ("/opt/ooce/pgsql-15/bin/pg_ctl -D /var/opt/ooce/pgsql/pgsql-15 -w start"). ]
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2024-08-14 01:22:34.625 UTC [413] FATAL: lock file "postmaster.pid" already exists
2024-08-14 01:22:34.625 UTC [413] HINT: Is another postmaster (PID 406) running in data directory "/var/opt/ooce/pgsql/pgsql-15"?
stopped waiting
pg_ctl: could not start server
Examine the log output.
[ Aug 14 01:22:34 Method "start" exited with status 1. ]
[ Aug 14 01:22:34 Executing start method ("/opt/ooce/pgsql-15/bin/pg_ctl -D /var/opt/ooce/pgsql/pgsql-15 -w start"). ]
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2024-08-14 01:22:34.836 UTC [418] FATAL: lock file "postmaster.pid" already exists
2024-08-14 01:22:34.836 UTC [418] HINT: Is another postmaster (PID 406) running in data directory "/var/opt/ooce/pgsql/pgsql-15"?
stopped waiting
pg_ctl: could not start server
Examine the log output.
[ Aug 14 01:22:34 Method "start" exited with status 1. ]
I think it's important to note that PostgreSQL can, by design, take a totally arbitrary and often surprisingly long time between starting the database and being ready to accept SQL connections due to WAL recovery. I don't believe it is appropriate for the SMF method to wait for this; we should be using pg_ctl --no-wait to start the database. We should also increase the timeout to something more like 300 or even 600 seconds. If the database gets going and then something catastrophic occurs, it will almost certainly exit and the empty contract will cause SMF to note a fault and try to restart it anyway.
Monitoring the active/healthy state of a PostgreSQL instance (which may, for instance, be part of a cluster of instances anyway) has to be something that higher level site-specific software does outside of the context of the lower-level process supervision that SMF provides.
The text was updated successfully, but these errors were encountered:
I have a relatively busy host system that has a few OmniOS VMs. When the host is booting, sometimes the guests (which are a little oversubscribed) compete for CPU and disk I/O resources as they're getting out of bed. The PostgreSQL server service frequently fails to start under these conditions, because:
Sadly, once we hit the timeout we kill the pg_ctl process, but that apparently doesn't necessarily immediately result in the postgres process being killed. This is probably compounded by a long-standing but as-yet unfixed SMF bug, 13091 process contract escaped SMF, where the restarter does not always completely clean up the entire contract of a failed method prior to proceeding with more actions, such as kicking off another instance of the start method. This can result in hitting the maintenance state relatively reliably after a single timeout failure like this, e.g.,
I think it's important to note that PostgreSQL can, by design, take a totally arbitrary and often surprisingly long time between starting the database and being ready to accept SQL connections due to WAL recovery. I don't believe it is appropriate for the SMF method to wait for this; we should be using
pg_ctl --no-wait
to start the database. We should also increase the timeout to something more like 300 or even 600 seconds. If the database gets going and then something catastrophic occurs, it will almost certainly exit and the empty contract will cause SMF to note a fault and try to restart it anyway.Monitoring the active/healthy state of a PostgreSQL instance (which may, for instance, be part of a cluster of instances anyway) has to be something that higher level site-specific software does outside of the context of the lower-level process supervision that SMF provides.
The text was updated successfully, but these errors were encountered: