Description
Hi!
postgres-operator: v1.8.1
spilo: 2.1-p6
Patroni: 2.1.4
synchronous_mode: true
synchronous_mode_strict: true
We use "ConfigMap" setup instead of endpoints setup.
System: Openshift
Are you running Postgres Operator in production? A: Yes, in production pipe.
Type of issue? Bug
pods:
rdbms-pg-cluster-0 Replica --- We broke this up intentionally. We moved WAL files etc. to break it up.
rdbms-pg-cluster-1 Leader
rdbms-pg-cluster-2 Sync Standby
After we broke rdbms-pg-cluster-0:
If we delete all 3 pods, then DB cluster does not start up:
It ends up in following state:
kubectl get pods -A | grep rdbms
namespace0 rdbms-pg-cluster-0 1/2 Running 0 43m
namespace0 rdbms-pg-operator-6c8b55d586-c28xp 1/1 Running 0 21h
So, a problem:
Only one pod is shown / launching.
rdbms-pg-cluster-1 nor rdbms-pg-cluster-2 are not even shown in the list of pods.
So, High Availability is not working in this kind of a scenario, it seems.
Q: How should such be tried to be solved (or fixed)?
In postgres-operator or with pod_management_policy?
Reference:
pod_management_policy, ordered_ready (default), (or parallel).
https://opensource.zalando.com/postgres-operator/docs/reference/operator_parameters.html
We hesitate using "parallel" in pod_management_policy.
Logs from postgres-operator. Click to expand
time="2022-08-17T04:14:57Z" level=info msg="SYNC event has been queued" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0
time="2022-08-17T04:14:57Z" level=info msg="there are 1 clusters running" pkg=controller
time="2022-08-17T04:14:57Z" level=info msg="Creating the role binding "postgres-pod" in the "namespace0" namespace" pkg=controller
time="2022-08-17T04:14:57Z" level=warning msg="pods and/or Patroni may misfunction due to the lack of permissions: could not create role binding "postgres-pod" : cannot bind the pod service account "postgres-pod" defined in the configuration to the cluster role in the "namespace0" namespace: clusterroles.rbac.authorization.k8s.io "postgres-pod" not found" pkg=controller
time="2022-08-17T04:14:57Z" level=info msg="syncing of the cluster started" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0
time="2022-08-17T04:14:57Z" level=debug msg="team API is disabled" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:14:57Z" level=info msg="syncing secrets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:14:57Z" level=debug msg="syncing master service" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:14:57Z" level=debug msg="syncing replica service" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:14:57Z" level=debug msg="syncing volumes using "pvc" storage resize mode" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:14:57Z" level=info msg="volume claims do not require changes" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:14:57Z" level=debug msg="syncing statefulsets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:14:57Z" level=debug msg="making GET http request: http://10.129.x3.y3:8008/config" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:15:09Z" level=debug msg="making GET http request: http://10.129.x3.y3:8008/patroni" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:15:09Z" level=debug msg="making GET http request: http://10.129.x4.y4:8008/patroni" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:15:09Z" level=debug msg="making GET http request: http://10.129.x5.y5:8008/patroni" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:15:09Z" level=debug msg="syncing pod disruption budgets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
W0817 04:15:09.577175 1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
time="2022-08-17T04:15:09Z" level=debug msg="syncing roles" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:15:09Z" level=debug msg="closing database connection" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:15:09Z" level=debug msg="syncing databases" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:15:09Z" level=debug msg="closing database connection" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:15:09Z" level=debug msg="syncing prepared databases with schemas" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:15:09Z" level=debug msg="syncing connection pooler (master, replica) from (false, nil) to (false, nil)" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:15:09Z" level=info msg="cluster has been synced" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0
time="2022-08-17T04:44:57Z" level=info msg="SYNC event has been queued" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0
time="2022-08-17T04:44:57Z" level=info msg="there are 1 clusters running" pkg=controller
time="2022-08-17T04:44:57Z" level=info msg="Creating the role binding "postgres-pod" in the "namespace0" namespace" pkg=controller
time="2022-08-17T04:44:57Z" level=warning msg="pods and/or Patroni may misfunction due to the lack of permissions: could not create role binding "postgres-pod" : cannot bind the pod service account "postgres-pod" defined in the configuration to the cluster role in the "namespace0" namespace: clusterroles.rbac.authorization.k8s.io "postgres-pod" not found" pkg=controller
time="2022-08-17T04:44:57Z" level=info msg="syncing of the cluster started" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0
time="2022-08-17T04:44:57Z" level=debug msg="team API is disabled" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:44:57Z" level=info msg="syncing secrets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:44:57Z" level=debug msg="syncing master service" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:44:57Z" level=debug msg="syncing replica service" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:44:57Z" level=debug msg="syncing volumes using "pvc" storage resize mode" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:44:57Z" level=info msg="volume claims do not require changes" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:44:57Z" level=debug msg="syncing statefulsets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:44:57Z" level=debug msg="making GET http request: http://10.129.x2.y2:8008/config" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:45:09Z" level=debug msg="making GET http request: http://10.129.x2.y2:8008/patroni" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:45:11Z" level=debug msg="syncing pod disruption budgets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
W0817 04:45:11.038395 1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
time="2022-08-17T04:45:11Z" level=debug msg="syncing roles" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:45:11Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:45:26Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:45:41Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:45:56Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:46:11Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:46:26Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:46:41Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:46:56Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:46:56Z" level=warning msg="error while syncing cluster state: could not sync roles: could not init db connection: could not init db connection: still failing after 8 retries" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:46:56Z" level=error msg="could not sync cluster: could not sync roles: could not init db connection: could not init db connection: still failing after 8 retries" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0
time="2022-08-17T05:14:57Z" level=info msg="SYNC event has been queued" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0
time="2022-08-17T05:14:57Z" level=info msg="there are 1 clusters running" pkg=controller
time="2022-08-17T05:14:57Z" level=info msg="Creating the role binding "postgres-pod" in the "namespace0" namespace" pkg=controller
time="2022-08-17T05:14:57Z" level=warning msg="pods and/or Patroni may misfunction due to the lack of permissions: could not create role binding "postgres-pod" : cannot bind the pod service account "postgres-pod" defined in the configuration to the cluster role in the "namespace0" namespace: clusterroles.rbac.authorization.k8s.io "postgres-pod" not found" pkg=controller
time="2022-08-17T05:14:57Z" level=info msg="syncing of the cluster started" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0
time="2022-08-17T05:14:57Z" level=debug msg="team API is disabled" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:14:57Z" level=info msg="syncing secrets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:14:57Z" level=debug msg="syncing master service" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:14:57Z" level=debug msg="syncing replica service" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:14:57Z" level=debug msg="syncing volumes using "pvc" storage resize mode" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:14:57Z" level=info msg="volume claims do not require changes" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:14:57Z" level=debug msg="syncing statefulsets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:14:57Z" level=debug msg="making GET http request: http://10.129.x2.y2:8008/config" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:15:09Z" level=debug msg="making GET http request: http://10.129.x2.y2:8008/patroni" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:15:11Z" level=debug msg="syncing pod disruption budgets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
W0817 05:15:11.047404 1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
time="2022-08-17T05:15:11Z" level=debug msg="syncing roles" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:15:11Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:15:26Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:15:41Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:15:56Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:16:11Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:16:26Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:16:41Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:16:56Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:16:56Z" level=warning msg="error while syncing cluster state: could not sync roles: could not init db connection: could not init db connection: still failing after 8 retries" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:16:56Z" level=error msg="could not sync cluster: could not sync roles: could not init db connection: could not init db connection: still failing after 8 retries" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0
Anything common with following? An older modification:
#1765 Fixed: Rolling upgrade does not proceed anymore, if pod ends up in unhealthy state during the rolling upgrade.
===========================
I am not sure are there WAs to fix the situation manually. PVC of rdbms-pg-cluster-0 pod was tried to be removed, etc. But still it other 2 pods did not launch at all. Also following text was shown in logs of rdbms-pg-cluster-0 pod:
2022-08-22 06:03:21,105 INFO: waiting for leader to bootstrap
(Stays there.)
===========================
This seems like quite a serious problem currently. (Urgent to try to fix.)