Skip to content

pod-0 unhealthy --> Restart all 3 pods --> 2 healthy pods are not being launched/started at all --> DB cluster remains "stuck"/down #2003

Open
@Samusername

Description

@Samusername

Hi!
postgres-operator: v1.8.1
spilo: 2.1-p6
Patroni: 2.1.4
synchronous_mode: true
synchronous_mode_strict: true
We use "ConfigMap" setup instead of endpoints setup.
System: Openshift
Are you running Postgres Operator in production? A: Yes, in production pipe.
Type of issue? Bug

pods:
rdbms-pg-cluster-0 Replica --- We broke this up intentionally. We moved WAL files etc. to break it up.
rdbms-pg-cluster-1 Leader
rdbms-pg-cluster-2 Sync Standby

After we broke rdbms-pg-cluster-0:
If we delete all 3 pods, then DB cluster does not start up:
It ends up in following state:

kubectl get pods -A | grep rdbms
namespace0 rdbms-pg-cluster-0 1/2 Running 0 43m
namespace0 rdbms-pg-operator-6c8b55d586-c28xp 1/1 Running 0 21h

So, a problem:

Only one pod is shown / launching.
rdbms-pg-cluster-1 nor rdbms-pg-cluster-2 are not even shown in the list of pods.
So, High Availability is not working in this kind of a scenario, it seems.

Q: How should such be tried to be solved (or fixed)?

In postgres-operator or with pod_management_policy?

Reference:
pod_management_policy, ordered_ready (default), (or parallel).
https://opensource.zalando.com/postgres-operator/docs/reference/operator_parameters.html

We hesitate using "parallel" in pod_management_policy.

Logs from postgres-operator. Click to expand

time="2022-08-17T04:14:57Z" level=info msg="SYNC event has been queued" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0
time="2022-08-17T04:14:57Z" level=info msg="there are 1 clusters running" pkg=controller
time="2022-08-17T04:14:57Z" level=info msg="Creating the role binding "postgres-pod" in the "namespace0" namespace" pkg=controller
time="2022-08-17T04:14:57Z" level=warning msg="pods and/or Patroni may misfunction due to the lack of permissions: could not create role binding "postgres-pod" : cannot bind the pod service account "postgres-pod" defined in the configuration to the cluster role in the "namespace0" namespace: clusterroles.rbac.authorization.k8s.io "postgres-pod" not found" pkg=controller
time="2022-08-17T04:14:57Z" level=info msg="syncing of the cluster started" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0
time="2022-08-17T04:14:57Z" level=debug msg="team API is disabled" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:14:57Z" level=info msg="syncing secrets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:14:57Z" level=debug msg="syncing master service" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:14:57Z" level=debug msg="syncing replica service" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:14:57Z" level=debug msg="syncing volumes using "pvc" storage resize mode" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:14:57Z" level=info msg="volume claims do not require changes" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:14:57Z" level=debug msg="syncing statefulsets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:14:57Z" level=debug msg="making GET http request: http://10.129.x3.y3:8008/config" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:15:09Z" level=debug msg="making GET http request: http://10.129.x3.y3:8008/patroni" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:15:09Z" level=debug msg="making GET http request: http://10.129.x4.y4:8008/patroni" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:15:09Z" level=debug msg="making GET http request: http://10.129.x5.y5:8008/patroni" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:15:09Z" level=debug msg="syncing pod disruption budgets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
W0817 04:15:09.577175 1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
time="2022-08-17T04:15:09Z" level=debug msg="syncing roles" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:15:09Z" level=debug msg="closing database connection" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:15:09Z" level=debug msg="syncing databases" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:15:09Z" level=debug msg="closing database connection" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:15:09Z" level=debug msg="syncing prepared databases with schemas" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:15:09Z" level=debug msg="syncing connection pooler (master, replica) from (false, nil) to (false, nil)" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:15:09Z" level=info msg="cluster has been synced" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0
time="2022-08-17T04:44:57Z" level=info msg="SYNC event has been queued" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0
time="2022-08-17T04:44:57Z" level=info msg="there are 1 clusters running" pkg=controller
time="2022-08-17T04:44:57Z" level=info msg="Creating the role binding "postgres-pod" in the "namespace0" namespace" pkg=controller
time="2022-08-17T04:44:57Z" level=warning msg="pods and/or Patroni may misfunction due to the lack of permissions: could not create role binding "postgres-pod" : cannot bind the pod service account "postgres-pod" defined in the configuration to the cluster role in the "namespace0" namespace: clusterroles.rbac.authorization.k8s.io "postgres-pod" not found" pkg=controller
time="2022-08-17T04:44:57Z" level=info msg="syncing of the cluster started" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0
time="2022-08-17T04:44:57Z" level=debug msg="team API is disabled" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:44:57Z" level=info msg="syncing secrets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:44:57Z" level=debug msg="syncing master service" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:44:57Z" level=debug msg="syncing replica service" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:44:57Z" level=debug msg="syncing volumes using "pvc" storage resize mode" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:44:57Z" level=info msg="volume claims do not require changes" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:44:57Z" level=debug msg="syncing statefulsets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:44:57Z" level=debug msg="making GET http request: http://10.129.x2.y2:8008/config" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:45:09Z" level=debug msg="making GET http request: http://10.129.x2.y2:8008/patroni" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:45:11Z" level=debug msg="syncing pod disruption budgets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
W0817 04:45:11.038395 1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
time="2022-08-17T04:45:11Z" level=debug msg="syncing roles" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:45:11Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:45:26Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:45:41Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:45:56Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:46:11Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:46:26Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:46:41Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:46:56Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:46:56Z" level=warning msg="error while syncing cluster state: could not sync roles: could not init db connection: could not init db connection: still failing after 8 retries" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T04:46:56Z" level=error msg="could not sync cluster: could not sync roles: could not init db connection: could not init db connection: still failing after 8 retries" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0
time="2022-08-17T05:14:57Z" level=info msg="SYNC event has been queued" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0
time="2022-08-17T05:14:57Z" level=info msg="there are 1 clusters running" pkg=controller
time="2022-08-17T05:14:57Z" level=info msg="Creating the role binding "postgres-pod" in the "namespace0" namespace" pkg=controller
time="2022-08-17T05:14:57Z" level=warning msg="pods and/or Patroni may misfunction due to the lack of permissions: could not create role binding "postgres-pod" : cannot bind the pod service account "postgres-pod" defined in the configuration to the cluster role in the "namespace0" namespace: clusterroles.rbac.authorization.k8s.io "postgres-pod" not found" pkg=controller
time="2022-08-17T05:14:57Z" level=info msg="syncing of the cluster started" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0
time="2022-08-17T05:14:57Z" level=debug msg="team API is disabled" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:14:57Z" level=info msg="syncing secrets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:14:57Z" level=debug msg="syncing master service" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:14:57Z" level=debug msg="syncing replica service" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:14:57Z" level=debug msg="syncing volumes using "pvc" storage resize mode" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:14:57Z" level=info msg="volume claims do not require changes" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:14:57Z" level=debug msg="syncing statefulsets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:14:57Z" level=debug msg="making GET http request: http://10.129.x2.y2:8008/config" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:15:09Z" level=debug msg="making GET http request: http://10.129.x2.y2:8008/patroni" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:15:11Z" level=debug msg="syncing pod disruption budgets" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
W0817 05:15:11.047404 1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
time="2022-08-17T05:15:11Z" level=debug msg="syncing roles" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:15:11Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:15:26Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:15:41Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:15:56Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:16:11Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:16:26Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:16:41Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:16:56Z" level=warning msg="could not connect to Postgres database: dial tcp 172.30.x.y:5432: connect: connection refused" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:16:56Z" level=warning msg="error while syncing cluster state: could not sync roles: could not init db connection: could not init db connection: still failing after 8 retries" cluster-name=namespace0/rdbms-pg-cluster pkg=cluster worker=0
time="2022-08-17T05:16:56Z" level=error msg="could not sync cluster: could not sync roles: could not init db connection: could not init db connection: still failing after 8 retries" cluster-name=namespace0/rdbms-pg-cluster pkg=controller worker=0

Anything common with following? An older modification:
#1765 Fixed: Rolling upgrade does not proceed anymore, if pod ends up in unhealthy state during the rolling upgrade.

===========================
I am not sure are there WAs to fix the situation manually. PVC of rdbms-pg-cluster-0 pod was tried to be removed, etc. But still it other 2 pods did not launch at all. Also following text was shown in logs of rdbms-pg-cluster-0 pod:
2022-08-22 06:03:21,105 INFO: waiting for leader to bootstrap
(Stays there.)

===========================
This seems like quite a serious problem currently. (Urgent to try to fix.)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions