Skip to content

Cluster Service Didn’t Work After Switching to External ETCD #2503

Open
@prasadkris

Description

@prasadkris

We are managing a handful of PostgreSQL clusters using the Zalando operator. Generally, it functions well 👍🏻 but we occasionally encounter problems during brief disruptions with the kubernetes API service. These disruptions leads to adverse effects in the clusters, impacting our apps and sometime requires a postgresql cluster restart to fix issues. We get the ERROR: Error communicating with DCS message in the cluster's when this happens and I believe this is because Zalando uses the Kubernetes API as a Distributed Config Store (DCS) by default, which can cause problems during connectivity issues with the Kubernetes API service as described in postgres-operator/issues/354, postgres-operator/issues/1703

In order to get rid of this, we have decided to switch to an external etcd for the operator and have updated the PostgreSQL operator deployment with the etcd_host: environment variable.

  # etcd connection string for Patroni. Empty uses K8s-native DCS.
  etcd_host: "etcd.postgres-operator.svc.cluster.local"

This worked and performed a rolling restart of the existing clusters, but our applications were unable to connect to the database service afterward 🙁

We noticed the following errors in the operator logs (the operator logs are available here), and upon investigation, I found that the master service is somehow missing the endpoint, even though there is a pod with spilo-role=master label.

time="2023-12-19T09:30:33Z" level=warning msg="could not connect to Postgres database: dial tcp 10.111.90.85:5432: i/o timeout" cluster-name=test/ops-sentry-postgresql pkg=cluster
time="2023-12-19T09:30:48Z" level=warning msg="could not connect to Postgres database: dial tcp 10.111.90.85:5432: i/o timeout" cluster-name=test/ops-sentry-postgresql pkg=cluster
time="2023-12-19T09:31:03Z" level=warning msg="could not connect to Postgres database: dial tcp 10.111.90.85:5432: i/o timeout" cluster-name=test/ops-sentry-postgresql pkg=cluster
time="2023-12-19T09:31:18Z" level=warning msg="could not connect to Postgres database: dial tcp 10.111.90.85:5432: i/o timeout" cluster-name=test/ops-sentry-postgresql pkg=cluster
Name:              ops-sentry-postgresql
Namespace:         test
Labels:            application=spilo
                   cluster-name=ops-sentry-postgresql
                   spilo-role=master
                   team=ops-sentry
Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.111.90.85
IPs:               10.111.90.85
Port:              postgresql  5432/TCP
TargetPort:        5432/TCP
Endpoints:         <none>
Session Affinity:  None
Events:            <none>

kgpo -l spilo-role=master
NAME                      READY   STATUS    RESTARTS   AGE
ops-sentry-postgresql-0   2/2     Running   0          46m

The only way to fix this is either to roll back to the k8s native DCS or manually delete the database services and then restart the postgresql-operator deployment. This will recreate the database service with the correct endpoint IP pointing to the master pod

kubectl delete svc ops-sentry-postgresql 
kubectl -n postgres-operator rollout restart deployment postgres-operator
  • Which image of the operator are you using?: registry.opensource.zalan.do/acid/postgres-operator:v1.10.0
  • Where do you run it: Bare Metal K8s
  • Are you running Postgres Operator in production?: Yes
  • Type of issue?: Bug report

Any help with this would be much appreciated. We can definitely switch to external etcd and use the workaround (manually delete the database services and then restart the postgresql-operator deployment), but that will involve downtime, which is not desirable. Please feel free to let me know if you guys need any further details. Thanks! 🙏🏻

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions