Cluster Service Didn’t Work After Switching to External ETCD

We are managing a handful of PostgreSQL clusters using the Zalando operator. Generally, it functions well 👍🏻  but we occasionally encounter problems during brief disruptions with the kubernetes API service. These disruptions leads to adverse effects in the clusters, impacting our apps and sometime requires a postgresql cluster restart to fix issues. We get the `ERROR: Error communicating with DCS` message in the cluster's when this happens and I believe this is because Zalando uses the Kubernetes API as a Distributed Config Store (DCS) by default, which can cause problems during connectivity issues with the Kubernetes API service as described in [postgres-operator/issues/354](https://github.com/zalando/postgres-operator/issues/354), [postgres-operator/issues/1703](https://github.com/zalando/postgres-operator/issues/1703)

In order to get rid of this, we have decided to switch to an external etcd for the operator and have updated the PostgreSQL operator deployment with the `etcd_host:` environment variable.

```
  # etcd connection string for Patroni. Empty uses K8s-native DCS.
  etcd_host: "etcd.postgres-operator.svc.cluster.local"
```

This worked and performed a rolling restart of the existing clusters, but our applications were unable to connect to the database service afterward 🙁

We noticed the following errors in the operator logs (the operator logs are available [here](https://gist.github.com/prasadkris/3b3c259126e0e43e00c6ab70a7d38d13)), and upon investigation, I found that the master service is somehow missing the endpoint, even though there is a pod with `spilo-role=master` label.

```
time="2023-12-19T09:30:33Z" level=warning msg="could not connect to Postgres database: dial tcp 10.111.90.85:5432: i/o timeout" cluster-name=test/ops-sentry-postgresql pkg=cluster
time="2023-12-19T09:30:48Z" level=warning msg="could not connect to Postgres database: dial tcp 10.111.90.85:5432: i/o timeout" cluster-name=test/ops-sentry-postgresql pkg=cluster
time="2023-12-19T09:31:03Z" level=warning msg="could not connect to Postgres database: dial tcp 10.111.90.85:5432: i/o timeout" cluster-name=test/ops-sentry-postgresql pkg=cluster
time="2023-12-19T09:31:18Z" level=warning msg="could not connect to Postgres database: dial tcp 10.111.90.85:5432: i/o timeout" cluster-name=test/ops-sentry-postgresql pkg=cluster
```

```
Name:              ops-sentry-postgresql
Namespace:         test
Labels:            application=spilo
                   cluster-name=ops-sentry-postgresql
                   spilo-role=master
                   team=ops-sentry
Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.111.90.85
IPs:               10.111.90.85
Port:              postgresql  5432/TCP
TargetPort:        5432/TCP
Endpoints:         <none>
Session Affinity:  None
Events:            <none>

kgpo -l spilo-role=master
NAME                      READY   STATUS    RESTARTS   AGE
ops-sentry-postgresql-0   2/2     Running   0          46m

```
The only way to fix this is either to roll back to the k8s native DCS or manually delete the database services and then restart the postgresql-operator deployment. This will recreate the database service with the correct endpoint IP pointing to the master pod

```
kubectl delete svc ops-sentry-postgresql 
kubectl -n postgres-operator rollout restart deployment postgres-operator
```

- **Which image of the operator are you using?**: registry.opensource.zalan.do/acid/postgres-operator:v1.10.0
- **Where do you run it**:  Bare Metal K8s
- **Are you running Postgres Operator in production?:** Yes
- **Type of issue?:** Bug report

Any help with this would be much appreciated. We can definitely switch to external etcd and use the workaround (manually delete the database services and then restart the postgresql-operator deployment), but that will involve downtime, which is not desirable. Please feel free to let me know if you guys need any further details. Thanks! 🙏🏻 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cluster Service Didn’t Work After Switching to External ETCD #2503

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cluster Service Didn’t Work After Switching to External ETCD #2503

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions