Selection of switchover candidates when synchronous_mode is enabled is too loose

Please, answer some short questions which should help us to understand your problem / question better?

- **Which image of the operator are you using?** e.g. registry.opensource.zalan.do/acid/postgres-operator:v1.9.0
- **Where do you run it - cloud or metal? Kubernetes or OpenShift?** Bare Metal K8s
- **Are you running Postgres Operator in production?** yes
- **Type of issue?** Bug report / question

---

Hello dear maintainers! Before we get into the details of what brings us here, we'd like to thank you for sharing this quite useful project, and your availability when considering our problems with it (see #1983 and #1984). That was really appreciated on our side ❤️ 

### Context

We have a three nodes K8s cluster (bare-metal) on which we run a three nodes PostgreSQL cluster managed by Postgres Operator.
The configuration includes Patroni settings to enforce a sync_standby replica is always available:

```
spec:
  # [...]
  patroni:
    synchronous_mode: true
    synchronous_node_count: 1
```

We are running a rolling update of our K8s cluster by draining nodes in a sequence, before upgrading control plane services. During the drain, Postgres Operator takes care of switching over the master to a selected candidate.

### Problem

We have seen a situation where, for some reason, Patroni returns a list of members which do not include a single `sync_standby` replica, even though the current master logged that it assigned the role to a replica before we request this list.

Relevant logs can be found at the bottom of this ticket description.

The selected candidate (by `getSwitchoverCandidate`) is unfortunately not the one which was promoted by the master. Then, the failover request sent to the master is rejected, since the candidate is not sync_standy even though synchronous_mode is enabled in configuration. And all subsequent retries are done with the same candidate, which continue failing.

We believe there are two issues:

1. `getSwitchoverCandidate` loosely assumes that if `patroni.GetClusterMembers` does not include any `sync_standby` replica, then `synchronous_mode` is not enabled: but our problematic situation shows that this assumption may be wrong
2. `MigrateMasterPod` performs a retry on the actual `Switchover` call to Patroni, but always re-uses the same candidate, until it stops with an error - this whole method doesn't seem to be retried, which may be another problem in itself

### Proposal

1. Bail out when no sync replicas can be found if synchronous_mode is enabled

    - We could change `getSwitchoverCandidate` to read from `c.Spec.SynchronousMode` instead of assuming anything, and return an error in case `len(syncCandidates) == 0`

2. Include the candidate selection in the retried loop within `MigrateMasterPod` to allow recovering from such failures

We would be happy to submit a PR if you agree with this proposal. We are also unsure whether this behavior from Patroni is expected or if it's a bug (see Logs section below for details), in which case we would also submit an issue in their project.

### Logs

- `postgres-operator` logs

    - When draining the first node:

      <details>
        <summary>Expand</summary>

        ```
        time="2023-02-21T14:21:06Z" level=info msg="moving pods: node \"/node01\" became unschedulable and does not have a ready label: map[]" pkg=controller
        time="2023-02-21T14:21:06Z" level=info msg="starting process to migrate master pod \"my-ns/postgres-2\"" cluster-name=my-ns/postgres pkg=cluster
        time="2023-02-21T14:21:06Z" level=debug msg="making GET http request: http://10.233.3.147:8008/cluster" cluster-name=my-ns/postgres pkg=cluster
        time="2023-02-21T14:21:06Z" level=info msg="check failed: pod \"my-ns/postgres-1\" is already on a live node" cluster-name=my-ns/postgres pkg=cluster
        time="2023-02-21T14:21:06Z" level=debug msg="switching over from \"postgres-2\" to \"my-ns/postgres-1\"" cluster-name=my-ns/postgres pkg=cluster
        time="2023-02-21T14:21:06Z" level=debug msg="subscribing to pod \"my-ns/postgres-1\"" cluster-name=my-ns/postgres pkg=cluster
        time="2023-02-21T14:21:06Z" level=debug msg="making POST http request: http://10.233.3.147:8008/failover" cluster-name=my-ns/postgres pkg=cluster
        time="2023-02-21T14:21:08Z" level=debug msg="successfully switched over from \"postgres-2\" to \"my-ns/postgres-1\"" cluster-name=my-ns/postgres pkg=cluster
        time="2023-02-21T14:21:09Z" level=debug msg="unsubscribing from pod \"my-ns/postgres-1\" events" cluster-name=my-ns/postgres pkg=cluster
        time="2023-02-21T14:21:09Z" level=info msg="1/1 master pods have been moved out from the \"/node01\" node" pkg=controller
        ```
      </details>

    - When draining the second node, we see similar logs as for the first one

    - When draining the last node (and failing):

      <details>
        <summary>Expand</summary>

        ```
        time="2023-02-21T14:31:07Z" level=info msg="moving pods: node \"/node03\" became unschedulable and does not have a ready label: map[]" pkg=controller
        time="2023-02-21T14:31:07Z" level=info msg="starting process to migrate master pod \"my-ns/postgres-0\"" cluster-name=my-ns/postgres pkg=cluster
        time="2023-02-21T14:31:07Z" level=debug msg="making GET http request: http://10.233.46.63:8008/cluster" cluster-name=my-ns/postgres pkg=cluster
        time="2023-02-21T14:31:07Z" level=info msg="check failed: pod \"my-ns/postgres-1\" is already on a live node" cluster-name=my-ns/postgres pkg=cluster
        time="2023-02-21T14:31:07Z" level=debug msg="switching over from \"postgres-0\" to \"my-ns/postgres-1\"" cluster-name=my-ns/postgres pkg=cluster
        time="2023-02-21T14:31:07Z" level=debug msg="subscribing to pod \"my-ns/postgres-1\"" cluster-name=my-ns/postgres pkg=cluster
        time="2023-02-21T14:31:07Z" level=debug msg="making POST http request: http://10.233.46.63:8008/failover" cluster-name=my-ns/postgres pkg=cluster
        time="2023-02-21T14:31:08Z" level=debug msg="unsubscribing from pod \"my-ns/postgres-1\" events" cluster-name=my-ns/postgres pkg=cluster
        time="2023-02-21T14:31:08Z" level=error msg="could not failover to pod \"my-ns/postgres-1\": could not switch over from \"postgres-0\" to \"my-ns/postgres-1\": patroni returned 'Failover failed'" cluster-name=my-ns/postgres pkg=cluster
        [... more attempts]
        time="2023-02-21T14:35:08Z" level=error msg="could not move master pod \"my-ns/postgres-0\": could not migrate master pod: still failing after 5 retries" pkg=controller
        time="2023-02-21T14:35:08Z" level=info msg="0/1 master pods have been moved out from the \"/node03\" node" pkg=controller
        time="2023-02-21T14:35:08Z" level=warning msg="failed to move master pods from the node \"node03\": could not move master 1/1 pods from the \"/node03\" node" pkg=controller
        ```
      </details>

- `postgres-0` (master) logs:

    - after failover from `postgres-1` to `postgres-0` succeeded

      <details>
        <summary>Expand</summary>

        ```
        2023-02-21 14:26:07,349 INFO: Cleaning up failover key after acquiring leader lock...
        2023-02-21 14:26:07,516 INFO: promoted self to leader by acquiring session lock
        2023-02-21 14:26:07,524 INFO: cleared rewind state after becoming the leader
        2023-02-21 14:26:07,517 INFO: Lock owner: postgres-0; I am postgres-0
        2023-02-21 14:26:07,652 INFO: updated leader lock during promote
        2023-02-21 14:26:19,358 INFO: Lock owner: postgres-0; I am postgres-0
        2023-02-21 14:26:19,412 INFO: Assigning synchronous standby status to ['postgres-2']
        2023-02-21 14:26:21,477 INFO: Synchronous standby status assigned to ['postgres-2']
        ```
      </details>

    - when trying to failover to `postgres-1` again (because we're draining `node03`)

      <details>
        <summary>Expand</summary>

        ```
        2023-02-21 14:31:07,812 INFO: received failover request with leader=postgres-0 candidate=postgres-1 scheduled_at=None
        2023-02-21 14:31:07,825 INFO: Got response from postgres-1 http://10.233.213.196:8008/patroni: {"state": "running", "postmaster_start_time": "2023-02-21 14:30:25.180516+00:00", "role": "replica", "server_version": 140004, "xlog": {"received_location": 1595645080, "replayed_location": 1595645080, "replayed_timestamp": "2023-02-21 14:31:00.143559+00:00", "paused": false}, "timeline": 21, "dcs_last_seen": 1676989859, "database_system_identifier": "7174390007193890885", "patroni": {"version": "2.1.4", "scope": "postgres"}}
        2023-02-21 14:31:07,968 INFO: Lock owner: postgres-0; I am postgres-0
        2023-02-21 14:31:08,018 WARNING: Failover candidate=postgres-1 does not match with sync_standbys=None
        2023-02-21 14:31:08,018 WARNING: manual failover: members list is empty
        2023-02-21 14:31:08,018 WARNING: manual failover: no healthy members found, failover is not possible
        2023-02-21 14:31:08,019 INFO: Cleaning up failover key
        2023-02-21 14:31:08,068 INFO: Assigning synchronous standby status to ['postgres-2']
        2023-02-21 14:31:08,116 INFO: Synchronous standby status assigned to ['postgres-2']
        ```
      </details>

      **Note**: it appears that after the second failover attempt, the master reassigned the sync standby status to `postgres-2` while it said it already did so after becoming master. This may be an issue in itself, which we weren't able to understand yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Selection of switchover candidates when synchronous_mode is enabled is too loose #2248

Context

Problem

Proposal

Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Selection of switchover candidates when synchronous_mode is enabled is too loose #2248

Description

Context

Problem

Proposal

Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions