Skip to content

OCPBUGS-57456: podman-etcd should keep the container for crash debugging #2062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

clobrano
Copy link
Collaborator

@clobrano clobrano commented Jul 23, 2025

This change modifies the podman-etcd resource agent to conditionally reuse existing containers, preventing their removal and preserving historical logs.

As preliminary change, it replaces inline configuration arguments with external file:

  • etcd configuration variable passed via file /var/lib/etcd/config.yaml

This allows configuration changes to be applied via container restart rather than requiring full container recreation.

The agent can then

  • Compare Cluster Etcd Operator (CEO) pod.yaml manifest to decide whether to reuse an existing container or create a new one.
  • Retain containers after stops to ensure logs are always available for debugging.
  • On container replacement, the old container is maintained on disk with -previous suffix, together with the correspondent config.yaml, renamed config-previous.yaml, to allow further debugging.

clobrano added 3 commits July 23, 2025 10:33
Replace embedded configuration arguments with external files:
- etcd config file (/var/lib/etcd/config.yaml)
- podman env file (/var/lib/etcd/env.yaml)

This allows configuration changes to be applied via container restart
rather than requiring full container recreation.
This commit modifies the podman-etcd resource agent to conditionally
reuse existing containers, preventing their removal and preserving
historical logs.

The agent now:
* Compares etcd-pod.yaml changes to decide whether to reuse an existing
  container or create a new one.
* Retains containers after stops to ensure logs are always available for
  debugging.
No need to move further in the podman_start function if a container is
already running.

Signed-off-by: Carlo Lobrano <[email protected]>
Copy link

knet-jenkins bot commented Jul 23, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/1/input

Copy link

@jaypoulz jaypoulz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definetely seems like a step forward for debugging purposes. I am concerned about ending up with an infinite list of old containers though. Is is possible to just keep the previous and current ones?

@clobrano
Copy link
Collaborator Author

The existing container is deleted right before starting a new one. I could improve it to keep the last one only

Copy link

knet-jenkins bot commented Jul 29, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/2/input

@clobrano clobrano changed the title [WIP] OCPBUGS-57456: podman-etcd should keep the container for crash debugging OCPBUGS-57456: podman-etcd should keep the container for crash debugging Jul 30, 2025
@clobrano clobrano marked this pull request as ready for review July 30, 2025 07:33
@clobrano
Copy link
Collaborator Author

clobrano commented Aug 1, 2025

As this PR also introduces configuration files for podman (env.yaml) and etcd (config.yaml), I also need to backup those files together with the podman container, otherwise we will be unable to check what data etcd is started with.

/hold

@clobrano clobrano marked this pull request as draft August 1, 2025 09:15
Copy link

@jaypoulz jaypoulz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the new file-based configuration options. It think it makes it easier to follow along to where the configuration options are being sourced and propagated. Excited to also have these logs preserved.

FORCE_NEW_CLUSTER=false
fi

cat > "$OCF_RESKEY_podman_env_file" << EOF
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the collection of commented-out options, are these things we're leaving here because we expect to enable them down the line?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those were leftovers I forgot to remove. However I noticed that some env variables (e.g. etcd_data) were not correctly used, and, thinking more about it, the env file for podman is not necessary anyway, as the only thing that can really change is the etcd command line. This to say that I need to push a version without podman env file :)

@clobrano clobrano force-pushed the ocpbugs-57456-tnf-podman-etcd-keep-container-for-crash-debugging branch from 48e121c to d5da5f6 Compare August 12, 2025 09:52
Copy link

knet-jenkins bot commented Aug 12, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/3/input

@clobrano clobrano force-pushed the ocpbugs-57456-tnf-podman-etcd-keep-container-for-crash-debugging branch from d5da5f6 to e6f68d0 Compare August 12, 2025 10:06
Copy link

knet-jenkins bot commented Aug 12, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/4/input

@clobrano clobrano force-pushed the ocpbugs-57456-tnf-podman-etcd-keep-container-for-crash-debugging branch from e6f68d0 to 4c021cc Compare August 12, 2025 14:25
When recreating the container due to configuration changes, archive the
current stopped container and its config file by renaming them to
`${CONTAINER}-previous` and `/var/lib/etcd/config-previous.yaml`. This
preserves the container state for debugging while the new instance is
created.

Falls back to deletion if archiving fails. Only one archived copy is
maintained to limit disk usage.

Note: the use of an environment file for podman was removed as it is
an unnecessary complication. If the environment changes we are forced to
create a new container anyway.
Copy link

knet-jenkins bot commented Aug 12, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/5/input

@clobrano clobrano force-pushed the ocpbugs-57456-tnf-podman-etcd-keep-container-for-crash-debugging branch from 4c021cc to 1353f02 Compare August 12, 2025 14:26
@clobrano clobrano marked this pull request as ready for review August 12, 2025 14:27
Copy link

knet-jenkins bot commented Aug 12, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/6/input

@clobrano
Copy link
Collaborator Author

Let's keep it on hold again. Notice this line from etcd that should be investigated more

{"level":"info","ts":"2025-08-14T12:14:04.430103Z","caller":"etcdmain/config.go:370","msg":"loaded server configuration, other configuration command line flags and environment variables will be ignored if provided","path":"/var/lib/etcd/config.yaml"}

@clobrano
Copy link
Collaborator Author

/hold

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants