-
Notifications
You must be signed in to change notification settings - Fork 19
Bugfixes #167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfixes #167
Conversation
mladjan-gadzic
commented
Sep 4, 2025
- UFM Cleanup on Pod Lifecycle Events: Fixed issue where GUIDs weren't properly removed from UFM (Unified Fabric Manager) when pods completed (success/error) or were deleted.
- GUID Reallocation Conflicts: Added logic to remove existing GUID allocations from UFM before assigning a new partition (PKey) to prevent conflicts when the same GUID is reused.
- Pod State Handling: Improved pod lifecycle management by treating finished pods (succeeded/failed) the same as deleted pods for cleanup purposes.
35562ac
to
70eace6
Compare
Pull Request Test Coverage Report for Build 18346261465Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mladjan-gadzic, thank you for your contribution. Left some comments. Could you also cover the changed functionality with unit tests?
pkg/daemon/daemon.go
Outdated
log.Info().Msgf("matched guid %s to pod %s, removing", guidAddr, guidPodEntry) | ||
guidList = append(guidList, guidAddr) | ||
} else { | ||
log.Warn().Msgf("guid %s is allocated to another pod %s not %s, not removing", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When can this happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you point me to the comment? I can't find it
Hi @mladjan-gadzic there is also a failing DCO check: https://github.com/Mellanox/ib-kubernetes/pull/167/checks?check_run_id=49668982984 Please, sign the commits in your PR and re-push them to the branch |
46ab9d1
to
8d8eafc
Compare
@mladjan-gadzic can you rebase and handle conflict? |
54de066
to
c9dee03
Compare
@almaslennikov thanks for the review. i've pushed required changes. once you're okay with those, please resolve comments. @rollandf thanks for the comment. i've squashed everything under one commit. i am going to resolve conflicts soon. |
c9dee03
to
a72cdef
Compare
pkg/daemon/daemon.go
Outdated
|
||
// Remove stale GUIDs that are no longer in use by the subnet manager | ||
// This handles cleanup of GUIDs from deleted/finished pods | ||
for allocatedGUID, podNetworkID := range d.guidPodNetworkMap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the time when syncGuidPool is called, the d.guidPodNetworkMap
is empty because it's populated in the d.initPool
method. Since we run both only once, let's combine them into a single function and properly handle the order of operations
0301ca3
to
e12ba14
Compare
@almaslennikov, please, let me know if any additional changes are needed. |
@mladjan-gadzic Please, resolve the merge conflicts. After merging #168 I see there are several places where |
e12ba14
to
7b07865
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mladjan-gadzic could you test that the changes in this PR work? The current version doesn't build
pkg/daemon/daemon.go
Outdated
} | ||
|
||
// Initialize guid pool with existing pods and sync with subnet manager | ||
err = daemonInstance.initGUIDPool() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to not run it here anymore since we only should start the actual logic if the instance becomes a leader. In the becomeLeader
function we are already running initGUIDPool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's refactor this function to return the daemon fully formed at the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
pkg/daemon/daemon.go
Outdated
log.Info().Msg("delete periodic update finished") | ||
} | ||
|
||
<<<<<<< HEAD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are merge artifacts left here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
UFM Cleanup on Pod Lifecycle Events: Fixed issue where GUIDs weren't properly removed from UFM (Unified Fabric Manager) when pods completed (success/error) or were deleted. GUID Reallocation Conflicts: Added logic to remove existing GUID allocations from UFM before assigning a new partition (PKey) to prevent conflicts when the same GUID is reused. Pod State Handling: Improved pod lifecycle management by treating finished pods (succeeded/failed) the same as deleted pods for cleanup purposes. Signed-off-by: Mladjan Gadzic <[email protected]>
7b07865
to
9e8319e
Compare
i've run make build/all/test/lint locally and it succeeds. |
GO = go | ||
|
||
TIMEOUT = 15 | ||
TIMEOUT = 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've noticed that tests were flaky without this. it might be something locally to my env.
@almaslennikov thanks for merging this! |