Skip to content

Conversation

mladjan-gadzic
Copy link
Contributor

  • UFM Cleanup on Pod Lifecycle Events: Fixed issue where GUIDs weren't properly removed from UFM (Unified Fabric Manager) when pods completed (success/error) or were deleted.
  • GUID Reallocation Conflicts: Added logic to remove existing GUID allocations from UFM before assigning a new partition (PKey) to prevent conflicts when the same GUID is reused.
  • Pod State Handling: Improved pod lifecycle management by treating finished pods (succeeded/failed) the same as deleted pods for cleanup purposes.

@coveralls
Copy link

coveralls commented Sep 5, 2025

Pull Request Test Coverage Report for Build 18346261465

Details

  • 103 of 137 (75.18%) changed or added relevant lines in 6 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+6.4%) to 46.579%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/sm/plugins/noop/noop.go 0 2 0.0%
pkg/daemon/daemon.go 58 90 64.44%
Files with Coverage Reduction New Missed Lines %
pkg/daemon/daemon.go 1 15.58%
Totals Coverage Status
Change from base Build 18335558999: 6.4%
Covered Lines: 708
Relevant Lines: 1520

💛 - Coveralls

Copy link
Collaborator

@almaslennikov almaslennikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mladjan-gadzic, thank you for your contribution. Left some comments. Could you also cover the changed functionality with unit tests?

log.Info().Msgf("matched guid %s to pod %s, removing", guidAddr, guidPodEntry)
guidList = append(guidList, guidAddr)
} else {
log.Warn().Msgf("guid %s is allocated to another pod %s not %s, not removing",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When can this happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment for this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you point me to the comment? I can't find it

@almaslennikov
Copy link
Collaborator

Hi @mladjan-gadzic there is also a failing DCO check: https://github.com/Mellanox/ib-kubernetes/pull/167/checks?check_run_id=49668982984

Please, sign the commits in your PR and re-push them to the branch

@rollandf
Copy link
Member

rollandf commented Sep 8, 2025

@mladjan-gadzic can you rebase and handle conflict?
Also maybe stash the commits to one?

@mladjan-gadzic
Copy link
Contributor Author

mladjan-gadzic commented Sep 8, 2025

@almaslennikov thanks for the review. i've pushed required changes. once you're okay with those, please resolve comments.

@rollandf thanks for the comment. i've squashed everything under one commit. i am going to resolve conflicts soon.


// Remove stale GUIDs that are no longer in use by the subnet manager
// This handles cleanup of GUIDs from deleted/finished pods
for allocatedGUID, podNetworkID := range d.guidPodNetworkMap {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the time when syncGuidPool is called, the d.guidPodNetworkMap is empty because it's populated in the d.initPool method. Since we run both only once, let's combine them into a single function and properly handle the order of operations

@mladjan-gadzic mladjan-gadzic force-pushed the bugfixes branch 2 times, most recently from 0301ca3 to e12ba14 Compare September 23, 2025 11:25
@mladjan-gadzic
Copy link
Contributor Author

@almaslennikov, please, let me know if any additional changes are needed.

@almaslennikov
Copy link
Collaborator

almaslennikov commented Oct 7, 2025

@mladjan-gadzic Please, resolve the merge conflicts. After merging #168 I see there are several places where syncGuidPool is called. Could you align your PR with the new logic?

Copy link
Collaborator

@almaslennikov almaslennikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mladjan-gadzic could you test that the changes in this PR work? The current version doesn't build

}

// Initialize guid pool with existing pods and sync with subnet manager
err = daemonInstance.initGUIDPool()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to not run it here anymore since we only should start the actual logic if the instance becomes a leader. In the becomeLeader function we are already running initGUIDPool

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's refactor this function to return the daemon fully formed at the end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

log.Info().Msg("delete periodic update finished")
}

<<<<<<< HEAD
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are merge artifacts left here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

UFM Cleanup on Pod Lifecycle Events: Fixed issue where GUIDs weren't properly removed from UFM (Unified Fabric Manager) when pods completed (success/error) or were deleted.

GUID Reallocation Conflicts: Added logic to remove existing GUID allocations from UFM before assigning a new partition (PKey) to prevent conflicts when the same GUID is reused.

Pod State Handling: Improved pod lifecycle management by treating finished pods (succeeded/failed) the same as deleted pods for cleanup purposes.

Signed-off-by: Mladjan Gadzic <[email protected]>
@mladjan-gadzic
Copy link
Contributor Author

@mladjan-gadzic could you test that the changes in this PR work? The current version doesn't build

i've run make build/all/test/lint locally and it succeeds.

GO = go

TIMEOUT = 15
TIMEOUT = 30
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've noticed that tests were flaky without this. it might be something locally to my env.

@almaslennikov almaslennikov merged commit 20e4e6b into Mellanox:master Oct 14, 2025
7 of 8 checks passed
@mladjan-gadzic
Copy link
Contributor Author

@almaslennikov thanks for merging this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants