Skip to content

Rocm plugin update to accommodate for changes in 7.0 #31

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Aug 18, 2025

Conversation

alexandraBara
Copy link
Collaborator

@alexandraBara alexandraBara commented Aug 6, 2025

Error observed when running node-scraper inside a container with rocm 7.0.0
Previous behavior:

(venv) root@TheraC61:/mnt/kgr/node-scraper/node-scraper# node-scraper run-plugins RocmPlugin
  2025-08-06 19:37:37 UTC       INFO               nodescraper | Log path: ./scraper_logs_therac61_2025_08_06-07_37_37_PM
  2025-08-06 19:37:37 UTC       INFO               nodescraper | System Name: TheraC61
  2025-08-06 19:37:37 UTC       INFO               nodescraper | System SKU: None
  2025-08-06 19:37:37 UTC       INFO               nodescraper | System Platform: None
  2025-08-06 19:37:37 UTC       INFO               nodescraper | System location: SystemLocation.LOCAL
  2025-08-06 19:37:37 UTC       INFO               nodescraper | Initializing connection manager for InBandConnectionManager with default args
  2025-08-06 19:37:37 UTC       INFO               nodescraper | --------------------------------------------------
  2025-08-06 19:37:37 UTC       INFO               nodescraper | Running plugin RocmPlugin
  2025-08-06 19:37:37 UTC       INFO               nodescraper | Initializing connection: InBandConnectionManager
  2025-08-06 19:37:37 UTC       INFO               nodescraper | Using local shell
  2025-08-06 19:37:37 UTC       INFO               nodescraper | Checking OS family
  2025-08-06 19:37:37 UTC       INFO               nodescraper | OS Family: LINUX
  2025-08-06 19:37:37 UTC       INFO               nodescraper | Running data collector: RocmCollector
  2025-08-06 19:37:37 UTC   CRITICAL               nodescraper | Pydantic validation error
  2025-08-06 19:37:37 UTC   CRITICAL               nodescraper | (RocmPlugin) task failed to run (1 errors)
  2025-08-06 19:37:37 UTC       INFO               nodescraper | Closing connections
  2025-08-06 19:37:37 UTC       INFO               nodescraper | Running result collators
  2025-08-06 19:37:37 UTC       INFO               nodescraper | Running TableSummary result collator
  2025-08-06 19:37:37 UTC       INFO               nodescraper |

+-------------------------+--------+-----------------------------+
|  Connection              | Status | Message                     |
+-------------------------+--------+-----------------------------+
|  InBandConnectionManager | UNSET  | task completed successfully |
+-------------------------+--------+-----------------------------+

+------------+-------------------+----------------------------------------------------------------------------------------------------------------------+
|  Plugin     | Status            | Message                                                                                                              |
+------------+-------------------+----------------------------------------------------------------------------------------------------------------------+
|  RocmPlugin | EXECUTION_FAILURE | Collection error: Unhandled exception running data collector: Unable to serialize unknown type: <class 'ValueError'> |
+------------+-------------------+----------------------------------------------------------------------------------------------------------------------+

Current behavior:

(venv) root@TheraC61:/mnt/kgr/node-scraper/node-scraper# node-scraper run-plugins RocmPlugin
  2025-08-06 19:46:05 UTC       INFO               nodescraper | Log path: ./scraper_logs_therac61_2025_08_06-07_46_05_PM
  2025-08-06 19:46:05 UTC       INFO               nodescraper | System Name: TheraC61
  2025-08-06 19:46:05 UTC       INFO               nodescraper | System SKU: None
  2025-08-06 19:46:05 UTC       INFO               nodescraper | System Platform: None
  2025-08-06 19:46:05 UTC       INFO               nodescraper | System location: SystemLocation.LOCAL
  2025-08-06 19:46:05 UTC       INFO               nodescraper | Initializing connection manager for InBandConnectionManager with default args
  2025-08-06 19:46:05 UTC       INFO               nodescraper | --------------------------------------------------
  2025-08-06 19:46:05 UTC       INFO               nodescraper | Running plugin RocmPlugin
  2025-08-06 19:46:05 UTC       INFO               nodescraper | Initializing connection: InBandConnectionManager
  2025-08-06 19:46:05 UTC       INFO               nodescraper | Using local shell
  2025-08-06 19:46:05 UTC       INFO               nodescraper | Checking OS family
  2025-08-06 19:46:05 UTC       INFO               nodescraper | OS Family: LINUX
  2025-08-06 19:46:05 UTC       INFO               nodescraper | Running data collector: RocmCollector
  2025-08-06 19:46:05 UTC       INFO               nodescraper | (RocmPlugin) ROCm: {'rocm_version': '7.0.0'}
  2025-08-06 19:46:05 UTC       INFO               nodescraper | Running data analyzer: RocmAnalyzer
  2025-08-06 19:46:05 UTC       INFO               nodescraper | (RocmPlugin) Expected ROCm not provided
  2025-08-06 19:46:05 UTC       INFO               nodescraper | Closing connections
  2025-08-06 19:46:05 UTC       INFO               nodescraper | Running result collators
  2025-08-06 19:46:05 UTC       INFO               nodescraper | Running TableSummary result collator
  2025-08-06 19:46:05 UTC       INFO               nodescraper |

+-------------------------+--------+-----------------------------+
|  Connection              | Status | Message                     |
+-------------------------+--------+-----------------------------+
|  InBandConnectionManager | UNSET  | task completed successfully |
+-------------------------+--------+-----------------------------+

+------------+--------+-------------------------------------+
|  Plugin     | Status | Message                             |
+------------+--------+-------------------------------------+
|  RocmPlugin | OK     | Plugin tasks completed successfully |
+------------+--------+-------------------------------------+

  2025-08-06 19:46:05 UTC       INFO               nodescraper | Data written to csv file: ./scraper_logs_therac61_2025_08_06-07_46_05_PM/nodescraper.csv

@cmcknigh
Copy link
Collaborator

cmcknigh commented Aug 7, 2025

ROCm has changed the way they provide the version number.
They no longer provide the build number in the version file
you need to check the version-rocm or version-utils file to get the full M.m.p-build info from rocm.
I believe some 6.x builds have those files, but its a recent add.

@alexandraBara alexandraBara changed the title Rocm plugin update: updated regex for rocm version to accept 7.0.0 Rocm plugin update to accommodate for changes in 7.0 Aug 7, 2025
@alexandraBara
Copy link
Collaborator Author

ROCm has changed the way they provide the version number. They no longer provide the build number in the version file you need to check the version-rocm or version-utils file to get the full M.m.p-build info from rocm. I believe some 6.x builds have those files, but its a recent add.

@cmcknigh I added version-rocm, should i be adding version-utils as well?

Copy link
Collaborator

@cmcknigh cmcknigh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like it should cover the case of pre and post 7.0

@alexandraBara alexandraBara marked this pull request as draft August 12, 2025 19:53
@alexandraBara alexandraBara marked this pull request as ready for review August 13, 2025 14:34

rocm_data = None
for path in version_paths:
if Path(path).exists():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not use path.exists() as this would only be supported for local execution. _run_sut_cmd should be used for anything that interacts with the shell of the target node.

Comment on lines 66 to 71
self._log_event(
category=EventCategory.OS,
description=f"Could not get ROCm version format from {path}",
data={"raw_output": res.stdout},
priority=EventPriority.ERROR,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be better to log this at the end using for...else if no break was hit. Otherwise, we will always log an error event if "/opt/rocm/.info/version-rocm" does not exist, even if we are able to read the version from "/opt/rocm/.info/version". When we successfully read the version from either of the paths, we should not be logging error events.

else:
self._log_event(
category=EventCategory.OS,
description=f"Could not get ROCm version format from {path}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is being logged at the end, we should reference all the paths that were checked rather than just the last one.

f"Unable to read ROCm version from {version_paths }"

@alexandraBara alexandraBara merged commit 5da03b4 into development Aug 18, 2025
5 checks passed
@alexandraBara alexandraBara deleted the alex_rocm_fix branch August 18, 2025 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants