-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Open
Labels
bugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray CoreobservabilityIssues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or ProfilingIssues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or ProfilingregressionstabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)Needs triage (eg: priority, bug/not-bug, and owning component)
Description
What happened + What you expected to happen
The issue mentioned in #53466 fails in version 2.45.0 and 2.48.0 when I try to start a Ray cluster using Slurm, but works when I use 2.44.1.
There seems to be two kinds of errors when starting the dashboard caused by commits since 2.44.1. I'm not sure if these two are related so I'm filing one issue for both — let me know if I should split them into separate issues instead.
First issue: dashboard fails to start despite submodules being loaded properly
This issue happens from this commit onwards.
Logs (Slurm output file)
2025-08-05 10:11:28,096 ERROR services.py:1351 -- Failed to start the dashboard
2025-08-05 10:11:28,097 ERROR services.py:1376 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure' to find where the log file is.
2025-08-05 10:11:28,097 ERROR services.py:1420 --
The last 20 lines of /path/to/ray/session_xxx/logs/dashboard.log (it contains the error message from the dashboard):
2025-08-05 10:11:09,051 INFO utils.py:326 -- Get all modules by type: DashboardHeadModule
2025-08-05 10:11:09,773 INFO utils.py:359 -- Available modules: [<class 'ray.dashboard.modules.actor.actor_head.ActorHead'>, <class 'ray.dashboard.modules.metrics.metrics_head.MetricsHead'>, <class 'ray.dashboard.modules.data.data_head.DataHead'>, <class 'ray.dashboard.modules.event.event_head.EventHead'>, <class 'ray.dashboard.modules.job.job_head.JobHead'>, <class 'ray.dashboard.modules.node.node_head.NodeHead'>, <class 'ray.dashboard.modules.reporter.reporter_head.ReportHead'>, <class 'ray.dashboard.modules.serve.serve_rest_api_impl.create_serve_rest_api.<locals>.ServeRestApiImpl'>, <class 'ray.dashboard.modules.snapshot.snapshot_head.APIHead'>, <class 'ray.dashboard.modules.state.state_head.StateHead'>, <class 'ray.dashboard.modules.train.train_head.TrainHead'>, <class 'ray.dashboard.modules.usage_stats.usage_stats_head.UsageStatsHead'>]
2025-08-05 10:11:09,773 INFO head.py:261 -- DashboardHeadModules to load: None.
2025-08-05 10:11:09,773 INFO head.py:264 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.actor.actor_head.ActorHead'>.
2025-08-05 10:11:09,773 INFO head.py:264 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.metrics.metrics_head.MetricsHead'>.
2025-08-05 10:11:09,773 INFO head.py:264 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.data.data_head.DataHead'>.
2025-08-05 10:11:09,774 INFO head.py:264 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.event.event_head.EventHead'>.
2025-08-05 10:11:09,774 INFO head.py:264 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.job.job_head.JobHead'>.
2025-08-05 10:11:09,774 INFO head.py:264 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.node.node_head.NodeHead'>.
2025-08-05 10:11:09,774 INFO head.py:264 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.reporter.reporter_head.ReportHead'>.
2025-08-05 10:11:09,774 INFO head.py:264 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.serve.serve_rest_api_impl.create_serve_rest_api.<locals>.ServeRestApiImpl'>.
2025-08-05 10:11:09,774 INFO head.py:264 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.snapshot.snapshot_head.APIHead'>.
2025-08-05 10:11:09,774 INFO head.py:264 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.state.state_head.StateHead'>.
2025-08-05 10:11:09,774 INFO head.py:264 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.train.train_head.TrainHead'>.
2025-08-05 10:11:09,774 INFO head.py:264 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.usage_stats.usage_stats_head.UsageStatsHead'>.
2025-08-05 10:11:09,774 INFO head.py:268 -- Loaded 12 dashboard head modules: [<ray.dashboard.modules.actor.actor_head.ActorHead object at 0x7fffda6b7610>, <ray.dashboard.modules.metrics.metrics_head.MetricsHead object at 0x7fffdb3fe150>, <ray.dashboard.modules.data.data_head.DataHead object at 0x7ffff2369010>, <ray.dashboard.modules.event.event_head.EventHead object at 0x7ffff2368f50>, <ray.dashboard.modules.job.job_head.JobHead object at 0x7ffff2369f10>, <ray.dashboard.modules.node.node_head.NodeHead object at 0x7fffdae2f310>, <ray.dashboard.modules.reporter.reporter_head.ReportHead object at 0x7fffd9f9c710>, <ray.dashboard.modules.serve.serve_rest_api_impl.create_serve_rest_api.<locals>.ServeRestApiImpl object at 0x7fffd9db5f90>, <ray.dashboard.modules.snapshot.snapshot_head.APIHead object at 0x7fffd9db5fd0>, <ray.dashboard.modules.state.state_head.StateHead object at 0x7ffff2368c90>, <ray.dashboard.modules.train.train_head.TrainHead object at 0x7fffd9db64d0>, <ray.dashboard.modules.usage_stats.usage_stats_head.UsageStatsHead object at 0x7fffd9db6650>].
2025-08-05 10:11:09,774 INFO utils.py:326 -- Get all modules by type: SubprocessModule
2025-08-05 10:11:09,776 INFO utils.py:359 -- Available modules: [<class 'ray.dashboard.modules.healthz.healthz_head.HealthzHead'>]
2025-08-05 10:11:09,776 INFO head.py:315 -- Loading SubprocessModule: <class 'ray.dashboard.modules.healthz.healthz_head.HealthzHead'>.
2025-08-05 10:11:07,812 INFO usage_lib.py:472 -- Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2025-08-05 10:11:07,813 INFO scripts.py:861 -- �[37mLocal node IP�[39m: �[1mxxx�[22m
2025-08-05 10:11:29,318 SUCC scripts.py:897 -- �[32m--------------------�[39m
2025-08-05 10:11:29,318 SUCC scripts.py:898 -- �[32mRay runtime started.�[39m
2025-08-05 10:11:29,318 SUCC scripts.py:899 -- �[32m--------------------�[39m
2025-08-05 10:11:29,318 INFO scripts.py:901 -- �[36mNext steps�[39m
2025-08-05 10:11:29,318 INFO scripts.py:904 -- To add another node to this Ray cluster, run
2025-08-05 10:11:29,318 INFO scripts.py:907 -- �[1m ray start --address='xxx'�[22m
2025-08-05 10:11:29,318 INFO scripts.py:916 -- To connect to this Ray cluster:
2025-08-05 10:11:29,318 INFO scripts.py:918 -- �[35mimport�[39m�[26m ray
2025-08-05 10:11:29,318 INFO scripts.py:919 -- ray�[35m.�[39m�[26minit()
2025-08-05 10:11:29,318 INFO scripts.py:950 -- To terminate the Ray runtime, run
2025-08-05 10:11:29,319 INFO scripts.py:951 -- �[1m ray stop�[22m
2025-08-05 10:11:29,319 INFO scripts.py:954 -- To view the status of the cluster, use
2025-08-05 10:11:29,319 INFO scripts.py:955 -- �[1mray status�[22m�[26m
2025-08-05 10:11:29,319 INFO scripts.py:1071 -- �[36m�[1m--block�[22m�[39m
2025-08-05 10:11:29,319 INFO scripts.py:1072 -- This command will now block forever until terminated by a signal.
2025-08-05 10:11:29,319 INFO scripts.py:1075 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
Second issue: Module MetricsHead fails to start (as in #53466)
This issue happens from this commit onwards.
Logs
2025-08-05 10:57:47,739 ERROR services.py:1362 -- Failed to start the dashboard
2025-08-05 10:57:47,739 ERROR services.py:1387 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure' to find where the log file is.
2025-08-05 10:57:47,739 ERROR services.py:1431 --
The last 20 lines of /path/to/ray/session_xxx/logs/dashboard.log (it contains the error message from the dashboard):
buf = self._recv(4)
^^^^^^^^^^^^^
File "/path/to/uv/python/cpython-3.11.13-linux-x86_64-gnu/lib/python3.11/multiprocessing/connection.py", line 399, in _recv
raise EOFError
EOFError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/path/to/.venv/lib/python3.11/site-packages/ray/dashboard/dashboard.py", line 294, in <module>
loop.run_until_complete(dashboard.run())
File "/path/to/uv/python/cpython-3.11.13-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/path/to/.venv/lib/python3.11/site-packages/ray/dashboard/dashboard.py", line 95, in run
await self.dashboard_head.run()
File "/path/to/.venv/lib/python3.11/site-packages/ray/dashboard/head.py", line 405, in run
handle.wait_for_module_ready()
File "/path/to/.venv/lib/python3.11/site-packages/ray/dashboard/subprocesses/handle.py", line 145, in wait_for_module_ready
raise RuntimeError(
RuntimeError: Module MetricsHead failed to start. Received EOF from pipe.
Versions / Dependencies
- Ray version: >=2.45.0, including 2.48.0
- Python version: 3.11.13
- OS: Rhel 9.4
Reproduction script
test.sh
#!/bin/bash
# shellcheck disable=SC2206
#SBATCH --job-name=test
#SBATCH --nodelist=your-node # Change this
#SBATCH --time=1-00:00:00
PATH_TO_VENV=/path/to/venv/bin/activate
source $PATH_TO_VENV
# Getting the node names
nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
# Start the head node
srun --nodes=1 --ntasks=1 -w "$head_node" bash -c \
"source $PATH_TO_VENV && ray start --head --node-ip-address=$head_node_ip --block"
Repro for 2.48.0
pyproject.toml
[project]
name = "test"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.11"
dependencies = [
"ray[default]>=2.48.0",
]
Running the script
uv sync
sbatch test.sh
Repro for the two specific commits linked above
pyproject.toml
[project]
name = "test"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.11"
dependencies = []
Ray installation
Replace commit-sha
as needed.
uv add "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/master/commit-sha/ray-3.0.0.dev0-cp311-cp311-manylinux2014_x86_64.whl"
Running the script
sbatch test.sh
Issue Severity
High: It blocks me from completing my task.
Metadata
Metadata
Assignees
Labels
bugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray CoreobservabilityIssues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or ProfilingIssues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or ProfilingregressionstabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)Needs triage (eg: priority, bug/not-bug, and owning component)