Skip to content

[Core] Ray 2.45.0+ dashboard fails to start #55255

@justinkeung1018

Description

@justinkeung1018

What happened + What you expected to happen

The issue mentioned in #53466 fails in version 2.45.0 and 2.48.0 when I try to start a Ray cluster using Slurm, but works when I use 2.44.1.

There seems to be two kinds of errors when starting the dashboard caused by commits since 2.44.1. I'm not sure if these two are related so I'm filing one issue for both — let me know if I should split them into separate issues instead.

First issue: dashboard fails to start despite submodules being loaded properly

This issue happens from this commit onwards.

Logs (Slurm output file)

2025-08-05 10:11:28,096	ERROR services.py:1351 -- Failed to start the dashboard 
2025-08-05 10:11:28,097	ERROR services.py:1376 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure' to find where the log file is.
2025-08-05 10:11:28,097	ERROR services.py:1420 -- 
The last 20 lines of /path/to/ray/session_xxx/logs/dashboard.log (it contains the error message from the dashboard): 
2025-08-05 10:11:09,051	INFO utils.py:326 -- Get all modules by type: DashboardHeadModule
2025-08-05 10:11:09,773	INFO utils.py:359 -- Available modules: [<class 'ray.dashboard.modules.actor.actor_head.ActorHead'>, <class 'ray.dashboard.modules.metrics.metrics_head.MetricsHead'>, <class 'ray.dashboard.modules.data.data_head.DataHead'>, <class 'ray.dashboard.modules.event.event_head.EventHead'>, <class 'ray.dashboard.modules.job.job_head.JobHead'>, <class 'ray.dashboard.modules.node.node_head.NodeHead'>, <class 'ray.dashboard.modules.reporter.reporter_head.ReportHead'>, <class 'ray.dashboard.modules.serve.serve_rest_api_impl.create_serve_rest_api.<locals>.ServeRestApiImpl'>, <class 'ray.dashboard.modules.snapshot.snapshot_head.APIHead'>, <class 'ray.dashboard.modules.state.state_head.StateHead'>, <class 'ray.dashboard.modules.train.train_head.TrainHead'>, <class 'ray.dashboard.modules.usage_stats.usage_stats_head.UsageStatsHead'>]
2025-08-05 10:11:09,773	INFO head.py:261 -- DashboardHeadModules to load: None.
2025-08-05 10:11:09,773	INFO head.py:264 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.actor.actor_head.ActorHead'>.
2025-08-05 10:11:09,773	INFO head.py:264 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.metrics.metrics_head.MetricsHead'>.
2025-08-05 10:11:09,773	INFO head.py:264 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.data.data_head.DataHead'>.
2025-08-05 10:11:09,774	INFO head.py:264 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.event.event_head.EventHead'>.
2025-08-05 10:11:09,774	INFO head.py:264 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.job.job_head.JobHead'>.
2025-08-05 10:11:09,774	INFO head.py:264 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.node.node_head.NodeHead'>.
2025-08-05 10:11:09,774	INFO head.py:264 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.reporter.reporter_head.ReportHead'>.
2025-08-05 10:11:09,774	INFO head.py:264 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.serve.serve_rest_api_impl.create_serve_rest_api.<locals>.ServeRestApiImpl'>.
2025-08-05 10:11:09,774	INFO head.py:264 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.snapshot.snapshot_head.APIHead'>.
2025-08-05 10:11:09,774	INFO head.py:264 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.state.state_head.StateHead'>.
2025-08-05 10:11:09,774	INFO head.py:264 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.train.train_head.TrainHead'>.
2025-08-05 10:11:09,774	INFO head.py:264 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.usage_stats.usage_stats_head.UsageStatsHead'>.
2025-08-05 10:11:09,774	INFO head.py:268 -- Loaded 12 dashboard head modules: [<ray.dashboard.modules.actor.actor_head.ActorHead object at 0x7fffda6b7610>, <ray.dashboard.modules.metrics.metrics_head.MetricsHead object at 0x7fffdb3fe150>, <ray.dashboard.modules.data.data_head.DataHead object at 0x7ffff2369010>, <ray.dashboard.modules.event.event_head.EventHead object at 0x7ffff2368f50>, <ray.dashboard.modules.job.job_head.JobHead object at 0x7ffff2369f10>, <ray.dashboard.modules.node.node_head.NodeHead object at 0x7fffdae2f310>, <ray.dashboard.modules.reporter.reporter_head.ReportHead object at 0x7fffd9f9c710>, <ray.dashboard.modules.serve.serve_rest_api_impl.create_serve_rest_api.<locals>.ServeRestApiImpl object at 0x7fffd9db5f90>, <ray.dashboard.modules.snapshot.snapshot_head.APIHead object at 0x7fffd9db5fd0>, <ray.dashboard.modules.state.state_head.StateHead object at 0x7ffff2368c90>, <ray.dashboard.modules.train.train_head.TrainHead object at 0x7fffd9db64d0>, <ray.dashboard.modules.usage_stats.usage_stats_head.UsageStatsHead object at 0x7fffd9db6650>].
2025-08-05 10:11:09,774	INFO utils.py:326 -- Get all modules by type: SubprocessModule
2025-08-05 10:11:09,776	INFO utils.py:359 -- Available modules: [<class 'ray.dashboard.modules.healthz.healthz_head.HealthzHead'>]
2025-08-05 10:11:09,776	INFO head.py:315 -- Loading SubprocessModule: <class 'ray.dashboard.modules.healthz.healthz_head.HealthzHead'>.
2025-08-05 10:11:07,812	INFO usage_lib.py:472 -- Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2025-08-05 10:11:07,813	INFO scripts.py:861 -- �[37mLocal node IP�[39m: �[1mxxx�[22m
2025-08-05 10:11:29,318	SUCC scripts.py:897 -- �[32m--------------------�[39m
2025-08-05 10:11:29,318	SUCC scripts.py:898 -- �[32mRay runtime started.�[39m
2025-08-05 10:11:29,318	SUCC scripts.py:899 -- �[32m--------------------�[39m
2025-08-05 10:11:29,318	INFO scripts.py:901 -- �[36mNext steps�[39m
2025-08-05 10:11:29,318	INFO scripts.py:904 -- To add another node to this Ray cluster, run
2025-08-05 10:11:29,318	INFO scripts.py:907 -- �[1m  ray start --address='xxx'�[22m
2025-08-05 10:11:29,318	INFO scripts.py:916 -- To connect to this Ray cluster:
2025-08-05 10:11:29,318	INFO scripts.py:918 -- �[35mimport�[39m�[26m ray
2025-08-05 10:11:29,318	INFO scripts.py:919 -- ray�[35m.�[39m�[26minit()
2025-08-05 10:11:29,318	INFO scripts.py:950 -- To terminate the Ray runtime, run
2025-08-05 10:11:29,319	INFO scripts.py:951 -- �[1m  ray stop�[22m
2025-08-05 10:11:29,319	INFO scripts.py:954 -- To view the status of the cluster, use
2025-08-05 10:11:29,319	INFO scripts.py:955 --   �[1mray status�[22m�[26m
2025-08-05 10:11:29,319	INFO scripts.py:1071 -- �[36m�[1m--block�[22m�[39m
2025-08-05 10:11:29,319	INFO scripts.py:1072 -- This command will now block forever until terminated by a signal.
2025-08-05 10:11:29,319	INFO scripts.py:1075 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.

Second issue: Module MetricsHead fails to start (as in #53466)

This issue happens from this commit onwards.

Logs

2025-08-05 10:57:47,739	ERROR services.py:1362 -- Failed to start the dashboard 
2025-08-05 10:57:47,739	ERROR services.py:1387 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure' to find where the log file is.
2025-08-05 10:57:47,739	ERROR services.py:1431 -- 
The last 20 lines of /path/to/ray/session_xxx/logs/dashboard.log (it contains the error message from the dashboard): 
    buf = self._recv(4)
          ^^^^^^^^^^^^^
  File "/path/to/uv/python/cpython-3.11.13-linux-x86_64-gnu/lib/python3.11/multiprocessing/connection.py", line 399, in _recv
    raise EOFError
EOFError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/path/to/.venv/lib/python3.11/site-packages/ray/dashboard/dashboard.py", line 294, in <module>
    loop.run_until_complete(dashboard.run())
  File "/path/to/uv/python/cpython-3.11.13-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/path/to/.venv/lib/python3.11/site-packages/ray/dashboard/dashboard.py", line 95, in run
    await self.dashboard_head.run()
  File "/path/to/.venv/lib/python3.11/site-packages/ray/dashboard/head.py", line 405, in run
    handle.wait_for_module_ready()
  File "/path/to/.venv/lib/python3.11/site-packages/ray/dashboard/subprocesses/handle.py", line 145, in wait_for_module_ready
    raise RuntimeError(
RuntimeError: Module MetricsHead failed to start. Received EOF from pipe.

Versions / Dependencies

  • Ray version: >=2.45.0, including 2.48.0
  • Python version: 3.11.13
  • OS: Rhel 9.4

Reproduction script

test.sh

#!/bin/bash
# shellcheck disable=SC2206
#SBATCH --job-name=test
#SBATCH --nodelist=your-node # Change this
#SBATCH --time=1-00:00:00

PATH_TO_VENV=/path/to/venv/bin/activate
source $PATH_TO_VENV

# Getting the node names
nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)

head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

# Start the head node
srun --nodes=1 --ntasks=1 -w "$head_node" bash -c \
    "source $PATH_TO_VENV && ray start --head --node-ip-address=$head_node_ip --block"

Repro for 2.48.0

pyproject.toml

[project]
name = "test"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.11"
dependencies = [
    "ray[default]>=2.48.0",
]

Running the script

uv sync
sbatch test.sh

Repro for the two specific commits linked above

pyproject.toml

[project]
name = "test"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.11"
dependencies = []

Ray installation

Replace commit-sha as needed.

uv add "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/master/commit-sha/ray-3.0.0.dev0-cp311-cp311-manylinux2014_x86_64.whl"

Running the script

sbatch test.sh

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray CoreobservabilityIssues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or ProfilingregressionstabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions