Skip to content

Optimize GPU usage in reward models #82

Open
@p-ferreira

Description

@p-ferreira

Some of the validators are getting CUDA OOM every now and then (including the test validator).

https://wandb.ai/opentensor-dev/openvalidators/runs/7p6prmo1/logs?workspace=user-opentensor-pedro

My initial hypothesis is that things are getting stacked in the GPU until they reach the limit. Considering that we have a validator that should run for days, it would be nice to identify some potential points of improvement for GPU management in order to avoid reaching the OOM point.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions