Skip to content

Got loss=nan when training on gpu #730

@fkurushin

Description

@fkurushin

Got loss=nan and sometimes fails and cuda error (again loss calculation cased it) when training on GPU. When i set calculate_training_loss=False - model trains absolutely fine. If calculate_training_loss=True than:

Using GPU: 3
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [03:02<00:00, 12.19s/it, loss=nan]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25728193/25728193 [07:42<00:00, 55622.56it/s]
NDCG for qvec 0: 1.26 %
CuPy cache cleared on GPU 3
Using GPU: 4
  7%|█████████████▍                                                                                                                                                                                           | 1/15 [00:12<02:52, 12.33s/it]
Traceback (most recent call last):
  File "/home/fkurushin/personal-query-recommendations-research/ndcg_vs_sample_srategy.py", line 74, in <module>
    main(args.gpus, args.sparse_matrix_paths, args.n_factors)
  File "/home/fkurushin/personal-query-recommendations-research/ndcg_vs_sample_srategy.py", line 53, in main
    model.fit(train)
  File "/home/fkurushin/venv/implicit/lib/python3.11/site-packages/implicit/gpu/als.py", line 166, in fit
    loss = self.solver.calculate_loss(Cui, X, Y, self.regularization)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "_cuda.pyx", line 265, in implicit.gpu._cuda.LeastSquaresSolver.calculate_loss
RuntimeError: Cuda Error: an illegal memory access was encountered (/home/fkurushin/implicit/implicit/gpu/als.cu:276)
terminate called after throwing an instance of 'std::runtime_error'
  what():  Cuda Error: an illegal memory access was encountered (/home/fkurushin/implicit/implicit/gpu/matrix.cu:246)
Aborted

And i am pretty sure this problem is not with data types and variables overflow

Additional Information:

  • implicit: 0.7.2 (built from source)
  • Python: 3.11.2
  • CUDA: 12.3
  • OS: Debian GNU/Linux 12
  • Scipy: 1.14.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions