-
Notifications
You must be signed in to change notification settings - Fork 622
Open
Description
Got loss=nan
and sometimes fails and cuda error (again loss calculation cased it) when training on GPU. When i set calculate_training_loss=False
- model trains absolutely fine. If calculate_training_loss=True
than:
Using GPU: 3
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [03:02<00:00, 12.19s/it, loss=nan]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25728193/25728193 [07:42<00:00, 55622.56it/s]
NDCG for qvec 0: 1.26 %
CuPy cache cleared on GPU 3
Using GPU: 4
7%|█████████████▍ | 1/15 [00:12<02:52, 12.33s/it]
Traceback (most recent call last):
File "/home/fkurushin/personal-query-recommendations-research/ndcg_vs_sample_srategy.py", line 74, in <module>
main(args.gpus, args.sparse_matrix_paths, args.n_factors)
File "/home/fkurushin/personal-query-recommendations-research/ndcg_vs_sample_srategy.py", line 53, in main
model.fit(train)
File "/home/fkurushin/venv/implicit/lib/python3.11/site-packages/implicit/gpu/als.py", line 166, in fit
loss = self.solver.calculate_loss(Cui, X, Y, self.regularization)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "_cuda.pyx", line 265, in implicit.gpu._cuda.LeastSquaresSolver.calculate_loss
RuntimeError: Cuda Error: an illegal memory access was encountered (/home/fkurushin/implicit/implicit/gpu/als.cu:276)
terminate called after throwing an instance of 'std::runtime_error'
what(): Cuda Error: an illegal memory access was encountered (/home/fkurushin/implicit/implicit/gpu/matrix.cu:246)
Aborted
And i am pretty sure this problem is not with data types and variables overflow
Additional Information:
- implicit: 0.7.2 (built from source)
- Python: 3.11.2
- CUDA: 12.3
- OS: Debian GNU/Linux 12
- Scipy: 1.14.0
Metadata
Metadata
Assignees
Labels
No labels