Skip to content

Conversation

gassmoeller
Copy link
Member

Related to #6663.

I would like to discuss replacing all instances of TimerOutput::Scope with timer.enter_subsection/leave_subsection in ASPECT. The problem with TimerOutput::Scope was discussed in dealii/dealii#12248, but I dont see a good path to fix this consistently inside deal.II (at least not without changing the behavior of TimerOutput::Scope). In essence, because the destructor of TimerOutput::Scope triggers MPI communication it is very prone to deadlocks if an exception is not triggered on all ranks, a scenario that is pretty common in ASPECT. In such a case the unwinding of the stack of the throwing MPI rank needs to reach an MPI_abort statement without triggering MPI communication otherwise we deadlock. Using computing_timer.leave_subsection(); means the throwing MPI rank will not leave the subsection and at least have the possibility to unwind the stack.

I am aware that this undoes #2087, but I feel the current situation is not sustainable. About once a year I spend a day chasing down a deadlock, which would usually be fixed in minutes if I had the correct error message on the screen. Additionally, we occasionally have users reporting stalling simulations in the forum (e.g. https://community.geodynamics.org/t/aspect-hangs-at-same-timesteps-when-using-isosurfaces-stratagey/3917/2). Of course not all the reports will be caused by this, but I suspect a number are, and this is almost impossible to debug for a new user.

@tjhei
Copy link
Member

tjhei commented Sep 5, 2025

Considering we have about 40 Scopes in ASPECT, I don't think it is a significant effort/headache to convert them to enter/leave. We have functions that have complicated return patterns, though. There we need to be careful to handle this correctly. But I am not opposed to just making this change for now.

That said, this is not only an ASPECT related issue but affects any deal.II code. Maybe there is a way to fix this the right way in deal.II instead? For example, one could argue that the MPI communication for timing section would not be necessary to happen at this point in time but only before the user asks for the TimerOutput text output. This might be more complicated of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants