Replace TimerOutput::Scope to avoid deadlocks #6664
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Related to #6663.
I would like to discuss replacing all instances of
TimerOutput::Scope
withtimer.enter_subsection
/leave_subsection
in ASPECT. The problem withTimerOutput::Scope
was discussed in dealii/dealii#12248, but I dont see a good path to fix this consistently inside deal.II (at least not without changing the behavior ofTimerOutput::Scope
). In essence, because the destructor ofTimerOutput::Scope
triggers MPI communication it is very prone to deadlocks if an exception is not triggered on all ranks, a scenario that is pretty common in ASPECT. In such a case the unwinding of the stack of the throwing MPI rank needs to reach an MPI_abort statement without triggering MPI communication otherwise we deadlock. Usingcomputing_timer.leave_subsection();
means the throwing MPI rank will not leave the subsection and at least have the possibility to unwind the stack.I am aware that this undoes #2087, but I feel the current situation is not sustainable. About once a year I spend a day chasing down a deadlock, which would usually be fixed in minutes if I had the correct error message on the screen. Additionally, we occasionally have users reporting stalling simulations in the forum (e.g. https://community.geodynamics.org/t/aspect-hangs-at-same-timesteps-when-using-isosurfaces-stratagey/3917/2). Of course not all the reports will be caused by this, but I suspect a number are, and this is almost impossible to debug for a new user.