Skip to content

Improve other_operation_in_progress error users, fix handling of its delays #6588

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

last-genius
Copy link
Contributor

@last-genius last-genius commented Jul 14, 2025

other_operation_in_progress is reported inconsistently, with some users only specifying [class; class_ref], and others also op_type. Make it consistent, changing all users to report [class; class_ref; op_type; op_ref]. This makes it much easier to navigate the logs, since the operation blocking progress is clearly referenced.

Compare before, where it's unclear which operation is precluding progress:

Caught exception while marking SR for VDI.clone in message forwarder:
OTHER_OPERATION_IN_PROGRESS:
[ SR; OpaqueRef:d019f26a-b2e8-529b-bb66-fd4f008d4f82; VDI.clone ]

And after, with the ID of the operation present:

Caught exception while marking SR for VDI.clone in message forwarder:
OTHER_OPERATION_IN_PROGRESS:
[ SR; OpaqueRef:b0f54a40-485f-f48f-fdc7-6db28a64d3fd;
  scan; OpaqueRef:832fb22d-2ce5-0d26-8a00-a4165e78b34d ]

In addition, fix incorrect operation reporting, so that when progress is blocked only when particular types of operations are present in current_operations, these specific operations are reported, and not just any current operation.

Also fix handling of other_operation_in_progress delays - some of the users would miss the match arm and use the non-interruptible Thread.sleep instead of the interruptible Delay.sleep that provides a mechanism for the blocking task to wake us up on its way out.

…ion_in_progress error

Compare before, where it's unclear which operation is precluding progress:
```
Caught exception while marking SR for VDI.clone in message forwarder:
OTHER_OPERATION_IN_PROGRESS:
[ SR; OpaqueRef:d019f26a-b2e8-529b-bb66-fd4f008d4f82; VDI.clone ]
```

And after, where the ID of the operation makes it easier to navigate the logs:
```
Caught exception while marking SR for VDI.clone in message forwarder:
OTHER_OPERATION_IN_PROGRESS:
[ SR; OpaqueRef:b0f54a40-485f-f48f-fdc7-6db28a64d3fd;
  scan; OpaqueRef:832fb22d-2ce5-0d26-8a00-a4165e78b34d ]
```

Also, avoid iterating over all the current_operations when looking for
non-parallelisable ops, stop on the first one. Report the non-parallelisable op
instead of the first current operation as the reason for the error as well.

Signed-off-by: Andrii Sultanov <[email protected]>
@last-genius last-genius force-pushed the asv/improve-other-operation-logs branch from b122c68 to b332213 Compare July 14, 2025 09:47
set_errors Api_errors.other_operation_in_progress
["VBD"; _ref; vbd_operations_to_string (List.hd current_ops)]
[`unpause] ;
let set_difference a b = List.filter (fun (_, x) -> not (List.mem x b)) a in
Copy link
Member

@psafont psafont Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Listext.List.set_difference is used further up ahead: you could replace all users, or keep using listext here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a reimplementation of set_difference for this particular use case - I need to ignore the reference, only comparing operation types.

@@ -277,7 +282,15 @@ let check_operation_error ~__context ?sr_records:_ ?(pbd_records = [])
mechanism of message forwarding and only use the event loop. *)
my_has_current_operation_vbd_records <> [] && op <> `data_destroy
then
Error (Api_errors.other_operation_in_progress, ["VDI"; _ref])
let op_ref, op_type =
List.hd
Copy link
Member

@psafont psafont Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two more list.hd, a bit awkward that is has to be coded this way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but sadly replacing this with match over options as elsewhere makes the logic more inscrutable, so I've kept it as is.

Copy link
Contributor

@changlei-li changlei-li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little big, I haven't gone through it.

other_operation_in_progress has separate fields for the main class of the
object ("Cluster" or "Cluster_host") and the operation that's blocking
progress, do not concatenate these into one string. Also report the task ref
that's blocking progress.

Signed-off-by: Andrii Sultanov <[email protected]>
other_operation_in_progress has separate fields for the main class of the
object ("VM") and the operation that's blocking progress, do not concatenate
these into one string. Also report the task ref that's blocking progress.

Signed-off-by: Andrii Sultanov <[email protected]>
…n_in_progress

Fix incorrect operations being reported - e.g. when other_operation_in_progress
is reported only when there are non-parallelisable operations currently
happening, we can't just report the first current op, we need to report the
first *non-parallelisable* op. This required some refactoring, moving to
options instead of bools, etc.

Additionally report the blocking operation's reference when raising the error.

Signed-off-by: Andrii Sultanov <[email protected]>
…other_operation_in_progress

Instead of logging all the blocking operations with 'debug', just report them
with the error.

Signed-off-by: Andrii Sultanov <[email protected]>
…ation_in_progress error

Signed-off-by: Andrii Sultanov <[email protected]>
Not all of other_operation_in_progress error contain two objects in the list
([cls; objref]), some contain three or four, but the match would miss them and
they would be put into a non-interruptible Thread.sleep instead of the
interruptible Delay.sleep that provides a mechanism for the other task to wake
us up on its way out.

Signed-off-by: Andrii Sultanov <[email protected]>
…progress

Most of the users report these when appropriate, so document them here.

Signed-off-by: Andrii Sultanov <[email protected]>
quality-gate: Reduce expected List.hd count
test_clustering: test for the blocking operation to be reported

Signed-off-by: Andrii Sultanov <[email protected]>
@last-genius last-genius force-pushed the asv/improve-other-operation-logs branch from b332213 to 9be1b7d Compare July 16, 2025 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants