Skip to content

Do not auto-retry gRPC-message-size-too-large errors #2604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

maciejdudko
Copy link

What was changed

  • GrpcRetryer does not retry on gRPC error if the cause is message size limit, instead wraps the error in the new custom GrpcMessageTooLargeException and throws.
  • WorkflowWorker catches GrpcMessageTooLargeException on reporting workflow task completion/failure, and reports task failure with this error. This does NOT prevent server-side retry of the workflow task, but it does record the error in event history for easier debugging.

Why?

Feature request: temporalio/features#624

Checklist

  1. Closes Do not auto-retry gRPC-message-size-too-large errors #1585

  2. How was this tested: added tests to GrpcSyncRetryerTest, GrpcAsyncRetryerTest and GrpcMessageTooLargeTest.

@maciejdudko maciejdudko requested a review from a team as a code owner July 21, 2025 22:01
@@ -236,7 +236,9 @@ public static final class Builder {
private Duration nextRetryDelay;
private ApplicationErrorCategory category;

private Builder() {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already fixed this in a different PR :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undid the change.

@@ -0,0 +1,7 @@
package io.temporal.internal.retryer;

public class GrpcMessageTooLargeException extends RuntimeException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this type really internal? If a user makes a client call with a payload that is to large won't that client call throw this exception?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per conversation, the exception is internal, but I changed it so it derives StatusRuntimeException so the user can catch it that way.

result.getRequestRetryOptions(),
workflowTypeScope);
} else if (queryCompleted != null) {
if (queryCompleted != null) {
sendDirectQueryCompletedResponse(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The direct query response can also fail due to gRPC message size too large so I think that needs to be under the try as well no?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@maciejdudko maciejdudko force-pushed the grpc-message-too-large branch from b3bad78 to 6444df3 Compare July 28, 2025 16:40
.startsWith("grpc: received message after decompression larger than max"))) {
return new GrpcMessageTooLargeException(status.withCause(exception), exception.getTrailers());
} else {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could just change the return type and return the original StatusRuntimeException. Since all the callers just do that as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is different depending on whether the wrap was successful or not - GrpcMessageTooLargeException is sent as a failure to the server, other StatusRuntimeException is just rethrown. Technically it could be instanceof check, but I think null check is cleaner.

@Quinn-With-Two-Ns
Copy link
Contributor

Looks good!, did we plan to add a Temporal cloud test as part of this PR? If not that is okay, but can we at least manually test against it?

workflowExecution.getWorkflowId(),
tooLargeException,
"Failed to send query response");
RespondQueryTaskCompletedRequest.Builder queryFailedBuilder =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this in the initial review, we should be failing the workflow task here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked offline, In Java we seem to always use RespondQueryTaskCompletedRequest for queries, even if there was something that would normally fail the workflow task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not auto-retry gRPC-message-size-too-large errors
2 participants