-
Notifications
You must be signed in to change notification settings - Fork 167
Do not auto-retry gRPC-message-size-too-large errors #2604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@@ -236,7 +236,9 @@ public static final class Builder { | |||
private Duration nextRetryDelay; | |||
private ApplicationErrorCategory category; | |||
|
|||
private Builder() {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already fixed this in a different PR :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Undid the change.
@@ -0,0 +1,7 @@ | |||
package io.temporal.internal.retryer; | |||
|
|||
public class GrpcMessageTooLargeException extends RuntimeException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this type really internal? If a user makes a client call with a payload that is to large won't that client call throw this exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per conversation, the exception is internal, but I changed it so it derives StatusRuntimeException so the user can catch it that way.
result.getRequestRetryOptions(), | ||
workflowTypeScope); | ||
} else if (queryCompleted != null) { | ||
if (queryCompleted != null) { | ||
sendDirectQueryCompletedResponse( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The direct query response can also fail due to gRPC message size too large so I think that needs to be under the try as well no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
b3bad78
to
6444df3
Compare
.startsWith("grpc: received message after decompression larger than max"))) { | ||
return new GrpcMessageTooLargeException(status.withCause(exception), exception.getTrailers()); | ||
} else { | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could just change the return type and return the original StatusRuntimeException
. Since all the callers just do that as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic is different depending on whether the wrap was successful or not - GrpcMessageTooLargeException
is sent as a failure to the server, other StatusRuntimeException
is just rethrown. Technically it could be instanceof
check, but I think null check is cleaner.
Looks good!, did we plan to add a Temporal cloud test as part of this PR? If not that is okay, but can we at least manually test against it? |
workflowExecution.getWorkflowId(), | ||
tooLargeException, | ||
"Failed to send query response"); | ||
RespondQueryTaskCompletedRequest.Builder queryFailedBuilder = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed this in the initial review, we should be failing the workflow task here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked offline, In Java we seem to always use RespondQueryTaskCompletedRequest
for queries, even if there was something that would normally fail the workflow task.
What was changed
GrpcRetryer
does not retry on gRPC error if the cause is message size limit, instead wraps the error in the new customGrpcMessageTooLargeException
and throws.WorkflowWorker
catchesGrpcMessageTooLargeException
on reporting workflow task completion/failure, and reports task failure with this error. This does NOT prevent server-side retry of the workflow task, but it does record the error in event history for easier debugging.Why?
Feature request: temporalio/features#624
Checklist
Closes Do not auto-retry gRPC-message-size-too-large errors #1585
How was this tested: added tests to
GrpcSyncRetryerTest
,GrpcAsyncRetryerTest
andGrpcMessageTooLargeTest
.