-
Notifications
You must be signed in to change notification settings - Fork 92
Fix autoscaler returning fatal error on GOAWAY if no scaling seen #973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -110,7 +110,7 @@ struct GenericService<F> { | |
} | ||
impl<F> Service<tonic::codegen::http::Request<Body>> for GenericService<F> | ||
where | ||
F: FnMut() -> Response<Body>, | ||
F: FnMut() -> BoxFuture<'static, Response<Body>>, | ||
{ | ||
type Response = Response<Body>; | ||
type Error = Infallible; | ||
|
@@ -133,7 +133,7 @@ where | |
) | ||
.unwrap(); | ||
let r = (self.response_maker)(); | ||
async move { Ok(r) }.boxed() | ||
async move { Ok(r.await) }.boxed() | ||
} | ||
} | ||
impl<F> NamedService for GenericService<F> { | ||
|
@@ -144,12 +144,12 @@ struct FakeServer { | |
addr: std::net::SocketAddr, | ||
shutdown_tx: oneshot::Sender<()>, | ||
header_rx: tokio::sync::mpsc::UnboundedReceiver<String>, | ||
server_handle: tokio::task::JoinHandle<()>, | ||
pub server_handle: tokio::task::JoinHandle<()>, | ||
} | ||
|
||
async fn fake_server<F>(response_maker: F) -> FakeServer | ||
where | ||
F: FnMut() -> Response<Body> + Clone + Send + Sync + 'static, | ||
F: FnMut() -> BoxFuture<'static, Response<Body>> + Clone + Send + Sync + 'static, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I made this change while trying to get the fake server to create the goaway. Couldn't do it, but, it's probably useful for the fake responses to be async anyway. |
||
{ | ||
let (shutdown_tx, shutdown_rx) = oneshot::channel::<()>(); | ||
let (header_tx, header_rx) = tokio::sync::mpsc::unbounded_channel(); | ||
|
@@ -191,7 +191,7 @@ impl FakeServer { | |
|
||
#[tokio::test] | ||
async fn timeouts_respected_one_call_fake_server() { | ||
let mut fs = fake_server(|| Response::new(Body::empty())).await; | ||
let mut fs = fake_server(|| async { Response::new(Body::empty()) }.boxed()).await; | ||
let header_rx = &mut fs.header_rx; | ||
|
||
let mut opts = get_integ_server_options(); | ||
|
@@ -260,7 +260,11 @@ async fn non_retryable_errors() { | |
Code::Unauthenticated, | ||
Code::Unimplemented, | ||
] { | ||
let mut fs = fake_server(move || Status::new(code, "bla").into_http()).await; | ||
let mut fs = fake_server(move || { | ||
let s = Status::new(code, "bla").into_http(); | ||
async { s }.boxed() | ||
}) | ||
.await; | ||
|
||
let mut opts = get_integ_server_options(); | ||
let uri = format!("http://localhost:{}", fs.addr.port()) | ||
|
@@ -292,13 +296,13 @@ async fn retryable_errors() { | |
{ | ||
let count = Arc::new(AtomicUsize::new(0)); | ||
let mut fs = fake_server(move || { | ||
dbg!("Making resp"); | ||
let prev = count.fetch_add(1, Ordering::Relaxed); | ||
if prev < 3 { | ||
let r = if prev < 3 { | ||
Status::new(code, "bla").into_http() | ||
} else { | ||
make_ok_response(RespondActivityTaskCanceledResponse::default()) | ||
} | ||
}; | ||
async { r }.boxed() | ||
}) | ||
.await; | ||
|
||
|
@@ -335,7 +339,7 @@ async fn namespace_header_attached_to_relevant_calls() { | |
.add_service(GenericService { | ||
header_to_parse: "Temporal-Namespace", | ||
header_tx, | ||
response_maker: || Response::new(Body::empty()), | ||
response_maker: || async { Response::new(Body::empty()) }.boxed(), | ||
}) | ||
.serve_with_incoming_shutdown( | ||
tokio_stream::wrappers::TcpListenerStream::new(listener), | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I get some background here? I expect Tonic to hide go-away and implicitly handle reconnects. We send a go away every 5m on average. Is there a situation where a regular non-worker client use may see some go-away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bug in hyper or tonic that was worked around here: #811
But, the short-circuit that the autoscaler turns on so it can scale better on otherwise non-fatal errors makes this come through as fatal - so it gets ignored specifically here.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(marking approved anyways, just trying to understand some strangeness here)
So IIUC the server by default sends a
GoAway
after 5m (aka soft connection close to tell you to stop making more calls after whatever is in flight) and then hard-closes the TCP connection after 7m (because 2m is enough time between soft and hard for you to never hit this in a properly behaving client).So somehow we're sending RPC calls even after soft close and therefore hitting the 7m limit? If that is the case, there can be data loss in a rare/racy way if hard TCP death occurs during gRPC call (in our case it'd be a task timeout because server might send us task and then close conn). Or maybe Tonic is the one that's eagerly returning a
Code::Cancelled
w/"connection closed"
before it ever even makes the call? Obviously not important if it all works today, but it is a bit confusing to me.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the latter thing is the explanation, but yeah I agree it's confusing.