Skip to content

Fix panic handling in detached mode to return 500 instead of 499 #1359

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nshalman
Copy link
Contributor

Handler panics in HandlerTaskMode::Detached were incorrectly being reported as 499 (Client Closed Request) instead of 500 (Internal Server Error) in telemetry and traces.

This occurred because the panic handling code called panic::resume_unwind(), which caused the request handling task to panic and trigger the disconnect guard, leading to incorrect status code assignment.

Changes:

  • Extract panic message from JoinError and convert to HttpError
  • Return 500 Internal Server Error instead of propagating panic
  • Add integration test to verify panic handling returns proper 500 status
  • Preserve existing behavior for HandlerTaskMode::CancelOnDisconnect

🤖 Generated with Claude Code

Handler panics in HandlerTaskMode::Detached were incorrectly being
reported as 499 (Client Closed Request) instead of 500 (Internal
Server Error) in telemetry and traces.

This occurred because the panic handling code called panic::resume_unwind(),
which caused the request handling task to panic and trigger the disconnect
guard, leading to incorrect status code assignment.

Changes:
- Extract panic message from JoinError and convert to HttpError
- Return 500 Internal Server Error instead of propagating panic
- Add integration test to verify panic handling returns proper 500 status
- Preserve existing behavior for HandlerTaskMode::CancelOnDisconnect

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@nshalman nshalman marked this pull request as ready for review June 26, 2025 18:29
@nshalman
Copy link
Contributor Author

nshalman commented Jul 7, 2025

I know you folks are busy, so these pings are not intended to get an immediate response, just to get this on your radar for feedback. This 499 on a handler panic has seemed wrong to me for a long time. @steveklabnik @davepacheco if you could throw this in your backlog to review I'd appreciate it. Rejection is totally acceptable and I can keep this patch in a branch of my own if this isn't a good fit for all the downstream software you folks have.

@steveklabnik
Copy link
Contributor

500 feels correct to me here too; I haven't run into this in the wild, but at a high level that feels like the right status code, personally.

@davepacheco
Copy link
Collaborator

Sorry for the delay on this! I agree that reporting a 499 is confusing at best. It looks like that's an accident of the current implementation rather than intentional. If I'm understanding right, the code in the guard (L799-L817 on "main") is assuming that if the guard got dropped, it's because the future was cancelled; but it can also get dropped if the handler panics. Ideally, it would somehow account for that. I'm not yet sure how.

I don't love the idea of treating a panic as a handleable error, though. In the "programmer error" / "operational error" parlance, our general disposition among most of the Dropshot consumers I'm familiar with is to treat panics as programmer errors that reflect something deeply wrong, whose blast radius could include the whole process. I know this isn't quite fair, but I think of a panic handler in our programs as analogous to a SIGSEGV handler in a C program that reports the error and lets the program continue. That'd be pretty dangerous -- often much worse than crashing. Folks have made the case that Rust provides enough safety that the blast radius of a panic might well be more contained than "the whole process", and I could see that. But in practice, I think it's been quite rare for us to see panics in request handlers, which both validates the idea that these really shouldn't happen but also means we don't have a lot of experience to say that the blast radius is likely to be pretty contained.

I'm not suggesting you (or anybody else) should adopt this same approach, only that I don't think we'd want that to be the default Dropshot behavior. If this is important, maybe it could be an option?

@ahl any thoughts?

@nshalman
Copy link
Contributor Author

nshalman commented Jul 7, 2025

I don't love the idea of treating a panic as a handleable error, though. In the "programmer error" / "operational error" parlance, our general disposition among most of the Dropshot consumers I'm familiar with is to treat panics as programmer errors that reflect something deeply wrong, whose blast radius could include the whole process. I know this isn't quite fair, but I think of a panic handler in our programs as analogous to a SIGSEGV handler in a C program that reports the error and lets the program continue. That'd be pretty dangerous -- often much worse than crashing. Folks have made the case that Rust provides enough safety that the blast radius of a panic might well be more contained than "the whole process", and I could see that. But in practice, I think it's been quite rare for us to see panics in request handlers, which both validates the idea that these really shouldn't happen but also means we don't have a lot of experience to say that the blast radius is likely to be pretty contained.

Agreed! So the remaining question on my mind then is, why, when a handler panics in detached mode does dropshot not find a way to panic the entire server? It seems to me that either if a panic does not crash the server, then the user deserves a 500 error, or, otherwise the entire process should abort and dump core. Currently it seems like dropshot does neither. But maybe I am not using it quite right?

Edited to add: I'd be happy to rework this into a change that makes a panic in a handler properly propagate out. I initially used a panic in a handler to exercise various failure modes in my opentelemetry tracing work which is when I originally stumbled over a handler panic logging a 499. I can then rework my error tests to use software generated errors rather than panics.

@davepacheco
Copy link
Collaborator

Your question has caused me to re-think a lot of this from first principles (probably too far). Sorry in advance this is a little rambly.

Agreed! So the remaining question on my mind then is, why, when a handler panics in detached mode does dropshot not find a way to panic the entire server? It seems to me that either if a panic does not crash the server, then the user deserves a 500 error, or, otherwise the entire process should abort and dump core. Currently it seems like dropshot does neither. But maybe I am not using it quite right?

Dropshot doesn't do this because it can't know what the consumer wants. I assume it's the consumer's responsibility to set things up so that panics are handled the way that they want, especially if their goal is to abort on panic since that's easy to do globally. For our own stuff, we set panic = abort in release builds (which, incidentally, I think means that it doesn't matter what Dropshot does -- the program will call abort(3C) directly from the panic subsystem in Rust).

I believe Dropshot propagates this particular panic because it seemed like the thing that would let consumers decide how they wanted to handle it. In retrospect, that approach works fine for ordinary synchronous dependencies, but it's a lot less clear what should be done in dependencies like hyper or Dropshot that spawn background threads or tasks that themselves could panic. These crates should probably translate that panic into an error that they can propagate to the consumer (and, ideally, force the consumer to have a handler for such errors so that they can't go silently unnoticed -- my big fear about this sort of thing is that some critical background thread/task simply stops running with little more than a note in the log and the program limps along in a permanent state of implicit, non-fatal failure). There's probably a gap in Dropshot here.

But even if we had that, we'd still want to clean up the state of the request, which means figuring out what to report for the result and how to respond to the request. 499 is clearly wrong since this is not any kind of client error. I still don't love 500 for a few reasons:

  • it implies a clean failure with no consequences for other Dropshot requests, and we can't really know that if we panicked inside Dropshot
  • it implies that the client got the 500, or at least that it didn't get something else. But it's possible we sent even a 200 header to the client and panicked after that. It'd be really confusing to see a 500 on the server for the same request that the client saw some other code for.

It's always possible that if Dropshot panics at the wrong time, nothing gets reported for the request and the client might see an abruptly terminated TCP connection. I think that's what will happen if Dropshot panics before setting up the scope guard or from the code that runs in the guard. So I think one option is: if Dropshot panics, it just doesn't run that guard at all. Another option is that we report "no status". When an HTTP request completes, it could have an ordinary HTTP status, or it may never complete (if we panicked at a bad time), or it completes with an indicator that the handler panicked. I'd view this as analogous to the question of "what's the exit code of a Unix process that was terminated by SIGSEGV?" The answer is it doesn't have one -- exit codes are only for processes not terminated by signals.

That'd be fine, but unless you have a compelling reason to handle these panics (and are fine with some panics not being handleable), I'd be inclined to just have the scope guard not run at all if the task panicked (if that's even possible...). But I'm interested to know more about your use case to see if that makes sense!

@steveklabnik
Copy link
Contributor

I don't love the idea of treating a panic as a handleable error, though. In the "programmer error" / "operational error" parlance, our general disposition among most of the Dropshot consumers I'm familiar with is to treat panics as programmer errors that reflect something deeply wrong, whose blast radius could include the whole process. I

For whatever it's worth, "a web server which has a handler that panics doesn't take down the whole server" was one of the direct motivations for being able to catch panics in the first place, as it's a common user expectation.

@davepacheco
Copy link
Collaborator

I don't love the idea of treating a panic as a handleable error, though. In the "programmer error" / "operational error" parlance, our general disposition among most of the Dropshot consumers I'm familiar with is to treat panics as programmer errors that reflect something deeply wrong, whose blast radius could include the whole process. I

For whatever it's worth, "a web server which has a handler that panics doesn't take down the whole server" was one of the direct motivations for being able to catch panics in the first place, as it's a common user expectation.

I'm not surprised, as I've seen this approach used in other environments. Joyent wound up deploying it in Node.js (via an uncaughtException handler). It led to a number of serious outages requiring operator intervention that would have been inconsequential (and automatically resolved) had the process just crashed instead of trying to stay up. We learned from those incidents that this pattern creates considerable risk that the server remains up when it's not actually healthy, when crashing and restarting could have restored service immediately. Rust is undoubtedly in a better place to handle this pattern than Node.js, but the pattern remains fraught for the same reasons, and Rust is also in a much better position to avoid panicking so much in the first place. That said, I don't care about convincing Dropshot consumers about any of this. My goal is to make sure that we avoid that problem and ideally avoid leading new users towards it who might otherwise choose differently.

@nshalman
Copy link
Contributor Author

nshalman commented Jul 8, 2025

For our own stuff, we set panic = abort in release builds (which, incidentally, I think means that it doesn't matter what Dropshot does -- the program will call abort(3C) directly from the panic subsystem in Rust).

I assume this is https://doc.rust-lang.org/book/ch09-01-unrecoverable-errors-with-panic.html

[profile.release]
panic = 'abort'

This is helpful context for me to experiment with. Thank you!

An additional point that I think might be interesting is whether we can detect the difference between a panic originating in the handler vs a panic originating elsewhere in dropshot itself. As long as we can detect the difference, I think it's reasonable to give the user control over what happens when a handler panics while a dropshot panic should continue to do whatever it does.

Alternately it's possible that the mechanism offered should simply be panic = abort.

So I do think that perhaps this change is salvageable if I can prove it has no real impact when panic = abort and thus creates no behavior change for existing Oxide software. But per Steve's comment, if the server isn't going to crash and dropshot has detected that a panic specifically came from handler code and not its own code, dropshot should be allowed to return the 500 error to the user and log it with some appropriately critical error message.

Oh! That was the other thing. Is it really possible for a handler to return a status code of 200 to the user and then proceed to panic? I think the handler has to successfully return a value for dropshot to send a response to the user. I agree that a generic panic within dropshot in general could potentially trip over returning the 200 to the user and then crashing, but at that point you're already pretty screwed because 200 made it out.

I'm tempted to add a tiny bit of philosophical note about panic=abort somewhere in the README so that those who follow the same philosophy will know what to put into their dropshot software.

@davepacheco
Copy link
Collaborator

For our own stuff, we set panic = abort in release builds (which, incidentally, I think means that it doesn't matter what Dropshot does -- the program will call abort(3C) directly from the panic subsystem in Rust).

I assume this is https://doc.rust-lang.org/book/ch09-01-unrecoverable-errors-with-panic.html

[profile.release]
panic = 'abort'

Yup.

An additional point that I think might be interesting is whether we can detect the difference between a panic originating in the handler vs a panic originating elsewhere in dropshot itself. As long as we can detect the difference, I think it's reasonable to give the user control over what happens when a handler panics while a dropshot panic should continue to do whatever it does.

I would phrase this as a new feature request, which is: allow a Dropshot server to treat [consumer] handler panics as 500s [without propagating them]. If we want to go down this road, I would move the panic check as close to the point where we invoke the user handler as possible so that we know that if we caught a panic, it was their code, not Dropshot's. Then either turn it into the appropriate HttpError immediately or propagate a type that conveys which case we were in so that this code and do that.

Oh! That was the other thing. Is it really possible for a handler to return a status code of 200 to the user and then proceed to panic? I think the handler has to successfully return a value for dropshot to send a response to the user.

Agreed -- it'd have to be a panic after the handler function returns.

I agree that a generic panic within dropshot in general could potentially trip over returning the 200 to the user and then crashing, but at that point you're already pretty screwed because 200 made it out.

Well, there's different degrees of "screwed". If the program crashes, then:

  • The client can (and should) detect this because they won't get a well-formed HTTP response. They'll get say a non-zero content-length but then EOF before they've read all the bytes.
  • Instrumentation will reflect a request started but never finished (this can be detected, though I don't know if existing tools do this well).

These are ordinary operational errors that can happen and are handleable. The thing I was worried about is reporting that request as a 500, and now the instrumentation is wrong about what actually happened.

I'm tempted to add a tiny bit of philosophical note about panic=abort somewhere in the README so that those who follow the same philosophy will know what to put into their dropshot software.

That's okay but I don't think this is really Dropshot-specific.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants