-
Notifications
You must be signed in to change notification settings - Fork 84
Fix panic handling in detached mode to return 500 instead of 499 #1359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Handler panics in HandlerTaskMode::Detached were incorrectly being reported as 499 (Client Closed Request) instead of 500 (Internal Server Error) in telemetry and traces. This occurred because the panic handling code called panic::resume_unwind(), which caused the request handling task to panic and trigger the disconnect guard, leading to incorrect status code assignment. Changes: - Extract panic message from JoinError and convert to HttpError - Return 500 Internal Server Error instead of propagating panic - Add integration test to verify panic handling returns proper 500 status - Preserve existing behavior for HandlerTaskMode::CancelOnDisconnect 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
I know you folks are busy, so these pings are not intended to get an immediate response, just to get this on your radar for feedback. This 499 on a handler panic has seemed wrong to me for a long time. @steveklabnik @davepacheco if you could throw this in your backlog to review I'd appreciate it. Rejection is totally acceptable and I can keep this patch in a branch of my own if this isn't a good fit for all the downstream software you folks have. |
500 feels correct to me here too; I haven't run into this in the wild, but at a high level that feels like the right status code, personally. |
Sorry for the delay on this! I agree that reporting a 499 is confusing at best. It looks like that's an accident of the current implementation rather than intentional. If I'm understanding right, the code in the guard (L799-L817 on "main") is assuming that if the guard got dropped, it's because the future was cancelled; but it can also get dropped if the handler panics. Ideally, it would somehow account for that. I'm not yet sure how. I don't love the idea of treating a panic as a handleable error, though. In the "programmer error" / "operational error" parlance, our general disposition among most of the Dropshot consumers I'm familiar with is to treat panics as programmer errors that reflect something deeply wrong, whose blast radius could include the whole process. I know this isn't quite fair, but I think of a panic handler in our programs as analogous to a SIGSEGV handler in a C program that reports the error and lets the program continue. That'd be pretty dangerous -- often much worse than crashing. Folks have made the case that Rust provides enough safety that the blast radius of a panic might well be more contained than "the whole process", and I could see that. But in practice, I think it's been quite rare for us to see panics in request handlers, which both validates the idea that these really shouldn't happen but also means we don't have a lot of experience to say that the blast radius is likely to be pretty contained. I'm not suggesting you (or anybody else) should adopt this same approach, only that I don't think we'd want that to be the default Dropshot behavior. If this is important, maybe it could be an option? @ahl any thoughts? |
Agreed! So the remaining question on my mind then is, why, when a handler panics in detached mode does dropshot not find a way to panic the entire server? It seems to me that either if a panic does not crash the server, then the user deserves a 500 error, or, otherwise the entire process should abort and dump core. Currently it seems like dropshot does neither. But maybe I am not using it quite right? Edited to add: I'd be happy to rework this into a change that makes a panic in a handler properly propagate out. I initially used a panic in a handler to exercise various failure modes in my opentelemetry tracing work which is when I originally stumbled over a handler panic logging a 499. I can then rework my error tests to use software generated errors rather than panics. |
Your question has caused me to re-think a lot of this from first principles (probably too far). Sorry in advance this is a little rambly.
Dropshot doesn't do this because it can't know what the consumer wants. I assume it's the consumer's responsibility to set things up so that panics are handled the way that they want, especially if their goal is to abort on panic since that's easy to do globally. For our own stuff, we set I believe Dropshot propagates this particular panic because it seemed like the thing that would let consumers decide how they wanted to handle it. In retrospect, that approach works fine for ordinary synchronous dependencies, but it's a lot less clear what should be done in dependencies like hyper or Dropshot that spawn background threads or tasks that themselves could panic. These crates should probably translate that panic into an error that they can propagate to the consumer (and, ideally, force the consumer to have a handler for such errors so that they can't go silently unnoticed -- my big fear about this sort of thing is that some critical background thread/task simply stops running with little more than a note in the log and the program limps along in a permanent state of implicit, non-fatal failure). There's probably a gap in Dropshot here. But even if we had that, we'd still want to clean up the state of the request, which means figuring out what to report for the result and how to respond to the request. 499 is clearly wrong since this is not any kind of client error. I still don't love 500 for a few reasons:
It's always possible that if Dropshot panics at the wrong time, nothing gets reported for the request and the client might see an abruptly terminated TCP connection. I think that's what will happen if Dropshot panics before setting up the scope guard or from the code that runs in the guard. So I think one option is: if Dropshot panics, it just doesn't run that guard at all. Another option is that we report "no status". When an HTTP request completes, it could have an ordinary HTTP status, or it may never complete (if we panicked at a bad time), or it completes with an indicator that the handler panicked. I'd view this as analogous to the question of "what's the exit code of a Unix process that was terminated by SIGSEGV?" The answer is it doesn't have one -- exit codes are only for processes not terminated by signals. That'd be fine, but unless you have a compelling reason to handle these panics (and are fine with some panics not being handleable), I'd be inclined to just have the scope guard not run at all if the task panicked (if that's even possible...). But I'm interested to know more about your use case to see if that makes sense! |
For whatever it's worth, "a web server which has a handler that panics doesn't take down the whole server" was one of the direct motivations for being able to catch panics in the first place, as it's a common user expectation. |
I'm not surprised, as I've seen this approach used in other environments. Joyent wound up deploying it in Node.js (via an |
I assume this is https://doc.rust-lang.org/book/ch09-01-unrecoverable-errors-with-panic.html
This is helpful context for me to experiment with. Thank you! An additional point that I think might be interesting is whether we can detect the difference between a panic originating in the handler vs a panic originating elsewhere in dropshot itself. As long as we can detect the difference, I think it's reasonable to give the user control over what happens when a handler panics while a dropshot panic should continue to do whatever it does. Alternately it's possible that the mechanism offered should simply be panic = abort. So I do think that perhaps this change is salvageable if I can prove it has no real impact when panic = abort and thus creates no behavior change for existing Oxide software. But per Steve's comment, if the server isn't going to crash and dropshot has detected that a panic specifically came from handler code and not its own code, dropshot should be allowed to return the 500 error to the user and log it with some appropriately critical error message. Oh! That was the other thing. Is it really possible for a handler to return a status code of 200 to the user and then proceed to panic? I think the handler has to successfully return a value for dropshot to send a response to the user. I agree that a generic panic within dropshot in general could potentially trip over returning the 200 to the user and then crashing, but at that point you're already pretty screwed because 200 made it out. I'm tempted to add a tiny bit of philosophical note about panic=abort somewhere in the README so that those who follow the same philosophy will know what to put into their dropshot software. |
Yup.
I would phrase this as a new feature request, which is: allow a Dropshot server to treat [consumer] handler panics as 500s [without propagating them]. If we want to go down this road, I would move the panic check as close to the point where we invoke the user handler as possible so that we know that if we caught a panic, it was their code, not Dropshot's. Then either turn it into the appropriate
Agreed -- it'd have to be a panic after the handler function returns.
Well, there's different degrees of "screwed". If the program crashes, then:
These are ordinary operational errors that can happen and are handleable. The thing I was worried about is reporting that request as a 500, and now the instrumentation is wrong about what actually happened.
That's okay but I don't think this is really Dropshot-specific. |
Handler panics in HandlerTaskMode::Detached were incorrectly being reported as 499 (Client Closed Request) instead of 500 (Internal Server Error) in telemetry and traces.
This occurred because the panic handling code called panic::resume_unwind(), which caused the request handling task to panic and trigger the disconnect guard, leading to incorrect status code assignment.
Changes:
🤖 Generated with Claude Code