Skip to content

Error Handling: refactor XlaCoordinator to use status types. #9386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ysiraichi
Copy link
Collaborator

@ysiraichi ysiraichi commented Jun 18, 2025

This PR does 3 different things:

  1. Adds functions and macros to be used with Status and StatusOr<T> constructs, so as to make it easier to propagate non-ok status.
  2. Introduces XLA_SHOW_CPP_ERROR_CONTEXT environment variable for toggling error context.
  3. Refactors XlaCoordinator to use the new status propagation constructs as an example

Mainly, inspired by tensorflow implementation, it introduces the following status propagation functions:

  • XLA_ASSIGN_OR_RETURN(LHS, REXPR, ...) , which either assigns the value held by REXPR to LHS if it holds a non-ok status code, or return the non-ok status.
  • XLA_RETURN_IF_ERROR(EXPR, ...), which early-returns if EXPR is a non-ok status, propagating the error to the caller.
  • XLA_ERROR_WITH_LOCATION(STATUS), which builds a new absl::Status from STATUS, by maybe appending the current file location

C++ Error Handling

Idea: by default, print only the user targeted message, with no extra C++ details. Then, by setting the XLA_SHOW_CPP_ERROR_CONTEXT environment variable, also print extra context information, such as C++ source location, and other contextual messages.

  • Callers may overwrite the error message by specifying the last optional argument of XLA_ASSIGN_OR_RETURN and XLA_RETURN_IF_ERROR. Rationale: they might have more context to create a more user-friendly message.
  • If XLA_SHOW_CPP_ERROR_CONTEXT is set, the callee overwritten error messages are shown in the following lines

Example errors:

  • When XLA_SHOW_CPP_ERROR_CONTEXT=0 (default):
RuntimeError: <message>
  • When XLA_SHOW_CPP_ERROR_CONTEXT=1:
RuntimeError: <message> (at <file>:<line>)
From Error: <previous-message> (at <file>:<line>)
...

@ysiraichi
Copy link
Collaborator Author

Should be merged only after #9384.

@ysiraichi ysiraichi marked this pull request as draft June 18, 2025 20:45
@ysiraichi ysiraichi force-pushed the ysiraichi/status-qol-functions branch 3 times, most recently from 3bd1d2c to 1b2ebca Compare June 19, 2025 16:15
@ysiraichi ysiraichi force-pushed the ysiraichi/status-for-getcomputationclient branch from a351a69 to 7e0da02 Compare June 19, 2025 16:34
@ysiraichi ysiraichi force-pushed the ysiraichi/status-qol-functions branch 5 times, most recently from eaac1c3 to 87d2d12 Compare June 21, 2025 16:08
@zhanyong-wan
Copy link
Collaborator

This PR is a draft? Is it ready for review? If yes, please take it out of draft. Thanks!

@ysiraichi
Copy link
Collaborator Author

This is a draft. I'm still working on it.

@ysiraichi ysiraichi force-pushed the ysiraichi/status-qol-functions branch 3 times, most recently from 510dcce to a1b26a4 Compare June 23, 2025 18:31
@ysiraichi ysiraichi force-pushed the ysiraichi/status-qol-functions branch 2 times, most recently from a78bc1a to 4b71836 Compare June 25, 2025 14:56
@ysiraichi ysiraichi changed the base branch from ysiraichi/status-for-getcomputationclient to master June 25, 2025 14:56
@ysiraichi ysiraichi changed the title Add Status and StatusOr<T> QOL functions. Error Handling: refactor XlaCoordinator to use status types. Jun 25, 2025
Add new Status and StatusOr utility functions that provide better error
handling and contextual information:
- `XLA_ERROR_WITH_LOCATION`: Creates Status with file location info
- `XLA_RETURN_IF_ERROR` and `XLA_ASSIGN_OR_RETURN`: Enhanced macros for
status propagation with optional custom error messages
- `MaybeWithLocation` and `MaybeWithNewMessage`: Functions for enriching
error messages
- `XLA_SHOW_CPP_ERROR_CONTEXT` environment variable to control error
context display
Replace direct constructor with `Create()` factory method that returns
`StatusOr<unique_ptr<XlaCoordinator>>`:
- Uses new `XLA_ASSIGN_OR_RETURN` and `XLA_RETURN_IF_ERROR` macros
- Provides better error handling with contextual messages
- Maintains same functionality with improved error propagation
@ysiraichi ysiraichi force-pushed the ysiraichi/status-qol-functions branch from 4b71836 to 55214ac Compare June 25, 2025 15:38
Comment on lines +30 to +33
XLA_ASSIGN_OR_RETURN(
dist_runtime_service_,
xla::GetDistributedRuntimeService(dist_service_addr, service_options),
"Failed to initialize distributed runtime service.");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An error status from xla::GetDistributedRuntimeService() call will get its error message overwritten by this macro. In the end, the user will see:

RuntimeError: Failed to initialize distributed runtime service.

@ysiraichi ysiraichi marked this pull request as ready for review June 25, 2025 16:51
Copy link
Collaborator

@zhanyong-wan zhanyong-wan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@@ -37,6 +37,7 @@ extern const char* const kEnvPjrtDynamicPlugins;
extern const char* const kEnvDistSvcHeartbeatIntervalInSec;
extern const char* const kEnvDistSvcMaxMissingHeartbeats;
extern const char* const kEnvDistSvcShutdownTimeoutInMin;
extern const char* const kEnvShowCppErrorContext;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document how this env var affects the torch_xla behavior?

private:
// Convenience function called by `Create()` that initializes the current
// XlaCoordinator.
absl::Status Initialize(int global_rank, int world_size,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document the parameters?

@@ -41,7 +48,17 @@ class XlaCoordinator {
// false otherwise.
bool ReachedSyncPoint(int step);

// Creates a new instance of XlaCoordinator, and initializes it.
static absl::StatusOr<absl_nonnull std::unique_ptr<XlaCoordinator>> Create(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document the parameters?

@@ -12,11 +13,17 @@ namespace runtime {
// XlaCoordinator serves as the point of entry for all operations which
// required the XLA distributed runtime, such as preemption coordination.
class XlaCoordinator {
private:
// Private struct for making the constructor private, but still callable
// with std::make_unique<T>() function.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace with

// with std::make_unique<XlaCoordinator>(PrivateUse()) within this class?

@@ -0,0 +1,81 @@
#ifndef XLA_TORCH_XLA_CSRC_STATUS_H_
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit tests for this library?


namespace torch_xla {

bool showCppErrorContext() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: CamelCase function name.


namespace torch_xla {

bool showCppErrorContext() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: mark this static.

namespace torch_xla {

bool showCppErrorContext() {
return runtime::sys_util::GetEnvBool(runtime::env::kEnvShowCppErrorContext,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called at every Status error construction site. It must be efficient. Use a local static const bool variable to remember the result so that we only call GetEnvBool once?

//
// The idea is that whenever `new_message` is given, it should have more
// context to give a better error message to the user.
std::string_view message = (new_message.empty()) ? old_message : new_message;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove () around new_message.empty().

absl::StatusOr<absl_nonnull std::unique_ptr<XlaCoordinator>>
XlaCoordinator::Create(int global_rank, int world_size, std::string master_addr,
std::string port) {
auto coordinator = std::make_unique<XlaCoordinator>(PrivateUse());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be achieved with

auto coordinator = std::make_unique<XlaCoordinator>(new XlaCoordinator());

and place the default constructor privately?

private:
  XlaCoordinator();

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It cannot. (The code won't compile as there's no XlaCoordinator(XlaCoordinator*) ctor.)

// This function also appends file location information to the error message, if
// `kEnvShowCppErrorContext` is set.
absl::Status MaybeWithNewMessage(absl::Status&& status, const char* file,
const int32_t line,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const in const int32_t is redundant. Similarly, std::string_view is read-only so const keyword isn't needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants