-
Notifications
You must be signed in to change notification settings - Fork 544
Error Handling: refactor XlaCoordinator
to use status types.
#9386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Should be merged only after #9384. |
3bd1d2c
to
1b2ebca
Compare
a351a69
to
7e0da02
Compare
eaac1c3
to
87d2d12
Compare
This PR is a draft? Is it ready for review? If yes, please take it out of draft. Thanks! |
This is a draft. I'm still working on it. |
510dcce
to
a1b26a4
Compare
a78bc1a
to
4b71836
Compare
Status
and StatusOr<T>
QOL functions.XlaCoordinator
to use status types.
Add new Status and StatusOr utility functions that provide better error handling and contextual information: - `XLA_ERROR_WITH_LOCATION`: Creates Status with file location info - `XLA_RETURN_IF_ERROR` and `XLA_ASSIGN_OR_RETURN`: Enhanced macros for status propagation with optional custom error messages - `MaybeWithLocation` and `MaybeWithNewMessage`: Functions for enriching error messages - `XLA_SHOW_CPP_ERROR_CONTEXT` environment variable to control error context display
Replace direct constructor with `Create()` factory method that returns `StatusOr<unique_ptr<XlaCoordinator>>`: - Uses new `XLA_ASSIGN_OR_RETURN` and `XLA_RETURN_IF_ERROR` macros - Provides better error handling with contextual messages - Maintains same functionality with improved error propagation
4b71836
to
55214ac
Compare
XLA_ASSIGN_OR_RETURN( | ||
dist_runtime_service_, | ||
xla::GetDistributedRuntimeService(dist_service_addr, service_options), | ||
"Failed to initialize distributed runtime service."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An error status from xla::GetDistributedRuntimeService()
call will get its error message overwritten by this macro. In the end, the user will see:
RuntimeError: Failed to initialize distributed runtime service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
@@ -37,6 +37,7 @@ extern const char* const kEnvPjrtDynamicPlugins; | |||
extern const char* const kEnvDistSvcHeartbeatIntervalInSec; | |||
extern const char* const kEnvDistSvcMaxMissingHeartbeats; | |||
extern const char* const kEnvDistSvcShutdownTimeoutInMin; | |||
extern const char* const kEnvShowCppErrorContext; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document how this env var affects the torch_xla behavior?
private: | ||
// Convenience function called by `Create()` that initializes the current | ||
// XlaCoordinator. | ||
absl::Status Initialize(int global_rank, int world_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document the parameters?
@@ -41,7 +48,17 @@ class XlaCoordinator { | |||
// false otherwise. | |||
bool ReachedSyncPoint(int step); | |||
|
|||
// Creates a new instance of XlaCoordinator, and initializes it. | |||
static absl::StatusOr<absl_nonnull std::unique_ptr<XlaCoordinator>> Create( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document the parameters?
@@ -12,11 +13,17 @@ namespace runtime { | |||
// XlaCoordinator serves as the point of entry for all operations which | |||
// required the XLA distributed runtime, such as preemption coordination. | |||
class XlaCoordinator { | |||
private: | |||
// Private struct for making the constructor private, but still callable | |||
// with std::make_unique<T>() function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace with
// with std::make_unique<XlaCoordinator>(PrivateUse()) within this class?
@@ -0,0 +1,81 @@ | |||
#ifndef XLA_TORCH_XLA_CSRC_STATUS_H_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit tests for this library?
|
||
namespace torch_xla { | ||
|
||
bool showCppErrorContext() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: CamelCase
function name.
|
||
namespace torch_xla { | ||
|
||
bool showCppErrorContext() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: mark this static
.
namespace torch_xla { | ||
|
||
bool showCppErrorContext() { | ||
return runtime::sys_util::GetEnvBool(runtime::env::kEnvShowCppErrorContext, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is called at every Status error construction site. It must be efficient. Use a local static const bool
variable to remember the result so that we only call GetEnvBool
once?
// | ||
// The idea is that whenever `new_message` is given, it should have more | ||
// context to give a better error message to the user. | ||
std::string_view message = (new_message.empty()) ? old_message : new_message; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: remove ()
around new_message.empty()
.
absl::StatusOr<absl_nonnull std::unique_ptr<XlaCoordinator>> | ||
XlaCoordinator::Create(int global_rank, int world_size, std::string master_addr, | ||
std::string port) { | ||
auto coordinator = std::make_unique<XlaCoordinator>(PrivateUse()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be achieved with
auto coordinator = std::make_unique<XlaCoordinator>(new XlaCoordinator());
and place the default constructor privately?
private:
XlaCoordinator();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It cannot. (The code won't compile as there's no XlaCoordinator(XlaCoordinator*)
ctor.)
// This function also appends file location information to the error message, if | ||
// `kEnvShowCppErrorContext` is set. | ||
absl::Status MaybeWithNewMessage(absl::Status&& status, const char* file, | ||
const int32_t line, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
in const int32_t
is redundant. Similarly, std::string_view
is read-only so const keyword isn't needed.
This PR does 3 different things:
Status
andStatusOr<T>
constructs, so as to make it easier to propagate non-ok status.XLA_SHOW_CPP_ERROR_CONTEXT
environment variable for toggling error context.XlaCoordinator
to use the new status propagation constructs as an exampleMainly, inspired by tensorflow implementation, it introduces the following status propagation functions:
XLA_ASSIGN_OR_RETURN(LHS, REXPR, ...)
, which either assigns the value held byREXPR
toLHS
if it holds a non-ok status code, or return the non-ok status.XLA_RETURN_IF_ERROR(EXPR, ...)
, which early-returns ifEXPR
is a non-ok status, propagating the error to the caller.XLA_ERROR_WITH_LOCATION(STATUS)
, which builds a newabsl::Status
fromSTATUS
, by maybe appending the current file locationC++ Error Handling
Idea: by default, print only the user targeted message, with no extra C++ details. Then, by setting the
XLA_SHOW_CPP_ERROR_CONTEXT
environment variable, also print extra context information, such as C++ source location, and other contextual messages.XLA_ASSIGN_OR_RETURN
andXLA_RETURN_IF_ERROR
. Rationale: they might have more context to create a more user-friendly message.XLA_SHOW_CPP_ERROR_CONTEXT
is set, the callee overwritten error messages are shown in the following linesExample errors:
XLA_SHOW_CPP_ERROR_CONTEXT=0
(default):XLA_SHOW_CPP_ERROR_CONTEXT=1
: