Added f32/fp64 specializations for complex exp function. #4928
+219
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Updated the complex exp function to avoid under/overflow issues.
There are a few issues with the bounds in the cexp function relating to under/overflow.
In the current version, it can be the case that the real/imag parts of the answer, which is implemented as:
(e^real * cos(imag), e^real * sin(imag))
,can have the
e^real
part overflow, while the next sin/cos multiplication would bring it back into the fp range, however we returnINF
.It can also generate
NaN
results, whenimag == 0.0
->sin(imag) == 0
, and exp overflows.Here the correct
imag
result is 0, but the naive implementation returnsNaN
.This new version fixes all these possible under/overflow issues, and through inlining some of the function calls and stripping out unneeded checks does so with minimal perf disruption.
Perf
For fp64 precision performance was faster on H100 and Quadro RTX 8000.
For fp32 precision performance was only faster on the Quadro RTX 8000.
Using the math-teams standard
math_bench
test we have the following on H100 (averaged summary):Operations/SM/cycle:
(See https://docs.google.com/spreadsheets/d/1TdOpEbLgoL1QOWKjeO3pwFDDMoPr4iqnPjz_XrL7UEA for raw/non-averaged info. NV internal.)
Correctness GPU/CPU
The before / after accuracy accuracy is similar on regions where the current function does not have issues.
An intensive bracket and bisect search, along with testing special hard values, gives:
Note on __half/__nv_bfloat16
If the header is hacked to get the __half/__nv_bfloat16 verions to call the new fp32 version, it results in nearly correctly-rounded functions, and also (maybe surprisingly) faster functions than the template generated functions here.
However, due to the way the header is structured it is not possible to easily specialize
exp
for these types, as the type support for these is only included at the end of this file and no template specialization is (easily) possible.