Skip to content

Added f32/fp64 specializations for complex exp function. #4928

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

s-oboyle
Copy link

@s-oboyle s-oboyle commented Jun 6, 2025

Updated the complex exp function to avoid under/overflow issues.

There are a few issues with the bounds in the cexp function relating to under/overflow.

In the current version, it can be the case that the real/imag parts of the answer, which is implemented as:
(e^real * cos(imag), e^real * sin(imag)),

can have the e^real part overflow, while the next sin/cos multiplication would bring it back into the fp range, however we return INF.

It can also generate NaN results, when imag == 0.0 -> sin(imag) == 0, and exp overflows.
Here the correct imag result is 0, but the naive implementation returns NaN.

This new version fixes all these possible under/overflow issues, and through inlining some of the function calls and stripping out unneeded checks does so with minimal perf disruption.

Perf

For fp64 precision performance was faster on H100 and Quadro RTX 8000.
For fp32 precision performance was only faster on the Quadro RTX 8000.

Using the math-teams standard math_bench test we have the following on H100 (averaged summary):
Operations/SM/cycle:

H100 old new
fp64 0.66082 0.70204
fp32 1.61528 1.53386

(See https://docs.google.com/spreadsheets/d/1TdOpEbLgoL1QOWKjeO3pwFDDMoPr4iqnPjz_XrL7UEA for raw/non-averaged info. NV internal.)

Correctness GPU/CPU

The before / after accuracy accuracy is similar on regions where the current function does not have issues.
An intensive bracket and bisect search, along with testing special hard values, gives:

GPU fp64:
Max ulp real error (2.98,0.9139) @ (5.217369075,0.849946261)    (0x4014de95ffae7653,0x3feb32c280518909)
        Ours = (121.7401997,138.5658332)    Ref = (121.7401997,138.5658332)
        Ours = (0x405e6f5f6e98d3e1,0x4061521b4e34f7e3)               Ref = (0x405e6f5f6e98d3de,0x4061521b4e34f7e2)

Max ulp imag error (0.666,2.852) @ (8.78730102e-12,3.14673129e+293)     (0x3da352cf8a03ca22,0x7cdf886be6d24ce9)
        Ours = (-0.9685123725,-0.2489654278)    Ref = (-0.9685123725,-0.2489654278)
        Ours = (0xbfeefe0da8ba09b5,0xbfcfde1961370bc6)               Ref = (0xbfeefe0da8ba09b6,0xbfcfde1961370bc9)
GPU fp32:
Max ulp real error (3.206,1.588) @ (39.83263779,7.382916255e+37)        (0x421f549f,0x7e5e2beb)
        Ours = (-1.419121705e+17,1.396640729e+17)    Ref = (-1.419121447e+17,1.396640558e+17)
        Ours = (0xdbfc162f,0x5bf817de)               Ref = (0xdbfc162c,0x5bf817dc)

Max ulp imag error (1.272,3.172) @ (5.828636646,5.381953992e+32)        (0x40ba8431,0x75d447d9)
        Ours = (228.6749878,-251.4684296)    Ref = (228.6749725,-251.4683838)
        Ours = (0x4364accc,0xc37b77eb)               Ref = (0x4364accb,0xc37b77e8)
CPU fp64:
Max ulp real error (2.322,1.704) @ (256.7589006,4.832923854e+61)        (0x40700c2474f00000,0x4cbe134ab9e75400)
        Ours = (-2.361656648e+111,-2.201020209e+111)    Ref = (-2.361656648e+111,-2.201020209e+111)
        Ours = (0xd70f6cabdc4ed10f,0xd70d497c09efb5e3)               Ref = (0xd70f6cabdc4ed111,0xd70d497c09efb5e5)

Max ulp imag error (0.8213,2.311) @ (-3.834169936,8.576369563e+12)      (0xc00eac6149836800,0x429f335dd850e400)
        Ours = (-0.0202071154,-0.007685414766)    Ref = (-0.0202071154,-0.007685414766)
        Ours = (0xbf94b12c8f1cc18c,0xbf7f7abdd1402e07)               Ref = (0xbf94b12c8f1cc18b,0xbf7f7abdd1402e05)
CPU fp32:
Max ulp real error (2.307,0.9254) @ (5.177964211,45.91706467)   (0x40a5b1e2,0x4237ab13)
        Ours = (-63.12431335,165.7052002)    Ref = (-63.12432098,165.7052155)
        Ours = (0xc27c7f4c,0x4325b488)               Ref = (0xc27c7f4e,0x4325b489)

Max ulp imag error (1.298,2.302) @ (-3.816472054,1.394906292e+38)       (0xc0744114,0x7ed1e1d6)
        Ours = (-0.02060239203,0.007731408812)    Ref = (-0.02060239017,0.007731407881)
        Ours = (0xbca8c659,0x3bfd57c2)               Ref = (0xbca8c658,0x3bfd57c0)

Note on __half/__nv_bfloat16

If the header is hacked to get the __half/__nv_bfloat16 verions to call the new fp32 version, it results in nearly correctly-rounded functions, and also (maybe surprisingly) faster functions than the template generated functions here.
However, due to the way the header is structured it is not possible to easily specialize exp for these types, as the type support for these is only included at the end of this file and no template specialization is (easily) possible.

...
#if _LIBCUDACXX_HAS_NVFP16()
#  include <cuda/std/__complex/nvfp16.h>
#endif // _LIBCUDACXX_HAS_NVFP16()

#if _LIBCUDACXX_HAS_NVBF16()
#  include <cuda/std/__complex/nvbf16.h>
#endif // _LIBCUDACXX_HAS_NVBF16()

@s-oboyle s-oboyle requested a review from a team as a code owner June 6, 2025 15:29
@s-oboyle s-oboyle requested a review from wmaxey June 6, 2025 15:29
@github-project-automation github-project-automation bot moved this to Todo in CCCL Jun 6, 2025
Copy link

copy-pr-bot bot commented Jun 6, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Jun 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

1 participant