Added f32/fp64 specializations for complex exp function. #4928

s-oboyle · 2025-06-06T15:29:47Z

Updated the complex exp function to avoid under/overflow issues.

There are a few issues with the bounds in the cexp function relating to under/overflow.

In the current version, it can be the case that the real/imag parts of the answer, which is implemented as:
(e^real * cos(imag), e^real * sin(imag)),

can have the e^real part overflow, while the next sin/cos multiplication would bring it back into the fp range, however we return INF.

It can also generate NaN results, when imag == 0.0 -> sin(imag) == 0, and exp overflows.
Here the correct imag result is 0, but the naive implementation returns NaN.

This new version fixes all these possible under/overflow issues, and through inlining some of the function calls and stripping out unneeded checks does so with minimal perf disruption.

Perf

For fp64 precision performance was faster on H100 and Quadro RTX 8000.
For fp32 precision performance was only faster on the Quadro RTX 8000.

Using the math-teams standard math_bench test we have the following on H100 (averaged summary):
Operations/SM/cycle:

H100	old	new
fp64	0.66082	0.70204
fp32	1.61528	1.53386

(See https://docs.google.com/spreadsheets/d/1TdOpEbLgoL1QOWKjeO3pwFDDMoPr4iqnPjz_XrL7UEA for raw/non-averaged info. NV internal.)

Correctness GPU/CPU

The before / after accuracy accuracy is similar on regions where the current function does not have issues.
An intensive bracket and bisect search, along with testing special hard values, gives:

GPU fp64:
Max ulp real error (2.98,0.9139) @ (5.217369075,0.849946261)    (0x4014de95ffae7653,0x3feb32c280518909)
        Ours = (121.7401997,138.5658332)    Ref = (121.7401997,138.5658332)
        Ours = (0x405e6f5f6e98d3e1,0x4061521b4e34f7e3)               Ref = (0x405e6f5f6e98d3de,0x4061521b4e34f7e2)

Max ulp imag error (0.666,2.852) @ (8.78730102e-12,3.14673129e+293)     (0x3da352cf8a03ca22,0x7cdf886be6d24ce9)
        Ours = (-0.9685123725,-0.2489654278)    Ref = (-0.9685123725,-0.2489654278)
        Ours = (0xbfeefe0da8ba09b5,0xbfcfde1961370bc6)               Ref = (0xbfeefe0da8ba09b6,0xbfcfde1961370bc9)

GPU fp32:
Max ulp real error (3.206,1.588) @ (39.83263779,7.382916255e+37)        (0x421f549f,0x7e5e2beb)
        Ours = (-1.419121705e+17,1.396640729e+17)    Ref = (-1.419121447e+17,1.396640558e+17)
        Ours = (0xdbfc162f,0x5bf817de)               Ref = (0xdbfc162c,0x5bf817dc)

Max ulp imag error (1.272,3.172) @ (5.828636646,5.381953992e+32)        (0x40ba8431,0x75d447d9)
        Ours = (228.6749878,-251.4684296)    Ref = (228.6749725,-251.4683838)
        Ours = (0x4364accc,0xc37b77eb)               Ref = (0x4364accb,0xc37b77e8)

CPU fp64:
Max ulp real error (2.322,1.704) @ (256.7589006,4.832923854e+61)        (0x40700c2474f00000,0x4cbe134ab9e75400)
        Ours = (-2.361656648e+111,-2.201020209e+111)    Ref = (-2.361656648e+111,-2.201020209e+111)
        Ours = (0xd70f6cabdc4ed10f,0xd70d497c09efb5e3)               Ref = (0xd70f6cabdc4ed111,0xd70d497c09efb5e5)

Max ulp imag error (0.8213,2.311) @ (-3.834169936,8.576369563e+12)      (0xc00eac6149836800,0x429f335dd850e400)
        Ours = (-0.0202071154,-0.007685414766)    Ref = (-0.0202071154,-0.007685414766)
        Ours = (0xbf94b12c8f1cc18c,0xbf7f7abdd1402e07)               Ref = (0xbf94b12c8f1cc18b,0xbf7f7abdd1402e05)

CPU fp32:
Max ulp real error (2.307,0.9254) @ (5.177964211,45.91706467)   (0x40a5b1e2,0x4237ab13)
        Ours = (-63.12431335,165.7052002)    Ref = (-63.12432098,165.7052155)
        Ours = (0xc27c7f4c,0x4325b488)               Ref = (0xc27c7f4e,0x4325b489)

Max ulp imag error (1.298,2.302) @ (-3.816472054,1.394906292e+38)       (0xc0744114,0x7ed1e1d6)
        Ours = (-0.02060239203,0.007731408812)    Ref = (-0.02060239017,0.007731407881)
        Ours = (0xbca8c659,0x3bfd57c2)               Ref = (0xbca8c658,0x3bfd57c0)

Note on half/nv_bfloat16

If the header is hacked to get the __half/__nv_bfloat16 verions to call the new fp32 version, it results in nearly correctly-rounded functions, and also (maybe surprisingly) faster functions than the template generated functions here.
However, due to the way the header is structured it is not possible to easily specialize exp for these types, as the type support for these is only included at the end of this file and no template specialization is (easily) possible.

...
#if _LIBCUDACXX_HAS_NVFP16()
#  include <cuda/std/__complex/nvfp16.h>
#endif // _LIBCUDACXX_HAS_NVFP16()

#if _LIBCUDACXX_HAS_NVBF16()
#  include <cuda/std/__complex/nvbf16.h>
#endif // _LIBCUDACXX_HAS_NVBF16()

copy-pr-bot · 2025-06-06T15:29:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Added f32/fp64 specializations for complex exp function.

cfba759

s-oboyle requested a review from a team as a code owner June 6, 2025 15:29

s-oboyle requested a review from wmaxey June 6, 2025 15:29

github-project-automation bot added this to CCCL Jun 6, 2025

github-project-automation bot moved this to Todo in CCCL Jun 6, 2025

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Jun 6, 2025

clang-format and spelling

bccd629

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added f32/fp64 specializations for complex exp function. #4928

Added f32/fp64 specializations for complex exp function. #4928

s-oboyle commented Jun 6, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Jun 6, 2025

Uh oh!

Uh oh!

Added f32/fp64 specializations for complex exp function. #4928

Are you sure you want to change the base?

Added f32/fp64 specializations for complex exp function. #4928

Conversation

s-oboyle commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Updated the complex exp function to avoid under/overflow issues.

Perf

Correctness GPU/CPU

Note on __half/__nv_bfloat16

Uh oh!

copy-pr-bot bot commented Jun 6, 2025

Uh oh!

Uh oh!

s-oboyle commented Jun 6, 2025 •

edited

Loading

Note on half/nv_bfloat16