Activation Checkpoint improvment #1645

fegin · 2025-08-27T18:19:36Z

This PR refactors the activation checkpoint by moving apply_ac() out of the llama3 parallelize.py module. Additionally, it introduces a warning about the configuration combinations involving SAC, torch.compile, and flex_attention to inform users of potential issues.

This PR depends on pytorch/pytorch#161541

cc., @drisspg @bdhirsh @soulitzer

torchtitan/distributed/activation_checkpoint.py

tianyu-l

shall we update all the parallelize.py files to depend on this file?
@xmfan mentioned that there could be silent numerical incorrectness when we compile MoE, which sounds concerning if we always recommend compiling to work with FlexAttention in this PR.

tianyu-l · 2025-08-27T21:27:05Z

torchtitan/distributed/activation_checkpoint.py

+from torchtitan.tools.logging import logger
+
+# for selective op activation checkpointing
+_save_list = {


I think it's preferable if we can create customized list for each individual model if necessary, in addition to some default save_list.
E.g. MoE and dense models may need different save_list, and it sounds bad if we just mix everything.

This refactor can happen in a separate PR.

fegin · 2025-08-27T21:37:57Z

shall we update all the parallelize.py files to depend on this file?

I did, but somehow the change was not uploaded. It should be good now.

@xmfan mentioned that there could be silent numerical incorrectness when we compile MoE, which sounds concerning if we always recommend compiling to work with FlexAttention in this PR.

ye, I also don't want to force users to always torch.compile. But even a hack to enable SAC + FlexAttention requires some discussion. As for now, I would keep this suggestion since we mainly focus on performance at this moment. We should make the hack work soon.

cc., @drisspg @soulitzer @bdhirsh

tianyu-l

sgtm

fegin · 2025-08-27T21:41:04Z

uh, there are some conflicts, let me fix it

This PR refactors the activation checkpoint by moving `apply_ac()` out of the llama3 `parallelize.py` module. Additionally, it introduces a warning about the configuration combinations involving SAC, `torch.compile`, and `flex_attention` to inform users of potential issues.

xmfan · 2025-08-29T06:10:24Z

@xmfan mentioned that there could be silent numerical incorrectness when we compile MoE

@tianyu-l Right now, I think this applies only to when you set torch._dynamo.config.capture_scalar_outputs=True. Without it, we have small graphs and I don't think we have any inplace ops + autograd functions in the same graph. For context, the issue is: pytorch/pytorch#161275

bdhirsh · 2025-08-29T15:54:49Z

@xmfan so just to confirm:

capture_scalar_outputs=True gave us bigger graphs (fewer graph breaks)
those bigger graphs caused us to capture an inplace op + autograd.Function in the same dynamo region, causing the correctness issue linked above
Now that @soulitzer has removed the autograd.Function for all2all, though (see Add config to AC to toggle early-stop and revert A2A autograd.Function workaround #1580), I would imagine that it's safe to add back capture_scalar_outputs=True in titan if it gives meaningful perf wins, no?

xmfan · 2025-08-29T17:34:14Z

Yes on 1 and 2. For 3, there's other things broken with capture_scalar_outputs=True, it's still broken in main rn: #1649.

@drisspg

This PR refactors the activation checkpoint by moving `apply_ac()` out of the llama3 `parallelize.py` module. Additionally, it introduces a warning about the configuration combinations involving SAC, `torch.compile`, and `flex_attention` to inform users of potential issues. This PR depends on pytorch/pytorch#161541 cc., @drisspg @bdhirsh @soulitzer

fegin requested review from tianyu-l, wwwjn and wconstab as code owners August 27, 2025 18:19

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 27, 2025

fegin mentioned this pull request Aug 27, 2025

torch.compile Not Applied to FlexAttention with SAC selective_ac_option=op #1631

Open

soulitzer reviewed Aug 27, 2025

View reviewed changes

torchtitan/distributed/activation_checkpoint.py Show resolved Hide resolved

tianyu-l reviewed Aug 27, 2025

View reviewed changes

tianyu-l approved these changes Aug 27, 2025

View reviewed changes

fegin added 3 commits August 27, 2025 14:42

misc

a6e2743

misc

42a7cae

fegin force-pushed the chienchin/flex_sac branch from 9cc3dcf to 42a7cae Compare August 27, 2025 21:49

fegin added 2 commits August 27, 2025 14:55

misc

259ade4

Fix tests

752d090

fegin merged commit 8a6c9fe into main Aug 29, 2025
8 checks passed

fegin deleted the chienchin/flex_sac branch August 29, 2025 04:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Activation Checkpoint improvment #1645

Activation Checkpoint improvment #1645

Uh oh!

fegin commented Aug 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Aug 27, 2025

Uh oh!

fegin commented Aug 27, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

fegin commented Aug 27, 2025

Uh oh!

Uh oh!

xmfan commented Aug 29, 2025 •

edited

Loading

Uh oh!

bdhirsh commented Aug 29, 2025

Uh oh!

xmfan commented Aug 29, 2025

Uh oh!

Uh oh!

Activation Checkpoint improvment #1645

Activation Checkpoint improvment #1645

Uh oh!

Conversation

fegin commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

fegin commented Aug 27, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

fegin commented Aug 27, 2025

Uh oh!

Uh oh!

xmfan commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bdhirsh commented Aug 29, 2025

Uh oh!

xmfan commented Aug 29, 2025

Uh oh!

Uh oh!

fegin commented Aug 27, 2025 •

edited

Loading

xmfan commented Aug 29, 2025 •

edited

Loading