Skip to content

From NVIDIA Megatron-LM for visibility #18

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4,906 commits into
base: multi-query-attention
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
4906 commits
Select commit Hold shift + click to select a range
23e6471
ADLR/megatron-lm!2949 - perf(mla, experimental): MLA RoPE fusion and …
hxbai Jun 2, 2025
9c1a535
Merge branch 'hongxiaob/mla_rope' into 'main'
ko3n1g Jun 2, 2025
da3f0ff
ADLR/megatron-lm!3280 - Fix custom FSDP float8 tensor set_item
shjwudp Jun 3, 2025
549d637
Merge branch 'fix_cfsdp_fp8_param_load' into 'main'
chtruong814 Jun 3, 2025
24c60db
ADLR/megatron-lm!3401 - ci: Move queue blocker
ko3n1g Jun 3, 2025
cfea2ea
Merge branch 'ko3n1g/ci/move-queue-blocker' into 'main'
ko3n1g Jun 3, 2025
37b0afd
ADLR/megatron-lm!3400 - ci: Improve error-handling of missing logs
ko3n1g Jun 4, 2025
6a62a54
Merge branch 'ko3n1g/ci/better-log-failure-handling' into 'main'
ko3n1g Jun 4, 2025
4648912
ADLR/megatron-lm!3408 - ci: Control job concurrency
ko3n1g Jun 4, 2025
cde60ce
Merge branch 'ko3n1g/ci/job-concurrency' into 'main'
ko3n1g Jun 4, 2025
eab047c
ADLR/megatron-lm!3412 - ci: Catch missing logs
ko3n1g Jun 4, 2025
25a26ca
Merge branch 'ko3n1g/ci/fix-no-log' into 'main'
ko3n1g Jun 4, 2025
9bdfe31
ADLR/megatron-lm!3411 - ci: Remove tests from A100
ko3n1g Jun 4, 2025
ff64f96
Merge branch 'ko3n1g/ci/move-tests' into 'main'
ko3n1g Jun 4, 2025
d960800
ADLR/megatron-lm!3393 - Add an option to skip counting zeros in grad …
erhoo82 Jun 5, 2025
b47a9bb
Merge branch 'no_count_zeros' into 'main'
ko3n1g Jun 5, 2025
bc80491
ADLR/megatron-lm!3326 - Add an interface to set high priority stream …
youngeunkwon0405 Jun 5, 2025
957f348
Merge branch 'comm-priority-setting' into 'main'
ko3n1g Jun 5, 2025
7af72f9
ADLR/megatron-lm!3241 - Llama4 inference
wdykas Jun 6, 2025
4eb36f8
Merge branch 'llama4-inference' into 'main'
chtruong814 Jun 6, 2025
61a42f6
ADLR/megatron-lm!3421 - Change default value of high_priority_stream_…
youngeunkwon0405 Jun 6, 2025
7c64be3
Merge branch 'comm-priority-patch' into 'main'
jaredcasper Jun 6, 2025
92d68da
ADLR/megatron-lm!3170 - [feat, moe]: FP8 padding optimization of MoE …
Victarry Jun 9, 2025
140dce2
Merge branch 'denliu/router_pad' into 'main'
ko3n1g Jun 9, 2025
9e3adb5
ADLR/megatron-lm!3306 - Remove deprecated alltoall_seq dispatcher.
Victarry Jun 9, 2025
823466e
Merge branch 'denliu/remove_alltoall_seq_dispatcher' into 'main'
ko3n1g Jun 9, 2025
db07e3f
ADLR/megatron-lm!3347 - Fix flash decode bug caused by unnecessary ro…
santhnm2 Jun 9, 2025
2e15d12
Merge branch 'hybrid_example' into 'main'
ko3n1g Jun 9, 2025
1589517
ADLR/megatron-lm!3404 - Fix perf issues with NVTX range profiling
Jun 9, 2025
b04c901
Merge branch 'nvtx_perf_fix' into 'main'
ko3n1g Jun 9, 2025
791454d
ADLR/megatron-lm!3385 - Enforce param group ordering after checkpoint…
skierat Jun 9, 2025
40cb6e7
Merge branch 'skierat/fix_param_groups' into 'main'
ko3n1g Jun 9, 2025
54cdc7a
ADLR/megatron-lm!3399 - [MM] [Bug Fix] model parameter dtype, embeddi…
cuichenx Jun 10, 2025
d1409db
Merge branch 'chcui/llama-nemotron-nano-vl-8b' into 'main'
ko3n1g Jun 10, 2025
629b615
Revert "Merge branch 'chcui/llama-nemotron-nano-vl-8b' into 'main'"
ko3n1g Jun 10, 2025
50a1247
Reapply "Merge branch 'chcui/llama-nemotron-nano-vl-8b' into 'main'"
ko3n1g Jun 10, 2025
5ae21f8
Revert "ADLR/megatron-lm!3399 - [MM] [Bug Fix] model parameter dtype,…
ko3n1g Jun 10, 2025
62e7e60
ADLR/megatron-lm!3332 - fix(mtp): Fix issue with MTP+VPP after !3108 …
shifangx Jun 11, 2025
ad36348
Merge branch 'shifang/fix_vp_stage' into 'main'
ko3n1g Jun 11, 2025
0f4f095
ADLR/megatron-lm!3384 - Expose TE fused MLP with module spec
timmoon10 Jun 11, 2025
0595ef2
Merge branch 'mfutrega/fused_swiglu' into 'main'
ko3n1g Jun 11, 2025
9e5fe7a
ADLR/megatron-lm!3403 - Moe inference functional tests
wdykas Jun 12, 2025
0dea9a5
Merge branch 'moe-tests' into 'main'
ko3n1g Jun 12, 2025
80d66ec
ADLR/megatron-lm!3458 - ci: Benchmark release tests suite with TE2.2 …
ko3n1g Jun 12, 2025
a3e2222
Merge branch 'ko3n1g/chore/release-benchmarks-dev' into 'main'
ko3n1g Jun 12, 2025
15e4446
ADLR/megatron-lm!3371 - Move data to GPU for TP data processing
parthmannan Jun 12, 2025
d58f062
Merge branch 'pmannan/improve_data_processing' into 'main'
ko3n1g Jun 12, 2025
f5cfc10
Reapply "ADLR/megatron-lm!3399 - [MM] [Bug Fix] model parameter dtype…
ko3n1g Jun 12, 2025
5bb6cf3
update golden values
ko3n1g Jun 12, 2025
603592a
ADLR/megatron-lm!3366 - Optimize dummy weight tensors for cudagraph a…
gdengk Jun 12, 2025
40bfaf5
Merge branch 'gaod/llama4/cudagraph_optimize' into 'main'
ko3n1g Jun 12, 2025
6782fe4
ADLR/megatron-lm!3377 - Add --enable-experimental to args.
Victarry Jun 12, 2025
32737be
Merge branch 'denliu/add_enable_experimental_flag' into 'main'
ko3n1g Jun 12, 2025
e63aee4
ADLR/megatron-lm!3281 - perf(MLA): MLA down proj switch back to TELinear
yuzhongw-nvidia Jun 13, 2025
ae63c41
Merge branch 'mla_down_proj_telinear' into 'main'
ko3n1g Jun 13, 2025
9042182
ADLR/megatron-lm!3463 - ci: Retry on network errors
ko3n1g Jun 13, 2025
819f752
Merge branch 'ko3n1g/ci/wait-resources-resiliency' into 'main'
ko3n1g Jun 13, 2025
b8605c6
ADLR/megatron-lm!3361 - Add TE functional tests
ko3n1g Jun 13, 2025
107fc72
Merge branch 'ko3n1g/guyueh/te_functional_tests' into 'main'
ko3n1g Jun 13, 2025
effa991
revert
ko3n1g Jun 13, 2025
ad7d1df
ci: Restart on cuda error
ko3n1g Jun 13, 2025
f21a28b
Revert "ADLR/megatron-lm!3281 - perf(MLA): MLA down proj switch back …
ko3n1g Jun 13, 2025
a4fc916
Merge branch 'ko3n1g/ci/restart-on-cuda'
ko3n1g Jun 13, 2025
7f7ffcf
Merge branch 'ko3n1g/chore/re-apply-3399'
ko3n1g Jun 13, 2025
73558db
ci: Set gpt-nemo tests as allowed to fail
ko3n1g Jun 13, 2025
42f7f7f
ci: Fix while loop
ko3n1g Jun 13, 2025
0bbcbb1
ADLR/megatron-lm!3024 - Added support for offloading Swiglu activatio…
sanandaraj5597 Jun 13, 2025
fdcf52b
Merge branch 'swiglu_offload' into 'main'
ericharper Jun 13, 2025
cfe7b06
ADLR/megatron-lm!3279 - Fix MoE Aux loss
aklife97 Jun 13, 2025
aaddc23
Merge branch 'akhattar/auxloss_fix' into 'main'
ko3n1g Jun 13, 2025
db8cd9a
ADLR/megatron-lm!3429 - llama 3p1 nemotron nano vl 8b v1 instructions
Jun 13, 2025
dca59c6
Merge branch 'matthieul/llama_3p1_nemotron_nano_vl_8b_v1' into 'main'
ko3n1g Jun 13, 2025
9caa5d3
ADLR/megatron-lm!3289 - Fix attention unit test
santhnm2 Jun 14, 2025
8a03b29
Merge branch 'attention_unit_test_fix' into 'main'
ko3n1g Jun 14, 2025
04c93ae
ADLR/megatron-lm!3265 - Handle strict argument for local checkpointing
Jun 14, 2025
59ae4e3
Merge branch 'jszulc/local-ckpt-strict-loading' into 'main'
ko3n1g Jun 14, 2025
77732c3
ADLR/megatron-lm!2795 - feat(Pipeline Parallel, MoE): Flexible Asymme…
Shunkangz Jun 14, 2025
aec50ee
Merge branch 'flexible_vpp' into 'main'
ko3n1g Jun 14, 2025
19d30fa
ADLR/megatron-lm!3317 - Fix version check of test_fp8_param.py
kunlunl Jun 14, 2025
48396b2
Merge branch 'kunlunl/fix_fp8_param_ut_version_check' into 'main'
ko3n1g Jun 14, 2025
0d549aa
ADLR/megatron-lm!3461 - Fix common state comparison primitive
mikolajblaz Jun 14, 2025
de3da90
Merge branch 'mblaz/fix-dict-utils-diff' into 'main'
ko3n1g Jun 14, 2025
f2116e2
ADLR/megatron-lm!3153 - Update inference README
mathemakitten Jun 14, 2025
a981bf8
Merge branch 'helenn-update-inference-readme' into 'main'
jaredcasper Jun 14, 2025
d920c0d
ADLR/megatron-lm!3345 - M4 Taskforce: update get_rank & get_size of PG
yaoyu-33 Jun 14, 2025
fabb0a0
Merge branch 'yuya/m4_get_rank_get_size_of_pg_update' into 'main'
ko3n1g Jun 14, 2025
03322c1
ADLR/megatron-lm!3448 - CRADIO-g support
Jun 14, 2025
c85b6e7
Merge branch 'tpoon/cradio-g-mr' into 'main'
ko3n1g Jun 14, 2025
9d509a0
ADLR/megatron-lm!3127 - feat(optimizer): Support bf16 dtype for optim…
BestJuly Jun 14, 2025
083b1dc
Merge branch 'lit/support_bf16_optimzer_states' into 'main'
ko3n1g Jun 14, 2025
9900d9a
ADLR/megatron-lm!3379 - Megatron SFT
Jun 14, 2025
775a1d1
Merge branch 'megatron-main-sft' into 'main'
ko3n1g Jun 14, 2025
ee56591
ADLR/megatron-lm!3376 - Fix cuda graph for MambaLayer
guyueh1 Jun 14, 2025
5b4e466
Merge branch 'fix_cuda_graph_for_ssm' into 'main'
ko3n1g Jun 14, 2025
e3ec174
ADLR/megatron-lm!2276 - Add Mamba context parallel
duncanriach Jun 14, 2025
55080a3
Merge branch 'duncan/mamba-context-parallel' into 'main'
ericharper Jun 14, 2025
d559555
ADLR/megatron-lm!3415 - [MXFP8]Reduce memory footprint by initializin…
Jun 14, 2025
bcf96e3
Merge branch 'qiyuw/mxfp8-param' into 'main'
ko3n1g Jun 14, 2025
66194b7
ADLR/megatron-lm!3462 - Add hybrid functional inference test
wdykas Jun 14, 2025
d738935
Merge branch 'mamba-inference-test' into 'main'
ko3n1g Jun 14, 2025
bf6e998
ADLR/megatron-lm!3316 - added llama model training example with FP8
sbhavani Jun 14, 2025
38e30f5
Merge branch 'main' into 'main'
ko3n1g Jun 14, 2025
0f05866
ADLR/megatron-lm!3387 - feat(MoE): Using `te_general_gemm` to handle …
hxbai Jun 14, 2025
dc8372b
Merge branch 'hongxiaob/custom_router_gating' into 'main'
ko3n1g Jun 14, 2025
1674ce3
ADLR/megatron-lm!3190 - Mark weights from vision encoder to be non-te…
wdykas Jun 14, 2025
a165235
Merge branch 'hf-diverge-fix' into 'main'
ko3n1g Jun 14, 2025
0431153
ADLR/megatron-lm!2850 - Granular upcycling implementation
shifangx Jun 15, 2025
c2fb1de
Merge branch 'shifang/granular_upcycling' into 'main'
ko3n1g Jun 15, 2025
a0937dd
ADLR/megatron-lm!3424 - Add GPU energy (and ~power) monitoring for tr…
Jun 15, 2025
cca17b7
Merge branch 'energy-monitoring' into 'main'
ko3n1g Jun 15, 2025
8333bd5
ADLR/megatron-lm!3217 - feat(MoE): Support ep a2a overlap - (01) Add …
Wohox Jun 16, 2025
3e55583
Merge branch 'pingtianl/fine_grained_transformer_layer_submodules' in…
ko3n1g Jun 16, 2025
5005416
ADLR/megatron-lm!3397 - build: Switch to uv
ko3n1g Jun 16, 2025
0df9325
Merge branch 'ko3n1g/build/refactor-setup' into 'main'
ko3n1g Jun 16, 2025
59f2093
ADLR/megatron-lm!3468 - build: Simplify nemo image
ko3n1g Jun 16, 2025
df7401b
Merge branch 'ko3n1g/build/simplify-nemo-image' into 'main'
ko3n1g Jun 16, 2025
2b1c2d6
ADLR/megatron-lm!3272 - Make completions endpoint use MCore inference…
santhnm2 Jun 16, 2025
c40f31f
Merge branch 'completions_endpoint_fix' into 'main'
ko3n1g Jun 16, 2025
2b11af0
ADLR/megatron-lm!3420 - Implement dist-ckpt content versioning
mikolajblaz Jun 16, 2025
83a0f5a
Merge branch 'mblaz/dist-ckpt-content-versioning' into 'main'
ko3n1g Jun 16, 2025
8c1d0c7
ADLR/megatron-lm!3451 - fix (ckpt): Fix `_extra_state` for TE 2.5
yaox12 Jun 16, 2025
6bf889f
Merge branch 'xiny/fix_extra_state' into 'main'
ko3n1g Jun 16, 2025
6dc6050
ADLR/megatron-lm!3081 - Add Hybrid Shard Data-Parallel Support for Cu…
shjwudp Jun 16, 2025
aad967f
Merge branch 'custom_fsdp_hsdp_support' into 'main'
ko3n1g Jun 16, 2025
c7cf075
ADLR/megatron-lm!3450 - Revert `fork` to `spawn` based on stability i…
sbak5 Jun 16, 2025
c8f2f56
Merge branch 'sbak/ckpt_manager_fix' into 'main'
jaredcasper Jun 16, 2025
f7e4641
ADLR/megatron-lm!3301 - Add kitchen extension with per-layer configur…
kwyss-nvidia Jun 16, 2025
8c15450
Merge branch 'kwyss/megatron_kitchen_extension' into 'main'
jaredcasper Jun 16, 2025
1e8e9a4
ADLR/megatron-lm!3474 - Add deprecation warning for legacy inference
santhnm2 Jun 17, 2025
b87f147
Merge branch 'legacy_deprecation_warning' into 'main'
ko3n1g Jun 17, 2025
ab77e52
ADLR/megatron-lm!3181 - Change naming of original_max_position_embedd…
BoxiangW Jun 17, 2025
2386c6c
Merge branch 'boxiangw/mla-yarn-change-option-name' into 'main'
ericharper Jun 17, 2025
fee5600
ADLR/megatron-lm!3472 - Make cudagraph replay check more descriptive …
mathemakitten Jun 17, 2025
c3dc507
Merge branch 'helenn-flag-specific-error-for-cudagraph-replay' into '…
ericharper Jun 17, 2025
db70ed4
ADLR/megatron-lm!3414 - M4 Taskforce: Disable T5 and encoder_and_deco…
yaoyu-33 Jun 17, 2025
5615930
Merge branch 'yuya/m4_remove_encoder_pp_tests_ci_add_deprecation' int…
ko3n1g Jun 17, 2025
e0b2c60
ADLR/megatron-lm!3444 - Quick fix for NeMo: handle alternate key name…
skierat Jun 17, 2025
bfa39e8
Merge branch 'skierat/quick_nemo_fix' into 'main'
ko3n1g Jun 17, 2025
0e3af7e
ADLR/megatron-lm!3477 - chore: Bump version 0.14.0
ko3n1g Jun 17, 2025
27c9b6c
Merge branch 'ko3n1g/chore/release-version-0.14.0' into 'main'
ericharper Jun 17, 2025
3987e89
ADLR/megatron-lm!3071 - Added offloading support for MCore layers
sanandaraj5597 Jun 17, 2025
4a91173
Merge branch 'lora_offload' into 'main'
ericharper Jun 17, 2025
115785f
ADLR/megatron-lm!3437 - Bug fix to reset kv chunks assigned to -1 and…
shanmugamr1992 Jun 18, 2025
3b0f763
Merge branch 'bugFixDE' into 'main'
shanmugamr1992 Jun 18, 2025
642a181
ADLR/megatron-lm!3483 - chore: Add init to tools
ko3n1g Jun 18, 2025
0710137
Merge branch 'ko3n1g/chore/tool-init' into 'main'
ko3n1g Jun 18, 2025
171c351
ADLR/megatron-lm!3480 - Fix unit test test_fp8_param.py blockwise sca…
guyueh1 Jun 18, 2025
57082f9
Merge branch 'fix_2425' into 'main'
ko3n1g Jun 18, 2025
9f1c4b2
ADLR/megatron-lm!3492 - chore: Add init to examples
ko3n1g Jun 18, 2025
6ac5633
Merge branch 'ko3n1g/chore/examples-init' into 'main'
ko3n1g Jun 18, 2025
2074d19
ADLR/megatron-lm!3493 - build: Force pin down setuptools
ko3n1g Jun 18, 2025
0600a3c
Merge branch 'ko3n1g/build/fix-setuptools-version' into 'main'
ko3n1g Jun 18, 2025
a002d50
ADLR/megatron-lm!3341 - Pad input tensors and enable fp8 weights for …
santhnm2 Jun 18, 2025
6a6cd47
Merge branch 'fp8_inference' into 'main'
ko3n1g Jun 18, 2025
2151c65
ADLR/megatron-lm!3398 - M4 Taskforce: Add HyperCommGrid: N-Dimensiona…
yaoyu-33 Jun 26, 2025
45400df
Merge branch 'yuya/m4_hyper_comm_grid' into 'main'
chtruong814 Jun 26, 2025
db59202
ADLR/megatron-lm!3508 - Pass strict=False to load_checkpoint in infer…
mathemakitten Jun 26, 2025
1ab876d
Merge branch 'helenn-allow-loading-unstrict-checkpoint' into 'main'
deepakn94 Jun 26, 2025
9964092
ADLR/megatron-lm!3526 - Skip fused rope check if te version < 1.4.0
BoxiangW Jun 27, 2025
878d65f
Merge branch 'boxiangw/skip-te-fused-rope-test' into 'main'
ko3n1g Jun 27, 2025
e2d16c0
ADLR/megatron-lm!3529 - ci: Misc refactorings
ko3n1g Jun 27, 2025
cc3ed64
Merge branch 'ko3n1g/chore/some-fixes' into 'main'
ko3n1g Jun 27, 2025
1e42279
ADLR/megatron-lm!3284 - Add option to load main params from checkpoin…
kunlunl Jun 27, 2025
c203e6a
Merge branch 'kunlunl/load_main_params_from_ckpt' into 'main'
ko3n1g Jun 27, 2025
881dfe4
ADLR/megatron-lm!3328 - MiMO VLM training example and functional tests
yashaswikarnati Jun 28, 2025
6b70889
Merge branch 'yash/mimo_train_loop_mr' into 'main'
ko3n1g Jun 28, 2025
4ba4542
ADLR/megatron-lm!3539 - test: Disable apex tests
ko3n1g Jun 30, 2025
d125627
Merge branch 'ko3n1g/test/disable-apex-tests' into 'main'
ko3n1g Jun 30, 2025
5e34e9c
ADLR/megatron-lm!3533 - Added double buffering switch for offloading
sanandaraj5597 Jun 30, 2025
8a416d0
Merge branch 'double_buffering_interface' into 'main'
jaredcasper Jun 30, 2025
7fd003f
ADLR/megatron-lm!3440 - Add vp_stage attr to FSDP wrapper.
cspades Jul 1, 2025
5e0e2c7
Merge branch 'cye/fsdp-vp-stage-fix' into 'main'
ericharper Jul 1, 2025
6d5670e
ADLR/megatron-lm!3544 - tests: Disable Apex tests (part 2)
ko3n1g Jul 1, 2025
805f3b8
Merge branch 'ko3n1g/tests/disable-apex-tests-2' into 'main'
ko3n1g Jul 1, 2025
e392d40
ADLR/megatron-lm!3456 - Fix num_warmup_microbatches for PP=1 CUDA gra…
buptzyb Jul 1, 2025
c237a3d
Merge branch 'robinz/fix_schedule' into 'main'
ko3n1g Jul 1, 2025
8e7428e
ADLR/megatron-lm!3547 - tests: Remove multimodal test
ko3n1g Jul 1, 2025
720ea36
Merge branch 'ko3n1g/ci/nightlies' into 'main'
ko3n1g Jul 1, 2025
f06fa41
ADLR/megatron-lm!3549 - build: Guard modelopt on macOS
ko3n1g Jul 1, 2025
76144fe
Merge branch 'ko3n1g/build/guard-modelopt' into 'main'
ko3n1g Jul 1, 2025
4c092ba
ADLR/megatron-lm!3525 - Fix TE version change on rope_fusion
BoxiangW Jul 2, 2025
683895b
Merge branch 'boxiangw/te-rope-fusion-fix' into 'main'
ko3n1g Jul 2, 2025
106ca9b
ADLR/megatron-lm!3554 - ci: Retry on `Call to CUDA function failed.`
ko3n1g Jul 2, 2025
809aab6
Merge branch 'ko3n1g/ci/restart-cuda-error' into 'main'
ko3n1g Jul 2, 2025
915ae4c
tests(hotfix): Update golden values file
ko3n1g Jul 2, 2025
6d1e2d7
ADLR/megatron-lm!3545 - Fix FSDP-double-buffer
youngeunkwon0405 Jul 2, 2025
6f6968f
Merge branch 'fix_fsdp_double_buffer' into 'main'
ko3n1g Jul 2, 2025
f7ba245
ADLR/megatron-lm!3557 - Fix 'apex.contrib.nccl_allocator' has no attr…
youngeunkwon0405 Jul 2, 2025
b61e211
Merge branch 'fix_nccl_allocator_error' into 'main'
ko3n1g Jul 2, 2025
a82fa72
ADLR/megatron-lm!3478 - Fix zero grad_norm when enabling precision-aw…
BestJuly Jul 2, 2025
dc65034
Merge branch 'lit/fix_zero_grad_norm' into 'main'
ko3n1g Jul 2, 2025
d11ac8b
ADLR/megatron-lm!3561 - fix: md5 on FIPS enabled systems
ko3n1g Jul 3, 2025
da4f3d2
Merge branch 'ko3n1g/fix/md5-fips' into 'main'
ko3n1g Jul 3, 2025
1a0c467
ADLR/megatron-lm!3570 - Improve multimodal dataloader
duncanriach Jul 3, 2025
9b82cb1
Merge branch 'duncan/improve-multimodal-dataloader' into 'main'
trintamaki Jul 3, 2025
13a1c0f
ADLR/megatron-lm!3575 - ci: Flaky failure detection for unit tests
ko3n1g Jul 3, 2025
4573497
Merge branch 'ko3n1g/ci/flaky-failure-restart' into 'main'
ko3n1g Jul 3, 2025
2075e27
ADLR/megatron-lm!3574 - ci: Allow automated weekly prereleases
ko3n1g Jul 3, 2025
7e59266
Merge branch 'ko3n1g/ci/weekly-prerelease' into 'main'
ko3n1g Jul 3, 2025
3aec857
ADLR/megatron-lm!3510 - Megatron inference test case: 583m Transforme…
mathemakitten Jul 4, 2025
522ce2c
Merge branch 'helenn-test-tiny-transformer-cudagraphs' into 'main'
ko3n1g Jul 4, 2025
468ed28
ADLR/megatron-lm!3511 - Megatron inference test case: 2B hybrid + cud…
mathemakitten Jul 4, 2025
1b6d55d
Merge branch 'helenn-hybrid-cudagraphs-testcase' into 'main'
ko3n1g Jul 4, 2025
4003d70
ADLR/megatron-lm!3439 - M4 Taskforce: Remove Encoder PP related Funct…
yaoyu-33 Jul 4, 2025
6551940
Merge branch 'yuya/m4_remove_encoder_pp' into 'main'
ko3n1g Jul 4, 2025
0b9cc0e
ADLR/megatron-lm!3579 - Add new parameters to unit test
Jul 4, 2025
e664319
Merge branch 'rprenger/update_unit_test' into 'main'
ko3n1g Jul 4, 2025
78bba82
ci(hotfix): Exit code on integration tests
ko3n1g Jul 4, 2025
509384d
Revert "ADLR/megatron-lm!3439 - M4 Taskforce: Remove Encoder PP relat…
ko3n1g Jul 4, 2025
333dfd9
ci(hotfix): No false restart
ko3n1g Jul 4, 2025
bf02bd6
ADLR/megatron-lm!3187 - Add async functionality to DynamicInferenceEn…
santhnm2 Jul 4, 2025
05079d5
Merge branch 'add_request_dynamic' into 'main'
ko3n1g Jul 4, 2025
9f67938
ADLR/megatron-lm!3552 - Fix TE version for interleaved fused RoPE
tomlifu Jul 5, 2025
94fcc71
Merge branch 'fix_rope_TE_version_lifuz' into 'main'
ko3n1g Jul 5, 2025
6eaf754
ADLR/megatron-lm!3506 - refactor: Safe imports
ko3n1g Jul 5, 2025
351107d
Merge branch 'ko3n1g/refactor/safe-imports' into 'main'
ko3n1g Jul 5, 2025
2795ebd
ADLR/megatron-lm!3221 - build: Upgrade to `25.05-py3-devel` image
ko3n1g Jul 5, 2025
cc0bdfb
Merge branch 'ko3n1g/build/bump-pyt-25.05' into 'main'
ko3n1g Jul 5, 2025
65bd6f1
ADLR/megatron-lm!3591 - tests: Update golden values
ko3n1g Jul 6, 2025
6a1b515
Merge branch 'ko3n1g/tests/update-nightlies' into 'main'
ko3n1g Jul 6, 2025
aa8c311
ADLR/megatron-lm!3592 - build: Add pytest-asyncio
ko3n1g Jul 6, 2025
8f4d909
Merge branch 'ko3n1g/tests/fix-asyncio' into 'main'
ko3n1g Jul 6, 2025
3450806
chore: Version bump
Jul 7, 2025
0df8ee9
ADLR/megatron-lm!3597 - ci: Comment out outdated test
ko3n1g Jul 7, 2025
7b296e0
Merge branch 'ko3n1g/ci/disable-outdated-test' into 'main'
ko3n1g Jul 7, 2025
97c0766
ADLR/megatron-lm!3598 - ci: Disable flaky test
ko3n1g Jul 7, 2025
80f88df
Merge branch 'ko3n1g/ci/flaky-test-2' into 'main'
ko3n1g Jul 7, 2025
2621b4f
ADLR/megatron-lm!3501 - Add default values for Fp8Padding and Fp8Unpa…
santhnm2 Jul 7, 2025
42b2b1d
Merge branch 'fp8_fix' into 'main'
deepakn94 Jul 7, 2025
aa67beb
ADLR/megatron-lm!3509 - Add flag to disable early termination for sta…
santhnm2 Jul 7, 2025
efcabfc
Merge branch 'no_early_termination' into 'main'
deepakn94 Jul 7, 2025
5413c63
ci(hotfix): Disable more flaky tests
ko3n1g Jul 8, 2025
6fb6d14
ci(hotfix): Disable flaky tests
ko3n1g Jul 8, 2025
1169ce2
Reapply "ADLR/megatron-lm!3439 - M4 Taskforce: Remove Encoder PP rela…
ko3n1g Jul 8, 2025
9e49aa4
update "ModelType.encoder_and_decoder" to "ModelType.encoder_or_decod…
yaoyu-33 Jul 7, 2025
d235851
ci(hotfix): Release wheel workflow
ko3n1g Jul 8, 2025
436691d
Revert "Reapply "ADLR/megatron-lm!3439 - M4 Taskforce: Remove Encoder…
ko3n1g Jul 8, 2025
7a8b745
Revert "update "ModelType.encoder_and_decoder" to "ModelType.encoder_…
ko3n1g Jul 8, 2025
abb1704
ADLR/megatron-lm!3356 - Fix OOM when merging text datasets
verdimrc Jul 9, 2025
082373e
Merge branch 'vmarch/fix-oom-preproc' into 'main'
ko3n1g Jul 9, 2025
0059345
ADLR/megatron-lm!3531 - Adding CUDA Graph Support for Frozen Transfor…
tomlifu Jul 9, 2025
288dbf0
Merge branch 'frozen_layer_cuda_graph_support_lifuz' into 'main'
ko3n1g Jul 9, 2025
a0ac48d
ADLR/megatron-lm!3599 - Add assertion to PYTORCH_CUDA_ALLOC_CONF=expa…
youngeunkwon0405 Jul 9, 2025
e11d285
Merge branch 'exp_seg_assertion' into 'main'
ko3n1g Jul 9, 2025
c7ad90a
ADLR/megatron-lm!3527 - Miscellaneous timing fixes for inference scripts
santhnm2 Jul 10, 2025
3a2a972
Merge branch 'timing_fixes' into 'main'
ko3n1g Jul 10, 2025
14bfcc0
ADLR/megatron-lm!3520 - Fix issues from cpu init when parallel state …
yaoyu-33 Jul 12, 2025
ee082bf
Merge branch 'yuya/fix_cpu_init_pg_issue' into 'main'
ko3n1g Jul 12, 2025
d2c2210
ADLR/megatron-lm!3338 - Add TE 2.0 check for FSDP2 with fp8-param-gather
BoxiangW Jul 13, 2025
44fa0ea
Merge branch 'boxiangw/fsdp2-te2-fp8-warning' into 'main'
ko3n1g Jul 13, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
5 changes: 0 additions & 5 deletions .coveragerc

This file was deleted.

4 changes: 4 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[flake8]
max-line-length = 100
extend-ignore = E203,E501,F401,E402,E714
per-file-ignores = __init__.py:F401
32 changes: 32 additions & 0 deletions .github/ISSUE_TEMPLATE/bug.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
name: BUG
about: Report a bug that needs attention
title: "[BUG]"
labels: ''
assignees: ''

---

**Describe the bug**
A clear and concise description of what the bug is.

**To Reproduce**
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

**Expected behavior**
A clear and concise description of what you expected to happen.

**Stack trace/logs**
If applicable, add the stack trace or logs from the time of the error.

**Environment (please complete the following information):**
- Megatron-LM commit ID
- PyTorch version
- CUDA version
- NCCL version

**Proposed fix**
If you have a proposal for how to fix the issue state it here or link to a PR.

**Additional context**
Add any other context about the problem here.
23 changes: 23 additions & 0 deletions .github/ISSUE_TEMPLATE/enhancement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
name: ENHANCEMENT
about: Suggest an idea to improve this project
title: "[ENHANCEMENT]"
labels: ''
assignees: ''

---

**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Proposed implementation**
If you have a proposed implementation for the feature state it here or link to a PR.

**Additional context**
Add any other context or screenshots about the feature request here.
12 changes: 12 additions & 0 deletions .github/ISSUE_TEMPLATE/question.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
name: QUESTION
about: Ask a question about Megatron-LM that is not a bug, regression or enhancement
request
title: "[QUESTION]"
labels: ''
assignees: ''

---

**Your question**
Ask a clear and concise question about Megatron-LM.
39 changes: 39 additions & 0 deletions .github/ISSUE_TEMPLATE/regression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
name: REGRESSION
about: Report a regression in speed or accuracy due to a Megatron-LM update
title: "[REGRESSION]"
labels: ''
assignees: ''

---

**Describe the regression**
A clear and concise description of what the regression is.

**To Reproduce**
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

**Previous performance**
What speed or accuracy did you previously see.

**New performance**
What speed or accuracy do you see after the update.

**Stack trace/logs**
If applicable, add the stack trace or logs related to the regression.

**Environment (please complete the following information):**
- Previous Megatron-LM commit ID
- New Megatron-LM commit ID
- Previous PyTorch version
- New PyTorch version
- Previous CUDA version
- New CUDA version
- Previous NCCL version
- New NCCL version

**Proposed fix**
If you have a proposal for how to fix the issue state it here or link to a PR.

**Additional context**
Add any other context about the problem here.
31 changes: 31 additions & 0 deletions .github/workflows/stale.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# This workflow warns and then closes issues and PRs that have had no activity for a specified amount of time.
#
# You can adjust the behavior by modifying this file.
# For more information, see:
# https://github.com/actions/stale
name: Mark stale issues and pull requests

on:
schedule:
- cron: '15 18 * * *'

jobs:
stale:

runs-on: ubuntu-latest
permissions:
issues: write
pull-requests: write

steps:
- uses: actions/stale@v5
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
days-before-stale: 60
stale-issue-message: 'Marking as stale. No activity in 60 days.'
stale-pr-message: 'Marking as stale. No activity in 60 days.'
stale-issue-label: 'stale'
stale-pr-label: 'stale'
remove-stale-when-updated: true
operations-per-run: 1000
days-before-close: -1
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,10 @@ build
*~
slurm*
logs
.vscode
local/
.gitmodules
wandb/
onelogger.log
onelogger.err
.venv
Loading