[WIP][Dion Official Optimizer, Muon] Integrate official Dion, and high speed Muon, optimizer impl with TorchTitan and Optimizer component class #1521

lessw2020 · 2025-08-04T04:18:41Z

The Dion authors have released their official Dion implementation here.
Thus we can move from my prev unofficial impl here to their official implementation.

This PR:
integrates their three main optimizer files with Titan Optimizer class and Titan configs to make it available.
Dion optimizer files live under experiments/dion_optimizer.
directly added to build_optimizer for now, will look at the proper subclassing later.
adds parameterization file to classify lm head, embeddings and 2D matrix to route things appropriately with appropriate scaling factors.

adds logging for located lm head and embeddings to make sure these can be checked by user:

Testing:
8B llama3 trains nicely - needs more debugging to verify head, embedding, etc. are all being properly found.

…ification

agi-is-coming · 2025-08-22T06:53:36Z

Thank you for your work, but when I used this Muon configuration for the experiment, I found that it converges slower than the Adam optimizer.

my config toml :
`

torchtitan debug

[job]
dump_folder = ""
description = "Hybrid debug training"
print_args = true
use_for_integration_test = false

[profiling]
enable_profiling = false
save_traces_folder = "profile_trace"
profile_freq = 10
enable_memory_snapshot = false
save_memory_snapshot_folder = "memory_snapshot"

[comm]
init_timeout_seconds = 3600
train_timeout_seconds = 3600

[metrics]
log_freq = 1
disable_color_printing = false
enable_tensorboard = true ## TODO: 是否开启tensorboard
save_tb_folder = "tb"
enable_wandb = false

[model]
name = "hybrid_tie"
flavor = "4B_transformer_varlen"
ssm_cp_version = 1

test tokenizer.model, for debug purpose only

tokenizer_type = ""
tokenizer_path = ""

converters = ["float8"]

[optimizer]
name = "Muon"
lr = 3e-4
weight_decay = 0.1
beta1 = 0.9
beta2 = 0.95
eps = 1e-8

Muon-specific parameters

mu = 0.95 # Momentum factor for Muon
algorithm = "muon" # Main algorithm to use for 2D matrices
nesterov = false # Whether to use Nesterov momentum
adjust_lr = "spectral_norm" # How to adjust LR: "spectral_norm", "rms_norm", or null
flatten = false # Whether to flatten 3D+ tensors to 2D
use_triton = true # Whether to use Triton kernel for Newton-Schulz

Parameter-specific optimizer selection

scalar_optimizer = "adamw" # For 1D parameters (biases, layer norms)
embedding_optimizer = "adamw" # For embedding layers
head_optimizer = "adamw" # For model head/output layers
head_lr_scaling = true # Apply 1/sqrt(dim) scaling to head layers

Learning rate scaling factors

scalar_lr_factor = 1.0 # LR multiplier for scalar parameters
embedding_lr_factor = 1.0 # LR multiplier for embedding parameters
head_lr_factor = 1.0 # LR multiplier for head parameters (after head_lr_scaling)
routing_lr_factor = 1.0 # LR multiplier for routing parameters

[lr_scheduler]
warmup_steps = 2000 # lr scheduler warm up, normally 20% of the train steps
decay_ratio = 0 # lr scheduler decay ratio, 80% of the train steps
decay_type = "cosine"
lr_min = 0.1

[training]
local_batch_size = 2
global_batch_size = 256
seq_len = 4096
max_norm = 1.0 # grad norm clipping
steps = 10000
compile = true
dataset_type = "mmap" # mmap for megatron style
dataset = ""
dataset_path =
loss_function_type = "cross_entropy"
seed = 1234
rope_theta = 10000
mmap_dataloader_num_workers = 2

[inference]

checkpoint_path = None

mixed_precision_param = "float32"
use_inference_caches = true

[parallelism]
data_parallel_replicate_degree = 1
data_parallel_shard_degree = -1
fsdp_reshard_after_forward = "never" # default / never / always
tensor_parallel_degree = 1
enable_async_tensor_parallel = false
pipeline_parallel_degree = 1
pipeline_parallel_microbatch_size = 1
context_parallel_degree = 1
disable_loss_parallel = true

[checkpoint]
enable_checkpoint = true
folder = "checkpoint" # make sure is a shared folder
interval = 2000
last_save_model_weights_only = false
export_dtype = "float32"
async_mode = "disabled" # ["disabled", "async", "async_with_pinned_mem"]
keep_latest_k = 15

[activation_checkpoint]
mode = "none" # ["none", "selective", "full"]
selective_ac_option = '2' # 'int' = ac every positive int layer or 'op', ac based on ops policy

[float8]
enable_fsdp_float8_all_gather = false
precompute_float8_dynamic_scale_for_fsdp = false
filter_fqns = ["output", "router.gate"]
moe_fqns = ["experts"]
`

lessw2020 added 7 commits August 3, 2025 19:56

add official dion files

eb06088

first pass at integration

563a2ef

update integration with param groups and lm head and embeddings class…

f141d62

…ification

use lazy init to avoid circular imports issue

a54a32c

fix outer mesh matching

62ae023

debug model training with decreasing loss

378fd0c

8b loss working well

c0ea80d

lessw2020 requested review from tianyu-l, fegin, wwwjn and wconstab as code owners August 4, 2025 04:18

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 4, 2025

lessw2020 marked this pull request as draft August 4, 2025 04:19

lessw2020 mentioned this pull request Aug 4, 2025

[WIP][Optimizers] Unofficial implementation of DION optimizer - DIstributed OrthoNormal updates #1417

Closed

lessw2020 added 9 commits August 5, 2025 22:13

log lm head and embeddings for Dion optimizer

670f29e

add more detailed wandb logging, add dion for dsv3

5e60eac

tune moe settings

df1158d

add routihng layer detection and handling for moe arch

87ab2d1

enable grad accumulation to maximize dion updates

67ce665

enable grad accum for maximizing dion udpates

d4e5f7a

enable grad accum for maximizing dion udpates

d12551e

integrate faster muon for comparison

57b4383

ensure correct logging for 3D experts

e24caec

lessw2020 changed the title ~~[WIP][Dion Official Optimizer] Integrate official Dion optimizer impl with TorchTitan and Optimizer component class~~ [WIP][Dion Official Optimizer] Integrate official Dion, and high speed Muon, optimizer impl with TorchTitan and Optimizer component class Aug 13, 2025

lessw2020 added 2 commits August 13, 2025 19:19

fix Lion | AdamW for routers

4063f35

remove per iter Muon logging

68d34e5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][Dion Official Optimizer, Muon] Integrate official Dion, and high speed Muon, optimizer impl with TorchTitan and Optimizer component class #1521

[WIP][Dion Official Optimizer, Muon] Integrate official Dion, and high speed Muon, optimizer impl with TorchTitan and Optimizer component class #1521

Uh oh!

lessw2020 commented Aug 4, 2025 •

edited

Loading

Uh oh!

agi-is-coming commented Aug 22, 2025

Uh oh!

Uh oh!

[WIP][Dion Official Optimizer, Muon] Integrate official Dion, and high speed Muon, optimizer impl with TorchTitan and Optimizer component class #1521

Are you sure you want to change the base?

[WIP][Dion Official Optimizer, Muon] Integrate official Dion, and high speed Muon, optimizer impl with TorchTitan and Optimizer component class #1521

Uh oh!

Conversation

lessw2020 commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agi-is-coming commented Aug 22, 2025

torchtitan debug

test tokenizer.model, for debug purpose only

converters = ["float8"]

Muon-specific parameters

Parameter-specific optimizer selection

Learning rate scaling factors

checkpoint_path = None

Uh oh!

Uh oh!

lessw2020 commented Aug 4, 2025 •

edited

Loading