Skip to content

[WIP][Dion Official Optimizer, Muon] Integrate official Dion, and high speed Muon, optimizer impl with TorchTitan and Optimizer component class #1521

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

lessw2020
Copy link
Contributor

@lessw2020 lessw2020 commented Aug 4, 2025

The Dion authors have released their official Dion implementation here.
Thus we can move from my prev unofficial impl here to their official implementation.

This PR:
integrates their three main optimizer files with Titan Optimizer class and Titan configs to make it available.
Dion optimizer files live under experiments/dion_optimizer.
directly added to build_optimizer for now, will look at the proper subclassing later.
adds parameterization file to classify lm head, embeddings and 2D matrix to route things appropriately with appropriate scaling factors.

  • adds logging for located lm head and embeddings to make sure these can be checked by user:
Screenshot 2025-08-05 at 10 12 40 PM

Testing:
8B llama3 trains nicely - needs more debugging to verify head, embedding, etc. are all being properly found.

Screenshot 2025-08-03 at 9 07 39 PM

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 4, 2025
@lessw2020 lessw2020 marked this pull request as draft August 4, 2025 04:19
@lessw2020 lessw2020 changed the title [WIP][Dion Official Optimizer] Integrate official Dion optimizer impl with TorchTitan and Optimizer component class [WIP][Dion Official Optimizer] Integrate official Dion, and high speed Muon, optimizer impl with TorchTitan and Optimizer component class Aug 13, 2025
@lessw2020 lessw2020 changed the title [WIP][Dion Official Optimizer] Integrate official Dion, and high speed Muon, optimizer impl with TorchTitan and Optimizer component class [WIP][Dion Official Optimizer, Muon] Integrate official Dion, and high speed Muon, optimizer impl with TorchTitan and Optimizer component class Aug 13, 2025
@agi-is-coming
Copy link

Thank you for your work, but when I used this Muon configuration for the experiment, I found that it converges slower than the Adam optimizer.

image

my config toml :
`

torchtitan debug

[job]
dump_folder = ""
description = "Hybrid debug training"
print_args = true
use_for_integration_test = false

[profiling]
enable_profiling = false
save_traces_folder = "profile_trace"
profile_freq = 10
enable_memory_snapshot = false
save_memory_snapshot_folder = "memory_snapshot"

[comm]
init_timeout_seconds = 3600
train_timeout_seconds = 3600

[metrics]
log_freq = 1
disable_color_printing = false
enable_tensorboard = true ## TODO: 是否开启tensorboard
save_tb_folder = "tb"
enable_wandb = false

[model]
name = "hybrid_tie"
flavor = "4B_transformer_varlen"
ssm_cp_version = 1

test tokenizer.model, for debug purpose only

tokenizer_type = ""
tokenizer_path = ""

converters = ["float8"]

[optimizer]
name = "Muon"
lr = 3e-4
weight_decay = 0.1
beta1 = 0.9
beta2 = 0.95
eps = 1e-8

Muon-specific parameters

mu = 0.95 # Momentum factor for Muon
algorithm = "muon" # Main algorithm to use for 2D matrices
nesterov = false # Whether to use Nesterov momentum
adjust_lr = "spectral_norm" # How to adjust LR: "spectral_norm", "rms_norm", or null
flatten = false # Whether to flatten 3D+ tensors to 2D
use_triton = true # Whether to use Triton kernel for Newton-Schulz

Parameter-specific optimizer selection

scalar_optimizer = "adamw" # For 1D parameters (biases, layer norms)
embedding_optimizer = "adamw" # For embedding layers
head_optimizer = "adamw" # For model head/output layers
head_lr_scaling = true # Apply 1/sqrt(dim) scaling to head layers

Learning rate scaling factors

scalar_lr_factor = 1.0 # LR multiplier for scalar parameters
embedding_lr_factor = 1.0 # LR multiplier for embedding parameters
head_lr_factor = 1.0 # LR multiplier for head parameters (after head_lr_scaling)
routing_lr_factor = 1.0 # LR multiplier for routing parameters

[lr_scheduler]
warmup_steps = 2000 # lr scheduler warm up, normally 20% of the train steps
decay_ratio = 0 # lr scheduler decay ratio, 80% of the train steps
decay_type = "cosine"
lr_min = 0.1

[training]
local_batch_size = 2
global_batch_size = 256
seq_len = 4096
max_norm = 1.0 # grad norm clipping
steps = 10000
compile = true
dataset_type = "mmap" # mmap for megatron style
dataset = ""
dataset_path =
loss_function_type = "cross_entropy"
seed = 1234
rope_theta = 10000
mmap_dataloader_num_workers = 2

[inference]

checkpoint_path = None

mixed_precision_param = "float32"
use_inference_caches = true

[parallelism]
data_parallel_replicate_degree = 1
data_parallel_shard_degree = -1
fsdp_reshard_after_forward = "never" # default / never / always
tensor_parallel_degree = 1
enable_async_tensor_parallel = false
pipeline_parallel_degree = 1
pipeline_parallel_microbatch_size = 1
context_parallel_degree = 1
disable_loss_parallel = true

[checkpoint]
enable_checkpoint = true
folder = "checkpoint" # make sure is a shared folder
interval = 2000
last_save_model_weights_only = false
export_dtype = "float32"
async_mode = "disabled" # ["disabled", "async", "async_with_pinned_mem"]
keep_latest_k = 15

[activation_checkpoint]
mode = "none" # ["none", "selective", "full"]
selective_ac_option = '2' # 'int' = ac every positive int layer or 'op', ac based on ops policy

[float8]
enable_fsdp_float8_all_gather = false
precompute_float8_dynamic_scale_for_fsdp = false
filter_fqns = ["output", "router.gate"]
moe_fqns = ["experts"]
`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants