Skip to content

Conversation

jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Aug 27, 2025

✨ Description

Not really complete by itself, but extracted as a separate PR to limit PR scope.

New(ish) concepts:

  • Parameters are created through ParameterConfig.get_parameter, and most layers (more to come) are created through [LayerConfig].get_layer. This ensures correct, standardized creation, and leave more room for new additions (ex. dynamic types.)
  • Parameter and linear (will) support fine-grained customization, but are typically set by the parent layer. Because of this, several parameters will be left "unset" by default (None or with special default marker), and get_parameter/get_layer take default values as arguments. This way we keep existing behaviour as default and make the new options truly optional and opt-in. (Otherwise, things like disabling biases, setting initialization scale or lr scale would have needed manual setting of every single parameter.)

Main changes:

  • Add ParameterConfig as the new standard way to configure and instantiate (get_parameter) every parameter. Currently a placeholder config, but standard parameters (lr scale, initialization, maybe more) will be added in next PRs.
  • Add OptionalParameterConfig for weights that may be enabled or disabled (ex. biases). It comes with an enabled option, with default provided by the parent layer.
  • Add (mostly empty) configuration for linear layers. Distinguish pure linear (no bias) from affine linear (optional bias) . Linear layers are created through get_layer, which takes non-config arguments as well as defaults for bias.enabled (default_add_bias) and initialization (customizable initialization will come later).
  • Add CausalConv1d layer (based on Mamba 2 and Discrete Mamba 2 implementations) and config. Config is similar to AffineLinearConfig, but also supports custom activation, with default set by the parent layer.
  • Rework all layers and their configs to use the new configs.
  • Make the SSM config dynamic, separating it into MambaConfig, Mamba2Config, DiscreteMamba2Config. Things are a bit awkward for now because of the couble configuration (hybrid_block_layout, ssm.type), but this will be addressed in upcoming PR.
  • Remove auto_grad_accumulation arguments, as things work without it, and removing it allows mixing auto and non-auto accumulation (dt bias).

Config/breaking changes:

  • SSM arguments removed for SSM types that don't use them. (Ex. setting Mamba 2 option d_xb in a Mamba 1 layer will cause a crash.)
  • [temporary hack ] SSM configs need an explicitly set type.
  • Remove ssm.add_bias_linear, AddLinearBiasChoices. add_linear_biases: bool is kept as the only global option for biases, at least for now. Other options may be achieved through individual layer configs.
  • ssm.expansion_factor removed (redundant)
  • ssm.conv_kernel_dimension -> ssm.convolution_layer.kernel_size
  • ssm.activation_type -> ssm.convolution_layer.activation
  • Renamed conv1d_weight -> convolution.weight
  • Renamed conv1d_bias -> convolution.bias
  • Option to enable or disable all supported linear and convolution biases (unchanged defaults)
  • Mamba:
    • Added support for convolution bias
    • Renamed dt_proj_weight -> dt_proj.weight
    • Renamed dt_proj_bias -> dt_proj.bias
  • Mamba 2
    • dt_proj_bias -> dt_proj.bias
    • Add support for custom convolution activation, backup torch implementation

TODO:

  • Allow separate configuration for concatenated layers (ex. key_value, ssm in_proj)
  • Deal with conversion.
  • Review global bias options.

@jlamypoirier jlamypoirier changed the title Block interface: parameter and linear config Block interface: parameter and linear config, separate SSM config. Aug 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant