Steering MoE LLMs via Expert (De)Activation

Abstract

Mixture-of-Experts (MoE) in Large Language Models (LLMs) routes each token through a subset of specialized Feed-Forward Networks (FFN), known as experts. We present SteerMoE, a framework for steering MoE models by detecting and controlling behavior-linked experts. We detect key experts by comparing how often they activate between paired inputs that demonstrate opposite behaviors. By selectively activating or deactivating such experts during inference, we control behaviors like faithfulness and safety without retraining or modifying weights. Across 11 benchmarks and 6 LLMs, our steering raises safety by up to +20% and faithfulness by +27%. Alternatively, under unsafe steering, safety drops by -41% alone, and -100% when combined with existing jailbreak methods, bypassing all safety guardrails. Overall, SteerMoE offers a lightweight, effective, and widely applicable test-time control, while revealing unique vulnerabilities in MoE LLMs. Read the paper

How SteerMoE Works

Environment Setup

cd SteerMoE/
conda create --name steermoe-env -y
conda activate steermoe-env
conda install python=3.12 -y
pip install -r requirements.txt --no-input

Demo Notebooks

demo.ipynb – Demonstrates steering for faithfulness and safety using our identified experts and hyperparameters.
custom_steering.ipynb – Walks through detecting behavior-linked experts using your own contrastive dataset. E.g., steering between generating digits (1, 2, 3) or written numbers (one, two, three).

Files

src/modeling_vllm – Modified LLM code to modify expert routing at inference time.
src/modeling_vllm_save – Modified LLM code to save routing logits for detection.
activations/num_experts.jsonl - Number of experts modified per task and model in our experiments.
activations/*.pkl - Precomputed expert rankings for safety and faithfulness, as identified by the authors.
src/utils.py - Utilities for registering modified models and passing selected experts to the LLM code in modeling_vllm.

All modifications assume vLLM as the backend, chosen for its high-speed inference.

Citation:

@misc{fayyaz2025steeringmoellmsexpert,
      title={Steering MoE LLMs via Expert (De)Activation}, 
      author={Mohsen Fayyaz and Ali Modarressi and Hanieh Deilamsalehy and Franck Dernoncourt and Ryan Rossi and Trung Bui and Hinrich Schütze and Nanyun Peng},
      year={2025},
      eprint={2509.09660},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.09660}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
activations		activations
assets		assets
src		src
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
custom_steering.ipynb		custom_steering.ipynb
demo.ipynb		demo.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Steering MoE LLMs via Expert (De)Activation

Abstract

How SteerMoE Works

Environment Setup

Demo Notebooks

Files

Citation:

About

Uh oh!

Releases

Packages

Languages

License

adobe-research/SteerMoE

Folders and files

Latest commit

History

Repository files navigation

Steering MoE LLMs via Expert (De)Activation

Abstract

How SteerMoE Works

Environment Setup

Demo Notebooks

Files

Citation:

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages