Skip to content

Research code exploring shared subspaces in attention and feedforward networks. Includes SVD analysis of pre-trained models, experiments with shared output projections, and latent-space-efficient Transformer variants.

License

Notifications You must be signed in to change notification settings

chrisjmccormick/shared-subspaces

Repository files navigation

shared-subspaces

This repository contains research code exploring the use of shared subspaces in Transformer attention and feed-forward networks. The core of this work investigates the impact of adding a shared output latent space to Multihead Latent Attention (MLA), a parameter-efficient attention mechanism used in models like DeepSeek and Kimi.

The projects here include Singular Value Decomposition (SVD) analysis of pre-trained models to motivate the research, as well as experiments with custom, latent-space-efficient Transformer variants.

The Research Question: Constraining the Residual Stream

State-of-the-art Multihead Latent Attention (MLA) models like DeepSeek-V3 aggressively bottleneck the inputs to the attention layer. For instance, they project the model's hidden dimension (e.g., 7,168) down to much smaller latent spaces for the query (e.g., 1,536) and key/value pairs (e.g., 512).

This raises a key question: If these input bottlenecks are effective, what is the impact of adding a similar bottleneck to the output of the attention layer?

Using the language of mechanistic interpretability, we can think of a token's vector representation as a "residual stream"—a communication bus that all model components read from and write to. In this framing, MLA's input projections constrain how much information each head can read from the stream. This project explores constraining where they can write to.

Simple block diagram of the attention heads with shared spaces illustrated as trapezoids

The trapezoids in this illustration represent projections shared by all heads in a layer. Multihead Latent Attention defines a shared Key-Value latent projection (bottom) and a larger, shared Query latent projection (top). We're proposing a shared Output latent projection (right).

A shared output subspace, where the output matrix $W^O$ is decomposed into a per-head projection $W^{OA}_i$ and a shared projection $W^{OB}$, could have competing effects:

  • Potential Benefits: It could encourage shared learning and feature reuse, as the shared projection receives gradient updates from every token.
  • Potential Risks: It could reduce head diversity or lead to destructive interference as heads compete for representation capacity in a smaller space.

This repository documents the investigation into whether, under the right conditions, an output latent space can be beneficial.

Projects

This repository is organized into two main projects that follow the research narrative.

1. fused_attn_svd/

Before building new models, we first analyzed existing ones. This project performs a Singular Value Decomposition (SVD) analysis on the attention weight matrices of large, pre-trained MLA models (DeepSeek-V3, Kimi-K2) to measure their "effective rank." The primary goal was to see if the output heads already exhibit a low-rank structure that would suggest a shared subspace is feasible.

The analysis reveals that while there is some potential for rank reduction, especially in the early layers, simply decomposing the weights of a pre-trained model might not be the most effective approach. This motivated pre-training a model with the output subspace constraint from the beginning.

Dive into the analysis in the fused_attn_svd/README.md.

2. subspace_encoder/

This project implements a custom Transformer encoder from scratch to experimentally validate the impact of a shared output latent space. We train small (6-layer, 13M parameter) models on WikiText-103 and evaluate them on the SST-2 GLUE task.

The core experiments compare three architectures:

  1. MHA: A standard Multihead Attention baseline.
  2. MLA: Our implementation of Multihead Latent Attention.
  3. MLA-o: Our proposed variant, MLA with a shared output latent space.

Find the implementation, usage, and full experimental details in the subspace_encoder/README.md.

3. subspace_decoder/

Building on the encoder experiments, this project implements and evaluates the shared output latent space using a decoder architecture based on HuggingFace's DeepSeek-V3 implementation. Rather than building a custom model from scratch, this approach patches the existing DeepseekV3ForCausalLM to add the output subspace decomposition.

The core experiments compare the same three architectures as the encoder project:

  1. MHA: Standard Multihead Attention baseline.
  2. MLA: Multihead Latent Attention (DeepSeek-V3's approach).
  3. MLA-o: MLA with shared output latent space.

Models are pre-trained on WikiText-103 and fine-tuned on SST-2, with experiments conducted at both short (128 tokens) and longer (1,024 tokens) sequence lengths to evaluate the impact of context length on the shared output space.

Find the implementation details, experimental results, and usage instructions in the subspace_decoder/README.md.

Current Status & Preliminary Results

We have conducted experiments with both encoder and decoder architectures using 6-layer models with a hidden dimension of 256 and 8 attention heads.

Encoder Results (SubspaceEncoder)

The table below shows the best-performing encoder configurations evaluated on SST-2 test accuracy:

# Attention Test Accuracy Parameters Query Latent Key-Value Latent Output Latent Position Encoding # of RoPE Dims
1 MHA 85.67 13.50M n/a n/a n/a RoPE 32
2 MLA 84.75 12.67M 64 32 n/a RoPE 16
3 MLA-o 84.63 12.48M 64 32 64 RoPE 32

Decoder Results (DeepSeek-V3 based)

The decoder experiments, using patched DeepSeek-V3 models, show performance at different sequence lengths:

SST-2 Accuracy (Sequence Length 1,024)

|| Attention | Test Accuracy | Parameters | Query Latent | Key-Value Latent | Output Latent | Perplexity (WikiText-103) | |:---------:|:-------------:|:----------:|:------------:|------------------|---------------|:-------------------------:| || MLA | 87.96 | 16.26M | 96 | 64 | n/a | 28.89 | || MLA-o | 86.24 | 16.17M | 96 | 64 | 96 | 29.33 |

Key Observations:

  • Encoder vs. Decoder: The decoder models achieve higher SST-2 accuracy (~87-88%) compared to encoders (~84-86%), likely due to the Mixture of Experts architecture.
  • Consistency Across Architectures: Both encoder and decoder experiments show that MLA-o underperforms standard MLA by 1-2 percentage points while reducing parameter count.
  • Scale Effects: At sequence length 1,024, the performance gap between MLA and MLA-o remains consistent with shorter sequences.
  • Throughput: At current model scales, MLA-o does not yet show the expected throughput improvements, likely requiring larger models or more attention heads to become beneficial.

These results are preliminary. Further exploration is needed to understand the trade-offs and identify scenarios where an output latent space could be advantageous, particularly at larger scales where the computational benefits may become more apparent.

Future Directions & Collaboration

This is an active research project, and I welcome feedback, discussion, and collaboration! Some potential next steps include:

  • Scaling Experiments: Test the output subspace at larger model scales with more attention heads and larger hidden dimensions to identify the point where computational benefits become apparent.
  • Throughput Analysis: Systematically benchmark the performance (samples/sec) of an isolated attention layer to find the model/latent sizes where MLA-o becomes more computationally efficient.
  • Hyperparameter Sweeps: Thoroughly explore the impact of different latent space sizes for the query, key-value, and output projections.
  • Subspace Alignment: An interpretability tangent to investigate if the output heads align with other subspaces in the model.

If you are interested in these ideas, please feel free to open an issue or a pull request to discuss them further, or join the discussion in the Community Projects channel of the EleutherAI Discord server, here.

Repository Structure

.
├── fused_attn_svd/      # SVD analysis of pre-trained models.
│   ├── Calculating Singular Values in Large MLA Models.ipynb
│   └── Plotting Effective Rank of Attention Matrices.ipynb
│
├── subspace_encoder/    # Experimental encoder model implementation.
│   ├── configs/         # Model and training hyperparameters.
│   ├── scripts/         # Scripts for pre-training and fine-tuning.
│   ├── models/          # The SharedSubspaceEncoder model definition.
│   └── run_experiments.ipynb  # Notebook for running experiments and analyzing results.
│
├── subspace_decoder/    # Decoder experiments using patched DeepSeek-V3.
│   ├── configs/         # Model and training configurations.
│   ├── layers/          # Output subspace patching and implementations.
│   ├── scripts/         # Training and fine-tuning scripts.
│   └── run_experiments.ipynb  # Notebook for decoder experiments.
│
├── journals/            # Research notes and experiment documentation.
│   └── 2025-09-02 - Initial Decoder Experiments.ipynb
│
├── .gitignore
└── README.md            # You are here!

Getting Started

Welcome to the project!

A great way to get started is to head to the subspace_encoder and try running the run_experiments.ipynb Notebook. You can run it in Google Colab and it will clone this repository for you and kick off a pre-training run of the Encoder.

No need to run it to completion (it takes 1.5 hours on an A100), but it will give you a starting point for exploring. You can:

  • Check out your run on wandb to watch the metrics in real time.
  • Try modifying one or more of the hyperparameters by setting up a new config (see the tool at the end of the Notebook).

If anything's confusing or you run into problems, head to this issue to discuss it, and we'll make sure this process gets smooth.

Next, check out the Issues for ways to contribute. At this early stage, there's still a lot that can be done just by working with an AI to do it, so even if you're a beginner you may be able to lend a hand.

We'll use Issues to discuss specific work, but if you're a member of the EleutherAI Discord server we have a 'community project' post here for more general discussion.

Thanks for your interest--looking forward to seeing where this goes!

About

Research code exploring shared subspaces in attention and feedforward networks. Includes SVD analysis of pre-trained models, experiments with shared output projections, and latent-space-efficient Transformer variants.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •