LLM Kernel Foundry is a project dedicated to researching and implementing high-performance CUDA kernels for Large Language Models (LLMs).
Our goal is to create custom operators through low-level optimizations that match or exceed the performance of major deep learning frameworks like PyTorch, thereby accelerating LLM training and inference pipelines.
Currently, the following kernel has been successfully implemented and validated:
- Layer Normalization: A critical component in the Transformer architecture, essential for stabilizing the training process. Our implementation supports both
float32
andfloat16
data types.
Below is a performance comparison between our custom kernel and the native PyTorch implementation. All tests have passed numerical correctness checks.
- Test Environment:
- GPU:
NVIDIA GeForce RTX 4090
- CUDA:
12.9
- PyTorch:
2.7.0a0+79aa17489c.nv25.04
- GPU:
- Evaluation Date: September 21, 2025
Kernel | Input Shape | Precision | PyTorch (Native) | LKF (Ours) | Status / Note |
---|---|---|---|---|---|
LayerNorm | (16, 1024, 768) |
torch.float32 |
104.89 us |
103.83 us |
✅ Comparable |
LayerNorm | (16, 1024, 768) |
torch.float16 |
28.60 us |
34.52 us |
|
LayerNorm | (4, 4096, 2048) |
torch.float16 |
147.38 us |
145.82 us |
✅ Comparable |
Our current LayerNorm
implementation has achieved performance on par with PyTorch for float32
precision. For float16
precision, while numerical correctness is confirmed, a performance gap remains. This indicates that our baseline implementation is robust and stable, and the next priority is to introduce advanced optimization techniques to close this gap.
- Ensure you have the NVIDIA CUDA Toolkit and a compatible version of PyTorch installed.
- Clone this repository:
git clone [https://github.com/](https://github.com/)<Your-Username>/llm-kernel-foundry.git cd llm-kernel-foundry
- Navigate to the
python
directory and install the package in editable mode. This will compile the CUDA kernels.cd python pip install -e .
To validate the correctness and benchmark the performance of the implemented kernels, run the corresponding test script from the root directory of the project.
For example, to test the LayerNorm kernel:
python3 benchmarks/test_layernorm.py
This script will execute correctness checks against the native PyTorch implementation and then run a performance benchmark.