LLM Kernel Foundry

LLM Kernel Foundry is a project dedicated to researching and implementing high-performance CUDA kernels for Large Language Models (LLMs).

Our goal is to create custom operators through low-level optimizations that match or exceed the performance of major deep learning frameworks like PyTorch, thereby accelerating LLM training and inference pipelines.

Implemented Kernels

Currently, the following kernel has been successfully implemented and validated:

Layer Normalization: A critical component in the Transformer architecture, essential for stabilizing the training process. Our implementation supports both float32 and float16 data types.

Performance Evaluation

Below is a performance comparison between our custom kernel and the native PyTorch implementation. All tests have passed numerical correctness checks.

Test Environment:
- GPU: NVIDIA GeForce RTX 4090
- CUDA: 12.9
- PyTorch: 2.7.0a0+79aa17489c.nv25.04
Evaluation Date: September 21, 2025

Kernel	Input Shape	Precision	PyTorch (Native)	LKF (Ours)	Status / Note
LayerNorm	`(16, 1024, 768)`	`torch.float32`	`104.89 us`	`103.83 us`	✅ Comparable
LayerNorm	`(16, 1024, 768)`	`torch.float16`	`28.60 us`	`34.52 us`	⚠️ Optimization Needed
LayerNorm	`(4, 4096, 2048)`	`torch.float16`	`147.38 us`	`145.82 us`	✅ Comparable

Analysis

Our current LayerNorm implementation has achieved performance on par with PyTorch for float32 precision. For float16 precision, while numerical correctness is confirmed, a performance gap remains. This indicates that our baseline implementation is robust and stable, and the next priority is to introduce advanced optimization techniques to close this gap.

Installation

Ensure you have the NVIDIA CUDA Toolkit and a compatible version of PyTorch installed.

Clone this repository:

git clone [https://github.com/](https://github.com/)<Your-Username>/llm-kernel-foundry.git
cd llm-kernel-foundry

Navigate to the python directory and install the package in editable mode. This will compile the CUDA kernels.
```
cd python
pip install -e .
```

Running Benchmarks and Tests

To validate the correctness and benchmark the performance of the implemented kernels, run the corresponding test script from the root directory of the project.

For example, to test the LayerNorm kernel:

python3 benchmarks/test_layernorm.py

This script will execute correctness checks against the native PyTorch implementation and then run a performance benchmark.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
benchmarks		benchmarks
python		python
src/kernels/layernorm		src/kernels/layernorm
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Kernel Foundry

Implemented Kernels

Performance Evaluation

Analysis

Installation

Running Benchmarks and Tests

About

Uh oh!

Releases

Packages

Languages

ccs96307/llm-kernel-foundry

Folders and files

Latest commit

History

Repository files navigation

LLM Kernel Foundry

Implemented Kernels

Performance Evaluation

Analysis

Installation

Running Benchmarks and Tests

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages