Skip to content

ccs96307/llm-kernel-foundry

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

LLM Kernel Foundry

LLM Kernel Foundry is a project dedicated to researching and implementing high-performance CUDA kernels for Large Language Models (LLMs).

Our goal is to create custom operators through low-level optimizations that match or exceed the performance of major deep learning frameworks like PyTorch, thereby accelerating LLM training and inference pipelines.

Implemented Kernels

Currently, the following kernel has been successfully implemented and validated:

  • Layer Normalization: A critical component in the Transformer architecture, essential for stabilizing the training process. Our implementation supports both float32 and float16 data types.

Performance Evaluation

Below is a performance comparison between our custom kernel and the native PyTorch implementation. All tests have passed numerical correctness checks.

  • Test Environment:
    • GPU: NVIDIA GeForce RTX 4090
    • CUDA: 12.9
    • PyTorch: 2.7.0a0+79aa17489c.nv25.04
  • Evaluation Date: September 21, 2025
Kernel Input Shape Precision PyTorch (Native) LKF (Ours) Status / Note
LayerNorm (16, 1024, 768) torch.float32 104.89 us 103.83 us Comparable
LayerNorm (16, 1024, 768) torch.float16 28.60 us 34.52 us ⚠️ Optimization Needed
LayerNorm (4, 4096, 2048) torch.float16 147.38 us 145.82 us Comparable

Analysis

Our current LayerNorm implementation has achieved performance on par with PyTorch for float32 precision. For float16 precision, while numerical correctness is confirmed, a performance gap remains. This indicates that our baseline implementation is robust and stable, and the next priority is to introduce advanced optimization techniques to close this gap.

Installation

  1. Ensure you have the NVIDIA CUDA Toolkit and a compatible version of PyTorch installed.
  2. Clone this repository:
    git clone [https://github.com/](https://github.com/)<Your-Username>/llm-kernel-foundry.git
    cd llm-kernel-foundry
  3. Navigate to the python directory and install the package in editable mode. This will compile the CUDA kernels.
    cd python
    pip install -e .

Running Benchmarks and Tests

To validate the correctness and benchmark the performance of the implemented kernels, run the corresponding test script from the root directory of the project.

For example, to test the LayerNorm kernel:

python3 benchmarks/test_layernorm.py

This script will execute correctness checks against the native PyTorch implementation and then run a performance benchmark.

About

Optimized CUDA Kernels

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published