Skip to content

Ditto is an open-source framework that enables direct conversion of HuggingFace PreTrainedModels into TensorRT-LLM engines.

License

Notifications You must be signed in to change notification settings

SqueezeBits/Torch-TRTLLM

Repository files navigation

Ditto logo

tensorrt-llm torch-tensorrt version license

Ditto - Direct Torch to TensorRT-LLM Optimizer

Ditto is an open-source framework that enables direct conversion of HuggingFace PreTrainedModels into TensorRT-LLM engines. Normally, building a TensorRT-LLM engine consists of two steps - checkpoint conversion and trtllm-build - both of which rely on pre-defined model architectures. As a result, converting a novel model requires porting the model with TensorRT-LLM's Python API and writing a custom checkpoint conversion script. By automating these dull procedures, Ditto aims to make TensorRT-LLM more accessible to the broader AI community.

Ditto logo

Latest News

  • [2025/02] Blog post introducing Ditto is published! [Blog]
  • [2025/02] Ditto 0.1.0 released!
  • [2025/04] Ditto 0.2.0 released with new features - MoE, Quantization
  • [2025/07] Ditto 0.3.0 released with new features - Vision Language Model, Draft-Target Model

Getting Started

Key Advantages

  • Ease-of-use: Ditto enables users to convert models with a single command.
    ditto build <huggingface-model-name>
    
  • Enables conversion of novel model architectures into TensorRT engines, including models that are not supported in TensorRT-LLM due to the absence of checkpoint conversion scripts.
    • For example, as of the publication date of this document (February 10, 2025), Helium is supported in Ditto, while it is not in TensorRT-LLM. (Note that you need to re-install transformers nightly-build after installing Ditto as pip install git+https://github.com/huggingface/transformers.git)
  • Directly converts quantized HuggingFace models.

Benchmarks

We have conducted comprehensive benchmarks for both output quality and inference performance to validate the conversion process of Ditto. Llama3.3-70B-Instruct, Llama3.1-8B-Instruct, and Helium1-preview-2B were used for the benchmarks and all benchmarks were performed with both GEMM and GPT attention plugins enabled.

Quality

We used TensorRT-LLM llmapi integrated with lm-evaluation-harness for quality evaluation. For Helium model, ifeval task was excluded since it is not an instruction model.

MMLU
(Accuracy)
wikitext2
(PPL)
gpqa_main
_zeroshot
(Accuracy)
arc_challenge
(Accuracy)
ifeval
(Accuracy)
Llama3.3-70B-Instruct Ditto 0.819 3.96 0.507 0.928 0.915
TRT-LLM 0.819 3.96 0.507 0.928 0.915
Llama3.1-8B-Instruct Ditto 0.680 8.64 0.350 0.823 0.815
TRT-LLM 0.680 8.64 0.350 0.823 0.815
Helium1-preview-2B Ditto 0.486 11.37 0.263 0.578 -
TRT-LLM Not Supported

NOTE: All tasks were tested as 0-shot.

Throughput

Performance benchmarks were conducted using TensorRT-LLM gptManagerBenchmark. A100 in the table represents A100-SXM4-80GB.

TP A100
(token/sec)
A6000
(token/sec)
L40
(token/sec)
Llama3.3-70B-Instruct Ditto 4 1759.2 - -
TRT-LLM 4 1751.6 - -
Llama3.1-8B-Instruct Ditto 1 3357.9 1479.8 1085.2
TRT-LLM 1 3318.0 1508.6 1086.5
Helium1-preview-2B Ditto 1 - 1439.5 1340.5
TRT-LLM 1 Not Supported

Support Matrix

Models

  • Llama2-7B
  • Llama3-8B
  • LLama3.1-8B
  • Llama3.2
  • Llama3.3-70B
  • Mistral-7B
  • Gemma2-9B
  • Phi4
  • Phi3.5-mini
  • Qwen2-7B
  • Codellama
  • Codestral
  • ExaOne3.5-8B
  • aya-expanse-8B
  • Llama-DNA-1.0-8B
  • SOLAR-10.7B
  • Falcon
  • Nemotron
  • 42dot_LLM-SFT-1.3B
  • Helium1-2B
  • Sky-T1-32B
  • SmolLM2-1.7B
  • Mixtral-8x7B
  • Qwen-MoE
  • DeepSeek-V1
  • Qwen3, Qwen3-MoE
  • and many others that we haven't tested yet

Features

  • Multi LoRA
  • Tensor Parallelism / Pipeline Parallelism
  • Mixture of Experts
  • Quantization - Weight-only & FP8 (AutoAWQ, Compressed Tensors)
  • Multimodal (Vision Language Models)
  • Speculative Decoding (Draft-Target Model)

What's Next?

Below features are planned to be supported in Ditto in the near future. Feel free to reach out if you have any questions or suggestions.

  • Additional Quantization Support
  • Expert Parallelism
  • Multimodal
  • Speculative Decoding
  • Prefix Caching
  • State Space Model
  • Encode-Decoder Model

References

About

Ditto is an open-source framework that enables direct conversion of HuggingFace PreTrainedModels into TensorRT-LLM engines.

Resources

License

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •