examples

History

Name		Name	Last commit message	Last commit date
parent directory ..
chained_optimizations		chained_optimizations
deepseek		deepseek
diffusers		diffusers
llm_autodeploy		llm_autodeploy
llm_distill		llm_distill
llm_eval		llm_eval
llm_ptq		llm_ptq
llm_qat		llm_qat
llm_sparsity		llm_sparsity
model_hub		model_hub
onnx_ptq		onnx_ptq
pruning		pruning
speculative_decoding		speculative_decoding
vlm_eval		vlm_eval
vlm_ptq		vlm_ptq
windows		windows
README.md		README.md
benchmark.md		benchmark.md

README.md

NVIDIA TensorRT Model Optimizer Examples

Quantization

PTQ for LLMs covers how to use Post-training quantization (PTQ) and export to TensorRT-LLM for deployment for popular pre-trained models from frameworks like
PTQ for DeepSeek shows how to quantize the DeepSeek model with FP4 and export to TensorRT-LLM.
PTQ for Diffusers walks through how to quantize a diffusion model with FP8 or INT8, export to ONNX, and deploy with TensorRT. The Diffusers example in this repo is complementary to the demoDiffusion example in TensorRT repo and includes FP8 plugins as well as the latest updates on INT8 quantization.
PTQ for VLMs covers how to use Post-training quantization (PTQ) and export to TensorRT-LLM for deployment for popular Vision Language Models (VLMs).
PTQ for ONNX Models shows how to quantize the ONNX models in INT4 or INT8 quantization mode. The examples also include the deployment of quantized ONNX models using TensorRT.
QAT for LLMs demonstrates the recipe and workflow for Quantization-aware Training (QAT), which can further preserve model accuracy at low precisions (e.g., INT4, or FP4 in NVIDIA Blackwell platform).
AutoDeploy for AutoQuant LLM models demonstrates how to deploy mixed-precision models using ModelOpt's AutoQuant and TRT-LLM's AutoDeploy.

Pruning

Pruning demonstrates how to optimally prune Linear and Conv layers, and Transformer attention heads, MLP, and depth using the Model Optimizer for following frameworks:
- NVIDIA NeMo / NVIDIA Megatron-LM GPT-style models (e.g. Llama 3, Mistral NeMo, etc.)
- Hugging Face language models BERT and GPT-J
- Computer Vision models like NVIDIA Tao or MMDetection framework models.

Distillation

Distillation for LLMs demonstrates how to use Knowledge Distillation, which can increasing the accuracy and/or convergence speed for finetuning / QAT.

Speculative Decoding

Speculative Decoding demonstrates how to use speculative decoding to accelerate the text generation of large language models.

Sparsity

Sparsity for LLMs shows how to perform Post-training Sparsification and Sparsity-aware fine-tuning on a pre-trained Hugging Face model.

Evaluation

Evaluation for LLMs shows how to evaluate the performance of LLMs on popular benchmarks for quantized models or TensorRT-LLM engines.
Evaluation for VLMs shows how to evaluate the performance of VLMs on popular benchmarks for quantized models or TensorRT-LLM engines.

Chaining

Chained Optimizations shows how to chain multiple optimizations together (e.g. Pruning + Distillation + Quantization).

Model Hub

Model Hub provides an example to deploy and run quantized Llama 3.1 8B instruct model from Nvidia's Hugging Face model hub on both TensorRT-LLM and vLLM.

Windows

Windows contains examples for Model Optimizer on Windows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

examples

examples

README.md

NVIDIA TensorRT Model Optimizer Examples

Quantization

Pruning

Distillation

Speculative Decoding

Sparsity

Evaluation

Chaining

Model Hub

Windows

Files

examples

Directory actions

More options

Directory actions

More options

Latest commit

History

examples

Folders and files

parent directory

README.md

NVIDIA TensorRT Model Optimizer Examples

Quantization

Pruning

Distillation

Speculative Decoding

Sparsity

Evaluation

Chaining

Model Hub

Windows