You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PTQ for LLMs covers how to use Post-training quantization (PTQ) and export to TensorRT-LLM for deployment for popular pre-trained models from frameworks like
PTQ for DeepSeek shows how to quantize the DeepSeek model with FP4 and export to TensorRT-LLM.
PTQ for Diffusers walks through how to quantize a diffusion model with FP8 or INT8, export to ONNX, and deploy with TensorRT. The Diffusers example in this repo is complementary to the demoDiffusion example in TensorRT repo and includes FP8 plugins as well as the latest updates on INT8 quantization.
PTQ for VLMs covers how to use Post-training quantization (PTQ) and export to TensorRT-LLM for deployment for popular Vision Language Models (VLMs).
PTQ for ONNX Models shows how to quantize the ONNX models in INT4 or INT8 quantization mode. The examples also include the deployment of quantized ONNX models using TensorRT.
QAT for LLMs demonstrates the recipe and workflow for Quantization-aware Training (QAT), which can further preserve model accuracy at low precisions (e.g., INT4, or FP4 in NVIDIA Blackwell platform).
Pruning demonstrates how to optimally prune Linear and Conv layers, and Transformer attention heads, MLP, and depth using the Model Optimizer for following frameworks:
Distillation for LLMs demonstrates how to use Knowledge Distillation, which can increasing the accuracy and/or convergence speed for finetuning / QAT.
Speculative Decoding
Speculative Decoding demonstrates how to use speculative decoding to accelerate the text generation of large language models.
Sparsity
Sparsity for LLMs shows how to perform Post-training Sparsification and Sparsity-aware fine-tuning on a pre-trained Hugging Face model.
Evaluation
Evaluation for LLMs shows how to evaluate the performance of LLMs on popular benchmarks for quantized models or TensorRT-LLM engines.
Evaluation for VLMs shows how to evaluate the performance of VLMs on popular benchmarks for quantized models or TensorRT-LLM engines.
Chaining
Chained Optimizations shows how to chain multiple optimizations together (e.g. Pruning + Distillation + Quantization).
Model Hub
Model Hub provides an example to deploy and run quantized Llama 3.1 8B instruct model from Nvidia's Hugging Face model hub on both TensorRT-LLM and vLLM.
Windows
Windows contains examples for Model Optimizer on Windows.