Model performance
How to build function calling and JSON mode for open-source and fine-tuned LLMs
Use a state machine to generate token masks for logit biasing to enable function calling and structured output at the model server level.
How to double tokens per second for Llama 3 with Medusa
We observe up to a 122% increase in tokens per second for Llama 3 after training custom Medusa heads and running the updated model with TensorRT-LLM
How to serve 10,000 fine-tuned LLMs from a single GPU
LoRA swapping with TRT-LLM supports in-flight batching and loads LoRA weights in 1-2 ms, enabling each request to hit a different fine-tune.
Benchmarking fast Mistral 7B inference
Running Mistral 7B in FP8 on H100 GPUs with TensorRT-LLM, we achieve best in class time to first token and tokens per second on independent benchmarks.
33% faster LLM inference with FP8 quantization
Quantizing open-source LLMs to FP8 resulted in near-zero perplexity gains and yielded material performance improvements across latency, throughput, and cost.
High performance ML inference with NVIDIA TensorRT
Use TensorRT to achieve 40% lower latency for SDXL and sub-200ms time to first token for Mixtral 8x7B on A100 and H100 GPUs.
40% faster Stable Diffusion XL inference with NVIDIA TensorRT
Using NVIDIA TensorRT to optimize each component of the SDXL pipeline, we improved SDXL inference latency by 40% and throughput by 70% on NVIDIA H100 GPUs.
Unlocking the full power of NVIDIA H100 GPUs for ML inference with TensorRT
Double or triple throughput at same-or-better latencies by switching to H100 GPUs from A100s for model inference with TensorRT/TensorRT-LLM.
Faster Mixtral inference with TensorRT-LLM and quantization
Mixtral 8x7B structurally has faster inference than similarly-powerful Llama 2 70B, but we can make it even faster using TensorRT-LLM and int8 quantization.
A guide to LLM inference and performance
Learn if LLM inference is compute or memory bound to fully utilize GPU power. Get insights on better GPU resource utilization.