Model performance | Page 2

Faster Mixtral inference with TensorRT-LLM and quantization

Mixtral 8x7B structurally has faster inference than similarly-powerful Llama 2 70B, but we can make it even faster using TensorRT-LLM and int8 quantization.

2 others

A guide to LLM inference and performance

Learn if LLM inference is compute or memory bound to fully utilize GPU power. Get insights on better GPU resource utilization.

SDXL inference in under 2 seconds: the ultimate guide to Stable Diffusion optimization

SDXL 1.0 initially takes 8-10 seconds for a 1024x1024px image on A100 GPU. Learn how to reduce this to just 1.92 seconds on the same hardware.