Software Engineer
How we built production-ready speculative decoding with TensorRT-LLM
Our TensorRT-LLM Engine Builder now supports speculative decoding, which can improve LLM inference speeds.
A quick introduction to speculative decoding
Speculative decoding improves LLM inference latency by using a smaller model to generate draft tokens that the larger target model can accept during inference.
Introducing our Speculative Decoding Engine Builder integration for ultra-low-latency LLM inference
Our new Speculative Decoding integration can cut latency in half for production LLM workloads.
Benchmarking fast Mistral 7B inference
Running Mistral 7B in FP8 on H100 GPUs with TensorRT-LLM, we achieve best in class time to first token and tokens per second on independent benchmarks.
High performance ML inference with NVIDIA TensorRT
Use TensorRT to achieve 40% lower latency for SDXL and sub-200ms time to first token for Mixtral 8x7B on A100 and H100 GPUs.
40% faster Stable Diffusion XL inference with NVIDIA TensorRT
Using NVIDIA TensorRT to optimize each component of the SDXL pipeline, we improved SDXL inference latency by 40% and throughput by 70% on NVIDIA H100 GPUs.
Build with OpenAI’s Whisper model in five minutes
Deploy OpenAI Whisper for free on Baseten instantly from our model library. Or stick around to learn how to package and deploy Whisper with Truss.