Software Engineer
Introducing our Speculative Decoding Engine Builder integration for ultra-low-latency LLM inference
Our new Speculative Decoding integration can cut latency in half for production LLM workloads.
How to double tokens per second for Llama 3 with Medusa
We observe up to a 122% increase in tokens per second for Llama 3 after training custom Medusa heads and running the updated model with TensorRT-LLM
Introducing automatic LLM optimization with TensorRT-LLM Engine Builder
The TensorRT-LLM Engine Builder empowers developers to deploy extremely efficient and performant inference servers for open source and fine-tuned LLMs.
Benchmarking fast Mistral 7B inference
Running Mistral 7B in FP8 on H100 GPUs with TensorRT-LLM, we achieve best in class time to first token and tokens per second on independent benchmarks.
Introduction to quantizing ML models
Quantizing ML models like LLMs makes it possible to run big models on less expensive GPUs. But it must be done carefully to avoid quality reduction.
How to deploy Stable Diffusion using Truss
Explore deploying the open-source Stable Diffusion model by Stability AI on Baseten. This walkthrough details the deployment process for those interested.