Software Engineer
Machine learning infrastructure that just works
Baseten provides all the infrastructure you need to deploy and serve ML models performantly, scalable, and cost-efficiently.
Software Engineer
Our TensorRT-LLM Engine Builder now supports speculative decoding, which can improve LLM inference speeds.
Speculative decoding improves LLM inference latency by using a smaller model to generate draft tokens that the larger target model can accept during inference.
Our new Speculative Decoding integration can cut latency in half for production LLM workloads.
Running Mistral 7B in FP8 on H100 GPUs with TensorRT-LLM, we achieve best in class time to first token and tokens per second on independent benchmarks.
Use TensorRT to achieve 40% lower latency for SDXL and sub-200ms time to first token for Mixtral 8x7B on A100 and H100 GPUs.
Using NVIDIA TensorRT to optimize each component of the SDXL pipeline, we improved SDXL inference latency by 40% and throughput by 70% on NVIDIA H100 GPUs.
Deploy OpenAI Whisper for free on Baseten instantly from our model library. Or stick around to learn how to package and deploy Whisper with Truss.