Lead Developer Advocate
Machine learning infrastructure that just works
Baseten provides all the infrastructure you need to deploy and serve ML models performantly, scalable, and cost-efficiently.
Lead Developer Advocate
Using NVIDIA TensorRT to optimize each component of the SDXL pipeline, we improved SDXL inference latency by 40% and throughput by 70% on NVIDIA H100 GPUs.
Save money on high-traffic model inference workloads by increasing GPU utilization to maximize performance per dollar for LLMs, SDXL, Whisper, and more.
Explore the best open source large language models for 2025 for any budget, license, and use case.
Double or triple throughput at same-or-better latencies by switching to H100 GPUs from A100s for model inference with TensorRT/TensorRT-LLM.
Quantizing ML models like LLMs makes it possible to run big models on less expensive GPUs. But it must be done carefully to avoid quality reduction.
Benchmarking Stable Diffusion XL performance across latency, throughput, and cost depends on factors from hardware to model variant to inference config.
This guide helps you interpret LLM performance metrics to make direct comparisons on latency, throughput, and cost.
Mixtral 8x7B structurally has faster inference than similarly-powerful Llama 2 70B, but we can make it even faster using TensorRT-LLM and int8 quantization.