Lead Developer Advocate
Machine learning infrastructure that just works
Baseten provides all the infrastructure you need to deploy and serve ML models performantly, scalable, and cost-efficiently.
Lead Developer Advocate
In this tutorial, we'll build a streaming endpoint for the XTTS V2 text to speech model with real-time narration and 200 ms time to first chunk.
Learn how to increase throughput with minimal impact on latency during model inference with continuous and dynamic batching.
Multi-Instance GPUs enable splitting a single H100 GPU across two model serving instances for performance that matches or beats an A100 GPU at a 20% lower cost.
Running Mistral 7B in FP8 on H100 GPUs with TensorRT-LLM, we achieve best in class time to first token and tokens per second on independent benchmarks.
Quantizing open-source LLMs to FP8 resulted in near-zero perplexity gains and yielded material performance improvements across latency, throughput, and cost.
Use TensorRT to achieve 40% lower latency for SDXL and sub-200ms time to first token for Mixtral 8x7B on A100 and H100 GPUs.
The FP8 data format has an expanded dynamic range versus INT8 which allows for quantizing weights and activations for more LLMs without loss of output quality.
Multi-cloud and multi-region infrastructure for model serving provides availability, redundancy, lower latency, cost savings, and data residency compliance.