Baseten Blog | Page 5
Continuous vs dynamic batching for AI inference
Learn how to increase throughput with minimal impact on latency during model inference with continuous and dynamic batching.
New in March 2024
Fast Mistral 7B, fractional H100 GPUs, FP8 quantization, and API endpoints for model management.
Using fractional H100 GPUs for efficient model serving
Multi-Instance GPUs enable splitting a single H100 GPU across two model serving instances for performance that matches or beats an A100 GPU at a 20% lower cost.
Benchmarking fast Mistral 7B inference
Running Mistral 7B in FP8 on H100 GPUs with TensorRT-LLM, we achieve best in class time to first token and tokens per second on independent benchmarks.
33% faster LLM inference with FP8 quantization
Quantizing open-source LLMs to FP8 resulted in near-zero perplexity gains and yielded material performance improvements across latency, throughput, and cost.
High performance ML inference with NVIDIA TensorRT
Use TensorRT to achieve 40% lower latency for SDXL and sub-200ms time to first token for Mixtral 8x7B on A100 and H100 GPUs.
FP8: Efficient model inference with 8-bit floating point numbers
The FP8 data format has an expanded dynamic range versus INT8 which allows for quantizing weights and activations for more LLMs without loss of output quality.
Announcing our Series B
We’ve spent the last four and a half years building Baseten to be the most performant, scalable, and reliable way to run your machine learning workloads.
The benefits of globally distributed infrastructure for model serving
Multi-cloud and multi-region infrastructure for model serving provides availability, redundancy, lower latency, cost savings, and data residency compliance.