Lead Developer Advocate

Philip Kiely

Continuous vs dynamic batching for AI inference

Learn how to increase throughput with minimal impact on latency during model inference with continuous and dynamic batching.

Matt Howard

1 other

Prompt: A batch of candy being processed on a fantasy assembly line

GPU guides

Using fractional H100 GPUs for efficient model serving

Multi-Instance GPUs enable splitting a single H100 GPU across two model serving instances for performance that matches or beats an A100 GPU at a 20% lower cost.

Matt Howard

3 others

Prompt: Two tron-style motorcycles racing on an empty highway

Model performance

Benchmarking fast Mistral 7B inference

Running Mistral 7B in FP8 on H100 GPUs with TensorRT-LLM, we achieve best in class time to first token and tokens per second on independent benchmarks.

Abu Qader

3 others

Prompt: a model bullet train in a snowy village.

Model performance

33% faster LLM inference with FP8 quantization

Quantizing open-source LLMs to FP8 resulted in near-zero perplexity gains and yielded material performance improvements across latency, throughput, and cost.

Pankaj Gupta

1 other

Prompt: A ship in a bottle in a dark wood library

Model performance

High performance ML inference with NVIDIA TensorRT

Use TensorRT to achieve 40% lower latency for SDXL and sub-200ms time to first token for Mixtral 8x7B on A100 and H100 GPUs.

Justin Yi

1 other

Prompt: A friendly robot horse playing in a sunlit meadow

Glossary

FP8: Efficient model inference with 8-bit floating point numbers

The FP8 data format has an expanded dynamic range versus INT8 which allows for quantizing weights and activations for more LLMs without loss of output quality.

Pankaj Gupta

1 other

Glossary

The benefits of globally distributed infrastructure for model serving

Multi-cloud and multi-region infrastructure for model serving provides availability, redundancy, lower latency, cost savings, and data residency compliance.

Phil Howes

1 other

Prompt: a movie still of a gondola lift in the Alps

Model performance

40% faster Stable Diffusion XL inference with NVIDIA TensorRT

Using NVIDIA TensorRT to optimize each component of the SDXL pipeline, we improved SDXL inference latency by 40% and throughput by 70% on NVIDIA H100 GPUs.

Pankaj Gupta

2 others

Prompt: A movie still of an astronaut coming through a technicolor wormhole

1 2 3 4 5 6 7

Machine learning infrastructure that just works

Baseten provides all the infrastructure you need to deploy and serve ML models performantly, scalable, and cost-efficiently.