Platform

Frontier performance research in production

Run your models with the lowest latency and the highest throughput with the Baseten Inference Runtime.

Start building

Talk to our engineers

‌

Trusted by top engineering and machine learning teams

We’re really appreciative of the support Baseten has provided and their ability to move so quickly. The latency improvements have been so impressive, and on such a short timeline.
Antonio Scandurra, Co-founder

Antonio Scandurra,
Co-founder
We’re really appreciative of the support Baseten has provided and their ability to move so quickly. The latency improvements have been so impressive, and on such a short timeline.

MODEL PERFORMANCE

Experience the powerful, flexible Baseten Inference Runtime

Just run faster

Best-in-class AI products provide instant responses. Models on Baseten are reliably fast with low p99 latencies for consistently delightful end-user experiences.

Control SLAs and economics

Model performance is about navigating tradeoffs. Access the entire frontier between latency and throughput to find the right balance for your speed and cost targets.

Don't compromise quality

We don't believe in black-box optimizations. Every model performance technique used in our inference stack is rigorously tested and fully configurable.

Every model performance technique in one configurable runtime

Automatic runtime builds

We take the best open-source inference frameworks (TensorRT, SGLang, vLLM, TGI, TEI, and more) and layer in our own optimizations for maximum performance. Configure runtimes in minutes (vs. hours) from a single file with full support for every relevant model performance technique.

Reliable speculation engine

We natively support speculative decoding and self-speculative techniques like Medusa and Eagle. With the model orchestration fully abstracted, you can fine-tune parameters or leverage pre-built configs directly with dynamic speculative decoding selection and online speculator training.

Modality-specific optimization

Different modalities (language, audio, speech synthesis, embeddings, image and video generation) require different techniques. From leveraging TensorRT-LLM across any autoregressive token-based transformers model to compiling diffusor models, our runtime adapts to any architecture.

Custom kernels

We use kernel fusion to reduce overhead by combining multiple operations (e.g., matrix multiplication, bias addition, activation functions) into a single GPU kernel along with memory hierarchy optimization, asynchronous compute and PDL for better memory and GPU utilization.

Structured output

Hoping for JSON isn't enough. Our runtime guarantees spec adherence for structured output by biasing logits according to a state machine generated prior to decode, ensuring no reduction in inter-token latency. This same system enables tool use for models that support function calls.

Optional quantization

Post-training quantization, especially in floating-point formats, can massively improve performance while preserving quality with minimal perplexity gain. We support many approaches, including KV cache quantization, while giving you full control over quantization decisions.

KV Cache optimization

With model contexts getting longer, reusing KV cache is essential for maintaining low time-to-first-token in responses. With KV cache offloading and cache-aware routing, our inference runtime stores and shares caches across GPU, CPU, and system memory while maximizing hit rate.

Request prioritization

Prefill (generating the first token) is often both more computationally expensive and more urgent than decode (generating subsequent tokens), so our runtime can prioritize prefill steps over decode steps to keep time-to-first-token low. We also support disaggregated serving.

Topology-aware parallelism

When serving large models on multiple GPUs and across nodes, model parallelism strategies like tensor parallelism (TP) and expert parallelism (EP) minimize communication overhead. Our runtime blends TP and EP along with other parallelism techniques to serve large models efficiently.

Learn more

Talk to our engineers

Blog

Compare GPUs

Check out our detailed GPU benchmarks on our blog, featuring tailored recommendations for different use cases.

See the benchmarks

Check out our detailed GPU benchmarks on our blog, featuring tailored recommendations for different use cases.

See the benchmarks

Blog

The fastest transcription

Baseten engineers built the fastest, most accurate, and cost-efficient Whisper transcription available.

Learn more

Baseten engineers built the fastest, most accurate, and cost-efficient Whisper transcription available.

Learn more

Model library

Deploy an open-source model

Get a feel for the Baseten UI by deploying leading models directly from our Model Library.

Deploy

Get a feel for the Baseten UI by deploying leading models directly from our Model Library.

Deploy

We are very satisfied with the latency we achieved. The user experience feels so much faster. It’s so fast we actually don’t need as many replicas which helps keep costs down.
Anton Marin, Co-founder & CTO

Anton Marin,
Co-founder & CTO
We are very satisfied with the latency we achieved. The user experience feels so much faster. It’s so fast we actually don’t need as many replicas which helps keep costs down.

Explore Baseten today

Start deploying

Talk to an engineer