Blog

Blog

Expert guides and engineering deep dives to help you ship faster, scale easier, and learn along the way.

‌

All Model performance AI engineering Infrastructure News Community AI models Foundations

Model performance

How we run GPT OSS 120B at 500+ tokens per second on NVIDIA GPUs

Amir Haghighat

Tri Dao

Abu Qader

Bryce Dubayah

Philip Kiely

Amir Haghighat

4 others

GPT OSS 120B

Model performance

Run Qwen3 Embedding on NVIDIA Blackwell GPUs

Michael Feil

Michael Feil

1 other

Run Qwen3 Embedding on NVIDIA Blackwell GPUs with Baseten Embeddings Inference (BEI)

Model performance

Day zero benchmarks for Qwen 3 with SGLang on Baseten

Michael Feil

Philip Kiely

Yineng Zhang

2 others

Qwen + SGLang

Model performance

How we built BEI: high-throughput embedding, reranker, and classifier inference

Michael Feil

Philip Kiely

Michael Feil

1 other

TensorRT-LLM for embeddings

Model performance

How multi-node inference works for massive LLMs like DeepSeek-R1

Phil Howes

Philip Kiely

Phil Howes

1 other

Multi-node inference

Model performance

Driving model performance optimization: 2024 highlights

Pankaj Gupta

Pankaj Gupta

MP 2024 highlights

Model performance

How we built production-ready speculative decoding with TensorRT-LLM

Pankaj Gupta

Philip Kiely

Pankaj Gupta

2 others

Speculative Decoding with TensorRT-LLM

Model performance

A quick introduction to speculative decoding

Pankaj Gupta

Philip Kiely

Pankaj Gupta

2 others

Intro to Speculative Decoding

Model performance

Generally Available: The fastest, most accurate and cost-efficient Whisper transcription

William Gao

Derrick Yang

Rachel Rapp

William Gao

3 others

The fastest Whisper