Baseten Blog | Page 2

Topics

Latest Model performance Hacks & projects GPU guides ML models Glossary Community Product News

Driving model performance optimization: 2024 highlights

Baseten's model performance team works to optimize customer models for latency, throughput, quality, cost, features, and developer efficiency.

Pankaj Gupta

Product

New observability features: activity logging, LLM metrics, and metrics dashboard customization

We added three new observability features for improved monitoring and debugging: an activity log, LLM metrics, and customizable metrics dashboards.

Suren Atoyan

4 others

Introducing three new observability features on Baseten: the activity log, LLM metrics, and customizable metrics dashboards

Model performance

How we built production-ready speculative decoding with TensorRT-LLM

Our TensorRT-LLM Engine Builder now supports speculative decoding, which can improve LLM inference speeds.

Pankaj Gupta

2 others

Glossary

A quick introduction to speculative decoding

Speculative decoding improves LLM inference latency by using a smaller model to generate draft tokens that the larger target model can accept during inference.

Pankaj Gupta

2 others

A ghostly, glowing llama walking ahead of a real llama

Product

Introducing our Speculative Decoding Engine Builder integration for ultra-low-latency LLM inference

Our new Speculative Decoding integration can cut latency in half for production LLM workloads.

Justin Yi

3 others

Baseten's Speculative Decoding integration can cut latency in half for production LLM workloads.

Model performance

Generally Available: The fastest, most accurate and cost-efficient Whisper transcription

At Baseten, we've built the most performant (1000x real-time factor), accurate, and cost-efficient speech-to-text pipeline for production AI audio transcription