Baseten Blog | Page 1

Product

New observability features: activity logging, LLM metrics, and metrics dashboard customization

We added three new observability features for improved monitoring and debugging: an activity log, LLM metrics, and customizable metrics dashboards.

4 others
Model performance

How we built production-ready speculative decoding with TensorRT-LLM

Our TensorRT-LLM Engine Builder now supports speculative decoding, which can improve LLM inference speeds.

2 others
Glossary

A quick introduction to speculative decoding

Speculative decoding improves LLM inference latency by using a smaller model to generate draft tokens that the larger target model can accept during inference.

2 others
Product

Introducing our Speculative Decoding Engine Builder integration for ultra-low-latency LLM inference

Our new Speculative Decoding integration can cut latency in half for production LLM workloads.

3 others
Model performance

Generally Available: The fastest, most accurate and cost-efficient Whisper transcription

At Baseten, we've built the most performant (1000x real-time factor), accurate, and cost-efficient speech-to-text pipeline for production AI audio transcription

3 others
Product

Introducing Custom Servers: Deploy production-ready model servers from Docker images

Deploy production-ready model servers on Baseten directly from any Docker image using just a YAML file.

Product

Create custom environments for deployments on Baseten

Test and deploy ML models reliably with production-ready custom environments, persistent endpoints, and seamless CI/CD.

3 others
Product

Introducing canary deployments on Baseten

Our canary deployments feature lets you roll out new model deployments with minimal risk to your end-user experience.

3 others
GPU guides

Evaluating NVIDIA H200 Tensor Core GPUs for LLM inference

Are NVIDIA H200 GPUs cost-effective for model inference? We tested an 8xH200 cluster provided by Lambda to discover suitable inference workload profiles.