Baseten Blog | Page 2

Glossary

A quick introduction to speculative decoding

Speculative decoding improves LLM inference latency by using a smaller model to generate draft tokens that the larger target model can accept during inference.

2 others
Product

Introducing our Speculative Decoding Engine Builder integration for ultra-low-latency LLM inference

Our new Speculative Decoding integration can cut latency in half for production LLM workloads.

3 others
Model performance

Generally Available: The fastest, most accurate and cost-efficient Whisper transcription

At Baseten, we've built the most performant (1000x real-time factor), accurate, and cost-efficient speech-to-text pipeline for production AI audio transcription

3 others
Product

Introducing Custom Servers: Deploy production-ready model servers from Docker images

Deploy production-ready model servers on Baseten directly from any Docker image using just a YAML file.

Product

Create custom environments for deployments on Baseten

Test and deploy ML models reliably with production-ready custom environments, persistent endpoints, and seamless CI/CD.

3 others
Product

Introducing canary deployments on Baseten

Our canary deployments feature lets you roll out new model deployments with minimal risk to your end-user experience.

3 others
GPU guides

Evaluating NVIDIA H200 Tensor Core GPUs for LLM inference

Are NVIDIA H200 GPUs cost-effective for model inference? We tested an 8xH200 cluster provided by Lambda to discover suitable inference workload profiles.

News

Export your model inference metrics to your favorite observability tool

Export model inference metrics like response time and hardware utilization to observability platforms like Grafana, New Relic, Datadog, and Prometheus.

2 others
News

Baseten partners with Google Cloud to deliver high-performance AI infrastructure to a broader audience

Baseten is now on Google Cloud Marketplace, empowering organizations with the tools to build and scale AI applications effortlessly.