Baseten Blog | Page 1
How multi-node inference works for massive LLMs like DeepSeek-R1
Running DeepSeek-R1 on H100 GPUs requires multi-node inference to connect the 16 H100s needed to hold the model weights.
Testing Llama 3.3 70B inference performance on NVIDIA GH200 in Lambda Cloud
The NVIDIA GH200 Superchip combines an NVIDIA Hopper GPU with an ARM CPU via high-bandwidth interconnect
Baseten Chains is now GA for production compound AI systems
Baseten Chains delivers ultra-low-latency compound AI at scale, with custom hardware per model and simplified model orchestration.
Private, secure DeepSeek-R1 in production in US & EU data centers
Dedicated deployments of DeepSeek-R1 and DeepSeek-V3 offer private, secure, high-performance inference that's cheaper than OpenAI
Driving model performance optimization: 2024 highlights
Baseten's model performance team works to optimize customer models for latency, throughput, quality, cost, features, and developer efficiency.
New observability features: activity logging, LLM metrics, and metrics dashboard customization
We added three new observability features for improved monitoring and debugging: an activity log, LLM metrics, and customizable metrics dashboards.
How we built production-ready speculative decoding with TensorRT-LLM
Our TensorRT-LLM Engine Builder now supports speculative decoding, which can improve LLM inference speeds.
A quick introduction to speculative decoding
Speculative decoding improves LLM inference latency by using a smaller model to generate draft tokens that the larger target model can accept during inference.
Introducing our Speculative Decoding Engine Builder integration for ultra-low-latency LLM inference
Our new Speculative Decoding integration can cut latency in half for production LLM workloads.
Generally Available: The fastest, most accurate and cost-efficient Whisper transcription
At Baseten, we've built the most performant (1000x real-time factor), accurate, and cost-efficient speech-to-text pipeline for production AI audio transcription