Baseten Blog | Page 3
How to serve 10,000 fine-tuned LLMs from a single GPU
LoRA swapping with TRT-LLM supports in-flight batching and loads LoRA weights in 1-2 ms, enabling each request to hit a different fine-tune.
Using asynchronous inference in production
Learn how async inference works, protects against common inference failures, is applied in common use cases, and more.
Baseten Chains explained: building multi-component AI workflows at scale
A Delightful Developer Experience for Building and Deploying Compound ML Inference Workflows
Introducing Baseten Chains
Learn about Baseten's new Chains framework for deploying complex ML inference workflows across compound AI systems using multiple models and components
Comparing few-step image generation models
Few-step image generation models like LCMs, SDXL Turbo, and SDXL Lightning can generate images fast, but there's a tradeoff when it comes to speed vs quality.
How latent consistency models work
Latent Consistency Models (LCMs) improve on generative AI methods to produce high-quality images in just 2-4 steps, taking less than a second for inference.
New in May 2024
AI events, multicluster model serving architecture, tokenizer efficiency, and forward-deployed engineering
What I learned as a forward-deployed engineer working at an AI startup
My first six months at Baseten exposed me to a huge range of exciting engineering challenges as I learned how to make an impact as a forward-deployed engineer.
Control plane vs workload plane in model serving infrastructure
A separation of concerns between a control plane and workload planes enables multi-cloud, multi-region model serving and self-hosted inference.