Baseten Blog
Engineering meets ML infrastructure. Dive into curated insights, expert tutorials, and innovative techniques that make deploying ML models less daunting and more accessible. Explore the topics that resonate with today's tech landscape, and empower your developer journey with expert knowledge.
Generally Available: The fastest, most accurate and cost-efficient Whisper transcription
At Baseten, we've built the most performant (1000x real-time factor), accurate, and cost-efficient speech-to-text pipeline for production AI audio transcription
Model performance
View all Model performanceHow we built production-ready speculative decoding with TensorRT-LLM
Our TensorRT-LLM Engine Builder now supports speculative decoding, which can improve LLM inference speeds.
How to build function calling and JSON mode for open-source and fine-tuned LLMs
Use a state machine to generate token masks for logit biasing to enable function calling and structured output at the model server level.
How to double tokens per second for Llama 3 with Medusa
We observe up to a 122% increase in tokens per second for Llama 3 after training custom Medusa heads and running the updated model with TensorRT-LLM
How to serve 10,000 fine-tuned LLMs from a single GPU
LoRA swapping with TRT-LLM supports in-flight batching and loads LoRA weights in 1-2 ms, enabling each request to hit a different fine-tune.
Hacks & projects
View all Hacks & projectsDeploying custom ComfyUI workflows as APIs
Easily package your ComfyUI workflow to use any custom node or model checkpoint.
CI/CD for AI model deployments
In this article, we outline a continuous integration and continuous deployment (CI/CD) pipeline for using AI models in production.
Streaming real-time text to speech with XTTS V2
In this tutorial, we'll build a streaming endpoint for the XTTS V2 text to speech model with real-time narration and 200 ms time to first chunk.
How to serve your ComfyUI model behind an API endpoint
This guide details deploying ComfyUI image generation pipelines via API for app integration, using Truss for packaging & production deployment.
GPU guides
View all GPU guidesEvaluating NVIDIA H200 Tensor Core GPUs for LLM inference
Are NVIDIA H200 GPUs cost-effective for model inference? We tested an 8xH200 cluster provided by Lambda to discover suitable inference workload profiles.
Using fractional H100 GPUs for efficient model serving
Multi-Instance GPUs enable splitting a single H100 GPU across two model serving instances for performance that matches or beats an A100 GPU at a 20% lower cost.
NVIDIA A10 vs A10G for ML model inference
The A10, an Ampere-series GPU, excels in tasks like running 7B parameter LLMs. AWS's A10G variant, similar in GPU memory & bandwidth, is mostly interchangeable.
NVIDIA A10 vs A100 GPUs for LLM and Stable Diffusion inference
This article compares two popular GPUs—the NVIDIA A10 and A100—for model inference and discusses the option of using multi-GPU instances for larger models.
ML models
View all ML modelsThe best open-source image generation model
Explore the strengths and weaknesses of state-of-the-art image generation models like FLUX.1, Stable Diffusion 3, SDXL Lightning, and Playground 2.5.
Comparing few-step image generation models
Few-step image generation models like LCMs, SDXL Turbo, and SDXL Lightning can generate images fast, but there's a tradeoff when it comes to speed vs quality.
The best open source large language model
Explore the best open source large language models for 2025 for any budget, license, and use case.
Playground v2 vs Stable Diffusion XL 1.0 for text-to-image generation
Playground v2, a new text-to-image model, matches SDXL's speed & quality with a unique AAA game-style aesthetic. Ideal choice varies by use case & art taste.
Glossary
View all GlossaryA quick introduction to speculative decoding
Speculative decoding improves LLM inference latency by using a smaller model to generate draft tokens that the larger target model can accept during inference.
Building high-performance compound AI applications with MongoDB Atlas and Baseten
Using MongoDB Atlas and Baseten’s Chains framework for compound AI, you can build high-performance compound AI systems.
Compound AI systems explained
Compound AI systems combine multiple models and processing steps, and are forming the next generation of AI products.
How latent consistency models work
Latent Consistency Models (LCMs) improve on generative AI methods to produce high-quality images in just 2-4 steps, taking less than a second for inference.
Community
View all CommunitySPC hackathon winners build with Llama 3.1 on Baseten
SPC hackathon winner TestNinja and finalist VibeCheck used Baseten to power apps for test generation and mood board creation.
Ten reasons to join Baseten
Baseten is a Series B startup building infrastructure for AI. We're actively hiring for many roles — here are ten reasons to join the Baseten team.
What I learned as a forward-deployed engineer working at an AI startup
My first six months at Baseten exposed me to a huge range of exciting engineering challenges as I learned how to make an impact as a forward-deployed engineer.
What I learned from my AI startup’s internal hackathon
See hackathon projects from Baseten for ML infrastructure, inference, user experience, and streaming
Product
View all ProductNew observability features: activity logging, LLM metrics, and metrics dashboard customization
We added three new observability features for improved monitoring and debugging: an activity log, LLM metrics, and customizable metrics dashboards.
Introducing our Speculative Decoding Engine Builder integration for ultra-low-latency LLM inference
Our new Speculative Decoding integration can cut latency in half for production LLM workloads.
Introducing Custom Servers: Deploy production-ready model servers from Docker images
Deploy production-ready model servers on Baseten directly from any Docker image using just a YAML file.
Create custom environments for deployments on Baseten
Test and deploy ML models reliably with production-ready custom environments, persistent endpoints, and seamless CI/CD.
News
View all NewsExport your model inference metrics to your favorite observability tool
Export model inference metrics like response time and hardware utilization to observability platforms like Grafana, New Relic, Datadog, and Prometheus.
Baseten partners with Google Cloud to deliver high-performance AI infrastructure to a broader audience
Baseten is now on Google Cloud Marketplace, empowering organizations with the tools to build and scale AI applications effortlessly.
Introducing Baseten Hybrid: control and flexibility in your cloud and ours
Baseten Hybrid is a multi-cloud solution that enables you to run inference in your cloud—with optional spillover into ours.
Introducing function calling and structured output for open-source and fine-tuned LLMs
Add function calling and structured output capabilities to any open-source or fine-tuned large language model supported by TensorRT-LLM automatically.