Baseten Blog

Engineering meets ML infrastructure. Dive into curated insights, expert tutorials, and innovative techniques that make deploying ML models less daunting and more accessible. Explore the topics that resonate with today's tech landscape, and empower your developer journey with expert knowledge.

How to double tokens per second for Llama 3 with Medusa

We observe up to a 122% increase in tokens per second for Llama 3 after training custom Medusa heads and running the updated model with TensorRT-LLM

1 other

How to serve 10,000 fine-tuned LLMs from a single GPU

LoRA swapping with TRT-LLM supports in-flight batching and loads LoRA weights in 1-2 ms, enabling each request to hit a different fine-tune.

Benchmarking fast Mistral 7B inference

Running Mistral 7B in FP8 on H100 GPUs with TensorRT-LLM, we achieve best in class time to first token and tokens per second on independent benchmarks.

3 others

CI/CD for AI model deployments

In this article, we outline a continuous integration and continuous deployment (CI/CD) pipeline for using AI models in production.

3 others

Streaming real-time text to speech with XTTS V2

In this tutorial, we'll build a streaming endpoint for the XTTS V2 text to speech model with real-time narration and 200 ms time to first chunk.

How to serve your ComfyUI model behind an API endpoint

This guide details deploying ComfyUI image generation pipelines via API for app integration, using Truss for packaging & production deployment.

Using fractional H100 GPUs for efficient model serving

Multi-Instance GPUs enable splitting a single H100 GPU across two model serving instances for performance that matches or beats an A100 GPU at a 20% lower cost.

3 others

NVIDIA A10 vs A10G for ML model inference

The A10, an Ampere-series GPU, excels in tasks like running 7B parameter LLMs. AWS's A10G variant, similar in GPU memory & bandwidth, is mostly interchangeable.

NVIDIA A10 vs A100 GPUs for LLM and Stable Diffusion inference

This article compares two popular GPUs—the NVIDIA A10 and A100—for model inference and discusses the option of using multi-GPU instances for larger models.

Comparing few-step image generation models

Few-step image generation models like LCMs, SDXL Turbo, and SDXL Lightning can generate images fast, but there's a tradeoff when it comes to speed vs quality.

The best open source large language model

Explore the best open source large language models for 2025 for any budget, license, and use case.

Playground v2 vs Stable Diffusion XL 1.0 for text-to-image generation

Playground v2, a new text-to-image model, matches SDXL's speed & quality with a unique AAA game-style aesthetic. Ideal choice varies by use case & art taste.

Compound AI systems explained

Compound AI systems combine multiple models and processing steps, and are forming the next generation of AI products.

How latent consistency models work

Latent Consistency Models (LCMs) improve on generative AI methods to produce high-quality images in just 2-4 steps, taking less than a second for inference.

Control plane vs workload plane in model serving infrastructure

A separation of concerns between a control plane and workload planes enables multi-cloud, multi-region model serving and self-hosted inference.

Ten reasons to join Baseten

Baseten is a Series B startup building infrastructure for AI. We're actively hiring for many roles — here are ten reasons to join the Baseten team.

What I learned as a forward-deployed engineer working at an AI startup

My first six months at Baseten exposed me to a huge range of exciting engineering challenges as I learned how to make an impact as a forward-deployed engineer.

What I learned from my AI startup’s internal hackathon

See hackathon projects from Baseten for ML infrastructure, inference, user experience, and streaming

Introducing canary deployments on Baseten

Our canary deployments feature lets you roll out new model deployments with minimal risk to your end-user experience.

3 others

Using asynchronous inference in production

Learn how async inference works, protects against common inference failures, is applied in common use cases, and more.

2 others

Baseten Chains explained: building multi-component AI workflows at scale

A Delightful Developer Experience for Building and Deploying Compound ML Inference Workflows

Baseten partners with Google Cloud to deliver high-performance AI infrastructure to a broader audience

Baseten is now on Google Cloud Marketplace, empowering organizations with the tools to build and scale AI applications effortlessly.

Introducing function calling and structured output for open-source and fine-tuned LLMs

Add function calling and structured output capabilities to any open-source or fine-tuned large language model supported by TensorRT-LLM automatically.

Introducing Baseten Self-hosted

Gain granular control over data locality, align with strict compliance standards, meet specific performance requirements, and more with Baseten Self-hosted.