Baseten Blog

Engineering meets ML infrastructure. Dive into curated insights, expert tutorials, and innovative techniques that make deploying ML models less daunting and more accessible. Explore the topics that resonate with today's tech landscape, and empower your developer journey with expert knowledge.

Topics

Latest Model performance Hacks & projects GPU guides ML models Glossary Community Product News

News

Announcing Baseten’s $75M Series C

Baseten raised a $75M Series C to power mission-critical AI inference for leading AI companies.

Tuhin Srivastava

Model performance

View all Model performance

How we built high-throughput embedding, reranker, and classifier inference with TensorRT-LLM

Discover how we optimized embedding, reranker, and classifier inference using TensorRT-LLM, doubling throughput and achieving ultra-low latency at scale.

Michael Feil

1 other

A library -- it's classical with dark wooden shelves and glowing golden lights and grand architectural design. However, the books -- which fly on and off the shelves themselves -- are ghostly glowing blue holograms.

How multi-node inference works for massive LLMs like DeepSeek-R1

Running DeepSeek-R1 on H100 GPUs requires multi-node inference to connect the 16 H100s needed to hold the model weights.

Phil Howes

1 other

Driving model performance optimization: 2024 highlights

Baseten's model performance team works to optimize customer models for latency, throughput, quality, cost, features, and developer efficiency.

Pankaj Gupta

How we built production-ready speculative decoding with TensorRT-LLM

Our TensorRT-LLM Engine Builder now supports speculative decoding, which can improve LLM inference speeds.

Pankaj Gupta

2 others

Hacks & projects

View all Hacks & projects

Deploying custom ComfyUI workflows as APIs

Easily package your ComfyUI workflow to use any custom node or model checkpoint.

Het Trivedi

1 other

CI/CD for AI model deployments

In this article, we outline a continuous integration and continuous deployment (CI/CD) pipeline for using AI models in production.

Vlad Shulman

3 others

Streaming real-time text to speech with XTTS V2

In this tutorial, we'll build a streaming endpoint for the XTTS V2 text to speech model with real-time narration and 200 ms time to first chunk.

Het Trivedi

1 other

Prompt: A wooden boat full of books floating down a rapid river in a Japanese garden

How to serve your ComfyUI model behind an API endpoint

This guide details deploying ComfyUI image generation pipelines via API for app integration, using Truss for packaging & production deployment.

Het Trivedi

1 other

Model: SDXL + ControlNet, Prompt: A top down view of a river through the woods

GPU guides

View all GPU guides

Testing Llama 3.3 70B inference performance on NVIDIA GH200 in Lambda Cloud

The NVIDIA GH200 Superchip combines an NVIDIA Hopper GPU with an ARM CPU via high-bandwidth interconnect

Pankaj Gupta

1 other

Evaluating NVIDIA H200 Tensor Core GPUs for LLM inference

Are NVIDIA H200 GPUs cost-effective for model inference? We tested an 8xH200 cluster provided by Lambda to discover suitable inference workload profiles.

Pankaj Gupta

1 other

Using fractional H100 GPUs for efficient model serving

Multi-Instance GPUs enable splitting a single H100 GPU across two model serving instances for performance that matches or beats an A100 GPU at a 20% lower cost.

Matt Howard

3 others

Prompt: Two tron-style motorcycles racing on an empty highway

NVIDIA A10 vs A10G for ML model inference

The A10, an Ampere-series GPU, excels in tasks like running 7B parameter LLMs. AWS's A10G variant, similar in GPU memory & bandwidth, is mostly interchangeable.

Philip Kiely

ML models

View all ML models

The best open-source embedding models

Discover the best open-source embedding models for search, RAG, and recommendations—curated picks for performance, speed, and cost-efficiency.

Philip Kiely

Private, secure DeepSeek-R1 in production in US & EU data centers

Dedicated deployments of DeepSeek-R1 and DeepSeek-V3 offer private, secure, high-performance inference that's cheaper than OpenAI

Amir Haghighat

1 other

The best open-source image generation model

Explore the strengths and weaknesses of state-of-the-art image generation models like FLUX.1, Stable Diffusion 3, SDXL Lightning, and Playground 2.5.

Philip Kiely

Prompt: a paint roller reveals a beautiful nature scene

Comparing few-step image generation models

Few-step image generation models like LCMs, SDXL Turbo, and SDXL Lightning can generate images fast, but there's a tradeoff when it comes to speed vs quality.

Rachel Rapp

An AI-generated image of wooden steps in a futuristic setting surrounded by plants, symbolizing few-step image generation.

Glossary

View all Glossary

A quick introduction to speculative decoding

Speculative decoding improves LLM inference latency by using a smaller model to generate draft tokens that the larger target model can accept during inference.

Pankaj Gupta

2 others

A ghostly, glowing llama walking ahead of a real llama

Building high-performance compound AI applications with MongoDB Atlas and Baseten

Using MongoDB Atlas and Baseten’s Chains framework for compound AI, you can build high-performance compound AI systems.

Philip Kiely

Compound AI systems explained

Compound AI systems combine multiple models and processing steps, and are forming the next generation of AI products.

Rachel Rapp

An AI-generated image representing a compound AI system with multiple components.

How latent consistency models work

Latent Consistency Models (LCMs) improve on generative AI methods to produce high-quality images in just 2-4 steps, taking less than a second for inference.

Rachel Rapp

Two trees slightly different in size and color represent how latent consistency models ensure consistency between images.

Community

View all Community

Building performant embedding workflows with Chroma and Baseten

Integrate Chroma’s open-source vector database with Baseten’s fast inference engine for efficient, real-time embedding inference in your AI-native apps.

Philip Kiely

Build performant embedding workflows with Chroma and Baseten

SPC hackathon winners build with Llama 3.1 on Baseten

SPC hackathon winner TestNinja and finalist VibeCheck used Baseten to power apps for test generation and mood board creation.

Philip Kiely

Ten reasons to join Baseten

Baseten is a Series B startup building infrastructure for AI. We're actively hiring for many roles — here are ten reasons to join the Baseten team.

Dustin Michaels

1 other

What I learned as a forward-deployed engineer working at an AI startup

My first six months at Baseten exposed me to a huge range of exciting engineering challenges as I learned how to make an impact as a forward-deployed engineer.

Het Trivedi

Prompt: a software engineer building a bridge out of glowing code

Product

View all Product

Introducing Baseten Embeddings Inference: The fastest embeddings solution available

Baseten Embeddings Inference (BEI) delivers 2x higher throughput and 10% lower latency for production embedding, reranker and classification models at scale.

Michael Feil

1 other

Baseten Embeddings Inference (BEI) is a new toolkit offering the most performant embeddings inference in production

Baseten Chains is now GA for production compound AI systems

Baseten Chains delivers ultra-low-latency compound AI at scale, with custom hardware per model and simplified model orchestration.

Marius Killinger

2 others

Baseten Chains delivers ultra-low-latency, scalable compound AI with custom hardware per model and seamless model orchestration.

New observability features: activity logging, LLM metrics, and metrics dashboard customization

We added three new observability features for improved monitoring and debugging: an activity log, LLM metrics, and customizable metrics dashboards.

Suren Atoyan

4 others

Introducing three new observability features on Baseten: the activity log, LLM metrics, and customizable metrics dashboards

Introducing our Speculative Decoding Engine Builder integration for ultra-low-latency LLM inference

Our new Speculative Decoding integration can cut latency in half for production LLM workloads.

Justin Yi

3 others

Baseten's Speculative Decoding integration can cut latency in half for production LLM workloads.

News

View all News

Export your model inference metrics to your favorite observability tool

Export model inference metrics like response time and hardware utilization to observability platforms like Grafana, New Relic, Datadog, and Prometheus.

Helen Yang

2 others

Baseten's expert metrics integration lets you export inference metrics to Prometheus, Grafana Cloud, Datadog, and New Relic.

Baseten partners with Google Cloud to deliver high-performance AI infrastructure to a broader audience

Baseten is now on Google Cloud Marketplace, empowering organizations with the tools to build and scale AI applications effortlessly.

Mike Bilodeau

1 other

Baseten is now on Google Cloud Marketplace, enabling companies to boost their products with scalable, performant model inference.

Introducing Baseten Hybrid: control and flexibility in your cloud and ours

Baseten Hybrid is a multi-cloud solution that enables you to run inference in your cloud—with optional spillover into ours.

Phil Howes

2 others

A GIF showing Baseten Hybrid: inference is run in your VPC, with optional spillover to Baseten Cloud.

Introducing function calling and structured output for open-source and fine-tuned LLMs

Add function calling and structured output capabilities to any open-source or fine-tuned large language model supported by TensorRT-LLM automatically.

Bryce Dubayah

1 other