Lead Developer Advocate

Philip Kiely

Introducing automatic LLM optimization with TensorRT-LLM Engine Builder

The TensorRT-LLM Engine Builder empowers developers to deploy extremely efficient and performant inference servers for open source and fine-tuned LLMs.

Abu Qader

1 other

Community

Ten reasons to join Baseten

Baseten is a Series B startup building infrastructure for AI. We're actively hiring for many roles — here are ten reasons to join the Baseten team.

Dustin Michaels

1 other

Model performance

How to serve 10,000 fine-tuned LLMs from a single GPU

LoRA swapping with TRT-LLM supports in-flight batching and loads LoRA weights in 1-2 ms, enabling each request to hit a different fine-tune.

Pankaj Gupta

1 other

Prompt: Different-colored friendly robots standing in a field

Glossary

Control plane vs workload plane in model serving infrastructure

A separation of concerns between a control plane and workload planes enables multi-cloud, multi-region model serving and self-hosted inference.

Colin McGrath

2 others

Prompt: an intricate metal mobile of our solar system

Glossary

Comparing tokens per second across LLMs

To accurately compare tokens per second between different large language models, we need to adjust for tokenizer efficiency.

Philip Kiely

Hacks & projects

CI/CD for AI model deployments

In this article, we outline a continuous integration and continuous deployment (CI/CD) pipeline for using AI models in production.

Vlad Shulman

3 others

Hacks & projects

Streaming real-time text to speech with XTTS V2

In this tutorial, we'll build a streaming endpoint for the XTTS V2 text to speech model with real-time narration and 200 ms time to first chunk.

Het Trivedi

1 other

Prompt: A wooden boat full of books floating down a rapid river in a Japanese garden

Glossary

Continuous vs dynamic batching for AI inference

Learn how to increase throughput with minimal impact on latency during model inference with continuous and dynamic batching.

Matt Howard

1 other

Prompt: A batch of candy being processed on a fantasy assembly line

1 2 3 4…7

‌

‌
‌
‌

‌

‌
‌
‌

‌

‌
‌
‌

‌

‌
‌
‌

‌

Machine learning infrastructure that just works

Baseten provides all the infrastructure you need to deploy and serve ML models performantly, scalable, and cost-efficiently.