Lead Developer Advocate
Machine learning infrastructure that just works
Baseten provides all the infrastructure you need to deploy and serve ML models performantly, scalable, and cost-efficiently.
Lead Developer Advocate
We observe up to a 122% increase in tokens per second for Llama 3 after training custom Medusa heads and running the updated model with TensorRT-LLM
SPC hackathon winner TestNinja and finalist VibeCheck used Baseten to power apps for test generation and mood board creation.
The TensorRT-LLM Engine Builder empowers developers to deploy extremely efficient and performant inference servers for open source and fine-tuned LLMs.
Baseten is a Series B startup building infrastructure for AI. We're actively hiring for many roles — here are ten reasons to join the Baseten team.
LoRA swapping with TRT-LLM supports in-flight batching and loads LoRA weights in 1-2 ms, enabling each request to hit a different fine-tune.
A separation of concerns between a control plane and workload planes enables multi-cloud, multi-region model serving and self-hosted inference.
To accurately compare tokens per second between different large language models, we need to adjust for tokenizer efficiency.
In this article, we outline a continuous integration and continuous deployment (CI/CD) pipeline for using AI models in production.