++++++

Multi-model inference built for ultra low latency at scale

Use Chains to orchestrate inference workflows across multiple models with a framework designed for performance

Trusted by top engineering and machine learning teams

++++Multiple models. Multiple machines. One framework.

Simplify orchestration of multiple ML models, business logic services, and their underlying resources in pure Python using Chains

Get started Read the docs

Write multi-step ML workflows in Python that span multiple models and arbitrary code with built-in code completion and static type checking - no YAML needed.
Learn more
Define a chainlet as an atomic component that can be reused across different projects and workflows for maximum composability that is flexible but safe
Learn more
Set hardware requirements to separate GPU and CPU workloads and define autoscaling parameters to ensure optimal performance without excess cost.
Learn more
Deploy your Chain to production with each Chainlet specifying its own hardware resources, software dependencies and scaling settings independently, Mock and test locally for fast debugging.
Learn more

1import truss_chains as chains
2from truss import truss_config
3
4MISTRAL_HF_MODEL = "mistralai/Mistral-7B-Instruct-v0.2"
5MISTRAL_CACHE = truss_config.ModelRepo(
6    repo_id=MISTRAL_HF_MODEL, allow_patterns=["*.json", "*.safetensors", ".model"]
7)
8HF_ACCESS_TOKEN_NAME = "hf_access_token"
9
10class MistralLLM(chains.ChainletBase):
11    remote_config = chains.RemoteConfig(
12        docker_image=chains.DockerImage(
13            pip_requirements=[
14                "transformers==4.38.1",
15                "torch==2.0.1",
16            ]
17        ),
18        compute=chains.Compute(cpu_count=2, gpu="A10G"),
19        assets=chains.Assets(cached=[MISTRAL_CACHE], secret_keys=[HF_ACCESS_TOKEN_NAME]),
20    )
21
22    def __init__(
23        self,
24        # Adding the `context` to the init arguments, allows us to access the
25        # huggingface token.
26        context: chains.DeploymentContext = chains.depends_context(),
27    ) -> None:
28        # Note the imports of the *specific* python requirements are pushed down to
29        # here. This code will only be executed on the remotely deployed chainlet,
30        # not in the local environment, so we don't need to install these packages
31        # in the local dev environment.
32        import torch
33        import transformers
34
35        self._model = transformers.AutoModelForCausalLM.from_pretrained(
36            MISTRAL_HF_MODEL,
37            torch_dtype=torch.float16,
38            device_map="auto",
39            use_auth_token=context.secrets[HF_ACCESS_TOKEN_NAME],
40        )
41
42        self._tokenizer = transformers.AutoTokenizer.from_pretrained(
43            MISTRAL_HF_MODEL,
44            device_map="auto",
45            torch_dtype=torch.float16,
46            use_auth_token=context.secrets[HF_ACCESS_TOKEN_NAME],
47        )
48
49        self._generate_args = {
50            "max_new_tokens": 512,
51            "temperature": 1.0,
52            "top_p": 0.95,
53            "top_k": 50,
54            "repetition_penalty": 1.0,
55            "no_repeat_ngram_size": 0,
56            "use_cache": True,
57            "do_sample": True,
58            "eos_token_id": self._tokenizer.eos_token_id,
59            "pad_token_id": self._tokenizer.pad_token_id,
60        }
61
62    def run_remote(self, prompt: str) -> str:
63        import torch
64
65        formatted_prompt = f"[INST] {prompt} [/INST]"
66        input_ids = self._tokenizer(
67            formatted_prompt, return_tensors="pt"
68        ).input_ids.cuda()
69        with torch.no_grad():
70            output = self._model.generate(inputs=input_ids, **self._generate_args)
71            result = self._tokenizer.decode(output[0])
72        return result
73
74class PoemGenerator(chains.ChainletBase):
75    def __init__(self, phi_llm: PhiLLM = chains.depends(PhiLLM)) -> None:
76        self._phi_llm = phi_llm
77
78    def run_remote(self, words: list[str]) -> list[str]:
79        results = []
80        for word in words:
81          messages = Messages(
82              messages=[
83                  {"role": "system", "content": "You are poet"},
84                  {"role": "user", "content": f"Write a poem about {word}"},
85              ]
86          )
87          poem = self._phi_llm.run_remote(messages)
88          results.append(poem)
89        return results
90
91class PhiLLM(chains.ChainletBase):
92    remote_config = chains.RemoteConfig(
93        docker_image=chains.DockerImage(
94            pip_requirements=[
95                "transformers==4.41.2",
96                "torch==2.3.0",
97            ]
98        ),
99        compute=chains.Compute(cpu_count=2, gpu="T4"),
100    )

Get started with Chains

Guides and examples

Retrieval-augemented generation

Connect to vector databases and augment LLM results with additional context without introducing overhead to the model inference.

Learn more

Chunked audio transcription

Transcribe large audio files by splitting them into smaller chunks and processing them in parallel — process 10-hour files in minutes.

Learn more

Multi-model pipelines

Build powerful compound AI systems and experiences like AI phone calling, multi-step image generation, and Multimodal chat.

Learn more

Read the docs

Key Benefits

++++Get to market faster
with products that perform better

Reduce latency, increase throughput

DAGs weren’t built for real-time inference, Chains were designed for performance and scalability by default. Minimize network hops to deliver the lowest latency possible. Automatically scale GPU and CPU resources with demand to avoid bottlenecks and outages.

Reduce GPU cost at scale

Avoid wasting valuable GPU resources by deploying your multi-model application as a a monolith. Chains allow you to optimize cost by selecting the right GPU or CPU size for each decoupled component (Chainlet) of your workflow.

Save hundreds of development hours

Stop wasting valuable developer time building and maintaining inference infrastructure. Start shipping new AI features faster by using Chains to enable high-performance multi-model workflows at scale from day 1.

Increase industry compliance

Ensure that your workflows meet standards for HIPAA and other regulatory compliance frameworks. Self-host Chains to control exactly where you send sensitive data to reduce risk of violations and protect your customers' data privacy.

Key Features

+++++Created for engineers.
Loved by enterprises.

Support for every model

Integrate any model architecture seamlessly into your workflows. Combine your own fine-tuned or bespoke models with the latest open source and 3rd party models.

Delightful dev experience

Our SDK optimizes development by abstracting complexities, facilitating simple task automation while providing robust tools for intricate operations.

Composable and extensible

Create components once, and use them universally. Chainlets allow you to easily integrate new and existing AI technologies into a fully cohesive product experience.

Expert support on-demand

Our team of AI experts accelerates your project from concept to production. We optimize each part of your deployment to deliver the best possible performance at scale.

Volume-based GPU discounts

Get the best possible ROI on your GPU spend with our volume-based discounts. Reduce your incremental cost as you scale to realize the best possible unit economics.

Enterprise-grade security

Deploy with confidence, backed by enterprise-grade security protocols designed to safeguard your applications and data across all compliance requirements.

You guys have literally enabled us to hit insane revenue numbers without ever thinking about GPUs and scaling. I know I ask for a lot so I just wanted to let you guys know that I am so blown away by everything Baseten.
Isaiah Granet, CEO and Co-Founder of Bland AI

Rime’s state-of-the-art p99 latency and 100% uptime is driven by our shared laser focus on fundamentals, and we’re excited to push the frontier even further with Baseten.

Lily Clifford, Co-founder and CEO of Rime

Baseten enabled us to achieve something remarkable—delivering real-time AI phone calls with sub-400 millisecond response times. That level of speed set us apart from every competitor.
Isaiah Granet, CEO and Co-Founder of Bland AI

Vincent Wilmet, Co-founder and CTO @ toby

A week ago we reached out with a hefty goal and within days your team helped us get set up and stable for a launch. It went smoothly, entirely thanks to you guys. 100% couldn’t have gone live without the software and hardware support you guys worked through the weekend to get us. The custom optimized Whisper on Baseten’s autoscaling L4 GPUs saved us.
Vincent Wilmet, Co-founder and CTO @ toby

Inference for custom-built LLMs could be a major headache. Thanks to Baseten, we’re getting cost-effective high-performance model serving without any extra burden on our internal engineering teams. Instead, we get to focus our expertise on creating the best possible domain-specific LLMs for our customers.
Waseem Alshikh, CTO and Co-Founder of Writer

You guys have literally enabled us to hit insane revenue numbers without ever thinking about GPUs and scaling. We would be stuck in GPU AWS land without y'all. Truss files are amazing, y'all are on top of it always, and the product is well thought out. I know I ask for a lot so I just wanted to let you guys know that I am so blown away by everything Baseten.
Isaiah Granet

Nikhil Harithas, Senior ML Engineer at Patreon

Baseten gets the stuff we don't want to do out of the way. Now, our small, scrappy team can punch above our weight. It's everything from model serving, to auto-scaling, to iterating on products around those models, so we can deliver value to our customers and not worry about ML infrastructure.
Nikhil Harithas, Senior ML Engineer at Patreon

Faaez Ul Haq, Head of Data Science at Pipe

Baseten provides an easy way for us to host our models, iterate on them, and experiment without worrying about any of the DevOps involved.
Faaez Ul Haq, Head of Data Science at Pipe

Andrew Ward, VP of Machine Learning at Laurel

Baseten has allowed us to efficiently build an entirely new machine learning platform in just 4 months. By not needing to worry about managing our model infrastructure, Laurel has been able to drastically reduce our time to develop new predictive features and maintain more than double the number of models from our old platform.
Andrew Ward, VP of Machine Learning at Laurel

Explore Baseten today

We love partnering with companies developing innovative AI products by providing the most customizable model deployment with the lowest latency.

Get started free Talk to sales

Multi-model inference built for ultra low latency at scale

++++Multiple models. Multiple machines. One framework.

Get started with Chains

Guides and examples

Retrieval-augemented generation

Chunked audio transcription

Multi-model pipelines

++++Get to market fasterwith products that perform better

Reduce latency, increase throughput

Reduce GPU cost at scale

Save hundreds of development hours

Increase industry compliance

+++++Created for engineers.Loved by enterprises.

Support for every model

Delightful dev experience

Composable and extensible

Expert support on-demand

Volume-based GPU discounts

Enterprise-grade security

Explore Baseten today

++++Get to market faster
with products that perform better

+++++Created for engineers.
Loved by enterprises.