Baseten Chains is now GA for production compound AI systems

A gif showing the flow of data through a modularized workflow built with Baseten Chains

A speech-to-speech Chain with independent hardware and autoscaling for each step.

Deploying compound AI systems requires a different toolset than single-model deployments. Traditional approaches often force developers to build monolithic deployments where workflows and hardware are tightly coupled, and intricate model orchestration must be manually implemented. But this can be cumbersome and error-prone, leading to excess hardware, engineering, and maintenance costs, as well as performance bottlenecks.

Serving compound AI systems performantly in production brings unique challenges around:

Model orchestration: Managing multiple AI models and processing steps, as well as their data exchange.
Latency: Since models and processing steps need to pass data to one another, compound AI systems can easily incur excess latency.
Reliability: If one part of the system fails, the entire request fails.
Cost: Without sufficient composability, you can end up paying for idle GPU time and unnecessary egress costs.

After working with industry leaders serving AI-native products at massive scale, we saw the need for a more efficient way to deploy compound AI systems in production. We set out to build a solution that lets you:

Deploy ultra-low-latency compound AI systems, with efficient data exchange between models.
Reduce hardware costs and eliminate performance bottlenecks with custom hardware and autoscaling per model or processing step.
Save time with an expressive developer experience and comprehensive observability.

That’s why we built Baseten Chains, a framework and SDK for serving highly-performant compound AI systems in production.

Now with additional performance and DevEx improvements since our beta launch, we’re thrilled to announce the general availability of Chains for production AI!

How Baseten Chains works

Baseten Chains is an SDK for building and deploying compound AI workflows, built on top of our open-source model packaging library, Truss. A “Chain” refers to your entire compound workflow, like a retrieval-augmented generation (RAG) or audio transcription pipeline. Chains are composed of “Chainlets,” the individual subtasks or business logic.

A Chainlet can be any model, data coordination, or processing step. For example, a transcription Chain could be composed of three Chainlets that:

Download and transform incoming audio, defining chunks for further processing (data coordination)
Detect silence using a voice activity detection (VAD) model, and chunk the audio accordingly (AI model and processing step)
Transcribe the audio (AI model)

Each Chainlet runs on its own hardware with custom autoscaling. In our transcription example, we can scale up many replicas of our chunking Chainlet to quickly chunk long audio clips in parallel. Chunking can be done on CPUs, so we can make our pipeline more cost-efficient by not using more expensive GPUs for this step.

Chainlets call each other directly without a centralized “orchestration executor.” This helps us keep latencies low by eliminating the need to retrieve and send intermediary results for each step in your workflow. It also makes building complex workflows intuitive, because combining Chainlets is like calling and combining “normal” Python functions.

Chainlets run on customized hardware, scale independently, and call each other directly. This helps keep compound workflows efficient, low-latency, and low-cost.

Building compound AI systems with Chains

You can build any compound AI system with Chains while gaining the model performance and fluid horizontal scaling that Baseten specializes in. A Chain is implemented in typed Python code; the Chains library handles all of the orchestration according to how you define Chainlet interactions in the code (including networking between models, choosing transport layers and protocols, serializing data, managing data types, and more).

For instance, the following code snippet builds a simple Chain that prints “hello” to each person in a list of provided names:

1import asyncio
2import truss_chains as chains
3
4
5# This Chainlet does the work.
6class SayHello(chains.ChainletBase):
7
8    async def run_remote(self, name: str) -> str:
9        return f"Hello, {name}"
10
11
12# This Chainlet orchestrates the work.
13@chains.mark_entrypoint
14class HelloAll(chains.ChainletBase):
15
16    def __init__(self, say_hello_chainlet=chains.depends(SayHello)) -> None:
17        self._say_hello = say_hello_chainlet
18
19    async def run_remote(self, names: list[str]) -> str:
20        tasks = []
21        for name in names:
22            tasks.append(asyncio.ensure_future(
23                self._say_hello.run_remote(name)))
24        
25        return "\n".join(await asyncio.gather(*tasks))

The code highlights how “chaining” two Chainlets is as easy as calling local, type-safe async Python functions, but we can easily implement more complex use cases. For instance, “SayHello” could be an LLM instead of a simple string template, allowing us to take more complex actions for each person.

Common use cases for Chains include:

Agents
AI phone calling
Audio transcription (like our world’s fastest Whisper implementation)
Retrieval-augmented generation (RAG)
Content creation (images, text, and video)
Any custom, multi-step, or multi-model AI workflow

What’s new since our beta Chains release

In addition to many stability improvements, our main focus since our beta release has been on further boosting performance and developer experience.

Performance improvements

To boost performance, we’ve added:

Output streaming for lower initial response times
Binary IO (extending the JSON-centered Pydantic ecosystem with NumPy array support and raw bytes) for performant serialization of numeric data

Developer experience improvements

Since beta, we’ve also added feature support for:

Subclassing for easy reuse of Chainlets: with minimal code changes, you can deploy a base Chainlet on different hardware, with different concurrency, dependencies, model weights, and more.
Chains Watch to live-patch deployed code: run an exact copy of your production hardware and interface, with live code patching that lets you test changes rapidly.
A linter to quickly identify Chains definition errors before pushing code, more expressive logging, and many UI improvements around promoting, waking, and managing Chains environments.

Marius shows how “chaining” two Chainlets is as easy as calling local, type-safe async Python functions.

Deploy performant compound AI systems in production

Baseten exists to provide the most performant, reliable, and scalable infrastructure for AI-native products. Chains enables AI builders to ship complicated multi-model workflows in a cohesive yet individualized way, coupling a delightful DevEx with ultra-low-latency inference, custom hardware, and massive horizontal scaling.

Chains is now GA, but we’re constantly leveraging customer feedback to push the boundaries of state-of-the-art model performance in production. For rapid multi-model inference and simplified model management for your production AI workloads reach out to our engineers, and come talk to us at NVIDIA GTC or KubeCon Europe!