Run model inference asynchronously on Baseten

We’re thrilled to announce that you can now run async inference on Baseten models!

This unlocks some powerful inference use cases:

  • Scalable processing: Schedule tens of thousands of inference requests without worrying about the complexity of queuing and model capacity. This is particularly useful for jobs involving large numbers of requests, such as generating embeddings over many documents.

  • Long-running tasks: Handle complex tasks like transcription that can take more than 20 minutes to complete.

  • Priority-based inferences: Execute workloads in order of assigned priority, such as running online jobs with precedence over offline jobs.

You can leverage async inference on any Baseten model without making any changes to your model code. Check out our guide to using async inference on Baseten here, and our async inference API reference here.

We can’t wait (and neither can async inference) for you to give it a try!