Introducing Baseten Loops: A Training SDK for Frontier RL. Learn more here
Product

Frontier RL with Baseten Loops

Async RL on long sequence lengths with one-click checkpoint deploys to the Baseten Inference Stack.

Training Platform

Two ways to train, pick what fits your needs

Loops (early access)

Write training logic, not infra code.

A training SDK that supports long sequence length, async RL, and one-click checkpoint deploys.

  • 131K+ sequence length and 1T+ parameter model training. Qwen3.5/3.6 family and Kimi K2.6 support. Nemotron, Deepseek, GLM, and MiniMax series to follow shortly.

  • Train → deploy loop: Models trained with Loops promote directly to Baseten Dedicated Inference with one command.

  • Asynchronous RL primitives like policy versioning and non-blocking weight sync that enable bounded off-policy learning.

  • Full ownership of your trained weights, no lock-in.

Training Jobs (GA)

Run your existing training scripts on managed GPUs.

A framework-agnostic training product designed for teams who want bare-metal like control on managed infra.

  • Multi-node training with automatic checkpoint syncing between nodes.

  • On-demand compute acquired in seconds.

  • Plugs into your existing stack like W&B, HuggingFace, or S3 via Baseten Secrets.

  • SSH access built in for live debugging on any running container.

Loops SDK

Solving the key problems with large model post-training

We've run into deployment friction, synchronous weight syncs, and unpredictable runtimes ourselves. We built Loops to solve these problems.

Train and deploy, one platform

Train and deploy, one platform

Current Problem: Teams have to manually merge LoRAs, quantize across formats, and burn iteration cycles before serving prod traffic.

Loops: Inference is a first-class citizen in the product. After the last gradient step, your model is ready to deploy as a prod endpoint to close the training to inference loop.

Async RL at scale

Async RL at scale

Current Problem: Training 1T+ parameter models at long sequence lengths means hand-tuning parallelisms on fragile training libraries. True async RL is often an afterthought.

Loops: Take a gradient step with primitives like forward_backward, optim_step, and sample. Loops handles all the memory management and parallelisms. Training and sampling also overlap by pushing new weights in the background, so the trainer never waits for the weight sync.

Predictable performance

Predictable performance

Current Problem: Training large models on shared infra creates painful variance, with the same script taking hours one day and minutes the next.

Loops: Scale your samplers and trainers independently on dedicated infra that delivers consistent throughput run-over-run.

Eric Lehman logo

Baseten helped us train models to be 23x faster and is projected to save us $1.9M, while making the process so easy that even non-ML engineers could get results in under 30 minutes.

Eric Lehman
SVP of ML, OpenEvidence

Research support

Get started
Training Expertise

Partner with world-class RL researchers

Our team trains custom models for your use case that outperform closed-source models.

Our team trains custom models for your use case that outperform closed-source models.

Your Models

Own your model artifacts

All artifacts including model weights, evals, and training scripts belong entirely to you.

All artifacts including model weights, evals, and training scripts belong entirely to you.

Production Inference

Continual learning from inference

Easily deploy your custom model to inference and continually improve model quality with real-world data.

Easily deploy your custom model to inference and continually improve model quality with real-world data.

Our Research

Read More

Towards infinite context windows: neural KV cache compaction

Building an intermediate memory layer is a prerequisite for continual learning in LLMs.

Read

Building an intermediate memory layer is a prerequisite for continual learning in LLMs.

Read

Dense, on-policy, or both?

Constitutional alignment as a testbed for comparing learning signals in SFT, RL, and everything in between.

Read

Constitutional alignment as a testbed for comparing learning signals in SFT, RL, and everything in between.

Read

Repeated KV cache for long-running agents

Finding the core barrier to repeated KV cache compression for infinite context.

Read

Finding the core barrier to repeated KV cache compression for infinite context.

Read

Distillation without the dark

A co-evolving discriminator enables on-policy distillation from closed-source models without logit access.

Read

A co-evolving discriminator enables on-policy distillation from closed-source models without logit access.

Read

Iterative SFT (iSFT): dense reward learning

Iterative grader feedback turns imperfect model outputs into gold-quality SFT data.

Read

Iterative grader feedback turns imperfect model outputs into gold-quality SFT data.

Read

RGT (Rationale-Guided Training)

Upweight the strategy, not the tokens: faster training with explicit reasoning

Read

Upweight the strategy, not the tokens: faster training with explicit reasoning

Read

FAQ

Get early access to Loops