Glossary

Topics

Latest Model performance Hacks & projects GPU guides ML models Glossary Community Product News

A quick introduction to speculative decoding

Speculative decoding improves LLM inference latency by using a smaller model to generate draft tokens that the larger target model can accept during inference.

Pankaj Gupta

2 others

A ghostly, glowing llama walking ahead of a real llama

Building high-performance compound AI applications with MongoDB Atlas and Baseten

Using MongoDB Atlas and Baseten’s Chains framework for compound AI, you can build high-performance compound AI systems.

Philip Kiely

Compound AI systems explained

Compound AI systems combine multiple models and processing steps, and are forming the next generation of AI products.

Rachel Rapp

An AI-generated image representing a compound AI system with multiple components.

How latent consistency models work

Latent Consistency Models (LCMs) improve on generative AI methods to produce high-quality images in just 2-4 steps, taking less than a second for inference.

Rachel Rapp

Two trees slightly different in size and color represent how latent consistency models ensure consistency between images.

Control plane vs workload plane in model serving infrastructure

A separation of concerns between a control plane and workload planes enables multi-cloud, multi-region model serving and self-hosted inference.

Colin McGrath

2 others

Prompt: an intricate metal mobile of our solar system

Comparing tokens per second across LLMs

To accurately compare tokens per second between different large language models, we need to adjust for tokenizer efficiency.

Philip Kiely

Continuous vs dynamic batching for AI inference

Learn how to increase throughput with minimal impact on latency during model inference with continuous and dynamic batching.

Matt Howard

1 other

Prompt: A batch of candy being processed on a fantasy assembly line

FP8: Efficient model inference with 8-bit floating point numbers

The FP8 data format has an expanded dynamic range versus INT8 which allows for quantizing weights and activations for more LLMs without loss of output quality.

Pankaj Gupta

1 other

The benefits of globally distributed infrastructure for model serving

Multi-cloud and multi-region infrastructure for model serving provides availability, redundancy, lower latency, cost savings, and data residency compliance.

Phil Howes

1 other

Prompt: a movie still of a gondola lift in the Alps

Why GPU utilization matters for model inference

Save money on high-traffic model inference workloads by increasing GPU utilization to maximize performance per dollar for LLMs, SDXL, Whisper, and more.

Marius Killinger

1 other

Prompt: A retrofuturistic pickup truck loaded with green plants on a sunny highway

1 2