Software Engineer
Glossary
Control plane vs workload plane in model serving infrastructure
A separation of concerns between a control plane and workload planes enables multi-cloud, multi-region model serving and self-hosted inference.
Glossary
Continuous vs dynamic batching for AI inference
Learn how to increase throughput with minimal impact on latency during model inference with continuous and dynamic batching.
GPU guides
Using fractional H100 GPUs for efficient model serving
Multi-Instance GPUs enable splitting a single H100 GPU across two model serving instances for performance that matches or beats an A100 GPU at a 20% lower cost.