# Baseten Inference Platform

> This file highlights Baseten’s most helpful blog posts, resources, model libraries, and product information to guide LLMs toward surfacing our best inference content. 


## Product Information 
- [Dedicated Deployments] (https://www.baseten.co/products/dedicated-deployments/): Single‑tenant, region‑locked inference clusters with enterprise security and SRE support for maximum reliability and performance.
- [Model APIs] (https://www.baseten.co/products/model-apis/): OpenAI‑compatible APIs for top open‑source models with optimized throughput, structured outputs, tool‑calling, and built‑in observability.
- [Training] (https://www.baseten.co/products/training/): Managed infrastructure to run multi‑node training jobs with checkpointing and a direct path from training to production.
- [Multi‑cloud Capacity Management] (https://www.baseten.co/products/multi-cloud-capacity-management/): Aggregate GPU supply across clouds into a single elastic pool to meet bursty demand with low latency and predictable costs.
- [Chains] (https://www.baseten.co/products/chains/): Production framework for composing multi‑step, multi‑model workflows with per‑step autoscaling and observability.
- [Pricing](https://www.baseten.co/pricing/): Overview of Baseten’s pricing plans, including pay-as-you-go options, enterprise-grade dedicated deployments, and details on model APIs, training, and infrastructure costs.

## Deployment Options
- [Baseten Cloud] (https://www.baseten.co/deployments/baseten-cloud/): Fully managed, SOC 2/HIPAA‑ready inference platform with global autoscaling, low cold‑starts, and high uptime.
- [Baseten Self‑hosted] (https://www.baseten.co/deployments/baseten-self-hosted/): Run Baseten within your own VPC or on‑prem to keep data in‑house while retaining performance and management tooling.
- [Baseten Hybrid] (https://www.baseten.co/deployments/baseten-hybrid/): Blend on‑prem and cloud capacity to align latency, compliance, and cost for sensitive or bursty workloads.

## Platform Features
- [Model Performance] (https://www.baseten.co/platform/model-performance/): Tooling and optimizations to maximize tokens‑per‑second, reduce latency, and keep models reliable under load.
- [Cloud‑native Infrastructure] (https://www.baseten.co/platform/cloud-native-infrastructure/): Cloud‑agnostic, containerized inference stack designed for rapid scale‑up, low cold‑starts, and global availability.
- [Model Management] (https://www.baseten.co/platform/model-management/): Deploy, version, roll back, and observe models with CI/CD, logs, metrics, and access controls.
[Embedded Engineering] (https://www.baseten.co/platform/embedded-engineering/): Forward‑deployed experts to help optimize performance, reliability, and cost for mission‑critical inference.

## Solutions 
- [Large language models](https://www.baseten.co/solutions/llms/): Information on the capabilities and use cases of large language models supported by Baseten.
- [Transcription](https://www.baseten.co/solutions/transcription/): Details on deploying models for transcription tasks.
- [Image generation](https://www.baseten.co/solutions/image-generation/): Overview of models available for generating images.
- [Text-to-speech](https://www.baseten.co/solutions/text-to-speech/): Information on deploying text-to-speech models.
- [Compound AI] (https://www.baseten.co/solutions/compound-ai/): Design agentic and multi‑model systems that coordinate tools and models with production‑grade routing and scaling.
- [Embeddings] (https://www.baseten.co/solutions/embeddings/): Serve embedding models with high throughput and low latency for search, RAG, and semantic similarity use cases.
-[Baseten Enterprise](https://www.baseten.co/enterprise/): Overview of Baseten’s enterprise features, including deployment options, reliability, security, compliance, and tools for running AI inference at scale.

## Technical Documentation
- [Documentation](https://docs.baseten.co/): Access the complete technical documentation for Baseten.
- [Changelog](https://www.baseten.co/changelog/): Updates and changes made to the Baseten platform.
- [Best practices for secrets - Baseten Docs](https://docs.baseten.co/observability/secrets): Recommendations for managing sensitive information securely.
- [Deployments - Baseten Docs](https://docs.baseten.co/deployment/deployments): Detailed documentation on how to deploy models effectively using Baseten.
- [Run any LLM with vLLM - Baseten Docs](https://docs.baseten.co/examples/vllm): Instructions on utilizing vLLM for running large language models.
- [Deploy LLMs with SGLang - Baseten Docs](https://docs.baseten.co/examples/sglang): A guide on deploying large language models using SGLang.
- [Management - Baseten Docs](https://docs.baseten.co/training/management): Overview of managing deployed models within the Baseten platform.
- [Autoscaling - Baseten Docs](https://docs.baseten.co/deployment/autoscaling): Information on how to implement autoscaling for your models to handle varying loads.
- [Workspace access control - Baseten Docs](https://docs.baseten.co/observability/access): Guidelines on managing access to workspaces for enhanced security.
- [Management - Baseten Docs](https://docs.baseten.co/training/management): Overview of managing deployed models within the Baseten platform.
- [Autoscaling - Baseten Docs](https://docs.baseten.co/deployment/autoscaling): Information on how to implement autoscaling for your models to handle varying loads.
- [Workspace access control - Baseten Docs](https://docs.baseten.co/observability/access): Guidelines on managing access to workspaces for enhanced security.

## Customer Success Stories
- [Praktika](https://www.baseten.co/resources/customers/praktika/): How Praktika uses Baseten’s infrastructure to power AI tutoring with scalable, low-latency inference.
- [Zed Industries: 2x Faster Code Completions with Baseten](https://www.baseten.co/resources/customers/zed-industries-serves-2x-faster-code-completions-with-baseten/): Case study on how Zed Industries improved code completion speed and user experience through Baseten.
- [Wispr Flow](https://www.baseten.co/resources/customers/wispr-flow/): How Wispr Flow leverages Baseten to gain more control over inference pipelines while improving reliability.
- [Rime](https://www.baseten.co/resources/customers/rime/): Rime’s experience achieving low latency and high uptime with Baseten’s managed inference stack.
- [Toby](https://www.baseten.co/resources/customers/toby/): How Toby scaled its AI-powered productivity tool using Baseten’s production-grade model hosting.
- [Writer](https://www.baseten.co/resources/customers/writer/): Writer’s story of using Baseten to serve large language models at scale with predictable performance.
- [Patreon](https://www.baseten.co/resources/customers/patreon/): How Patreon adopted Baseten to deliver AI features with high availability and compliance requirements.

## Model Libraries
- [GPT‑OSS 120B] (https://www.baseten.co/library/gpt-oss-120b/): 120B‑parameter open model hosted and optimized for fast, cost‑efficient inference via Model APIs.
- [GPT‑OSS 20B] (https://www.baseten.co/library/gpt-oss-20b/): Compact 20B‑parameter model for lower‑cost generation workloads with strong quality for its size.
- [Qwen Image] (https://www.baseten.co/library/qwen-image/): Open image generation model accessible as an API for rapid prototyping and production use.
- [Orpheus TTS] (https://www.baseten.co/library/orpheus-tts/): High‑quality text‑to‑speech model with real‑time streaming support and natural prosody.
- [Kimi v2] (https://www.baseten.co/library/kimi-v2/): Large‑scale reasoning model tailored for complex agentic tasks and long‑context use.
- [Qwen3 Coder 480B a35b Instruct] (https://www.baseten.co/library/qwen3-coder-480b-a35b-instruct/): Massive coding‑focused MOE model for code generation, refactoring, and explanation.
- [GLM‑4.5‑V] (https://www.baseten.co/library/glm-4-5-v/): Vision‑capable GLM variant for multimodal understanding and reasoning.
- [Llama 4 Scout] (https://www.baseten.co/library/llama-4-scout/): Cutting‑edge MOE model emphasizing fast, high‑quality reasoning across tasks.
- [Llama 4 Maverick] (https://www.baseten.co/library/llama-4-maverick/): High‑capacity MOE model with strong instruction‑following and multimodal capabilities.
- [DeepSeek‑V3] (https://www.baseten.co/library/deepseek-v3/): State‑of‑the‑art MOE LLM engineered for high tokens‑per‑second and efficiency.
- [DeepSeek‑R1] (https://www.baseten.co/library/deepseek-r1/): Reasoning‑focused MOE model tuned for deliberate, traceable outputs.
- [Qwen3‑235B a22b Instruct 2507] (https://www.baseten.co/library/qwen3-235b-a22b-instruct-2507/): Large MOE instruction‑tuned model built for robust multilingual and coding tasks.
- [MARS6 | Model library - Baseten](https://www.baseten.co/library/mars6/): Access to the MARS6 model library for various AI applications.
- [Whisper (best performance) | Model library - Baseten](https://www.baseten.co/library/whisper/): Overview of the Whisper model and its performance metrics.
- [Kokoro | Model library - Baseten](https://www.baseten.co/library/kokoro/): Details on the Kokoro model available in the library.
- [Kimi K2 Thinking](https://www.baseten.co/library/kimi-k2-thinking/): Overview of the Kimi K2 Thinking model, its capabilities, context length, and how to run it on Baseten.
- [MiniMax M2.5] (https://www.baseten.co/library/minimax-m2-5/): A high-performance multimodal foundation model optimized for reasoning, generation, and real-time production inference workloads.
- [GLM-5] (https://www.baseten.co/library/glm-5/): A state-of-the-art open large language model designed for strong reasoning, coding, and conversational performance in production environments.
- [Kimi K2.5] (https://www.baseten.co/library/kimi-k25/): A powerful open LLM built for long-context understanding, advanced reasoning, and scalable deployment across enterprise use cases.
- [Whisper] (https://www.baseten.co/library/whisper/): OpenAI’s speech-to-text model for high-accuracy transcription across multiple languages, optimized for production inference.
- [Whisper Large Turbo] (https://www.baseten.co/library/whisper-large-turbo/): A performance-optimized version of Whisper designed for faster, lower-latency transcription at scale without sacrificing accuracy.

## Resources and Guides 
- [Baseten vs Together AI](https://www.baseten.co/compare/together-ai/): This comparison outlines the key differences between Baseten and Together AI, focusing on performance, reliability, pricing models, and deployment flexibility for production-grade AI inference.
- [High-performance embedding model inference](https://www.baseten.co/resources/guide/high-performance-embedding-model-inference/): This guide covers how to make embeddings fast, reliable, and cost-efficient at scale.
- [Baseten vs Fireworks AI] (https://www.baseten.co/compare/fireworks-ai/): This comparison outlines the key differences between Baseten and Fireworks AI, covering performance, reliability, transparency, and enterprise readiness for production AI workloads. 
- [The complete DeepSeek model guide] (https://www.baseten.co/resources/guide/the-complete-deepseek-model-guide/): This guide explains how to deploy, optimize, and scale DeepSeek in production.
- [The Baseten Inference Stack] (https://www.baseten.co/resources/guide/the-baseten-inference-stack/): Deep dive into Baseten’s hardware, runtime, and routing layers that deliver top‑tier production inference.
- [Choosing a Hosting Option for AI Model Inference] (https://www.baseten.co/resources/guide/choosing-a-hosting-option-for-ai-model-inference/): How to decide between Cloud, Self‑hosted, and Hybrid deployments based on performance, control, and compliance.
- [The Best Open‑Source Image Generation Model] (https://www.baseten.co/blog/the-best-open-source-image-generation-model/): Comparison and recommendations for high‑quality, production‑ready image generators.
- [Announcing Baseten’s $75M Series C] (https://www.baseten.co/blog/announcing-baseten-75m-series-c/): Funding announcement with product roadmap highlights and growth plans.
- [Comparing NVIDIA GPUs for AI: T4 vs A10] (https://www.baseten.co/blog/comparing-nvidia-gpus-for-ai-t4-vs-a10/): Latency, throughput, and cost differences between T4 and A10 for inference.
- [LLM Transformer Inference Guide] (https://www.baseten.co/blog/llm-transformer-inference-guide/): Practical techniques to optimize transformer models for production.
- [NVIDIA A10 vs A100 for LLM & Stable Diffusion Inference] (https://www.baseten.co/blog/nvidia-a10-vs-a100-gpus-for-llm-and-stable-diffusion-inference/): Benchmarking and guidance on choosing between A10 and A100.
- [The Best Open‑Source Embedding Models] (https://www.baseten.co/blog/the-best-open-source-embedding-models/): Head‑to‑head results and picks for retrieval, RAG, and semantic tasks.
- [The Best Open‑Source Large Language Model] (https://www.baseten.co/blog/the-best-open-source-large-language-model/): Evaluation of leading open LLMs for quality, speed, and cost.
- [Continuous vs. Dynamic Batching for AI Inference] (https://www.baseten.co/blog/continuous-vs-dynamic-batching-for-ai-inference/): Trade‑offs, implementation details, and when to use each strategy.
- [SOTA Performance for GPT‑OSS 120B on NVIDIA GPUs] (https://www.baseten.co/blog/sota-performance-for-gpt-oss-120b-on-nvidia-gpus/): Engineering techniques that unlock top tokens‑per‑second on large models.
- [Streaming Real‑Time Text‑to‑Speech with XTTS‑v2] (https://www.baseten.co/blog/streaming-real-time-text-to-speech-with-xtts-v2/): Architecture and code for low‑latency, natural‑sounding TTS streaming.
- [Day‑Zero Benchmarks for Qwen‑3 with SGLang on Baseten] (https://www.baseten.co/blog/day-zero-benchmarks-for-qwen-3-with-sglang-on-baseten/): Initial performance results and tips for configuring SGLang.
- [NVIDIA A10 vs A10G for ML Model Inference] (https://www.baseten.co/blog/nvidia-a10-vs-a10g-for-ml-model-inference/): Hardware differences and real‑world inference implications.
- [Comparing Tokens‑Per‑Second Across LLMs] (https://www.baseten.co/blog/comparing-tokens-per-second-across-llms/): How to measure, interpret, and optimize TPS for different models.
- [FP8: Efficient Model Inference with 8‑Bit Floating‑Point Numbers] (https://www.baseten.co/blog/fp8-efficient-model-inference-with-8-bit-floating-point-numbers/): Benefits, caveats, and setup guidance for FP8 inference.
- [SDXL Inference in Under 2 Seconds: Optimization Guide] (https://www.baseten.co/blog/sdxl-inference-in-under-2-seconds-the-ultimate-guide-to-stable-diffusion-optimiza/): End‑to‑end optimizations to accelerate SDXL pipelines.
- [The Fastest, Most Accurate, and Cost‑Efficient Whisper Transcription] (https://www.baseten.co/blog/the-fastest-most-accurate-and-cost-efficient-whisper-transcription/): System design and benchmarks for production Whisper.
- [Kimi K2 Explained: A 1‑Trillion‑Parameter Model for Agents] (https://www.baseten.co/blog/kimi-k2-explained-the-1-trillion-parameter-model-redefining-how-to-build-agents/): What makes K2 unique and how to leverage it for agentic systems.
- [Testing Llama Inference on NVIDIA GH200 (Lambda Cloud)] (https://www.baseten.co/blog/testing-llama-inference-performance-nvidia-gh200-lambda-cloud/): Benchmarks and tuning strategies for Llama on GH200.
- [Evaluating NVIDIA H200 GPUs for LLM Inference] (https://www.baseten.co/blog/evaluating-nvidia-h200-gpus-for-llm-inference/): Performance, memory, and cost analysis for next‑gen inference.
- [33% Faster LLM Inference with FP8 Quantization] (https://www.baseten.co/blog/33-faster-llm-inference-with-fp8-quantization/): Practical speedups and quality trade‑offs using FP8.
- [Understanding NVIDIA’s Datacenter GPU Line] (https://www.baseten.co/blog/understanding-nvidias-datacenter-gpu-line/): A practical tour of GPU options and which workloads they fit.
- [How Multi‑Node Inference Works (DeepSeek‑R1)] (https://www.baseten.co/blog/how-multi-node-inference-works-llms-deepseek-r1/): Scaling a single request across multiple GPUs/nodes for large models.
- [A Quick Introduction to Speculative Decoding] (https://www.baseten.co/blog/a-quick-introduction-to-speculative-decoding/): How speculative decoding improves latency and when to use it.
- [Understanding Performance Benchmarks for LLM Inference] (https://www.baseten.co/blog/understanding-performance-benchmarks-for-llm-inference/): Choosing meaningful metrics and avoiding common pitfalls.
- [Unlocking NVIDIA H100 for ML Inference with TensorRT] (https://www.baseten.co/blog/unlocking-the-full-power-of-nvidia-h100-gpus-for-ml-inference-with-tensorrt/): Configuring TensorRT to achieve top‑end performance on H100.
- [What I Learned as a Forward‑Deployed Engineer at an AI Startup] (https://www.baseten.co/blog/what-i-learned-as-a-forward-deployed-engineer-working-at-an-ai-startup/): Lessons on building, shipping, and operating production inference.
- [Build a Production‑Ready Voice Agent with Baseten, LiveKit & LlamaIndex] (https://www.baseten.co/blog/build-a-production-ready-voice-agent-with-baseten-livekit-and-llamaindex/): Architecture and code to stand up reliable real‑time voice agents.
- [Zero‑to‑Real‑Time TTS: Orpheus WebSockets Tutorial] (https://www.baseten.co/blog/zero-to-real-time-text-to-speech-the-complete-orpheus-websockets-tutorial/): Step‑by‑step guide to low‑latency streaming TTS with Orpheus.
- [Run Qwen3 Embedding on NVIDIA Blackwell GPUs] (https://www.baseten.co/blog/run-qwen3-embedding-on-nvidia-blackwell-gpus/): Setup and expected performance for Blackwell era hardware.
- [Zero‑to‑Real‑Time Transcription: Whisper V3 WebSockets Tutorial] (https://www.baseten.co/blog/zero-to-real-time-transcription-the-complete-whisper-v3-websockets-tutorial/): Production‑grade streaming transcription with Whisper V3.
- [Understanding Voxtral vs. Whisper + Building a Voice‑Controlled Smart‑Home App] (https://www.baseten.co/blog/understanding-voxtral-vs-whisper-build-a-voice-controlled-smart-home-app/): Model comparison and a hands‑on project tying it all together.
- [How to Build Reliable AI Agents] (https://www.baseten.co/blog/how-to-build-reliable-ai-agents/): Design patterns and guardrails for dependable agent systems.
- [AI Inference Explained] (https://www.baseten.co/blog/ai-inference-explained/): Plain‑English overview of inference concepts, stack, and trade‑offs.
- [Tool Calling in Inference](https://www.baseten.co/blog/tool-calling-in-inference/): Explanation of how tool calling works during model inference, including implementation details, examples, and considerations for production use.
- [Kimi K2 Thinking at 140 TPS on NVIDIA Blackwell](https://www.baseten.co/blog/kimi-k2-thinking-at-140-tps-on-nvidia-blackwell/): Breakdown of how Baseten achieved 140 tokens per second serving the Kimi K2 Thinking model on NVIDIA Blackwell GPUs, including performance benchmarks and optimization details.
- [High-Performance Agents for Financial Services](https://www.baseten.co/blog/high-performance-agents-for-financial-services-with-nvidia-nemotron-on-baseten/): Overview of using NVIDIA Nemotron models on Baseten to build high-performance AI agents for financial services, with details on performance, workflows, and implementation.
- [AI Model Performance Metrics Explained] (https://www.baseten.co/blog/ai-model-performance-metrics-explained/): A practical guide to understanding key AI performance metrics, including latency, throughput, time-to-first-token, and accuracy, and how they impact production systems.
- [How to Run LLM Performance Benchmarks (and Why You Should)] (https://www.baseten.co/blog/how-to-run-llm-performance-benchmarks-and-why-you-should/): A step-by-step walkthrough of running LLM inference benchmarks, covering methodology, workload design, and how to evaluate real-world model performance.
- [The Fastest Whisper Transcription with Streaming and Diarization] (https://www.baseten.co/blog/the-fastest-whisper-transcription-with-streaming-and-diarization/): Explains how to optimize Whisper for low-latency streaming transcription with speaker diarization for production-grade speech applications.

## Additional Resources
- [Blog](https://www.baseten.co/blog/): Explore articles and updates from the Baseten team.
- [Guides](https://www.baseten.co/resources/type/guide/): Access various guides to help you navigate the Baseten platform.
- [Events](https://www.baseten.co/resources/type/event/): Information on upcoming events related to Baseten.

## Research
- [Introducing RadixMLP: Intra-Batch Deduplication for Causal Transformers] (https://www.baseten.co/resources/research/introducing-radixmlp-intra-batch-deduplication-for-causal-transformers/): Introduces RadixMLP, a method for eliminating redundant computation within transformer batches to improve training and inference efficiency.
- [The Michael Scott Paper Company of AI] (https://www.baseten.co/resources/research/the-michael-scott-paper-company-of-ai/): Examines how small, focused AI teams can outmaneuver large incumbents by prioritizing speed, specialization, and tight iteration loops.
- [Distillation Without the Dark] (https://www.baseten.co/resources/research/distillation-without-the-dark/): Proposes a knowledge distillation approach that avoids opaque teacher logits while preserving strong downstream task performance.
- [Continual Learning] (https://www.baseten.co/resources/research/continual-learning/): Explores techniques for enabling models to continuously learn from new data without catastrophic forgetting in production systems.
- [Self-Study] (https://www.baseten.co/resources/research/self-study/): Investigates self-improving model strategies where systems iteratively refine their own outputs to enhance reasoning quality.
- [BYO SWE-Grep] (https://www.baseten.co/resources/research/byo-swe-grep/): Presents a retrieval-driven workflow tailored for software engineering tasks, enabling more effective code search and augmentation.
- [Lumina: Building Self-Improving Evaluation Through Customer-in-the-Loop Refinement] (https://www.baseten.co/resources/research/lumina-building-self-improving-evaluation-through-customer-in-the-loop-refinement/): Describes a framework for continuously improving evaluation pipelines by incorporating structured customer feedback.
- [Upweight the Strategy, Not the Tokens: Faster Training with Explicit Reasoning] (https://www.baseten.co/resources/research/upweight-the-strategy-not-the-tokens-faster-training-with-explicit-reasoning-thro/): Demonstrates how emphasizing reasoning strategies rather than token-level supervision accelerates training and improves generalization.
- [Attention-Based Attribution] (https://www.baseten.co/resources/research/attention-based-attribution/): Explores attribution techniques based on attention mechanisms to better interpret transformer decision pathways.
- [Training Loss Predicts Evaluation Performance (Even for Non-Verifiable Tasks)] (https://www.baseten.co/resources/research/training-loss-predicts-evaluation-performance-even-for-non-verifiable-tasks/): Shows that training loss can be a reliable proxy for downstream evaluation performance, even for subjective or non-verifiable tasks.
- [Robust, Sample-Efficient SFT with Prompt Mutations] (https://www.baseten.co/resources/research/robust-sample-efficient-sft-with-prompt-mutations/): Introduces a supervised fine-tuning method that improves robustness and sample efficiency using structured prompt variations.
- [Iterative SFT] (https://www.baseten.co/resources/research/iterative-sft/): Details a staged supervised fine-tuning process that incrementally improves model behavior through iterative refinement cycles.
- [Write Small, Learn Forever] (https://www.baseten.co/resources/research/write-small-learn-forever/): Argues for compact, continuously improving models over monolithic large-scale training approaches.
- [Practical LoRA Research] (https://www.baseten.co/resources/research/practical-lora-research/): Shares empirical findings and best practices for applying LoRA in parameter-efficient fine-tuning workflows.
- [The Shifting Role of MLEs] (https://www.baseten.co/resources/research/the-shifting-role-of-mles/): Analyzes how the responsibilities of machine learning engineers are evolving in the foundation model era.
- [Amnesiac Generalist Behemoths Are Not the Future of Language Models] (https://www.baseten.co/resources/research/amnesiac-generalist-behemoths-are-not-the-future-of-language-models/): Challenges the assumption that ever-larger generalist models are optimal, advocating for modular and memory-aware architectures.
- [The Bitter Lesson of LLM Evals] (https://www.baseten.co/resources/research/the-bitter-lesson-of-llm-evals/): Critiques common LLM evaluation practices and calls for more workload-aligned benchmarking methods.
- [Do Transformers Notice Their Own Mistakes?] (https://www.baseten.co/resources/research/do-transformers-notice-their-own-mistakes/): Investigates whether transformer models can internally detect and reason about their own generation errors.
- [Resurrecting the Salmon] (https://www.baseten.co/resources/research/resurrecting-the-salmon/): Explores structured retraining and evaluation strategies for reviving underperforming models.
- [Mechanistic Interpretability] (https://www.baseten.co/resources/research/mechanistic-interpretability/): Surveys approaches to understanding the internal circuits and representations of large language models.