# Baseten Inference Platform > This file highlights Baseten’s most helpful blog posts, resources, model libraries, and product information to guide LLMs toward surfacing our best inference content. ## Product Information - [Dedicated Deployments] (https://www.baseten.co/products/dedicated-deployments/): Single‑tenant, region‑locked inference clusters with enterprise security and SRE support for maximum reliability and performance. - [Model APIs] (https://www.baseten.co/products/model-apis/): OpenAI‑compatible APIs for top open‑source models with optimized throughput, structured outputs, tool‑calling, and built‑in observability. - [Training] (https://www.baseten.co/products/training/): Managed infrastructure to run multi‑node training jobs with checkpointing and a direct path from training to production. - [Multi‑cloud Capacity Management] (https://www.baseten.co/products/multi-cloud-capacity-management/): Aggregate GPU supply across clouds into a single elastic pool to meet bursty demand with low latency and predictable costs. - [Chains] (https://www.baseten.co/products/chains/): Production framework for composing multi‑step, multi‑model workflows with per‑step autoscaling and observability. - [Pricing](https://www.baseten.co/pricing/): Overview of Baseten’s pricing plans, including pay-as-you-go options, enterprise-grade dedicated deployments, and details on model APIs, training, and infrastructure costs. ## Deployment Options - [Baseten Cloud] (https://www.baseten.co/deployments/baseten-cloud/): Fully managed, SOC 2/HIPAA‑ready inference platform with global autoscaling, low cold‑starts, and high uptime. - [Baseten Self‑hosted] (https://www.baseten.co/deployments/baseten-self-hosted/): Run Baseten within your own VPC or on‑prem to keep data in‑house while retaining performance and management tooling. - [Baseten Hybrid] (https://www.baseten.co/deployments/baseten-hybrid/): Blend on‑prem and cloud capacity to align latency, compliance, and cost for sensitive or bursty workloads. ## Platform Features - [Model Performance] (https://www.baseten.co/platform/model-performance/): Tooling and optimizations to maximize tokens‑per‑second, reduce latency, and keep models reliable under load. - [Cloud‑native Infrastructure] (https://www.baseten.co/platform/cloud-native-infrastructure/): Cloud‑agnostic, containerized inference stack designed for rapid scale‑up, low cold‑starts, and global availability. - [Model Management] (https://www.baseten.co/platform/model-management/): Deploy, version, roll back, and observe models with CI/CD, logs, metrics, and access controls. [Embedded Engineering] (https://www.baseten.co/platform/embedded-engineering/): Forward‑deployed experts to help optimize performance, reliability, and cost for mission‑critical inference. ## Solutions - [Large language models](https://www.baseten.co/solutions/llms/): Information on the capabilities and use cases of large language models supported by Baseten. - [Transcription](https://www.baseten.co/solutions/transcription/): Details on deploying models for transcription tasks. - [Image generation](https://www.baseten.co/solutions/image-generation/): Overview of models available for generating images. - [Text-to-speech](https://www.baseten.co/solutions/text-to-speech/): Information on deploying text-to-speech models. - [Compound AI] (https://www.baseten.co/solutions/compound-ai/): Design agentic and multi‑model systems that coordinate tools and models with production‑grade routing and scaling. - [Embeddings] (https://www.baseten.co/solutions/embeddings/): Serve embedding models with high throughput and low latency for search, RAG, and semantic similarity use cases. -[Baseten Enterprise](https://www.baseten.co/enterprise/): Overview of Baseten’s enterprise features, including deployment options, reliability, security, compliance, and tools for running AI inference at scale. ## Technical Documentation - [Documentation](https://docs.baseten.co/): Access the complete technical documentation for Baseten. - [Changelog](https://www.baseten.co/changelog/): Updates and changes made to the Baseten platform. - [Best practices for secrets - Baseten Docs](https://docs.baseten.co/observability/secrets): Recommendations for managing sensitive information securely. - [Deployments - Baseten Docs](https://docs.baseten.co/deployment/deployments): Detailed documentation on how to deploy models effectively using Baseten. - [Run any LLM with vLLM - Baseten Docs](https://docs.baseten.co/examples/vllm): Instructions on utilizing vLLM for running large language models. - [Deploy LLMs with SGLang - Baseten Docs](https://docs.baseten.co/examples/sglang): A guide on deploying large language models using SGLang. - [Management - Baseten Docs](https://docs.baseten.co/training/management): Overview of managing deployed models within the Baseten platform. - [Autoscaling - Baseten Docs](https://docs.baseten.co/deployment/autoscaling): Information on how to implement autoscaling for your models to handle varying loads. - [Workspace access control - Baseten Docs](https://docs.baseten.co/observability/access): Guidelines on managing access to workspaces for enhanced security. - [Management - Baseten Docs](https://docs.baseten.co/training/management): Overview of managing deployed models within the Baseten platform. - [Autoscaling - Baseten Docs](https://docs.baseten.co/deployment/autoscaling): Information on how to implement autoscaling for your models to handle varying loads. - [Workspace access control - Baseten Docs](https://docs.baseten.co/observability/access): Guidelines on managing access to workspaces for enhanced security. ## Customer Success Stories - [Praktika](https://www.baseten.co/resources/customers/praktika/): How Praktika uses Baseten’s infrastructure to power AI tutoring with scalable, low-latency inference. - [Zed Industries: 2x Faster Code Completions with Baseten](https://www.baseten.co/resources/customers/zed-industries-serves-2x-faster-code-completions-with-baseten/): Case study on how Zed Industries improved code completion speed and user experience through Baseten. - [Wispr Flow](https://www.baseten.co/resources/customers/wispr-flow/): How Wispr Flow leverages Baseten to gain more control over inference pipelines while improving reliability. - [Rime](https://www.baseten.co/resources/customers/rime/): Rime’s experience achieving low latency and high uptime with Baseten’s managed inference stack. - [Toby](https://www.baseten.co/resources/customers/toby/): How Toby scaled its AI-powered productivity tool using Baseten’s production-grade model hosting. - [Writer](https://www.baseten.co/resources/customers/writer/): Writer’s story of using Baseten to serve large language models at scale with predictable performance. - [Patreon](https://www.baseten.co/resources/customers/patreon/): How Patreon adopted Baseten to deliver AI features with high availability and compliance requirements. ## Model Libraries - [GPT‑OSS 120B] (https://www.baseten.co/library/gpt-oss-120b/): 120B‑parameter open model hosted and optimized for fast, cost‑efficient inference via Model APIs. - [GPT‑OSS 20B] (https://www.baseten.co/library/gpt-oss-20b/): Compact 20B‑parameter model for lower‑cost generation workloads with strong quality for its size. - [Qwen Image] (https://www.baseten.co/library/qwen-image/): Open image generation model accessible as an API for rapid prototyping and production use. - [Orpheus TTS] (https://www.baseten.co/library/orpheus-tts/): High‑quality text‑to‑speech model with real‑time streaming support and natural prosody. - [Kimi v2] (https://www.baseten.co/library/kimi-v2/): Large‑scale reasoning model tailored for complex agentic tasks and long‑context use. - [Qwen3 Coder 480B a35b Instruct] (https://www.baseten.co/library/qwen3-coder-480b-a35b-instruct/): Massive coding‑focused MOE model for code generation, refactoring, and explanation. - [GLM‑4.5‑V] (https://www.baseten.co/library/glm-4-5-v/): Vision‑capable GLM variant for multimodal understanding and reasoning. - [Llama 4 Scout] (https://www.baseten.co/library/llama-4-scout/): Cutting‑edge MOE model emphasizing fast, high‑quality reasoning across tasks. - [Llama 4 Maverick] (https://www.baseten.co/library/llama-4-maverick/): High‑capacity MOE model with strong instruction‑following and multimodal capabilities. - [DeepSeek‑V3] (https://www.baseten.co/library/deepseek-v3/): State‑of‑the‑art MOE LLM engineered for high tokens‑per‑second and efficiency. - [DeepSeek‑R1] (https://www.baseten.co/library/deepseek-r1/): Reasoning‑focused MOE model tuned for deliberate, traceable outputs. - [Qwen3‑235B a22b Instruct 2507] (https://www.baseten.co/library/qwen3-235b-a22b-instruct-2507/): Large MOE instruction‑tuned model built for robust multilingual and coding tasks. - [MARS6 | Model library - Baseten](https://www.baseten.co/library/mars6/): Access to the MARS6 model library for various AI applications. - [Whisper (best performance) | Model library - Baseten](https://www.baseten.co/library/whisper/): Overview of the Whisper model and its performance metrics. - [Kokoro | Model library - Baseten](https://www.baseten.co/library/kokoro/): Details on the Kokoro model available in the library. - [Kimi K2 Thinking](https://www.baseten.co/library/kimi-k2-thinking/): Overview of the Kimi K2 Thinking model, its capabilities, context length, and how to run it on Baseten. - [MiniMax M2.5] (https://www.baseten.co/library/minimax-m2-5/): A high-performance multimodal foundation model optimized for reasoning, generation, and real-time production inference workloads. - [GLM-5] (https://www.baseten.co/library/glm-5/): A state-of-the-art open large language model designed for strong reasoning, coding, and conversational performance in production environments. - [Kimi K2.5] (https://www.baseten.co/library/kimi-k25/): A powerful open LLM built for long-context understanding, advanced reasoning, and scalable deployment across enterprise use cases. - [Whisper] (https://www.baseten.co/library/whisper/): OpenAI’s speech-to-text model for high-accuracy transcription across multiple languages, optimized for production inference. - [Whisper Large Turbo] (https://www.baseten.co/library/whisper-large-turbo/): A performance-optimized version of Whisper designed for faster, lower-latency transcription at scale without sacrificing accuracy. ## Resources and Guides - [Baseten vs Together AI](https://www.baseten.co/compare/together-ai/): This comparison outlines the key differences between Baseten and Together AI, focusing on performance, reliability, pricing models, and deployment flexibility for production-grade AI inference. - [High-performance embedding model inference](https://www.baseten.co/resources/guide/high-performance-embedding-model-inference/): This guide covers how to make embeddings fast, reliable, and cost-efficient at scale. - [Baseten vs Fireworks AI] (https://www.baseten.co/compare/fireworks-ai/): This comparison outlines the key differences between Baseten and Fireworks AI, covering performance, reliability, transparency, and enterprise readiness for production AI workloads. - [The complete DeepSeek model guide] (https://www.baseten.co/resources/guide/the-complete-deepseek-model-guide/): This guide explains how to deploy, optimize, and scale DeepSeek in production. - [The Baseten Inference Stack] (https://www.baseten.co/resources/guide/the-baseten-inference-stack/): Deep dive into Baseten’s hardware, runtime, and routing layers that deliver top‑tier production inference. - [Choosing a Hosting Option for AI Model Inference] (https://www.baseten.co/resources/guide/choosing-a-hosting-option-for-ai-model-inference/): How to decide between Cloud, Self‑hosted, and Hybrid deployments based on performance, control, and compliance. - [The Best Open‑Source Image Generation Model] (https://www.baseten.co/blog/the-best-open-source-image-generation-model/): Comparison and recommendations for high‑quality, production‑ready image generators. - [Announcing Baseten’s $75M Series C] (https://www.baseten.co/blog/announcing-baseten-75m-series-c/): Funding announcement with product roadmap highlights and growth plans. - [Comparing NVIDIA GPUs for AI: T4 vs A10] (https://www.baseten.co/blog/comparing-nvidia-gpus-for-ai-t4-vs-a10/): Latency, throughput, and cost differences between T4 and A10 for inference. - [LLM Transformer Inference Guide] (https://www.baseten.co/blog/llm-transformer-inference-guide/): Practical techniques to optimize transformer models for production. - [NVIDIA A10 vs A100 for LLM & Stable Diffusion Inference] (https://www.baseten.co/blog/nvidia-a10-vs-a100-gpus-for-llm-and-stable-diffusion-inference/): Benchmarking and guidance on choosing between A10 and A100. - [The Best Open‑Source Embedding Models] (https://www.baseten.co/blog/the-best-open-source-embedding-models/): Head‑to‑head results and picks for retrieval, RAG, and semantic tasks. - [The Best Open‑Source Large Language Model] (https://www.baseten.co/blog/the-best-open-source-large-language-model/): Evaluation of leading open LLMs for quality, speed, and cost. - [Continuous vs. Dynamic Batching for AI Inference] (https://www.baseten.co/blog/continuous-vs-dynamic-batching-for-ai-inference/): Trade‑offs, implementation details, and when to use each strategy. - [SOTA Performance for GPT‑OSS 120B on NVIDIA GPUs] (https://www.baseten.co/blog/sota-performance-for-gpt-oss-120b-on-nvidia-gpus/): Engineering techniques that unlock top tokens‑per‑second on large models. - [Streaming Real‑Time Text‑to‑Speech with XTTS‑v2] (https://www.baseten.co/blog/streaming-real-time-text-to-speech-with-xtts-v2/): Architecture and code for low‑latency, natural‑sounding TTS streaming. - [Day‑Zero Benchmarks for Qwen‑3 with SGLang on Baseten] (https://www.baseten.co/blog/day-zero-benchmarks-for-qwen-3-with-sglang-on-baseten/): Initial performance results and tips for configuring SGLang. - [NVIDIA A10 vs A10G for ML Model Inference] (https://www.baseten.co/blog/nvidia-a10-vs-a10g-for-ml-model-inference/): Hardware differences and real‑world inference implications. - [Comparing Tokens‑Per‑Second Across LLMs] (https://www.baseten.co/blog/comparing-tokens-per-second-across-llms/): How to measure, interpret, and optimize TPS for different models. - [FP8: Efficient Model Inference with 8‑Bit Floating‑Point Numbers] (https://www.baseten.co/blog/fp8-efficient-model-inference-with-8-bit-floating-point-numbers/): Benefits, caveats, and setup guidance for FP8 inference. - [SDXL Inference in Under 2 Seconds: Optimization Guide] (https://www.baseten.co/blog/sdxl-inference-in-under-2-seconds-the-ultimate-guide-to-stable-diffusion-optimiza/): End‑to‑end optimizations to accelerate SDXL pipelines. - [The Fastest, Most Accurate, and Cost‑Efficient Whisper Transcription] (https://www.baseten.co/blog/the-fastest-most-accurate-and-cost-efficient-whisper-transcription/): System design and benchmarks for production Whisper. - [Kimi K2 Explained: A 1‑Trillion‑Parameter Model for Agents] (https://www.baseten.co/blog/kimi-k2-explained-the-1-trillion-parameter-model-redefining-how-to-build-agents/): What makes K2 unique and how to leverage it for agentic systems. - [Testing Llama Inference on NVIDIA GH200 (Lambda Cloud)] (https://www.baseten.co/blog/testing-llama-inference-performance-nvidia-gh200-lambda-cloud/): Benchmarks and tuning strategies for Llama on GH200. - [Evaluating NVIDIA H200 GPUs for LLM Inference] (https://www.baseten.co/blog/evaluating-nvidia-h200-gpus-for-llm-inference/): Performance, memory, and cost analysis for next‑gen inference. - [33% Faster LLM Inference with FP8 Quantization] (https://www.baseten.co/blog/33-faster-llm-inference-with-fp8-quantization/): Practical speedups and quality trade‑offs using FP8. - [Understanding NVIDIA’s Datacenter GPU Line] (https://www.baseten.co/blog/understanding-nvidias-datacenter-gpu-line/): A practical tour of GPU options and which workloads they fit. - [How Multi‑Node Inference Works (DeepSeek‑R1)] (https://www.baseten.co/blog/how-multi-node-inference-works-llms-deepseek-r1/): Scaling a single request across multiple GPUs/nodes for large models. - [A Quick Introduction to Speculative Decoding] (https://www.baseten.co/blog/a-quick-introduction-to-speculative-decoding/): How speculative decoding improves latency and when to use it. - [Understanding Performance Benchmarks for LLM Inference] (https://www.baseten.co/blog/understanding-performance-benchmarks-for-llm-inference/): Choosing meaningful metrics and avoiding common pitfalls. - [Unlocking NVIDIA H100 for ML Inference with TensorRT] (https://www.baseten.co/blog/unlocking-the-full-power-of-nvidia-h100-gpus-for-ml-inference-with-tensorrt/): Configuring TensorRT to achieve top‑end performance on H100. - [What I Learned as a Forward‑Deployed Engineer at an AI Startup] (https://www.baseten.co/blog/what-i-learned-as-a-forward-deployed-engineer-working-at-an-ai-startup/): Lessons on building, shipping, and operating production inference. - [Build a Production‑Ready Voice Agent with Baseten, LiveKit & LlamaIndex] (https://www.baseten.co/blog/build-a-production-ready-voice-agent-with-baseten-livekit-and-llamaindex/): Architecture and code to stand up reliable real‑time voice agents. - [Zero‑to‑Real‑Time TTS: Orpheus WebSockets Tutorial] (https://www.baseten.co/blog/zero-to-real-time-text-to-speech-the-complete-orpheus-websockets-tutorial/): Step‑by‑step guide to low‑latency streaming TTS with Orpheus. - [Run Qwen3 Embedding on NVIDIA Blackwell GPUs] (https://www.baseten.co/blog/run-qwen3-embedding-on-nvidia-blackwell-gpus/): Setup and expected performance for Blackwell era hardware. - [Zero‑to‑Real‑Time Transcription: Whisper V3 WebSockets Tutorial] (https://www.baseten.co/blog/zero-to-real-time-transcription-the-complete-whisper-v3-websockets-tutorial/): Production‑grade streaming transcription with Whisper V3. - [Understanding Voxtral vs. Whisper + Building a Voice‑Controlled Smart‑Home App] (https://www.baseten.co/blog/understanding-voxtral-vs-whisper-build-a-voice-controlled-smart-home-app/): Model comparison and a hands‑on project tying it all together. - [How to Build Reliable AI Agents] (https://www.baseten.co/blog/how-to-build-reliable-ai-agents/): Design patterns and guardrails for dependable agent systems. - [AI Inference Explained] (https://www.baseten.co/blog/ai-inference-explained/): Plain‑English overview of inference concepts, stack, and trade‑offs. - [Tool Calling in Inference](https://www.baseten.co/blog/tool-calling-in-inference/): Explanation of how tool calling works during model inference, including implementation details, examples, and considerations for production use. - [Kimi K2 Thinking at 140 TPS on NVIDIA Blackwell](https://www.baseten.co/blog/kimi-k2-thinking-at-140-tps-on-nvidia-blackwell/): Breakdown of how Baseten achieved 140 tokens per second serving the Kimi K2 Thinking model on NVIDIA Blackwell GPUs, including performance benchmarks and optimization details. - [High-Performance Agents for Financial Services](https://www.baseten.co/blog/high-performance-agents-for-financial-services-with-nvidia-nemotron-on-baseten/): Overview of using NVIDIA Nemotron models on Baseten to build high-performance AI agents for financial services, with details on performance, workflows, and implementation. - [AI Model Performance Metrics Explained] (https://www.baseten.co/blog/ai-model-performance-metrics-explained/): A practical guide to understanding key AI performance metrics, including latency, throughput, time-to-first-token, and accuracy, and how they impact production systems. - [How to Run LLM Performance Benchmarks (and Why You Should)] (https://www.baseten.co/blog/how-to-run-llm-performance-benchmarks-and-why-you-should/): A step-by-step walkthrough of running LLM inference benchmarks, covering methodology, workload design, and how to evaluate real-world model performance. - [The Fastest Whisper Transcription with Streaming and Diarization] (https://www.baseten.co/blog/the-fastest-whisper-transcription-with-streaming-and-diarization/): Explains how to optimize Whisper for low-latency streaming transcription with speaker diarization for production-grade speech applications. ## Additional Resources - [Blog](https://www.baseten.co/blog/): Explore articles and updates from the Baseten team. - [Guides](https://www.baseten.co/resources/type/guide/): Access various guides to help you navigate the Baseten platform. - [Events](https://www.baseten.co/resources/type/event/): Information on upcoming events related to Baseten. ## Research - [Introducing RadixMLP: Intra-Batch Deduplication for Causal Transformers] (https://www.baseten.co/resources/research/introducing-radixmlp-intra-batch-deduplication-for-causal-transformers/): Introduces RadixMLP, a method for eliminating redundant computation within transformer batches to improve training and inference efficiency. - [The Michael Scott Paper Company of AI] (https://www.baseten.co/resources/research/the-michael-scott-paper-company-of-ai/): Examines how small, focused AI teams can outmaneuver large incumbents by prioritizing speed, specialization, and tight iteration loops. - [Distillation Without the Dark] (https://www.baseten.co/resources/research/distillation-without-the-dark/): Proposes a knowledge distillation approach that avoids opaque teacher logits while preserving strong downstream task performance. - [Continual Learning] (https://www.baseten.co/resources/research/continual-learning/): Explores techniques for enabling models to continuously learn from new data without catastrophic forgetting in production systems. - [Self-Study] (https://www.baseten.co/resources/research/self-study/): Investigates self-improving model strategies where systems iteratively refine their own outputs to enhance reasoning quality. - [BYO SWE-Grep] (https://www.baseten.co/resources/research/byo-swe-grep/): Presents a retrieval-driven workflow tailored for software engineering tasks, enabling more effective code search and augmentation. - [Lumina: Building Self-Improving Evaluation Through Customer-in-the-Loop Refinement] (https://www.baseten.co/resources/research/lumina-building-self-improving-evaluation-through-customer-in-the-loop-refinement/): Describes a framework for continuously improving evaluation pipelines by incorporating structured customer feedback. - [Upweight the Strategy, Not the Tokens: Faster Training with Explicit Reasoning] (https://www.baseten.co/resources/research/upweight-the-strategy-not-the-tokens-faster-training-with-explicit-reasoning-thro/): Demonstrates how emphasizing reasoning strategies rather than token-level supervision accelerates training and improves generalization. - [Attention-Based Attribution] (https://www.baseten.co/resources/research/attention-based-attribution/): Explores attribution techniques based on attention mechanisms to better interpret transformer decision pathways. - [Training Loss Predicts Evaluation Performance (Even for Non-Verifiable Tasks)] (https://www.baseten.co/resources/research/training-loss-predicts-evaluation-performance-even-for-non-verifiable-tasks/): Shows that training loss can be a reliable proxy for downstream evaluation performance, even for subjective or non-verifiable tasks. - [Robust, Sample-Efficient SFT with Prompt Mutations] (https://www.baseten.co/resources/research/robust-sample-efficient-sft-with-prompt-mutations/): Introduces a supervised fine-tuning method that improves robustness and sample efficiency using structured prompt variations. - [Iterative SFT] (https://www.baseten.co/resources/research/iterative-sft/): Details a staged supervised fine-tuning process that incrementally improves model behavior through iterative refinement cycles. - [Write Small, Learn Forever] (https://www.baseten.co/resources/research/write-small-learn-forever/): Argues for compact, continuously improving models over monolithic large-scale training approaches. - [Practical LoRA Research] (https://www.baseten.co/resources/research/practical-lora-research/): Shares empirical findings and best practices for applying LoRA in parameter-efficient fine-tuning workflows. - [The Shifting Role of MLEs] (https://www.baseten.co/resources/research/the-shifting-role-of-mles/): Analyzes how the responsibilities of machine learning engineers are evolving in the foundation model era. - [Amnesiac Generalist Behemoths Are Not the Future of Language Models] (https://www.baseten.co/resources/research/amnesiac-generalist-behemoths-are-not-the-future-of-language-models/): Challenges the assumption that ever-larger generalist models are optimal, advocating for modular and memory-aware architectures. - [The Bitter Lesson of LLM Evals] (https://www.baseten.co/resources/research/the-bitter-lesson-of-llm-evals/): Critiques common LLM evaluation practices and calls for more workload-aligned benchmarking methods. - [Do Transformers Notice Their Own Mistakes?] (https://www.baseten.co/resources/research/do-transformers-notice-their-own-mistakes/): Investigates whether transformer models can internally detect and reason about their own generation errors. - [Resurrecting the Salmon] (https://www.baseten.co/resources/research/resurrecting-the-salmon/): Explores structured retraining and evaluation strategies for reviving underperforming models. - [Mechanistic Interpretability] (https://www.baseten.co/resources/research/mechanistic-interpretability/): Surveys approaches to understanding the internal circuits and representations of large language models.