Baseten Blog | Page 4
Comparing tokens per second across LLMs
To accurately compare tokens per second between different large language models, we need to adjust for tokenizer efficiency.
New in April 2024
Use four new best in class LLMs, stream synthesized speech with XTTS, and deploy models with CI/CD
CI/CD for AI model deployments
In this article, we outline a continuous integration and continuous deployment (CI/CD) pipeline for using AI models in production.
Streaming real-time text to speech with XTTS V2
In this tutorial, we'll build a streaming endpoint for the XTTS V2 text to speech model with real-time narration and 200 ms time to first chunk.
Continuous vs dynamic batching for AI inference
Learn how to increase throughput with minimal impact on latency during model inference with continuous and dynamic batching.
New in March 2024
Fast Mistral 7B, fractional H100 GPUs, FP8 quantization, and API endpoints for model management.
Using fractional H100 GPUs for efficient model serving
Multi-Instance GPUs enable splitting a single H100 GPU across two model serving instances for performance that matches or beats an A100 GPU at a 20% lower cost.
Benchmarking fast Mistral 7B inference
Running Mistral 7B in FP8 on H100 GPUs with TensorRT-LLM, we achieve best in class time to first token and tokens per second on independent benchmarks.
33% faster LLM inference with FP8 quantization
Quantizing open-source LLMs to FP8 resulted in near-zero perplexity gains and yielded material performance improvements across latency, throughput, and cost.