Software Engineer

Abu Qader

Product

Introducing our Speculative Decoding Engine Builder integration for ultra-low-latency LLM inference

Our new Speculative Decoding integration can cut latency in half for production LLM workloads.

3 others
Model performance

How to double tokens per second for Llama 3 with Medusa

We observe up to a 122% increase in tokens per second for Llama 3 after training custom Medusa heads and running the updated model with TensorRT-LLM

1 other
News

Introducing automatic LLM optimization with TensorRT-LLM Engine Builder

The TensorRT-LLM Engine Builder empowers developers to deploy extremely efficient and performant inference servers for open source and fine-tuned LLMs.

1 other
Model performance

Benchmarking fast Mistral 7B inference

Running Mistral 7B in FP8 on H100 GPUs with TensorRT-LLM, we achieve best in class time to first token and tokens per second on independent benchmarks.

3 others
Glossary

Introduction to quantizing ML models

Quantizing ML models like LLMs makes it possible to run big models on less expensive GPUs. But it must be done carefully to avoid quality reduction.

1 other
ML models

How to deploy Stable Diffusion using Truss

Explore deploying the open-source Stable Diffusion model by Stability AI on Baseten. This walkthrough details the deployment process for those interested.

Machine learning infrastructure that just works

Baseten provides all the infrastructure you need to deploy and serve ML models performantly, scalable, and cost-efficiently.