Software Engineer

Abu Qader

Product

Introducing our Speculative Decoding Engine Builder integration for ultra-low-latency LLM inference

Our new Speculative Decoding integration can cut latency in half for production LLM workloads.

Justin Yi

3 others

Baseten's Speculative Decoding integration can cut latency in half for production LLM workloads.

Model performance

How to double tokens per second for Llama 3 with Medusa

We observe up to a 122% increase in tokens per second for Llama 3 after training custom Medusa heads and running the updated model with TensorRT-LLM

Abu Qader

1 other

A stone sculpture of a minotaur in a field

News

Introducing automatic LLM optimization with TensorRT-LLM Engine Builder

The TensorRT-LLM Engine Builder empowers developers to deploy extremely efficient and performant inference servers for open source and fine-tuned LLMs.

Abu Qader

1 other

Model performance

Benchmarking fast Mistral 7B inference

Running Mistral 7B in FP8 on H100 GPUs with TensorRT-LLM, we achieve best in class time to first token and tokens per second on independent benchmarks.

Abu Qader

3 others

Prompt: a model bullet train in a snowy village.

Glossary

Introduction to quantizing ML models

Quantizing ML models like LLMs makes it possible to run big models on less expensive GPUs. But it must be done carefully to avoid quality reduction.

Abu Qader

1 other

Prompt: A steampunk microscope in a lab run by lord of the rings elves. Model: Playground 2

ML models

How to deploy Stable Diffusion using Truss

Explore deploying the open-source Stable Diffusion model by Stability AI on Baseten. This walkthrough details the deployment process for those interested.

Abu Qader

‌

‌
‌
‌

‌

‌
‌
‌

‌

‌
‌
‌

‌

‌
‌
‌

‌

Machine learning infrastructure that just works

Baseten provides all the infrastructure you need to deploy and serve ML models performantly, scalable, and cost-efficiently.