Co-Founder

Pankaj Gupta

33% faster LLM inference with FP8 quantization

Quantizing open-source LLMs to FP8 resulted in near-zero perplexity gains and yielded material performance improvements across latency, throughput, and cost.

Pankaj Gupta

1 other

Prompt: A ship in a bottle in a dark wood library

Glossary

FP8: Efficient model inference with 8-bit floating point numbers

The FP8 data format has an expanded dynamic range versus INT8 which allows for quantizing weights and activations for more LLMs without loss of output quality.

Pankaj Gupta

1 other

Model performance

40% faster Stable Diffusion XL inference with NVIDIA TensorRT

Using NVIDIA TensorRT to optimize each component of the SDXL pipeline, we improved SDXL inference latency by 40% and throughput by 70% on NVIDIA H100 GPUs.

Pankaj Gupta

2 others

Prompt: A movie still of an astronaut coming through a technicolor wormhole

Model performance

Unlocking the full power of NVIDIA H100 GPUs for ML inference with TensorRT

Double or triple throughput at same-or-better latencies by switching to H100 GPUs from A100s for model inference with TensorRT/TensorRT-LLM.

Pankaj Gupta

1 other

Prompt: a retro rocket ship taking off on the beach at sunrise. Model: Playground 2

Model performance

Faster Mixtral inference with TensorRT-LLM and quantization

Mixtral 8x7B structurally has faster inference than similarly-powerful Llama 2 70B, but we can make it even faster using TensorRT-LLM and int8 quantization.

Pankaj Gupta

2 others

Prompt: An illustration of a face divided in half. Half the face is Marie Curie, the other half of the face is Einstein. Model: Playground v2.

Product

Technical deep dive: Truss live reload

Truss' live reload feature revolutionizes iterative development, turning the lengthy 3-30 minute model deployment process into an almost instant task.

Pankaj Gupta

1 2

‌

‌
‌
‌

‌

‌
‌
‌

‌

‌
‌
‌

‌

‌
‌
‌

‌

Machine learning infrastructure that just works

Baseten provides all the infrastructure you need to deploy and serve ML models performantly, scalable, and cost-efficiently.