Introducing automatic LLM optimization with TensorRT-LLM Engine Builder

TL;DR

Today, we’re launching the TensorRT-LLM Engine Builder, which empowers every developer to deploy extremely efficient and performant inference servers for open-source and fine-tuned LLMs in minutes. The Engine Builder replaces hours of tedious procurement, installation, and validation via automated deploy-time engine creation.

The TensorRT-LLM Engine Builder anchors a complete pipeline from model weights to low-latency, high-throughput production inference for open-source and fine-tuned models.

In a single command, you can now build and serve a wide range of foundation models like Llama, Mistral, and Whisper, plus fine-tuned variants. The Truss framework gives you full control to customize your model server, while the Baseten platform provides dedicated deployments with automatic traffic-based scaling, logging and metrics for observability, and best-in-class security and compliance.

âś•
Manual compilation vs engine builder: same great performance, 90% less effort

Why we made the TensorRT-LLM engine builder

TensorRT-LLM is an open-source performance optimization toolbox created by NVIDIA for optimizing large language model inference. TensorRT and TensorRT-LLM are extremely performant; we’ve achieved results like 33% faster LLM inference, 40% faster SDXL inference, and 3x better LLM throughput.

We often use TensorRT-LLM to support our custom models for teams like Writer. For their latest industry-specific Palmyra LLMs, Palmyra-Med-70B and Palmyra-Fin-70B, Writer saw 60% higher tokens per second with TensorRT-LLM inference engines deployed on Baseten.

While TensorRT-LLM is incredibly powerful, we found ourselves and our customers repeatedly facing three issues when trying to use it in production:

  1. It can take the better part of an hour just to spin up a GPU instance and wait for all of the required runtimes and packages to finish installing.

  2. The GPUs used for engine building must exactly match the production hardware – for Llama 3.1 405B that means you’re tracking down at least 8 extra H100 GPUs.

  3. Once the TensorRT-LLM engine is built and validated, it needs to be exported, packaged, and deployed manually.

In total, building engines is more often an exercise in patience than engineering—it often takes hours of babysitting the build process to produce a single engine.

To eliminate manual work from the engine building process and bring the power of TensorRT-LLM to more teams, we created the TensorRT-LLM Engine Builder, which automatically builds optimized model serving engines at deploy time from a single configuration file.

TensorRT-LLM works by converting model weights into an inference engine. The TensorRT-LLM engine builder handles this entire process automatically during model deployment. With the engine builder, you no longer need to procure or configure a separate GPU instance for compilation, deal with compatibility issues, or manually export finished engines.

How TensorRT-LLM makes model inference faster

In computing, optimization comes from specialization. For example, GPUs are better at inference than CPUs because GPUs specialize in matrix multiplication. Building a model serving engine follows the same philosophy. To improve your model’s performance, you need to bake in constraints.

A TensorRT inference engine is built for a specific model, GPU, sequence shape, and batch size. TensorRT-LLM uses this information to compile optimized CUDA instructions to maximize every aspect of the model’s performance and take advantage of every feature of the chosen hardware.

TensorRT-LLM is compatible with over 50 LLMs along with similarly-architected models like Whisper and certain large vision models. It also supports fine-tuned versions of these foundation models. During the engine building process, TensorRT-LLM can also apply post-training quantization for further speed and efficiency gains. For production serving, TensorRT-LLM supports model serving features like in-flight batching and LoRA swapping and advanced optimization techniques like speculative sampling.

Using TensorRT-LLM, you can build inference engines maximized for latency, throughput, cost, or a balance thereof.

How to use the TensorRT-LLM engine builder on Baseten

TensorRT-LLM Engine Builder Demo

The TensorRT-LLM engine builder is built into Truss, our open-source model packaging framework. To use the engine builder, install the latest version of Truss:

pip install --upgrade truss

One thing that makes TensorRT-LLM so powerful is the wide range of options provided for inference optimization. The engine builder supports the full set of parameters for TensorRT-LLM, so you’re not sacrificing control for convenience.

To build an engine for a given LLM on a given GPU, it’s helpful to first think about your goals.

  • What kinds of inputs and outputs are you expecting?

  • Do you need to support a large number of concurrent requests?

  • Do you want the lowest possible latency, regardless of cost?

With your goals set, building an inference engine becomes straightforward. Based on your use case, set values like:

  • Sequence shapes: TensorRT-LLM compiles CUDA instructions for specific input and output sequence lengths. Correctly predicting sequence shapes improves performance.

  • Batch size: how many requests to process at once. Larger batch sizes mean lower costs but worse latency.

  • Quantization: running a model at a lower precision improves performance and cost but may affect output quality and must be carefully validated.

When you deploy the Engine Builder Truss, your TensorRT-LLM engine will be seamlessly built and deployed to the model inference server. You get full control over the model server – you can access the engine object directly in the Truss’ Python interface – plus all of the benefits of deploying a model on Baseten like autoscaling in response to traffic, logging and metrics, and secure and compliant inference.

Sign up for Baseten today to access the Engine Builder – get started with our end-to-end guide to TensorRT-LLM engine building or follow along with our demo video. You can also try example implementations for Llama, Mistral, and Whisper models.

If you have any questions about how to get the best possible performance for LLMs in production, especially for fine-tuned and custom LLMs, let us know! We’ll get in touch to discuss your use case and support your team’s experimentation.