Custom medical and financial LLMs from Writer see 60% higher tokens per second with Baseten

Background

Writer is the leading full-stack generative AI platform with the quality and security required in the enterprise. It is composed of four parts: Palmyra (custom-built LLMs), graph-based RAG, AI guardrails, and AI Studio — a suite of development tools for building AI apps on top of the Writer platform. Writer is trusted by hundreds of world-class enterprises like Vanguard, Kenvue, and Salesforce to power their AI workflows.

Today, Writer is adding new domain-specific models to their family of LLMs:

Palmyra-Med-70B, tailored for the healthcare sector, surpasses the accuracy of widely recognized models like Med-PaLM-2 and GPT-4 by a notable margin.
Palmyra-Fin-70B demonstrates superior analytical performance over models such as Claude 3.5 Sonnet and GPT-4o in financial evaluations, excelling in tasks like long-form financial analysis and risk assessment. It even passed the CFA Level III exam, one of the most difficult financial tests available.

These domain-specific models bring unparalleled accuracy to two industries with strict compliance standards and complex use cases.

Problem

Writer makes state-of-the-art LLMs, but needs to match that with state-of-the-art performance in production. Optimizing model inference is detail-oriented engineering work that distracts from the team’s core competence of model training.

Writer set out to build domain-specific models for increased precision, accuracy, and compliance. But one essential benefit is resource optimization: domain-specific models are more cost-effective to train and deploy than larger, more general models. That benefit is unlocked by high-performance inference infrastructure.

When building Palmyra-Med-70B and Palmyra-Fin-70B, Writer’s engineering team needed to develop a model serving infrastructure capable of supporting these extremely powerful models.

Serving 70-billion-parameter LLMs is a challenge. These models require multiple high-end A100 or H100 GPUs to run at all, and getting them to run fast means applying cutting-edge research techniques and leveraging new model optimization frameworks.

Writer’s ML engineers needed a simple but powerful platform to deploy the models they create reliably and securely with production-ready performance. That need brought their team to Baseten.

Solution

Adopting Baseten as an inference platform was a sea change for Writer’s engineering team. Working with Baseten’s dedicated model performance team and forward-deployed engineers, Writer was able to achieve state-of-the art performance on custom 70-billion-parameter large language models.

Baseten’s model performance engineers analyzed Writer’s custom LLMs and worked closely with Writer’s team to understand what tradeoffs made sense for inference. Together, Writer and Baseten found a solution that meets their ambitious requirements for latency, throughput, and cost.

Using TensorRT-LLM, a model performance optimization SDK by NVIDIA, we built model-specific engines for each LLM. These engines compile specialized CUDA instructions to optimize model serving for a given sequence shape and batch size that matches real-world use. Additionally, these inference engines support features like in-flight batching for seamless production serving.

Result

Writer and Baseten worked together to create secure, private, high-performance deployments for Writer’s latest Palmyra LLMs. With TensorRT-LLM engines deployed on Baseten, Writer surpassed its performance requirements ahead of launching the new models.

In a benchmark running the LLMs in FP16 on four NVIDIA A100 GPUs, Writer saw:

60% higher tokens per second
23% lower time to first token
35% lower cost per million tokens

Inference for custom-built LLMs could be a major headache. Thanks to Baseten, we’re getting cost-effective high-performance model serving without any extra burden on our internal engineering teams. Instead, we get to focus our expertise on creating the best possible domain-specific LLMs for our customers.
Waseem Alshikh, CTO and Co-Founder of Writer

The improved performance means Writer customers can enjoy faster response times and higher token throughput when interacting with Palmyra models.

What's next

High-performance inference is essential for serving models in production. End users demand fast applications, and AI platforms need to operate models at scale while keeping costs under control. With Baseten, Writer’s team was able to bring custom large language models to market with best in class performance.

Writer is excited to explore even more performance optimizations like using H100 GPUs and new research like speculative decoding to further improve model speed and reduce operational costs.