How to build function calling and JSON mode for open-source and fine-tuned LLMs

Today, we announced support for function calling and structured output for LLMs deployed with our TensorRT-LLM Engine Builder. This adds support at the model server level for two key features:

Function calling: also known as “tool use,” this feature lets you pass a set of defined tools to a LLM as part of the request body. Based on the prompt, the model selects and returns the most appropriate function/tool from the provided options.
Structured output: an evolution of “JSON mode,” this feature enforces an output schema defined as part of the LLM input. The LLM output is guaranteed to adhere to the provided schema, with full Pydantic support.

To introduce these features, we build new capabilities into our customized version of NVIDIA’s Triton inference server. This engineering deep dive explains how the implementation works under the hood: defining schemas and tools, building a state machine, and using logit biasing to force valid output.

And the best part? Thanks to pre-computed token masks, there’s minimal latency impact from using either feature after the first call with a given schema is completed. You can expect the same tokens per second when generating JSON as when generating ordinary text.

If you’re looking to get started quickly with these new features, check out our launch announcement and docs for function calling and structured output. For implementation details, keep reading!

How structured output is generated

To understand how it’s possible to guarantee structured output, we need to dive into the details of how a token is generated during LLM inference. If you’re familiar with LLM inference, you’ll know that a new token is generated on each forward pass through the model. During that forward pass:

A vector of logits is outputted from the final layer of the LLM’s neural network.
A normalization function like softmax is applied to turn the logits into probabilities.
Using these probabilities, a token is selected. Depending on settings like top_p, top_k, beam_width, and temperature, this may not always be the highest-probability token.

Structured output uses logit biasing in the first step to guarantee valid tokens are generated.

Logit biasing ensures token validity

The length of the logit vector outputted in the first step is equal to the number of tokens in the model’s vocabulary. For example, Llama 3 LLMs have a vocabulary of ~128,000 tokens. Thus, the logit vector will have about 128K values. Each logit in the vector is a score representing how much the LLM thinks that the given token from the vocabulary could be the next token in the output sequence.

For structured output, we only want to generate valid tokens. For example, an array in JSON must have both an opening and closing bracket: [1, 2, 3]. If we already have generated [1, 2, 3 then the valid options are:

A comma, space, and another value such as four: , 4 .
A closing bracket to end the array: ].

From the model’s vocabulary, most of the possible tokens will not be valid at certain points when generating structured output. Logit biasing guarantees valid output structure by identifying every invalid token and setting its score to negative infinity, ensuring that the invalid tokens cannot be generated.

This discussion of logit biasing raises a natural question: how do we know where we are in the output schema and which tokens are valid?

State machine provides token requirements

The model server running beneath the inference process is responsible for tracking output format using a state machine. This model server is a modified version of NVIDIA Triton with extra capabilities that we call “Briton” (Baseten + Triton = Briton).

Using an industry standard library Outlines, which also powers VLLM, the Briton model server takes the schema passed as model output, transforms it into a regular expression, then generates a state machine from that regex. We chose Outlines for its robust feature set and reliability.

However, Outlines is written in Python, while TensorRT-LLM and Triton run in C++ for speed and efficiency. To handle this, we first generate the state machine in Python, then serialize it to Protocol Buffers and load it into the model server.

Once loaded into the model server, the state machine makes the logit biasing process incredibly efficient. The state machine is cached in memory, and an appropriate token mask – a list of 1s and 0s corresponding to valid and invalid tokens – is created for each node of the state machine for logit biasing. This means that these calculations aren’t made during inference time, rather, existing masks are applied based on which state is active.

With no token mask calculations happening during token generation, this approach to logit biasing has a negligible effect on model performance, so you’ll get the same high tokens per second that you’re used to from TensorRT-LLM while also ensuring that every token is valid for the provided output schema.

How to use function calling

Function calling works by providing LLMs with a structured description of a set of tools. Based on the prompt, the model selects the most appropriate tool or tools for the task described. Functions can be anything: API calls, ORM access, SQL queries, or just a script.

A function written to be passed to an LLM — note the descriptive docstring.

It’s essential to understand that function calling does not give the LLM the capability to execute code. Instead, the function calling asks the LLM to choose the most appropriate function from the list of available tools. The actual function execution needs to happen in the same environment that made the LLM call.

Our function calling implementation follows the OpenAI API spec for compatibility, but applies to any model served with TensorRT-LLM via the Engine Builder that has built-in function calling capabilities (e.g. Llama 3.1 Instruct, but not Llama 3). Using the same logit biasing process that creates structured output, Briton (the modified Triton inference server) guarantees schematically correct tool responses.

Example payload with function calling via the "tools" key

Function calling is critical for building agentic workflows and other advanced Compound AI systems. To use function calling for yourself, check out our function calling example in the documentation.

How to use structured output

The more general structured output feature forces LLMs to return output that adheres to a Pydantic schema. Structured output is valid JSON, but goes beyond JSON mode with support for required and optional fields, multiple data types, and additional validations like maximum length.

To start, define your output schema as a Pydantic model.

Pydantic model for a "Person" object. The schema can be passed to an LLM to structure output.

Then, when you add the schema to the LLM call, the model server will build the schema into a state machine and use it for token masking as described above. The LLM inference arguments match the OpenAI API spec for structured output to ensure maximum compatibility.

Example LLM request payload with a response schema.

Structured output is useful for a wide range of Compound AI applications as the guaranteed schema adherence means you can integrate LLMs into larger systems without worrying about type errors. To try structured output for your application, start with our structured output example in the documentation.

What to build with function calling and structured output

While the implementation behind these new features is interesting, what’s even more exciting is the use cases they enable.

Function calling unlocks a wide range of agentic use cases for open source LLMs. With function calling, you can give agents access to a set of tools to accomplish tasks. As we saw above, the LLM is only able to select the best tool, not actually execute the API call or run the function, so that’s where multi-step AI systems are needed.

These multi-step, often multi-model systems are commonly known as Compound AI. When building multi-stage Compound AI systems, structured output is critical. With structured output, each component of the system can communicate in valid JSON, preventing errors and avoiding parsing overhead.

As you build with function calling and structured output, remember that the model server changes don’t enhance quality, they only enforce format. Clear prompting and techniques like few-shot prompting still have their place for getting quality output within the enforced structure.

Get started building:

First, deploy Llama 3.1 8B with the TensorRT-LLM Engine Builder
Then try function calling with an accurate LLM math demo
And get JSON-mode output with in a document parsing demo

Subscribe to our newsletter

Stay up to date on model performance, GPUs, and more.

‌

How to build function calling and JSON mode for open-source and fine-tuned LLMs

How structured output is generated

Logit biasing ensures token validity

State machine provides token requirements

How to use function calling

How to use structured output

What to build with function calling and structured output

Subscribe to our newsletter

Related Model performance posts

How to double tokens per second for Llama 3 with Medusa

How to serve 10,000 fine-tuned LLMs from a single GPU

Benchmarking fast Mistral 7B inference