GPT vs Llama: Migrate to open source LLMs seamlessly

In the past few months, researchers have released powerful open source large language models like Llama and Qwen. Open source LLMs demonstrate strong capabilities on tasks from chat completion to code generation, and many can be run on cost-effective hardware.

One barrier to adopting open source ML models is the time it takes to re-integrate the new model into your application. Models have different input and output formats, support different parameters, and require different prompting strategies.

To make it easier to experiment with open source models, we’ve created a new endpoint for LLMs hosted on Baseten that’s compatible with OpenAI’s ChatCompletions API. With this endpoint and a supported model, you can go from GPT-3.5 to open source LLMs like Llama 3.1 with:

One-click model deployment.
Zero pip install commands.
Three tiny code changes.

Follow along with the video above or the tutorial below and you’ll be working with open source LLMs in no time!

Deploy Llama 3.2 11B

Any LLM can be implemented with an OpenAI-compatible model server. In particular, vLLM servers are OpenAI compatible out of the box, and TensorRT-LLM Engine Builder servers can be made compatible by adding metadata to the config.yaml.

Let’s deploy Llama 3.2 11B for this tutorial:

Select Llama 3.2 11B from the model library.
Click “Deploy on Baseten”
Get your model ID by opening the “Call model” modal and copying it from the model endpoint.
Create an API key for your Baseten workspace with the “Generate API key” button.

The Model ID in this example is "5wom4xnq"

You’ll need the API key and model ID to call the model endpoint in the next step.

Note that with open source models deployed on Baseten, you’re charged per minute of GPU usage rather than per token. Your Baseten account comes with free credits to fund experimentation, and all models deployed from the model library automatically take advantage of scale to zero with fast cold starts to save you money while the model is not in use.

Update your model usage script

Here’s the fun part of the project: you can drop this model into existing code that uses the ChatCompletions API with only minor changes.

In fact, the code will still use the OpenAI Python client, so you don’t need to install any new libraries. Let’s take a look at some code, we’ll cover the differences below.

Standard inference

Here’s a code sample for using OpenAI’s ChatCompletions API with GPT-3.5:

1from openai import OpenAI
2import os
3
4
5client = OpenAI(
6    api_key=os.environ["OPENAI_API_KEY"]
7)
8
9response = client.chat.completions.create(
10    model="gpt-3.5-turbo",
11    messages=[
12        {"role": "user", "content": "Who won the world series in 2020?"},
13        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
14        {"role": "user", "content": "Where was it played?"}
15    ]
16)
17
18print(response.choices[0].message.content)

And here’s the same code sample for Mistral on Baseten:

1from openai import OpenAI
2import os
3
4
5client = OpenAI(
6    # api_key=os.environ["OPENAI_API_KEY"],
7    api_key=os.environ["BASETEN_API_KEY"],
8    # Add base_url
9    base_url="https://bridge.baseten.co/{model_id}/v1"
10)
11
12response = client.chat.completions.create(
13    # model="gpt-3.5-turbo",
14    model="llama-3-2-11b",
15    messages=[
16        {"role": "user", "content": "Who won the world series in 2020?"},
17        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
18        {"role": "user", "content": "Where was it played?"}
19    ]
20)
21
22print(response.choices[0].message.content)

Rather than making you play spot-the-difference, we’ll highlight the three small code changes that make this work:

Replace the OPENAI_API_KEY with your BASETEN_API_KEY in the client object.
Set the base_url in the client object to https://bridge.baseten.co/{model_id}/v1 where {model_id} is the ID of your deployed model.
In the client.chat.completions.create() call, set model to llama-3-2-11b instead of gpt-3.5-turbo.

The response format will be exactly the same, though token usage values will not be calculated. The endpoint reference docs have complete information on supported inputs and outputs.

Streaming inference

LLMs on Baseten can also support streaming. To stream Llama responses, pass stream=True in the ChatCompletion API call, and parse the streaming response as needed:

1from openai import OpenAI
2import os
3
4
5client = OpenAI(
6    # api_key=os.environ["OPENAI_API_KEY"],
7    api_key=os.environ["BASETEN_API_KEY"],
8    # Add base_url
9    base_url="https://bridge.baseten.co/{model_id}/v1"
10)
11
12response = client.chat.completions.create(
13    # model="gpt-3.5-turbo",
14    model="llama-3-2-11b",
15    messages=[
16        {"role": "user", "content": "Who won the world series in 2020?"},
17        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
18        {"role": "user", "content": "Where was it played?"}
19    ],
20    stream=True
21)
22
23for chunk in response:
24    print(chunk.choices[0].delta)

Explore open source models

Any LLM can be served with an OpenAI-compatible ChatCompletions model server. Get started by adapting an example, like this implementation of Ultravox, or let us know what you need at support@baseten.co.

And while this bridge makes it easier to get started with LLMs like Mistral, there’s a wide world of open source models to explore. Our model library hosts dozens of curated open source models ranging from LLMs to models like FLUX and Whisper. Get started with our checklist for switching to open source or dive deeper with our guide to open source alternatives for ML models.