GPT vs Llama: Migrate to open source LLMs seamlessly
TL;DR
If you’re using the ChatCompletions API and want to experiment with open source LLMs for your generative AI application, we’ve built a bridge that lets you try out models like Llama with just three tiny code changes.
In the past few months, researchers have released powerful open source large language models like Llama and Qwen. Open source LLMs demonstrate strong capabilities on tasks from chat completion to code generation, and many can be run on cost-effective hardware.
One barrier to adopting open source ML models is the time it takes to re-integrate the new model into your application. Models have different input and output formats, support different parameters, and require different prompting strategies.
To make it easier to experiment with open source models, we’ve created a new endpoint for LLMs hosted on Baseten that’s compatible with OpenAI’s ChatCompletions API. With this endpoint and a supported model, you can go from GPT-3.5 to open source LLMs like Llama 3.1 with:
One-click model deployment.
Zero
pip install
commands.Three tiny code changes.
Follow along with the video above or the tutorial below and you’ll be working with open source LLMs in no time!
Deploy Llama 3.2 11B
Any LLM can be implemented with an OpenAI-compatible model server. In particular, vLLM servers are OpenAI compatible out of the box, and TensorRT-LLM Engine Builder servers can be made compatible by adding metadata to the config.yaml
.
Let’s deploy Llama 3.2 11B for this tutorial:
Select Llama 3.2 11B from the model library.
Click “Deploy on Baseten”
Get your model ID by opening the “Call model” modal and copying it from the model endpoint.
Create an API key for your Baseten workspace with the “Generate API key” button.
You’ll need the API key and model ID to call the model endpoint in the next step.
Note that with open source models deployed on Baseten, you’re charged per minute of GPU usage rather than per token. Your Baseten account comes with free credits to fund experimentation, and all models deployed from the model library automatically take advantage of scale to zero with fast cold starts to save you money while the model is not in use.
Update your model usage script
Here’s the fun part of the project: you can drop this model into existing code that uses the ChatCompletions API with only minor changes.
In fact, the code will still use the OpenAI Python client, so you don’t need to install any new libraries. Let’s take a look at some code, we’ll cover the differences below.
Standard inference
Here’s a code sample for using OpenAI’s ChatCompletions API with GPT-3.5:
1from openai import OpenAI
2import os
3
4
5client = OpenAI(
6 api_key=os.environ["OPENAI_API_KEY"]
7)
8
9response = client.chat.completions.create(
10 model="gpt-3.5-turbo",
11 messages=[
12 {"role": "user", "content": "Who won the world series in 2020?"},
13 {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
14 {"role": "user", "content": "Where was it played?"}
15 ]
16)
17
18print(response.choices[0].message.content)
And here’s the same code sample for Mistral on Baseten:
1from openai import OpenAI
2import os
3
4
5client = OpenAI(
6 # api_key=os.environ["OPENAI_API_KEY"],
7 api_key=os.environ["BASETEN_API_KEY"],
8 # Add base_url
9 base_url="https://bridge.baseten.co/{model_id}/v1"
10)
11
12response = client.chat.completions.create(
13 # model="gpt-3.5-turbo",
14 model="llama-3-2-11b",
15 messages=[
16 {"role": "user", "content": "Who won the world series in 2020?"},
17 {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
18 {"role": "user", "content": "Where was it played?"}
19 ]
20)
21
22print(response.choices[0].message.content)
Rather than making you play spot-the-difference, we’ll highlight the three small code changes that make this work:
Replace the
OPENAI_API_KEY
with yourBASETEN_API_KEY
in theclient
object.Set the
base_url
in theclient
object tohttps://bridge.baseten.co/{model_id}/v1
where{model_id}
is the ID of your deployed model.In the
client.chat.completions.create()
call, setmodel
tollama-3-2-11b
instead ofgpt-3.5-turbo
.
The response format will be exactly the same, though token usage values will not be calculated. The endpoint reference docs have complete information on supported inputs and outputs.
Streaming inference
LLMs on Baseten can also support streaming. To stream Llama responses, pass stream=True
in the ChatCompletion API call, and parse the streaming response as needed:
1from openai import OpenAI
2import os
3
4
5client = OpenAI(
6 # api_key=os.environ["OPENAI_API_KEY"],
7 api_key=os.environ["BASETEN_API_KEY"],
8 # Add base_url
9 base_url="https://bridge.baseten.co/{model_id}/v1"
10)
11
12response = client.chat.completions.create(
13 # model="gpt-3.5-turbo",
14 model="llama-3-2-11b",
15 messages=[
16 {"role": "user", "content": "Who won the world series in 2020?"},
17 {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
18 {"role": "user", "content": "Where was it played?"}
19 ],
20 stream=True
21)
22
23for chunk in response:
24 print(chunk.choices[0].delta)
Explore open source models
Any LLM can be served with an OpenAI-compatible ChatCompletions model server. Get started by adapting an example, like this implementation of Ultravox, or let us know what you need at support@baseten.co.
And while this bridge makes it easier to get started with LLMs like Mistral, there’s a wide world of open source models to explore. Our model library hosts dozens of curated open source models ranging from LLMs to models like FLUX and Whisper. Get started with our checklist for switching to open source or dive deeper with our guide to open source alternatives for ML models.
Subscribe to our newsletter
Stay up to date on model performance, GPUs, and more.