Llama 3.1 Nemotron Ultra 253B
A high-efficiency distill of Llama 3.2 405B with leading accuracy for reasoning, tool calling, chat, and instruction following.
Deploy Llama 3.1 Nemotron Ultra 253B behind an API endpoint in seconds.
Example usage
Input
1import requests
2
3# Replace the empty string with your model id below
4model_id = ""
5baseten_api_key = os.environ["BASETEN_API_KEY"]
6
7messages = [
8 {"role": "user", "content": "Write a limerick about the wonders of GPU computing.?"},
9]
10data = {
11 "messages": messages,
12 "stream": True,
13 "max_new_tokens": 512
14}
15
16# Call model endpoint
17res = requests.post(
18 f"https://model-{model_id}.api.baseten.co/production/predict",
19 headers={"Authorization": f"Api-Key {baseten_api_key}"},
20 json=data,
21 stream=True
22)
23
24# Print the generated tokens as they get streamed
25for content in res.iter_content():
26 print(content.decode("utf-8"), end="", flush=True)
JSON output
1null