three triangles with the bottom edge missing inside each otherMARS6

MARS6 is a frontier text-to-speech model by CAMB.AI with voice/prosody cloning capabilities in 10 languages. MARS6 must be licensed for commercial use, we can help!

Deploy MARS6 behind an API endpoint in seconds.

Example usage

This model requires at least four inputs:

  1. text: The input text that needs to be spoken

  2. audio_ref: An audio file containing the audio of a single person

  3. ref_text: What is spoken in audio_ref

  4. language: The language code for the target language

The model will try to output an audio stream containing the speech in the reference audio's style. The output is by default an HTTP1.1 chunked encoding response of an encoded audio file using an ADTS AAC stream, but can be configured to stream using flac format, or to not stream at all and return the entire response as a base64 encoded flac file.

data = {"text": "The quick brown fox jumps over the lazy dog",
        "audio_ref": encoded_str, 
        "ref_text": prompt_txt,
        "language": 'en-us', # Target language, in this case english. 
        # "top_p": 0.7, # Optionally specify a top_p (default 0.7)
        # "temperature": 0.7, # Optionally specify a temperature (default 0.7)
        # "chunk_length": 200, # Optional text chunk length for splitting long pieces of input text. Default 200
        # "max_new_tokens": 0, # Optional limit on max number of new tokens, default is zero (unlimited)
        # "repetition_penalty": 1.5 # Optional rep penalty, default 1.5
} 
Input
1import base64
2import time
3import torchaudio
4import requests
5import IPython.display as ipd
6import librosa, librosa.display
7import torch
8import io
9from torchaudio.io import StreamReader
10
11# Step 1: set endpoint url and api key:
12url = "<YOUR PREDICTION ENDPOINT>"
13headers = {"Authorization": "Api-Key <YOUR API KEY>"}
14
15# Step 2: pick reference audio to clone, encode it as base64
16file_path = "ref_debug.flac"  # any valid audio filepath, ideally between 6s-90s.
17wav, sr = librosa.load(file_path, sr=None, mono=True, offset=0, duration=5)
18io_data = io.BytesIO()
19torchaudio.save(io_data, torch.from_numpy(wav)[None], sample_rate=sr, format="wav")
20io_data.seek(0)
21encoded_data = base64.b64encode(io_data.read())
22encoded_str = encoded_data.decode("utf-8")
23# OPTIONAL: specify the transcript of the reference/prompt (slightly speeds up inference, and may make it sound a bit better).
24prompt_txt = None  # if unspecified, can be left as None
25
26# Step 3: define other inference settings:
27data = {
28    "text": "The quick brown fox jumps over the lazy dog",
29    "audio_ref": encoded_str,
30    "ref_text": prompt_txt,
31    "language": "en-us",  # Target language, in this case english.
32    # "top_p": 0.7, # Optionally specify a top_p (default 0.7)
33    # "temperature": 0.7, # Optionally specify a temperature (default 0.7)
34    # "chunk_length": 200, # Optional text chunk length for splitting long pieces of input text. Default 200
35    # "max_new_tokens": 0, # Optional limit on max number of new tokens, default is zero (unlimited)
36    # "repetition_penalty": 1.5 # Optional rep penalty, default 1.5
37    # stream: bool = True # whether to stream the resposne back as an HTTP1.1 chunked encoding response, or run to completion and return the base64 encoded file.
38    # stream_format: str = "adts" # 'adts' or 'flac' for stream format. Default 'adts'
39}
40
41st = time.time()
42
43
44class UnseekableWrapper:
45    def __init__(self, obj):
46        self.obj = obj
47
48    def read(self, n):
49        return self.obj.read(n)
50
51
52# Step 4: Send the POST request (note the first request might be a bit slow, but following requests should be fast)
53response = requests.post(url, headers=headers, json=data, stream=True, timeout=300)
54streamer = StreamReader(UnseekableWrapper(response.raw))
55streamer.add_basic_audio_stream(
56    11025, buffer_chunk_size=3, sample_rate=44100, num_channels=1
57)
58
59# Step 4.1: check the header format of the returned stream response
60for i in range(streamer.num_src_streams):
61    print(streamer.get_src_stream_info(i))
62
63# Step 5: stream the response back and decode it on-the-fly
64audio_samples = []
65for chunks in streamer.stream():
66    audio_chunk = chunks[0]
67    audio_samples.append(
68        audio_chunk._elem.squeeze()
69    )  # this is now just a (T,) float waveform, however you can set your own output format bove.
70    print(
71        f"Playing audio chunk of size {audio_chunk._elem.squeeze().shape} at {time.time() - st:.2f}s."
72    )
73    # If you wish, you can also play each chunk as you receive it, e.g. using IPython:
74    # ipd.display(ipd.Audio(audio_chunk._elem.squeeze().numpy(), rate=44100, autoplay=True))
75
76# Step 6: concatenate all the audio chunks and play the full audio (if you didn't play them on the fly above)
77final_full_audio = torch.concat(audio_samples, dim=0)  # (T,) float waveform @ 44.1kHz
78# ipd.display(ipd.Audio(final_full_audio.numpy(), rate=44100))
JSON output
1{
2    "result": "iVBORw0KGgoAAAANSUhEU"
3}
Preview
00:00/00:00

Deploy any model in just a few commands

Avoid getting tangled in complex deployment processes. Deploy best-in-class open-source models and take advantage of optimized serving for your own models.

$

truss init -- example stable-diffusion-2-1-base ./my-sd-truss

$

cd ./my-sd-truss

$

export BASETEN_API_KEY=MdNmOCXc.YBtEZD0WFOYKso2A6NEQkRqTe

$

truss push

INFO

Serializing Stable Diffusion 2.1 truss.

INFO

Making contact with Baseten 👋 👽

INFO

🚀 Uploading model to Baseten 🚀

Upload progress: 0% | | 0.00G/2.39G