MARS6

MARS6 is a frontier text-to-speech model by CAMB.AI with voice/prosody cloning capabilities in 10 languages. MARS6 must be licensed for commercial use, we can help!

Deploy MARS6 behind an API endpoint in seconds.

Deploy model

Example usage

This model requires at least four inputs:

text: The input text that needs to be spoken
audio_ref: An audio file containing the audio of a single person
ref_text: What is spoken in audio_ref
language: The language code for the target language

The model will try to output an audio stream containing the speech in the reference audio's style. The output is by default an HTTP1.1 chunked encoding response of an encoded audio file using an ADTS AAC stream, but can be configured to stream using flac format, or to not stream at all and return the entire response as a base64 encoded flac file.

data = {"text": "The quick brown fox jumps over the lazy dog",
        "audio_ref": encoded_str, 
        "ref_text": prompt_txt,
        "language": 'en-us', # Target language, in this case english. 
        # "top_p": 0.7, # Optionally specify a top_p (default 0.7)
        # "temperature": 0.7, # Optionally specify a temperature (default 0.7)
        # "chunk_length": 200, # Optional text chunk length for splitting long pieces of input text. Default 200
        # "max_new_tokens": 0, # Optional limit on max number of new tokens, default is zero (unlimited)
        # "repetition_penalty": 1.5 # Optional rep penalty, default 1.5
}

Input

1import base64
2import time
3import torchaudio
4import requests
5import IPython.display as ipd
6import librosa, librosa.display
7import torch
8import io
9from torchaudio.io import StreamReader
10
11# Step 1: set endpoint url and api key:
12url = "<YOUR PREDICTION ENDPOINT>"
13headers = {"Authorization": "Api-Key <YOUR API KEY>"}
14
15# Step 2: pick reference audio to clone, encode it as base64
16file_path = "ref_debug.flac"  # any valid audio filepath, ideally between 6s-90s.
17wav, sr = librosa.load(file_path, sr=None, mono=True, offset=0, duration=5)
18io_data = io.BytesIO()
19torchaudio.save(io_data, torch.from_numpy(wav)[None], sample_rate=sr, format="wav")
20io_data.seek(0)
21encoded_data = base64.b64encode(io_data.read())
22encoded_str = encoded_data.decode("utf-8")
23# OPTIONAL: specify the transcript of the reference/prompt (slightly speeds up inference, and may make it sound a bit better).
24prompt_txt = None  # if unspecified, can be left as None
25
26# Step 3: define other inference settings:
27data = {
28    "text": "The quick brown fox jumps over the lazy dog",
29    "audio_ref": encoded_str,
30    "ref_text": prompt_txt,
31    "language": "en-us",  # Target language, in this case english.
32    # "top_p": 0.7, # Optionally specify a top_p (default 0.7)
33    # "temperature": 0.7, # Optionally specify a temperature (default 0.7)
34    # "chunk_length": 200, # Optional text chunk length for splitting long pieces of input text. Default 200
35    # "max_new_tokens": 0, # Optional limit on max number of new tokens, default is zero (unlimited)
36    # "repetition_penalty": 1.5 # Optional rep penalty, default 1.5
37    # stream: bool = True # whether to stream the resposne back as an HTTP1.1 chunked encoding response, or run to completion and return the base64 encoded file.
38    # stream_format: str = "adts" # 'adts' or 'flac' for stream format. Default 'adts'
39}
40
41st = time.time()
42
43
44class UnseekableWrapper:
45    def __init__(self, obj):
46        self.obj = obj
47
48    def read(self, n):
49        return self.obj.read(n)
50
51
52# Step 4: Send the POST request (note the first request might be a bit slow, but following requests should be fast)
53response = requests.post(url, headers=headers, json=data, stream=True, timeout=300)
54streamer = StreamReader(UnseekableWrapper(response.raw))
55streamer.add_basic_audio_stream(
56    11025, buffer_chunk_size=3, sample_rate=44100, num_channels=1
57)
58
59# Step 4.1: check the header format of the returned stream response
60for i in range(streamer.num_src_streams):
61    print(streamer.get_src_stream_info(i))
62
63# Step 5: stream the response back and decode it on-the-fly
64audio_samples = []
65for chunks in streamer.stream():
66    audio_chunk = chunks[0]
67    audio_samples.append(
68        audio_chunk._elem.squeeze()
69    )  # this is now just a (T,) float waveform, however you can set your own output format bove.
70    print(
71        f"Playing audio chunk of size {audio_chunk._elem.squeeze().shape} at {time.time() - st:.2f}s."
72    )
73    # If you wish, you can also play each chunk as you receive it, e.g. using IPython:
74    # ipd.display(ipd.Audio(audio_chunk._elem.squeeze().numpy(), rate=44100, autoplay=True))
75
76# Step 6: concatenate all the audio chunks and play the full audio (if you didn't play them on the fly above)
77final_full_audio = torch.concat(audio_samples, dim=0)  # (T,) float waveform @ 44.1kHz
78# ipd.display(ipd.Audio(final_full_audio.numpy(), rate=44100))

JSON output

1{
2    "result": "iVBORw0KGgoAAAANSUhEU"
3}

Preview

00:00/00:00

Example usage

Deploy any model in just a few commands