Running the MeloTTS Model on MaixPy MaixCAM

2025-08-15

Update history

Date	Version	Author	Update content
2025-08-15	1.0.0	lxowalle	Initial version

Introduction

MeloTTS is a high-quality multilingual text-to-speech library jointly developed by MIT and MyShell.ai. Currently, it supports the mellotts-zh model, which can synthesize both Chinese and English speech. However, English synthesis is not yet optimal.

The default output audio is PCM data with a sample rate of 44100 Hz, single channel, and 16-bit depth.

Sample rate: The number of times sound is sampled per second.

Channels: The number of audio channels captured per sample. Single channel means mono audio, and dual channel means stereo (left and right channels). To reduce AI inference complexity, single-channel audio is generally used.

Bit depth: The data range captured per sample. A 16-bit depth usually represents each sample as a 16-bit signed integer. Higher bit depth captures finer audio details.

Downloading the Model

Supported models:

Model	Platform	Memory Requirement	Description
melotts-maixcam2	MaixCAM2	1G	base

Refer to the Large Model User Guide to download the model.

Running the Model with MaixPy

from maix import nn, audio

# Only MaixCAM2 supports this model.
sample_rate = 44100
p = audio.Player(sample_rate=sample_rate)
p.volume(80)

melotts = nn.MeloTTS(model="/root/models/melotts-maixcam2/melotts-zh.mud", speed = 0.8, language='zh')

pcm = melotts.infer('你好', output_pcm=True)
p.play(pcm)

Notes：

Import the nn module first to create a MeloTTS model object:

from maix import nn

Choose the model to load. currently, the melotts-zh model is supported:
- speed sets the playback speed
- language sets the language type

melotts = nn.MeloTTS(model="/root/models/melotts/melotts-zh.mud", speed = 0.8, language='zh')

Start inference:
- The text to infer here is 'hello'
- Set output_pcm=True to return PCM data

pcm = melotts.infer('hello', output_pcm=True)

Use the audio playback module to play the generated audio:
- Make sure the sample rate matches the model’s output
- Use p.volume(80) to control the output volume (range: 0–100)
- Play the PCM generated by MeloTTS with p.play(pcm)

p = audio.Player(sample_rate=sample_rate)
p.volume(80)
p.play(pcm)

Whisper Speech-Recognition Model

ONNX model to MaixCAM2's