Running the MeloTTS Model on MaixPy MaixCAM

Update history
Date Version Author Update content
2025-08-15 1.0.0 lxowalle Initial version

Introduction

MeloTTS is a high-quality multilingual text-to-speech library jointly developed by MIT and MyShell.ai. Currently, it supports the mellotts-zh model, which can synthesize both Chinese and English speech. However, English synthesis is not yet optimal.

The default output audio is PCM data with a sample rate of 44100 Hz, single channel, and 16-bit depth.

Sample rate: The number of times sound is sampled per second.

Channels: The number of audio channels captured per sample. Single channel means mono audio, and dual channel means stereo (left and right channels). To reduce AI inference complexity, single-channel audio is generally used.

Bit depth: The data range captured per sample. A 16-bit depth usually represents each sample as a 16-bit signed integer. Higher bit depth captures finer audio details.

Downloading the Model

Supported models:

Model Platform Memory Requirement Description
melotts-maixcam2 MaixCAM2 1G base

Refer to the Large Model User Guide to download the model.

Running the Model with MaixPy

from maix import nn, audio

# Only MaixCAM2 supports this model.
sample_rate = 44100
p = audio.Player(sample_rate=sample_rate)
p.volume(80)

melotts = nn.MeloTTS(model="/root/models/melotts-maixcam2/melotts-zh.mud", speed = 0.8, language='zh')

pcm = melotts.infer('你好', output_pcm=True)
p.play(pcm)

Notes:

  1. Import the nn module first to create a MeloTTS model object:
from maix import nn
  1. Choose the model to load. currently, the melotts-zh model is supported:
    • speed sets the playback speed
    • language sets the language type
melotts = nn.MeloTTS(model="/root/models/melotts/melotts-zh.mud", speed = 0.8, language='zh')
  1. Start inference:
    • The text to infer here is 'hello'
    • Set output_pcm=True to return PCM data
pcm = melotts.infer('hello', output_pcm=True)
  1. Use the audio playback module to play the generated audio:
    • Make sure the sample rate matches the model’s output
    • Use p.volume(80) to control the output volume (range: 0–100)
    • Play the PCM generated by MeloTTS with p.play(pcm)
p = audio.Player(sample_rate=sample_rate)
p.volume(80)
p.play(pcm)