MaixCAM MaixPy Keyword recognition
Update history
Date | Version | Author | Update content |
---|---|---|---|
2024-10-08 | 1.0.0 | 916BGAI | Initial document |
Introduction
MaixCAM
has ported the Maix-Speech
offline speech library, enabling continuous Chinese numeral recognition, keyword recognition, and large vocabulary speech recognition capabilities. It supports audio recognition in PCM
and WAV
formats, and can accept input recognition via the onboard microphone.
Maix-Speech
Maix-Speech
is an offline speech recognition library specifically designed for embedded environments. It has been deeply optimized for speech recognition algorithms, significantly reducing memory usage while maintaining excellent recognition accuracy. For detailed information, please refer to the Maix-Speech Documentation.
Keyword recognition
from maix import app, nn
speech = nn.Speech("/root/models/am_3332_192_int8.mud")
speech.init(nn.SpeechDevice.DEVICE_MIC)
kw_tbl = ['xiao3 ai4 tong2 xue2',
'ni3 hao3',
'tian1 qi4 zen3 me yang4']
kw_gate = [0.1, 0.1, 0.1]
def callback(data:list[float], len: int):
for i in range(len):
print(f"\tkw{i}: {data[i]:.3f};", end=' ')
print("\n")
speech.kws(kw_tbl, kw_gate, callback, True)
while not app.need_exit():
frames = speech.run(1)
if frames < 1:
print("run out\n")
break
Usage
- Import the
app
andnn
modules
from maix import app, nn
- Load the acoustic model
speech = nn.Speech("/root/models/am_3332_192_int8.mud")
- You can also load the
am_7332
acoustic model; larger models provide higher accuracy but consume more resources.
- Choose the corresponding audio device
speech.init(nn.SpeechDevice.DEVICE_MIC)
speech.init(nn.SpeechDevice.DEVICE_MIC, "hw:0,0") # Specify the audio input device
- This uses the onboard microphone and supports both
WAV
andPCM
audio as input.
speech.init(nn.SpeechDevice.DEVICE_WAV, "path/audio.wav") # Using WAV audio input
speech.init(nn.SpeechDevice.DEVICE_PCM, "path/audio.pcm") # Using PCM audio input
- Note that
WAV
must be16KHz
sample rate withS16_LE
storage format. You can use thearecord
tool for conversion.
arecord -d 5 -r 16000 -c 1 -f S16_LE audio.wav
- When recognizing
PCM/WAV
, if you want to reset the data source, such as for the next WAV file recognition, you can use thespeech.device
method, which will automatically clear the cache:
speech.device(nn.SpeechDevice.DEVICE_WAV, "path/next.wav")
- Set up the decoder
kw_tbl = ['xiao3 ai4 tong2 xue2',
'ni3 hao3',
'tian1 qi4 zen3 me yang4']
kw_gate = [0.1, 0.1, 0.1]
def callback(data:list[float], len: int):
for i in range(len):
print(f"\tkw{i}: {data[i]:.3f};", end=' ')
print("\n")
speech.kws(kw_tbl, kw_gate, callback, True)
The user can configure multiple decoders simultaneously.
kws
decoder is registered to output a list of probabilities for all registered keywords from the last frame. Users can observe the probability values and set their own thresholds for activation.When setting up the
kws
decoder, you need to provide akeyword list
separated by spaces in Pinyin, akeyword probability threshold list
arranged in order, and specify whether to enableautomatic near-sound processing
. If set toTrue
, different tones of the same Pinyin will be treated as similar words to accumulate probabilities. Finally, you need to set a callback function to handle the decoded data.Users can also manually register near-sound words using the
speech.similar
method, with a maximum of10
near-sound words registered for each Pinyin. (Note that using this interface to register near-sound words will override the near-sound table generated by enablingautomatic near-sound processing
.)
similar_char = ['zhen3', 'zheng3']
speech.similar('zen3', similar_char)
- If a decoder is no longer needed, you can deinitialize it by calling the
speech.dec_deinit
method.
speech.dec_deinit(nn.SpeechDecoder.DECODER_KWS)
- Recognition
while not app.need_exit():
frames = speech.run(1)
if frames < 1:
print("run out\n")
break
Use the
speech.run
method to run speech recognition. The parameter specifies the number of frames to run each time, returning the actual number of frames processed. Users can choose to run 1 frame each time and then perform other processing, or run continuously in a single thread, stopping it with an external thread.To clear the cache of recognized results, you can use the
speech.clear
method.When switching decoders during recognition, the first frame after the switch may produce incorrect results. You can use
speech.skip_frames(1)
to skip the first frame and ensure the accuracy of subsequent results.
Recognition Results
If the above program runs successfully, speaking into the onboard microphone will yield keyword recognition results, such as:
kws log 2.048s, len 24
decoder_kws_init get 3 kws
00, xiao3 ai4 tong2 xue2
01, ni3 hao3
02, tian1 qi4 zen3 me yang4
find shared memory(491520), saved:491520
kw0: 0.959; kw1: 0.000; kw2: 0.000; # xiao3 ai4 tong2 xue2
kw0: 0.000; kw1: 0.930; kw2: 0.000; # ni3 hao3
kw0: 0.000; kw1: 0.000; kw2: 0.961; # tian1 qi4 zen3 me yang4