Last year I was digging into Arduino Based TTS Solutions and came to the conclusion that the available engines will not provide any quality audio and therefore recommended to consider an approach which is based on recorded audio samples.
Today I took the opportunity to create the new arduino-simple-tts project which is based on this approach. As a proof of concept I have implemented the speaking of numbers and of the output of time. All the relevant words have been recorded as mp3 files and are stored in program memory.
Arduino Sketch
Here is a Example Sketch which implements a Speaking Clock:
#include "SimpleTTS.h"
#include "AudioCodecs/CodecMP3Helix.h"
#include "AudioLibs/AudioKit.h"
#include "TimeInfo.h"
// Output
TimeToText ttt;
AudioKitStream i2s; // replace with alterntive Audio Sink if needed: AnalogAudioStream, I2SStream etc.
MP3DecoderHelix mp3;
AudioDictionary dictionary(ExampleAudioDictionaryValues);
TextToSpeech tts(ttt, i2s, mp3, dictionary);
// Determine Time
TimeInfo timeInfo;
const char* ssid = "SSID";
const char* password = "password";
void setup() {
Serial.begin(115200);
AudioLogger::instance().begin(Serial, AudioLogger::Info);
// setup i2s
auto cfg = i2s.defaultConfig();
cfg.sample_rate = 24000;
cfg.channels = 1;
i2s.begin(cfg);
// We announce the time only every 5 minutes
timeInfo.setEveryMinutes(5);
// start WIFI and time
timeInfo.begin(ssid, password);
ttt.say(timeInfo.time());
}
void loop() {
// speach output
if (timeInfo.update()){
ttt.say(timeInfo.time());
}
}
The TimeToText class is translating the time into words which is the input to the TextToSpeech class which handles the audio. This class is based on my Arduino Audio Tools library: so we need to feed it with a OutputSink and a MP3 Decoder. The audio samples are determined with the help of the AudioDictionary. As part of the sketch I have implemented the TimeInfo class which just retrieves the time information from a time server and determines if we need to announce a new time.
The full source code is available on Github
Memory Requirements
The sketch which includes the audio data is only using 37% of the program storage:
Sketch uses 1171698 bytes (37%) of program storage space. Maximum is 3145728 bytes.
Global variables use 48232 bytes (14%) of dynamic memory, leaving 279448 bytes for
I think this is quite impressive and we have quite some headroom before we need to resort to the samples being stored on a SD drive.
Next Steps
I see three things that could be improved:
- There are some unnatural long gaps between some numbers: We could filter them out.
- We need some functionality that helps us to record the text input
- It would be cool to extend the example to support speak recognition that would reply to the request: “what’s the time?”
1 Comment
Len Struttmann · 13. January 2023 at 2:58
I got this to work! Thanks for posting this, it will save me HOURS of time. I’m building a Talking Thermometer.