I started to look into the topic of Text to Speach Synthesis (TTS) on Microcontrollers with the final goal to compare different engines.
Since I don’t want to be bothered to connect the Microcontroller to any output device, I decided to just render the result to a Webbrowser with an ESP32 before committing to any solution.
Unfortunately there are no Arduino engines which would provide the result as a stream, so I started to “extend” some projects. The first solution is “SAM”:
I created this project with the intention to provide SAM as Arduino Library which provides a simple API and supports different output alternatives:
SAM is a very small Text-To-Speech (TTS) program written in C, that runs on most popular platforms. It is an adaption to C of the speech software SAM (Software Automatic Mouth) for the Commodore C64 published in the year 1982 by Don’t Ask Software (now SoftVoice, Inc.). It includes a Text-To-Phoneme converter called reciter and a Phoneme-To-Speech routine for the final output. It is so small that it will work also on embedded computers.
The Arduino sketch for the Webserver is quite small because I am using my arduino-audio-tools . SAM is directly writing to the WebClient stream in a callback:
#include "AudioServer.h"
#include "sam_arduino.h"
using namespace audio_tools;
AudioWAVServer server("ssid","password");
int channels = 1;
int bits_per_sample = 8;
// Callback which provides the audio data
void outputData(Stream &out){
Serial.print("providing data...");
SAM sam(out, false);
sam.setOutputChannels(channels);
sam.setOutputBitsPerSample(bits_per_sample);
sam.say("hallo, I am SAM");
}
void setup(){
Serial.begin(115200);
// start data sink - provide a callback
server.begin(outputData, SAM::sampleRate(), channels, bits_per_sample);
}
// Arduino loop
void loop() {
// Handle new connections
server.doLoop();
}
Well the result did take quite some time (23 sec) to generate and it does not sound great:
I am afraid that this slowness is preventing I2S from working…
1 Comment
Jason · 5. September 2021 at 3:09
Sounds like the pitch and speed are doubled or more? Cut them in half, you might be surprised! Remember, the higher the falue, the lower/slower the result.