Speech Synthesis

Speech synthesis, also referred as text-to-speech is the opposite task of Speech Recognition (also referred as speech-to-text), where given a text input, it produces an artificial human-like speech audio.

The speech synthesis consists in two main parts, which are input preprocessing and audio generation

Input preprocessing

In the preprocessing part, the input text is divided into a list of phonemes and prosody control information.

Normalization

In this phase, text elements such as acronyms, special characters, numbers, emails, internet urls and similar elements are normalized to plain text. For static elements like symbols an numbers an hash-map can be used for the conversion. For other elements, mostly acronyms, the task is not trivial since the same element can be read in more ways depending on the context (SQL can be read as S-Q-L, seequl or Structured Query Language).

Prosody

Prosody is defined as the collection of characteristic that makes a speech sound less monotonic and more natural. These characteristics includes volume, speed, pitch variations and pauses. Prosody makes the speech more human-like and easier to understand.

In this pre-processing phase prosodic information is added into the speech that has to be generated. This is the most difficult part of the entire speech synthesis process.

Graphemes-to-Phonemes conversion

The last preprocessing steps consists in the conversion of the written text from graphemes to phonemes. This process is based on a set of rules and an optional graphemes-phonemes mapping of the common words. The process varies significantly also from the chosen language.

Nowadays, generative models are capable of doing this using deep learning data generation techniques.

todo: isn’t this part of the generation?

Audio generation

The generation of the actual waveform which models the speech can be done with three main approaches.

Concatenative synthesis

In concatenative synthesis, segments of recorded real speech (of high quality, and coming from a single subject) are concatenated together with some processing in order to create a final waveform. Signal processing methods such as MBROLA - Multi-band re-synthesis overlap add are used to combine phonemes together and add prosodic information.

The challenging part is to model co-articulation, which means that concatenated phonemes need to match each with each other in a natural way.

One solution is to record diphones, which are units which start from the middle of the phoneme and continue until the middle of the other phoneme. It’s also possible to use three consecutive phonemes (called triphones).

The problem with this approach is the fact that if we have $N$ phonemes, than we could have at most $N^{2}$ diphones and $N^{3}$ triphones (note: at most since not all the combination have sense), which would make the recording part very expensive for a large set of phonemes.

Formant synthesis

Formant is a characteristic component of the quality of the speech sound.

In formant synthesis the waveform is not generated using human speech samples like in concatenative synthesis, but it uses a set of phonological rules to control the audio signal generation.

Parameters such as fundamental frequency, voicing, noise levels are varied over time to create a waveform of artificial speech.¹

The main problem with formant synthesis is the sound quality, since it lacks naturalness, but in general it’s understandable.

Since they do not require a database of human-speech templates, formant synthesis algorithms are good to use in embedded systems.¹

Articulatory synthesis

In articulatory syntesis, the algorithm tries to generate the human speech in a much more concrete level than formant synthesis. It uses a set of computational techniques for synthesizing speech based on models of the human vocal tract and the articulation processes occurring there. ²

These technique is complex and requires a lot of computational resources. It generates speech by controlling the speech articulators, like the jaw, tongue lips etc. Changing the articulator position results in a change of the shape of the vocal tract, hence a different type of speech.

An example of articulary synthesis implementation is Gnuspeech (an example of the result is here: Gnuspeech on NeXTSTEP - YouTube).

tags: hci

Quartz 4

Explorer

Speech Synthesis

Input preprocessing

Normalization

Prosody

Graphemes-to-Phonemes conversion

Audio generation

Concatenative synthesis

Formant synthesis

Articulatory synthesis

Graph View

Table of Contents

Backlinks

Quartz 4

Explorer

Speech Synthesis

Input preprocessing

Normalization

Prosody

Graphemes-to-Phonemes conversion

Audio generation

Concatenative synthesis

Formant synthesis

Articulatory synthesis

Footnotes

Graph View

Table of Contents

Backlinks