To adjust speech speed and pitch in text-to-speech (TTS) systems, developers can use a combination of markup languages, API parameters, and model-specific controls. Most modern TTS platforms, such as Amazon Polly, Google Cloud Text-to-Speech, and Microsoft Azure Speech, support Speech Synthesis Markup Language (SSML). SSML tags like <prosody rate="x">
let you modify speed (e.g., rate="fast"
or rate="150%"
) and pitch (e.g., pitch="high"
or pitch="+5%"
). For example, <prosody rate="slow" pitch="low">Hello</prosody>
generates slower, deeper speech. APIs often expose direct parameters, such as Google’s speaking_rate
(e.g., 1.5
for 50% faster) or pitch
(e.g., -3.0
for a lower tone). Open-source tools like MaryTTS or Festival also allow similar adjustments through configuration files or code.
Post-processing audio output is another approach. After generating speech, developers can use digital signal processing (DSP) libraries to modify speed and pitch independently. For instance, LibROSA in Python provides time_stretch
to change speed without altering pitch and pitch_shift
to adjust tone. Tools like FFmpeg can apply filters (e.g., atempo
for speed, asetrate
for pitch) to WAV or MP3 files. However, these methods may introduce artifacts if overused. For example, increasing speed beyond 200% with atempo
might distort audio, while extreme pitch shifts could make voices sound robotic. This approach is useful when fine-grained control is needed beyond what the TTS engine natively supports.
Finally, custom TTS models enable deeper adjustments. Training or fine-tuning models like Tacotron 2 or FastSpeech allows developers to embed speed and pitch parameters directly into the synthesis process. For example, FastSpeech uses a duration predictor to control phoneme length, which affects speaking rate. Adjusting the model’s duration multiplier during inference can speed up or slow down output. Similarly, modifying the fundamental frequency (F0) in vocoders like WaveGlow alters pitch. Platforms like Coqui TTS or NVIDIA’s NeMo provide APIs to tweak these variables programmatically. This method requires more technical effort but offers precise, natural-sounding results compared to post-processing.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word