🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz

What options exist for tuning speech speed and pitch in TTS?

To adjust speech speed and pitch in text-to-speech (TTS) systems, developers can use a combination of markup languages, API parameters, and model-specific controls. Most modern TTS platforms, such as Amazon Polly, Google Cloud Text-to-Speech, and Microsoft Azure Speech, support Speech Synthesis Markup Language (SSML). SSML tags like <prosody rate="x"> let you modify speed (e.g., rate="fast" or rate="150%") and pitch (e.g., pitch="high" or pitch="+5%"). For example, <prosody rate="slow" pitch="low">Hello</prosody> generates slower, deeper speech. APIs often expose direct parameters, such as Google’s speaking_rate (e.g., 1.5 for 50% faster) or pitch (e.g., -3.0 for a lower tone). Open-source tools like MaryTTS or Festival also allow similar adjustments through configuration files or code.

Post-processing audio output is another approach. After generating speech, developers can use digital signal processing (DSP) libraries to modify speed and pitch independently. For instance, LibROSA in Python provides time_stretch to change speed without altering pitch and pitch_shift to adjust tone. Tools like FFmpeg can apply filters (e.g., atempo for speed, asetrate for pitch) to WAV or MP3 files. However, these methods may introduce artifacts if overused. For example, increasing speed beyond 200% with atempo might distort audio, while extreme pitch shifts could make voices sound robotic. This approach is useful when fine-grained control is needed beyond what the TTS engine natively supports.

Finally, custom TTS models enable deeper adjustments. Training or fine-tuning models like Tacotron 2 or FastSpeech allows developers to embed speed and pitch parameters directly into the synthesis process. For example, FastSpeech uses a duration predictor to control phoneme length, which affects speaking rate. Adjusting the model’s duration multiplier during inference can speed up or slow down output. Similarly, modifying the fundamental frequency (F0) in vocoders like WaveGlow alters pitch. Platforms like Coqui TTS or NVIDIA’s NeMo provide APIs to tweak these variables programmatically. This method requires more technical effort but offers precise, natural-sounding results compared to post-processing.

Like the article? Spread the word