Sound demos for "WaveFlow: A Compact Flow-based Model for Raw Audio"

Authors: Wei Ping, Kainan Peng, Kexin Zhao, Zhao Song

Paper: arXiv. Published at ICML 2020.

Code: PaddlePaddle reimplementation in Parakeet toolkit. Note that, the following samples and results in paper are obtained from an internal PyTorch implementation.



Our small WaveFlow has 5.91M parameters (i.e. 64 residual channels), and it can synthesize 22.05 kHz high-fidelity raw audio 42.6× faster than real-time on a GPU.  In contrast, WaveGlow requires 87.8M parameters (i.e. 256 residual channels) for generating high-fidelity audio, and its performance degrades quickly with small residual channels.  We also present audio samples from Gaussian autoregressive WaveNet and ClariNet.

Audio synthesis conditioned on mel spectrogram

WaveFlow (64-layer,  res. channels = 64)       WaveGlow (96-layer,  res. channels = 64)       ClariNet (60-layer,  res. channels = 64)      


WaveFlow (64-layer,  res. channels = 128)     WaveGlow (96-layer,  res. channels = 128)      WaveNet (30-layer,  res. channels = 128)    


 WaveFlow (64-layer,  res. channels = 256)      WaveGlow (96-layer,  res. channels = 256)      Ground-truth (recorded speech)    


Text-to-speech synthesis

The rainbow passage: When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow. The rainbow is a division of white light into many beautiful colors. These take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon. There is, according to legend, a boiling pot of gold at one end. People look, but no one ever finds it.

 Deep Voice 3 + WaveFlow       Deep Voice 3 + WaveGlow      Deep Voice 3 + WaveNet      Recorded human speech (reference only)