Sound demos for "WaveFlow: A Compact Flow-based Model for Raw Audio"

Our small WaveFlow has 5.91M parameters (i.e. 64 residual channels), and it can synthesize 22.05 kHz high-fidelity raw audio 42.6× faster than real-time on a GPU.  In contrast, WaveGlow requires 87.8M parameters (i.e. 256 residual channels) for generating high-fidelity audio, and its performance degrades quickly with small residual channels.  We also present audio samples from Gaussian autoregressive WaveNet and ClariNet.

Audio synthesis conditioned on mel spectrogram

 WaveFlow (64-layer,  res. channels = 256)      WaveGlow (96-layer,  res. channels = 256)      Ground-truth (recorded speech)    

WaveFlow (64-layer,  res. channels = 128)     WaveGlow (96-layer,  res. channels = 128)      WaveNet (30-layer,  res. channels = 128)    

WaveFlow (64-layer,  res. channels = 64)       WaveGlow (96-layer,  res. channels = 64)       ClariNet (60-layer,  res. channels = 64)      

Text-to-speech synthesis

The rainbow passage: When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow. The rainbow is a division of white light into many beautiful colors. These take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon. There is, according to legend, a boiling pot of gold at one end. People look, but no one ever finds it.

 Deep Voice 3 + WaveFlow       Deep Voice 3 + WaveGlow      Deep Voice 3 + WaveNet      Recorded human speech (reference only)