Our small WaveFlow has 5.91M parameters (i.e. 64 residual channels), and it can synthesize 22.05 kHz high-fidelity raw audio 42.6× faster than real-time on a GPU. In contrast, WaveGlow requires 87.8M parameters (i.e. 256 residual channels) for generating high-fidelity audio, and its performance degrades quickly with small residual channels. We also present audio samples from Gaussian autoregressive WaveNet and ClariNet.
Audio synthesis conditioned on mel spectrogram
WaveFlow (64-layer, res. channels = 64)
WaveGlow (96-layer, res. channels = 64)
ClariNet (60-layer, res. channels = 64)
WaveFlow (64-layer, res. channels = 128)
WaveGlow (96-layer, res. channels = 128)
WaveNet (30-layer, res. channels = 128)
WaveFlow (64-layer, res. channels = 256)
WaveGlow (96-layer, res. channels = 256)
Ground-truth (recorded speech)
Text-to-speech synthesis
The rainbow passage: When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow. The rainbow is a division of white light into many beautiful colors. These take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon. There is, according to legend, a boiling pot of gold at one end. People look, but no one ever finds it.