Deep Idol

Singing Machines

Machine learning researchers have recently improved the capabilities of vocal synthesis by experimenting with end-to-end unsupervised modeling approaches. The advantage of these models is the ability to discover the sonic characteristics or any set of sounds without the need for detailed hand engineering of features like previous systems required. One promising generative model for raw audio known as Wavenet, is capable of being trained to synthesize words and from a variety of speakers and languages. Deep Idol was built from a similar system with the goal of developing a synthesizer capable of imitating any sound.

Our Process

Deep Idol is a neural audio synthesis system. It utilizes an “unconditional end-to-end Neural audio generation model “[3] called SampleRNN. The system is similar to Wavenet but “has different modules operating at different clock-rates”[3] taking on a hierarchical structure to improve long term continuity of the generated sounds. With a intention of generating sounds beyond capability the human voice and into a more extensive music repertoire, we tested different sampling methods, model architectures, memory gates, sample-rates, and layer densities on a diverse range of audio datasets.

Data Pre-Processing

The Deep Idol system starts by preprocessing audio into 3,200 eight second chunks of raw audio data (FLAC). It is then randomly split into one of three sets used to train, test, and validate the model on.

Making A Model

After our data is pre-processed, the model can begin training on it. In a few days, our RNN’s (Recurrent Neural Networks) will have sufficiently studied an artist’s style and will get more accurate in it’s recreation with each iteration.


The generated signal begins as chaotic noise,  but as training continues, out emerges a uniquely styled piece of music composed of a great depth of textual layers with unpredictable swarming, sonic fluttering, randomly breaking free from the neural atmospheric warmth. Audio goes in and the stylistic essence is extracted by sampling from the models distribution sample-by-sample to synthesized musical phrases.


Vocalist Style Synthesis

As soon as we got our system running, we attempted to bring Kurt back to sing for us by training Deep Idol on a collection of the late singer’s acapellas. Listening for signs of life, we stood by as our bot’s first words began to emerge through the dense wall of growls, screams, and distortion. Slowly, over the course of a day, the vocal model began to zero in on the speaker pumping waveforms which began to resemble Kurt’s voice. Two-tiered networks with four layers trained on about forty minutes of Kurt acapellas successfully generated grungy screaming and bellowing melodic phrases with occasional English words cutting through.

Kurt 4 layer GRU

Slim 4 layer GRU

For percussion datasets,  we have found good results with five to seven layers.

5 RNN layers

7 RNN layers

Beatboxing With Deeper Networks and Skip Connections

Vowels like ooh and aah were easy to reproduce with basic parameters, but a world class beatboxer like Reeps One produces a much wider range of complex sounds than a typical vocalist including deep sub-bass tones, smacks, and breaths extending the number of layers from 5 to 9 produces a more capable model with more connections but takes longer to train

Beyond The Human Voice – gru vs lstm

The vocal optimized parameters suggested by the creators of SampleRNN didn’t perform well with a percussion dataset, so we went on a search for ways to generate drums. Gated recurrent units(GRU’s), proposed by Cho et al. [2014], are being selected for many text to speech systems for its improved speed over LSTM gated units. We found that aperiodic sounds with complex spectrum and a quick decay all suffer from noise explosions and long silent periods when audio is generated. Long short-term memory (LSTM), initially proposed by Hochreiter and Schmidhuber [1997], seemed to consistently improve these results leading to more stable audio



Towards High-Def Audio

Striving for higher bitrates comes cost of training time. Due to the sampling theorem, we only get to record frequencies up to ½ of the sample rate. This cuts all those energetic treble frequencies above 8HZ with 16kHz sample rate. But we don’t need to say goodbye to those crispy hi-hats if we can wait a little longer.



Aesthetics of Neural Synthesis

Left on it’s own, Deep Idol turns solo vocalist into a lush choir of ghostly voices, Rock bands into crunchy cubist-jazz, and cross-breeds multiple recordings into a surrealist chimera of sound. While much of the research we have done strived to achieve a realistic recreation of the original data, there is an interesting sound associated with neural synthesis which has aesthetic merit of it’s own. Artifacts like tube warmth, tape-hiss, vinyl distortion, and extreme pitch shifting are all very popular in modern production. Much like other signature timbral characteristics achieved through vintage sound production, neural sound can be exploited by pioneering artists to reveal the unique nature of the medium. We approach music technology as artists, so the glitch inspired creative possibilities excite us as much as the technical replication abilities of RNN’s. As was the case throughout music history, this new technology will surely usher in new trends for electronic music making as it becomes more accessible and ubiquitous in production.

Neural Synthesis Meets Crowd Remix

Fitting the output of Deep Idol into a novel interactive user experience brings this phrase synthesizer into a expressive digital instrument space. Using a concatenative audio engine like Crowd Remix for intelligent sorting, clustering, and re-arranging of generated phrases provides users with a complete workstation for composition and production. The ability to create variations of recording with ease. Neural synthesis will enable us to push beyond what has been possible with traditional synthesis methods.


Never before have music producers had the ability to imitate the style of a singer’s voice by merely providing audio examples of what to model. RNN’s with gated units are fantastic at learning texture and provide improved rhythmic continuity making it a great candidate for a musical phrase generator. We are able to generate long complex phrases of music by employing a multi-tier RNN with many layers as our point of departure. As electronic musicians, We found the possibilities inspiring and began to extend these spoken word techniques for musical creation in hopes of discovering a way to build a concatenative synth capable of learning any timbre we have examples of.


[1] Junyoung Chung Caglar Gulcehre KyungHyun Cho Yoshua Bengio

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling: arXiv:1412.3555v1 [cs.NE] 11 Dec 2014

[2] Alex Graves

Generating Sequences With Recurrent Neural Networks: arXiv:1308.0850v5 [cs.NE] 5 Jun 2014

[3] Sotelo, Jose, Mehri, Soroush, Kumar, Kundan, Santos, Joao Felipe, Kastner, Kyle, Courville, Aaron, and Bengio, Yoshua.

Char2wav: End-to-end speech synthesis. In ICLR 2017 workshop submission, 2017. URL

[4] Sercan O. Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, Shubho Sengupta, Mohammad Shoeybi

Deep Voice: Real-time Neural Text-to-Speech arXiv:1702.07825 [cs.CL]