Hands-on Tutorials

People have been trying to make machine generated music for a long time. Some of the earliest examples were musicians punching holes in piano roles to create complex melodies unplayable by humans (see Conlon Nancarrow, 1947).
More recently, it’s looked like electronic music in the form of MIDI files, where, by specifying various attributes -the instrument, pitch, duration, and timing—songs can be symbolically represented. But what does it look like for AI to run the whole generation process?
<iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fw.soundcloud.com%2Fplayer%2F%3Furl%3Dhttps%253A%252F%252Fapi.soundcloud.com%252Ftracks%252F1000521730%26show_artwork%3Dtrue&display_name=SoundCloud&url=https%3A%2F%2Fsoundcloud.com%2Fuser-935478966%2Fb-deep-h-53k-30s-l0-9&image=https%3A%2F%2Fsoundcloud.com%2Fimages%2Ffb_placeholder.png&key=a19fcc184b9711e1b4764040d3dc5c07&type=text%2Fhtml&schema=soundcloud" title="AI-generated "Human music" (more below). Interestingly, the two SoundCloud comments this received are from bots. It seems the highest compliments for AI-generated music come from AI-generated spam" height="166" width="800">
This article explores generative audio techniques, training OpenAI’s Jukebox on hours of house music.
Generative Audio Approaches
Historically, generative audio models were trained on datasets of these symbolic musical representations. These early (read: 2017) models like DeepBach or MidiNet produced ‘musical scores’ that mimic human melodies with impressive results.
But it’s a lot harder to train and generate raw audio. A typical audio file runs at 44.1kHz, meaning that each second contains 44,100 distinct values. This represents orders of magnitude more complexity than the symbolic models that merely learn from files containing "play middle C from 00:01 to 00:02 on a Steinway piano."
The hard problem of learning long-term dependencies in audio files was recently been addressed by OpenAI’s Jukebox and DeepMind’s WaveNet. There are three main classes of approaches to generate raw audio, each with its pros and cons.

Generative Adversarial Networks (GANs)
GANs learn to map input vectors (typically of much smaller dimension than the data) to data examples in the target domain. They are actually a combination of two neural networks trained against one another. The discriminator (D) tries to tell if the generated output is real or not. The generator (G) tries to generate this synthetic output that will fool the discriminator.

As the model trains, the generator becomes capable of creating outputs that rival real examples.

Theoretically, GANs are the fastest architecture for generation (of the three above) and have better global control over conditioning variables. However, the great success GANs have had in producing high resolution images have not yet translated into the audio domain. In practice, they are mostly used for producing the timbre of individual notes from an instrument and short clips (<4s) rather than generating new scores.
Autoregressive Models
Autoregressive models compress the prior terms (think: past few seconds of music) into a hidden state used for predicting the next bit of audio. While a RNN sees only one input sample at each time step and retains the influence of past samples in its state, WaveNet has explicit access to a past window of input samples.

This type of model is best at short audio generation. Generated samples typically lack long-term structure, so they’re typically used for speech generation. However, autoregressive models are seriously slow. Generating a new element requires all prior elements to be present. Normalizing flow models (Parallel WaveNet) added parallelism to these models, making generation 300x faster.
Variational Autoencoders (VAEs)
Traditional autoencoders map input data into the ‘latent space’ to learn the compressed, underlying representation of the dataset.

These latent variables can be used for generating new synthetic samples of data from lower-dimensional vectors: input a latent vector into the decoder and a synthetic sample is created. However, because decoders reconstruct data from learned vectors, they can’t generate novel output.
To solve this problem, variational autoencoders (VAEs) instead learn the parameters for the underlying distribution for the latent space, allowing for random sampling at generation time to create new synthetic samples.

Internally, variational autoencoders store the input as a distribution over latent space, rather than a single point. The VAE learns parameters of a probability distribution in the latent space from which samples can be drawn.

GANs typically yield better results because they don’t deal with any explicit probability density estimation. VAEs fail to generate sharp outputs when the model doesn’t learn the true posterior distribution.

VAEs are best at structuring data spaces and used to discover low-dimensional parametric representations of the data. However, because they learn distributions of parameters rather than actual values, the generated samples can be imprecise. Their output might be probabilistically correct, but not actually correct.
Breaking down Jukebox
OpenAI’s Jukebox uses a combination of these techniques. In order to handle longer audio files, it downsamples the inputs into a lower-resolution space. In this compressed space (128X smaller than the input), it is much easier to generate novel audio samples. After the new audio is generated, it is upsampled back to a higher-fidelity audio level.
Compressing Audio Files
Training a model to generate audio sequences requires developing another model to convert training data (audio files) from their original 44.1kHz resolution down to the lower-dimensional space and back. In this lower-dimensional space, it is much easier (and computationally cheaper) for us to train generative models. To accomplish this, Jukebox uses hierarchal VQ-VAEs. Think of the VQ-VAE as a translator, encoding from raw audio to this lower-dimensional space and decoding it back.

Vector quantized variational autoencoders (VQ-VAE) build on VAEs and allow for the generation of more realistic samples. As discussed above, VAEs often generate ‘blurry’ synthetic data because they pull from a distribution. VQ-VAEs fix this by instead quantizing (think: turning into discretized vectors) the latent variables.

Generated samples are comprised of bits of audio that mimic portions of audio in the training set rather than bits of audio that probabilistically should exist in the training set. The results are typically much more realistic. For visual intuition about how VQ-VAEs differ from VAEs, see below for a comparison of their latent spaces.

A single VQ-VAE can effectively capture local structure but struggles with long-term coherence. The generated audio might sound reasonable on a small timescale, but it doesn’t sound quite as good over a longer timeframe. Imagine the bridge of a song that just keeps continuing instead of transitioning into the next verse. Locally, this sounds coherent, but globally, this makes no sense. To better learn these long-term musical structures, Jukebox incorporates multiple VQ-VAEs for different timescales: sub-second, second, and ten second. These hierarchical VQ-VAEs help capture long-range correlations in waveforms.
Training a Deep House Prior
Before we can train our generative model, we need to train the VQ-VAE to handle our input audio. Because it’s an autoencoder, we want the output to sound as close as possible to the input, so any reconstruction loss is penalized. Additionally, we minimize the codebook loss, the difference between the samples of raw audio and their ‘vector-quantized’ representation. The ‘loss’ is the random noise we hear in the reconstructed audio not present in the original sample.
Now that we can work in this discretized space, we can train our prior to predict audio sequences. Given a number of past inputs and auxiliary data, a [transformer model](http://Given a number of past inputs and auxiliary data, a transformer) can be trained to predict what sound comes next. Transformers use attention mechanisms to boost the training speed of learning translation and prediction on sequences. Because they offer computational advantages for longer sequences over RNNs, transformers are an ideal architecture for raw audio. Jukebox uses sparse transformers reduce the computational complexity for each layer from O(n²) to O(n√n).

To generate audio, we train transformers to predict the next encoded sound given the prior encoded sounds for the songs in our dataset. During training, we condition the model on auxiliary information about the input audio to enable us to have parameters to control the generation. In practice, this looks like passing in information about the genre, artist, and lyrics along with the training audio, so that at generation time we can pass in the artist or genre we want it to output.

The top-level prior is trained on~50 hours of deep house music with the vocal track removed (c/o spleeter). The prior was trained for 4 days on a NVIDIA’s Tesla V100 GPU with a weight decay of 0.01. When the training loss was around 2 (~90k steps), the learning rate (LR) was annealed to 0. Learning rate annealing means decreasing your LR over time as you near the end of training to ensure you don’t lose the location of the minimum.
While 90k steps may seem like a while, the quality of the generated tracks markedly improves over time. For reference, compare these early tracks (generated at 33k and 53k steps) to the final track (after 90k steps) in the final section.
Generating New House Tracks
To generate new house music, we first sample from our trained prior and then upsample the track back to its original sample rate. Finally, the generated sample is passed through the VQ-VAE decoder and turned back into a waveform.

The top-level prior (level 2), conditioned on the information about artist and genre, is sampled to generate a new sequence of music tokens in the style of the training data. While we can ‘prime‘ the model to continue from an inputted audio file, the samples below were all unprimed and completely sampled from the learned top-level prior. The level 2 generated audio samples contain the long-range structure and semantics of the music but are low audio quality due to the compression.
To remove the ‘grain’ from the generated audio, we upsample it to lower and lower layers. These upsampler models similar to our top-level prior transformer but instead are first conditioned with the higher-level generated tokens. The Level 1 upsampler is passed this Level 2 sample (which is then passed to the Level 0 upsampler).

These upsamplers add in local musical structure like timbre and greatly improve the sound quality.
(Deep) House
So what does computer-generated house music sound like? Below are some of the best samples, but for more you can check out the entire playlist of outputs on SoundCloud.
As a friend said, "It’s like a computer is trying to learn how to pass off as Kygo to its computer friends"
Limitations
Longterm Structure: While the original paper lists this as a limitation, it’s less of a problem for our vocal-free house music because there are no vocals to clearly delineate verses from choruses (and house music is repetitive).
Generation Speed: While these tracks were generated in parallel, because Jukebox is built on an auto-regressive model generation of a single track isn’t parallelized. These 90 second samples took ~10 hours to generate.
Overfitting: Another concern is IP infringement from the generated samples. In a few of the tracks, if the audio fades out completely, the track restarts a few seconds later with this intro.
This was likely because the DJ who produced a lot of songs in the training dataset opens all of his songs this way, and the model learned to replicate the introduction.
I ran SoundHound against all of the auto-generated tracks and got no matches, but it’s unclear if this is because Jukebox is overfitting vocal-removed songs which don’t create a match or if they’re genuinely novel samples that are just heavily inspired by the training dataset. Because of the ‘hum a tune’ functionality in SoundHound, my intuition is that it’s the latter, however further study should investigate IP and AI-generated content.






