Custom Audio Classification with TensorFlow

Hands-on Tutorials

Mel-scaled spectrogram. Visualization created by the author.

For a recent research project, I had the chance to explore the world of audio classification. Based on the code that I created for this task I’ll guide you through an end-to-end machine learning project. The goal is to identify the gender of a person speaking on an audio sample.

We will begin with gathering the initial data. Based on this, we generate our training, test, and validation data. For easier handling, we will store our data as TFRecords files. After we have finished the data engineering, we create our model. With it at hand we can – finally – do the training.

Task description

Given an audio sample of a person speaking, determine whether it is a female or a male speaker.

This statement makes the task at hand a classification problem; we have a data sample and assign one or more labels to it. To learn as much as possible, we create the data from scratch, based on freely available audio samples.

Overview

For clarity I have split the post into several sections:

The first section is the data acquisition step.

The second section covers data generation.

The third section is the generation of the TFRecords files. In general, I have written a practical introduction to this data format here.

The fourth section combines all we have done so far to do the actual training.

All sections come with accompanying code, in the form of Colab notebooks:

Data Acquisition

Downloading, imports, and project setup

Our first step is getting the data. For this project, I have chosen a freely available audio dataset called LibriSpeechASR corpus. It contains short fragments of voluntary speakers reading audiobooks from the LibriVox project [1].

The dataset is provided in subsets of various sizes. We have several train and test sets with differently challenging speech (i.e., unclear pronunciation, accents). I selected the 6.3 GB train-clean-100 subset – feel free to choose another, possibly smaller version; remember to adapt the following code in this case.

For demonstrative purpose the code is written to be run in Jupiter notebooks, e.g. on Google Colab. We download the data to a directory in the mounted (!) Google drive; adapt the paths if you choose a different location. You will use the data throughout the following post.

We begin with creating our project folder, downloading the data and extracting it to our drive:

Now that the data is on our drive we listen to 10 random samples:

Metadata

Afterwards, we take a look at the metadata file to later identify all speakers that are present in our downloaded subset. After inspection, we remove the first twelve lines, which contain information unnecessary for our task:

One exemplary line is

20   | F | train-other-500  | 30.07 | name

The first entry is the ID, the second entry the gender, the third entry the subset this speaker is present in, the fourth column lists the total duration of audio content that this particular speaker contributes, and the last column contains the speaker’s name.

Next, we go over all such lines (that is, over all speakers), and temporarily store the gender if the speaker is present in our dataset (this is the case when the third column reads train-clean-100). Based on this metadata we then sort the speakers into separate male and female lists – this is the crucial part of the data acquisition step.

Notice the dirs list that we generated above: We will iterate over the downloaded directories, they are named after the ID of the speaker whose audio files they contain: If the directory’s name is 345, then it contains samples from the speaker with the ID 345. Based on this ID, we can check the speaker’s gender by looking for the value that is stored for the key ID in our _availablespeakers file.

Lastly, we save the lists to our drive, in the next section we will use them to generate our training, test, and validation data.

That’s it for this section. You can find the corresponding notebook hosted on Colab here. Leave a short note if you have suggestions or find any errors.

Audio Data Generation

In this section, we’ll go over the creation of our audio data: We cover creating training, validation, and test data. To prevent leaking test and validation data into our actual training data, we will refer to the the speaker.pkl file that we created in the previous section.

Imports

Let us start with installing and importing all necessary packages. I decided to use a python package called PyDub to create the audio files; it is easy to use and supports overlaying and appending audio samples to each other. We import the argparse package so we can easily turn our Colab notebook into a .py file supporting command line arguments:

Helper functions

Next, we implement several helper functions. We start with a simple one to dynamically create our dataset paths. Note the mixed argument: While I have not implemented this feature for this project, you can experiment with the code to create a third class, male-female mixed speech samples.

Going on, we implement a function that returns all the speech samples available for a speaker. In the last section, we created a speaker.pkl file that stored the directory paths of the speaker. With this method, we get all samples by this particular speaker (not only the directory, but the actual .flac files):

In the next method, we create the train, test, and validation splits for the speaker samples. Note that so far we have not created any training data, we merely sort our speech samples into non-overlapping splits. We load the speakers, and for each gender use the first 80 % of the speakers as training speakers, the next 10 % as validation speakers, and the last 10 % as test speakers. To make this process repeatable, we sort the list beforehand (we could further modify this by shuffling with the seed argument).

In the return statement, using the previous method, we convert our list of speakers to a list of actual samples. So far each subset contains the path to the speakers’ directories; now we obtain the paths to audio samples by these speakers:

Generating speech samples

We can finally create two functions that create our actual data that is used for the training, test, and validation subsets. For a list of file paths (_speechlist), we load the speech samples , randomly permutate the audio, and store them. Then, we start to overlay the samples onto a base sound (_basesound) which could be anything, but is in our case simply a 60 second audio sample of silence.

Note that we implement two functions here: The first function overlays all the speech samples at the same time location, simulating a room full of people talking. The second function overlays the samples in sequence, with 0 to 5 seconds of pause between:

You have probably noted that we called a function permutate() in the previous code. The function is short, we randomly vary the loudness and the length of the speech. This function can serve as starting point for more complex augmentations:

We now create a small wrapper function that wraps the three functions above:

Before we finally create our main method, we need a function that draws the speech samples from the female/male train, test, and validation subsets. We implement this function in the next step. Given a _number_ofsamples, a gender (indicating from which sets to draw from), and a subset (the actual subset to draw from), we return a list of random samples from the according subset. If gender is 1 (True), we draw from the female sets, otherwise we draw our random sample from the male sets:

Main method

Now, the core method, _createsample(). We create a _basesound which we overlay all speech samples onto. I decided to work with data of length 60 seconds, so I set the duration to 60000 milliseconds. We than decide whether we want to create a female or a male training sample. Given these two information, we sample some speeches to overlay, using the method implemented earlier. If, in principal, we allow multiple speakers to speak at the same time stamp, we now decide if we do so for this concrete sample (This is the reason why we implemented two separate methods before). We then create the speech overlay, and do a sanity check if our created data sample is actually long enough.

If this sanity check holds – which is important for the next blog section – , we store our data sample on disk. For this we need the subset argument which tells us for which, well, subset we use this generated sample. This subset is also the label information, which we return, together with the outname:

We use this two information to create a csv file during the process, this file contains a path->label mapping. We do this to have an overview of the file we created. Sure, we could iterate over the created folders later on, but this is inefficient, since we can do it as a byproduct.

Command line arguments

We also use the argparse module to define our command line arguments and any default parameters. the _— outputpath argument points to a folder where our dataset will be stored to, and the _— speakerfile argument points to the speaker file created in the previous section. The other arguments are self-explaining; the — mixed argument is listed for completeness if you want to expand the script to create a mixed category, and has no functionality. The num{train, test, valid}samples controls the number of samples in the according subset. In research and for large datasets the split usually is 80 %, 10 %, and 10%, but for our purpose we just create an equal number of samples for each split.

To call all functions, we use the main statement provided by Python. The list of speech samples, both male and female, are available globally when created here. We create our output paths for the data subsets, and then create our train, test, and validation subsets, with the loop’s index as the name of the file currently created. The progress is visualized with the help of the tqdm package, it displays a live progress bar. After each subset, we save a the list of newly created samples into a CSV file, for use in the next section:

On Colab the creation of one file takes around 5 seconds, which is slow. This might be because we write our data to the (slow) Google Drive storage. On a machine with SSDs and better CPUs this is faster. Since we only create our audio data once, this is negligible though.

That’s it for this section. The code is available as a Colab notebook here. In the next section, we use our created audio data to create TFRecord files, to make training efficient and fast.

Creating TFRecord files

In this section, we’ll go over writing our recently created audio data into TensorFlow’s TFRecord file format. I assume knowledge of TFRecord’s layout, but will refresh anything needed appropriately. If you are unfamiliar with it, you can read my introductions [here](https://towardsdatascience.com/a-practical-guide-to-tfrecords-584536bc786c) (where we train a CNN) or here (where we cover parsing all sorts of data with TensorFlow). Essentially, this format stores the data in serial layout, making them easily readable from the drive. Truth be told, this conversion step might not be necessary for our small dataset of a mere few hundred samples and MBs.

For larger dimensions this is quite handy – for a research project, we had datasets with a few hundred GBs, and data loading quickly became the bottleneck. After some optimization to the data handling, our time per epoch went from 10 minutes to two-ish minutes. That was better.

Imports

As always we start with importing all necessary packages, there is nothing unusual here. We use the pandas library for loading the CSV files we created in the previous section, and librosa to load the audio data from the disk. We also define a global variable, _tfrdir which will later point to the directory where all TFRecord files are stored:

Helper functions

We then create four short helper functions that convert a data point to a Feature object. This Feature object is the core object stored within a TFRecord file, that is within a single Example object (we will see soon):

With these helper functions in place, we can define the layout of an example file. This dictionary contains all features that we might need later on: The sampling rate, the label, the shape, and the actual data. The following function takes this information and returns the dictionary filled with converted Feature objects:

Main method

Now we come to the core method, which creates the TFRecord files. This method is rather long, but nothing special happens [here](https://towardsdatascience.com/a-practical-guide-to-tfrecords-584536bc786c). It is a more sophisticated version of the method presented here and is also covered in broad here. For better understanding, we will go over it in chunks, starting with the loading of the CSV files and determining the number of shards (think containers) that we need to hold all our data. TensorFlow suggests keeping the size of a single shard larger than a hundred MBs, meaning storing a substantial amount of samples. As we have a small dataset, we will just use two shards per data subset (train, validation, test) later on.

We load our CSV file with pandas, shuffle it, and then determine the number of shards. The +1 offset is used to account for cases where the number of samples is between two boundaries: When we use a shard size – the parameter _maxfiles – of 100, and we would have 50 data points only, then 50//100 would evaluate to 0. The +1 ensures that we have at least one shard to hold our data. On the other hand, if our number of files is an integer multiple of our shard size, then we do not need an extra (empty) shard; we handle this case with the if-statement:

In the following chunk we first create a file counter to count the number of files we successfully store into the shards. We then iterate over each shard (dubbed splits) and create a file name for this TFRecord file, following this convention:

{current_shard}{allshards}{subset}.tfrecords

With this information we create a writer to write the file to disk. We then fill the shard, with stopping earlier if we have stored all audio samples. We load these samples in the lower part, and do a sanity check whether the sample has a duration of 60 seconds (60 seconds x 22050 Hz sampling rate = 1 323 000 data points):

With our file read from this, we can now create an Example object, holding all of the information we determined above. This example object is then written to the current TFRecord file. We lastly increase the file_counter, close the writer when we have looped over all shards, and print a last statement to close this method:

In similar spirit, we create a function that loads some audio samples from a given subset and stores them to a numpy array. This array can be used later on during training to log some statistics (we will see in the last section). This is completely optional, we have done the crucial part with the method above.

We consume our CSV file; if it only contains a few samples we can optionally decide to use all the samples for monitoring and not just a small subset. We create an empty numpy array as a placeholder, and then iterate over our subset and fill the array on the go. We finally store the audio data and the according labels in two separate files:

Our main method first overwrites the global _tfrdir variable (which is accessed in the two previous methods above), and then creates the TFRecord files for all subsets. Optionally, it creates the monitoring samples, again for each subset. We control it through the args parameter, which holds all our command line arguments:

Command line arguments

Finally, we call the main method with Python’s name check. This starts the creation of our TFRecord files:

That’s it for this section. The code is available as a Colab notebook here. In the next section, we finally have contact with the Deep Learning part of this post, defining our model and callbacks to be used during training.

Training

In this section, I cover working with the TFRecord data, setting up callbacks, setting up our network, and finally go over to training.

Imports

As usual, we start with the imports. A particularity is that we’ll use the raw audio data as input, and let the network (and thus the GPUs) do any further computation. For this, we use the kapre python package, a collection of custom TensorFlow layers, explicitly designed for audio-related tasks. Secondly, we also install the Weights&Biases package, wandb. W&B offers seamless integration of model and statistics tracking. I’ve setup the code to not depend on W&B (as a free account is mandatory), so you are able to run it in any case:

Helper functions

In the next step, we implement two helper functions.

First, we code a function to set the seeds for any random generators we might use, either explicitly or implicitly (for TensorFlow, my experience is that this is sadly not reliable):

The second function creates a logger object for us, logging anything of interest to a file for us. This is a common best-practice; I save the logs to verify and comprehend any later results. We will use the returned logger-object throughout the code:

Data handling

Now we handle the data related part, namely parsing the TFRecord files that we created in the last section, and creating a TFRecordDataset. To load any TFRecord files, we first create a function that returns a list of files matching a pattern.

Remember the names we gave our TFRecord files? Based on their filenames, we sort the files into train, test, and validation sets, logging (and printing) the number of shards we found per subset:

Now that we have a way of getting the files, we just need a method to extract the single data samples stored within. In the last section, we defined the structure of a single sample with this code:

We will mimic that now, but this time extracting a sample, rather than writing. We define a dictionary placeholder for our data (akin to above), and then fill it with the data from a single element (one audio sample in our case).

We then take the feature (our raw audio data), and parse it to a tensor, reshaping it afterwards, based on the data shape we stored in x and y. For our task, the feature and its label are of importance, so we only return these two:

So far we can find all the TFRecord files and parse a single example. However, our model does not work on a single example only, so we need to create a structure that automatically parses all samples from our files. This is where a TFRecordDataset comes in handy.

We first create an AUTOTUNE object, which automatically determines the best parameters regarding prefetching and parallelization. Given a list of TFRecord files, we create an aforementioned TFRecordDataset, and setting the number of bytes in the read buffer to 1000.

Next, we map every element to the _parse_tfrelem() function. Upon loading, every element "passes" this function and is transformed to a {feature, label} pair. We set both the number of parallel calls and the number of elements to prefetch to AUTOTUNE.

If we provided a cache path, we enable caching of the dataset to this location, otherwise we cache it in the RAM. This option is useful if your data resides on a slow hard disk, and prevents reloading the samples each epoch. This is optional, just delete the four lines if you do not need any caching.

Lastly, we shuffle, batch, and repeat the dataset. If we would not enable repetition, the dataset would be exhausted after one iteration:

Distribution strategy

With the next code, we dynamically define under which strategy to train the model. The advantage of writing strategy-independent code is that it works with a single training device, but also with two, three, or even more. (So far I’ve used this code to do multi-device training with two cards, and TPU training). The code is not by me, but from Hugginface’s transformers library:

Callbacks

The next two code blocks are for the callbacks during training, which I’ve covered in slightly more detail here.

We implement a ClassificationReportCallback, which writes sklearn’s classification report to TensorBoard every k epochs. If we want to log our experiment with W&B, we optionally upload the report there:

The second custom callback is the ConfusionMatrixCallback, which, well, creates a confusion matrix every frequency epochs. The code is largely based on a tutorial from TensorFlow:

With our two custom callbacks in place, we now need a method to set them up. To this, we first create a TensorBoard directory for everything that we log, and create a _filewriter object. With this object, we can conveniently log images, scalars, and text data to TensorBoard, which we have already seen in the two callbacks above.

We then instantiate a TensorBoard callback to monitor training, nothing special here. What’s more interesting is setting up our own callbacks. We begin with the ConfusionMatrixCallbacks. As the name hints, it generates a confusion matrix after every k epochs. To create this plot, we parse some sample data – remember the data that we additionally saved into a numpy array in the TFRecord generation step? We’ll use that now to generate live confusion matrices, which are written to TensorBoard with the _filewriter object we previously created.

We create two of these callbacks, one for the validation samples and one for the training samples.

The next callback that we instantiate is our ClassificationReportCallback. We use it to log training statistics every k epochs:

precision    recall  f1-score   support

    male       0.00      0.00      0.00        10
  female       0.60      1.00      0.75        15

accuracy                           0.60        25

The last callback we use is the default EarlyStopping callback. A common problem in Deep Learning is overfitting. The model – deeper and larger ones are more prone to it – reaches a high score on the training data, but fails to generalize to unseen data. There are several ways we can prevent this, one is using a separate validation set. We stop the training once the score on this subset has not gotten better for patience epochs; we also enable restoring the model’s weights from this best state.

If we use W&B we lastly setup their callbacks. It automatically handles logging data to their servers.

With all the callbacks in place we then return a list of them:

Audio features

Now we finally come to the model setup. To understand it, let’s have a short excursion into the world of audio classification:

You might have noticed that we write raw audio data to our TFRecord files, and that we also feed raw audio data to our model. The usual approach, which I also used in my thesis, is to use features instead. Features are measurable or observable characteristics of the data; a common audio feature is the (Mel-scaled) spectrogram.

Let me first cover the spectrogram before we go further. The normal spectrogram shows the power of a frequency at a given time stamp:

Note that we use a log scale, and that higher values indicating more power are brighter. Based on this normal spectrogram, we have the Mel-scaled spectrogram:

The human auditory system can process frequencies up to 20 000 Hertz (give or take some, depending on the age, training, and giftedness), but not linearly. After 12 000 Hz the ability to differentiate frequencies – or sound – decreases strongly [2]. The Mel-scale builds on this fact, lower frequency ranges are mapped to more bins, higher frequencies are mapped to less bins.

In other words: For lower frequencies, we have narrow bins, that is only a short range of frequencies are captured. For higher frequencies, we have wide bins, that is a wide range of frequencies is mapped to the same bin. Notice that the y-axis is still logarithmic, but only goes to 4 kHz – and has more energy in this range, compared to the normal spectrogram.

Another cool feature is the Delta features, approximations of the derivative of a curve (in our case the curve is the audio signal) for the given time stamp. I have read some papers were the Mel-spectrogram and Delta features are combined [3]. And that’s what we do here: Our first channel holds the Mel-spectrogram, our second channel its Delta values.

Now, why not pre-compute them? Because, if we were to change any influential parameter, we had to do it all over again. The benefit of having the network do the computation is first that we can experiment faster, second that we can parameterize the creation, and third that we leave our original data untouched. And we don’t convolute our hard drive with more folders and files.

That being said, we can now return to defining the network.

Model

The first layer is our input layer, it receives a data sample of shape (1323000,1). The first one is from 60 seconds × 22050 Hz, or 60 seconds with 22050 data samples per second. The second entry, 1, is the number of channels, which is mono in our case. (On a side node, kapre supports more channels, I’ve seen 6 channel audio inputs).

Next, we define our audio-specific Mel-spectrogram layer. This layer, as the name hints, generates a Mel-scaled spectrogram from the raw audio input. The parameters control the resulting spectrogram – covering them is not the scope of this post. If you are curious about their effects, you can consult librosa’s documentation.

The next layer is the Delta layer, computing the delta values for the spectrogram input. We combine them to yield a two-channel tensor.

With the next operation, framing by the Frame layer, we split the input into frames. This is a common technique. Think of a long audio sample: Instead of keeping it in this long version, you chunk it up into (overlapping) segments. That’s what happens here.

The next blocks are pretty standard: We use a convolution-batch norm-dropout-pooling block. The first one has the kernels learn low level features, the second one scales the output’s mean close to 0 and its standard deviation close to 1.( Thats one of those cases where statistics, or rather distributions, play an important role). The dropout layer prevents overfitting, the MaxPooling layer halves each dimension despite the last; due to us setting the pool size to (2,2,2).

This block is repeated four times, with the last block having a global max pooling operation. This operation simply selects the maximum value per axis. We do this to keep the weights small – we could have directly followed with a dense layer, but that would have ramped up the number of weights quite drastically.

We end our model with a dense layer with two neurons, one per class, and softmax activation. This activation returns a probability distribution over the neurons, which together sum up to 1. To infer our class, we take the position of the highest value. If our prediction vector would be [0.2, 0.8], the class would be 1 (female, for our overall task).

The corresponding code:

The code above works relies on a configuration file. The next method creates our model configuration:

Why do we work with a configuration file, rather than hard-coding the hyper parameters? First, you can easily log this configuration to disk, second you can adapt and load such a configuration from disk, and third you can incorporate it into a hyper parameter search.

Setup

With our main functions defined above, we now create several wrapper functions. Following the principles of clean code, these functions are rather simple, they mainly do one thing, calling the previous code.

The first code is for setting up the model. Given a training strategy, the model configuration (number of kernels, dropout probability, etc.), and the command line arguments (the args parameter), we create a model and compile it. As a loss we use SparseCategoricalCrossentropy; which is the same as CategoricalCrossentropy, but works with non-one-hot vectors, i.e., it works with the raw class numbers, rather than requiring encoding them as a one hot vector. The same holds for the sparse categorical accuracy.

Secondly, we also save a visualization of the model to our _outdir, the main directory where everything is written to. The parameters are all self-explaining except rankdir. This parameter controls the orientation of the plot; ‘TB’ is for vertical plots (as seen above), ‘LR’ creates horizontal plots:

The next function saves our trained model to disk. We create the model directory if it does not exist already, and then save the model there. Because we utilize custom layers, we use the ‘TF’ save format:

With the next function we evaluate our model on the test data. We log a nicely formatted message containing the loss and accuracy we have achieved, and return these two values. The _testdataset does not need to be passed to the function, it’s available globally (you will see soon), the same with the logger object:

The next short function only serves one purpose: Model training. It accepts the model, a list of callbacks to apply during training, and the command line arguments (of which we only need the number of epochs). Since we set the dataset to repeat endlessly (see above), we need to tell TensorFlow how many batches one epoch contains, both for the training and validation dataset. Once training finishes, we return the trained model:

With these short functions in place, we create a functions that calls them appropriately. We begin by setting the seeds, then request our model configuration. If we want to use W&B, we log in at this place (completely optional). We then determine the distribution strategy used to train the model and set up our model with it. Lastly, we initialize all the callbacks, which we return together with the model:

This function is called by our main() function, which is responsible for calling the setup function, printing the model’s summary, and training. We can also log our results to W&B (completely optional), and then just run some steps to shut down the code:

Training

And now, the code to run this all: We use the argparse library to parse the command line arguments. They control the directory to save the model, the number of training epochs, the batch size, the seed, and many more – just check their description:

After we have parsed the command line arguments, we use python’s if name == "main" statement to get everything going.

We create our output directory, and determine the actual batch size. If we have two or more devices to train on, we multiply our _per_device_batchsize with the number of devices to speed up training. You can experiment with these values, they are just a solid foundation.

With our batch size determined, we then determine the number of steps needed to iterate through our train, test, and validation sets. This value is calculated by dividing the dataset length (50 in our case, since we have 50 audio samples per set) by the batch size.

We also load our additional numpy data which is used by our callbacks.

Everything that happens here is available globally, we therefor do not have to pass anythin to the main method – the args parameter is just for clarification. Once main() is called, it’ll setup the network, train it, save it, and evaluate it. The script has succesfully finished when we see

Script finished.

After trying to detect the gender of the speakers in the test set, I get

Finished evaluation. Loss: 0.3812, Accuracy: 0.9583.

That’s it. The code is available in a Colab notebook here.

Summary

To sum up what we did:

Our overall task was to classify the gender of the speaker, using custom audio data.

We started by downloading the public LibriSpeech dataset, which consist of short audio samples of voluntary people reading audio books, and created an index of the speakers, sorted by their gender.

With these files, we created our custom audio dataset; one class was the male class (label 0), the other class was the female class (1). We created the training, the validation, and the test sets.

In the third step, we created a TFRecord dataset from our raw FLAC audio data. We did this to speed up data loading – the data is saved sequentially – and also to learn how to work with it (TFRecords are really useful. But somewhat complex. See here for a hands-on guide).

In our fourth and last step, we combined everything we did before: We wrote code to load our dataset, we used custom callbacks, and yes, we also did some Deep Learning stuff: creating and training the model.

Where to go from now?

There are some things that you might have noticed.

The first is the size of the dataset. You can increase it to several hundred or thousand samples – I have used the code to create a dataset of 10 000 samples, so that’s definitely possible. In this case you have to run this locally, since the 15 GB Google Drive storage is not sufficient.

A second noteworthy thing is the unstable training: The loss and accuracy jumps around. Try different learning rates first, then experiment with other optimizers.

A third option is to try completely different architectures – a Siamese network, a LSTM.

That marks the end of this blog post.

The links to the Colab notebooks, to run things for yourself:

References

[1] Panayotov, Vassil, et al., Librispeech: an asr corpus based on public domain audio books (2015), IEEE international conference on acoustics, speech and signal processing (ICASSP)

[2] R. J. Pumphrey, Upper limit of frequency for human hearing (1950), Nature

[3] Karol J. Piczak, Environmental sound classification with convolutional neural networks (2015), IEEE 25th International Workshop on Machine Learning for Signal Processing

Custom Audio Classification with TensorFlow

Hands-on Tutorials

Task description

Overview

Data Acquisition

Downloading, imports, and project setup

Metadata

Audio Data Generation

Imports

Helper functions

Generating speech samples

Main method

Command line arguments

Creating TFRecord files

Imports

Helper functions

Main method

Command line arguments

Training

Imports

Helper functions

Data handling

Distribution strategy

Callbacks

Audio features

Model

Setup

Training

Summary

Where to go from now?

References

Related Articles

Implementing Convolutional Neural Networks in TensorFlow

Hands-on Time Series Anomaly Detection using Autoencoders, with Python

Solving a Constrained Project Scheduling Problem with Quantum Annealing

Back To Basics, Part Uno: Linear Regression and Cost Function

Must-Know in Statistics: The Bivariate Normal Projection Explained

Our Columns

Optimizing Marketing Campaigns with Budgeted Multi-Armed Bandits