What did the bird say? Part 5 - Dataset Preprocessing for CNNs

You do not like birds? You just do no know how to cook them

Posted by snakers41 on July 29, 2017

Looks like someone managed to separate bird songs from noise. Is it useful? Probably...

Article list

What did the bird say? Bird voice recognition. Part 1 - the beginning

What did the bird say? Bird voice recognition. Part 2 - taxonomy

What did the bird say? Bird voice recognition. Part 3 - Listen to the birds

What did the bird say? Bird voice recognition. Part 4 - Dataset choice, data download and pre-processing, visualization and analysis

What did the bird say? Bird voice recognition. Part 5 - Data pre-processing for CNNs

What did the bird say? Bird voice recognition. Part 6 - Neural network MVP

What  did the bird say? Bird voice recognition. Part 7 - full dataset preprocessing (169GB)

What  did the bird say? Bird voice recognition. Part 8 - fast Squeeze-net on 800k images - 500s per epoch (10-20x speed up)

0. Remind me, why am I reading this?

Actually I have no idea, but last time we left off on actually listening to the random bird songs and looking at the spectrograms. This time we need to pre-process all the data properly to feed it to our CNN (convolutional neural networks).  In this article I will not bore you with the details of actual implementation and I will point out only the most important code snippets.

So, below you can find the materials:

  • ipynb (as usual, use collapsible headers easiest navigation);
  • HTML;

If you have the collapsible headers plug-in installed,  then the notebook will look like this:

Without further ado, let's begin!

1. Analyze which data we managed to download

Well, for the proof of concept stage now we have a nice (9277, 34) shaped dataframe, indicating that we have ca. 9k bird songs. 

Song length distribution looks like this (log10 scale)

File size distribution looks like this (log10 scale) - which it its turn means that that the third of full dataset (most representative data) will weight ca. 50-100GB

Top bird class distribution

2. Test SVD / NVM for sound vs. noise decomposition

In the recent computational linear algebra course by fast.ai I stumbled upon a list of matrix decomposition methods in lesson 2, that are also used for noise / voice separation. Let''s try to apply them here and see the result. (For more fast.ai information consider their online textbook and a series of online videos).

In a nutshell, SVD  Singular Value Decomposition) and NMF (non-negative matrix factorization) decompose matrices numerically.

  • SVD is an exact decomposition and NVM is an approximate one;
  • Also NVM produces only non-negative values and required non-negative matrices to work with (which is really useful for real world applications);
  • In NMF you explicitly set the number of components, SVD gives you full decomposition;


To quote fast.ai SVD is an exact decomposition, since the matrices it creates are big enough to fully cover the original matrix. SVD is extremely widely used in linear algebra, and specifically in data science, including:

  • semantic analysis
  • collaborative filtering/recommendations (winning entry for Netflix Prize)
  • data compression
  • principal component analysis


To quote fast.ai, nonnegative matrix factorization (NMF) is a non-exact factorization that factors into one skinny positive matrix and one short positive matrix. NMF is NP-hard and non-unique. There are a number of variations on it, created by adding different constraints.

Applications of NMF:

  • Face Decompositions
  • Collaborative Filtering, eg movie recommendations
  • Audio source separation
  • Chemistry
  • Bioinformatics and Gene Expression
  • Topic Modeling

It's easy to project mentally this approach to separating bird song from the background noise. The result - as for bird noise / song separation - NMF works and SVD does not. 

NMF decomposition of random bird song

Note that I could not play back the decomposed tracks, because they were log-scaled (in db, between -80 and 0 db),  and I had to scale them to positive numbers and then scale them back. 

NMF decomposition of random bird song

Component importance

First and second components

Initially I wanted to include NMF into my pre-processing pipeline, but I decided against it for simplicity and speed. In the actual mobile app, it will definitely make sense to use some kind of noise reduction / sound preprocessing, but this is for separate investigation.

Also scientists are known for adding noise (e.g. see more in Andrew Ng MOOC about pipelines) to the data for regularization, data augmentation and sample extension purposes. So for the proof of concept I abandoned this idea.

3. Data preprocessing

Initially I wanted to try sci-kit cuda or Pytorch for pre-processing, but I learned a couple of things:

  • Sci-kit CUDA does not really install well from packages;
  • Numpy arrays read and write blazingly fast to consumer level SSDs in case you have only O(n) type sequential operation;

After some considerations, I chose the following hyperparameters for my dataset:

  • Ca. 7k bird songs;
  • Only bird genus with vernacular names (for ease of interpretation) and 50-90 bird songs per class (average ~50-70);
  • Only bird songs longer than 10 seconds;
  • Sliding window with length of ~5 seconds and ~2-3 second gap;

cols = ['id', 'length', 'sr', 'mel_size', 'length_s']
sub_sample_df = sample_df[(sample_df['length_s']>10)&(sample_df.processed==1)][cols]
sub_sample_df['w_id'] = ''

To check that preprocessed .npy arrays with spectrograms really contain sound, I did the following simple test (I could not play audio though for some reason);

# %%time
sr = 22050
song = np.load('sound_features/mel_'+'97975'+'.npy')
plt.imshow(song, aspect='auto', interpolation='none', origin='lower')
# so variables really store sound =)

It produced this chart without any stylized elements:

After some fiddling, I produced this code-snippet that cut bird songs into sliding windows of ~5s with ~2-3s intervals. Amazingly enough, it took only a couple of minutes for the whole process to finish.

result_list = []
for index, row in sub_sample_df.iterrows():
    song = np.load('sound_features/mel_'+str(row.id)+'.npy')
    pointer = 0
    counter = 0
    song_length = song.shape[1]
    while (pointer<song_length):
        counter = counter+1
        np.save('sound_features_cut/'+str(row.id)+'_'+str(counter)+'.npy', song[:,pointer:pointer+200])
        result_list.append([row.id, row.length, row.sr, row.mel_size, row.length_s, counter])
        pointer = pointer+300 

This process can be illustrated with these charts:


Reasons for choosing such preprocessing routine:

  1. Speed and simplicity;
  2. Data annotation is costly and takes a lot of time. For 100k songs it will take unreasonable  amount of time;
  3. In the real environment (mobile app) it's better to have relatively more light-weight models (i.e. analyzing each 5s sound) and then deciding on the bird class using some kind of simple voting algorithm (logistic regression, regression trees. simple voting rules etc);
  4. Birds have distinct 'sentences' but reasonable expert pattern recognition and annotation is simply not feasible;
  5. There are CNN-inspired approaches to more automated image / array segmentation, but they require some kind of annotated input anyway;

This technique produces ca. 62k long dataset with 132 classes.

4. Working with NNs - data preprocessing for Keras

Ok, now we have the dataset stored in .npy arrays. This dataset actually takes ca. 3-5GB of storage. Ideally we can fit it into our RAM and feed it to Keras. But despite this 'right' approach, I actually decided to convert all of the .npy arrays to .jpg pictures and feed them to Keras. Why? Here is a list of reasons:

  • The full dataset will contain 10-100x more data than now, and it WILL NOT fit in the RAM;
  • Image preprocessing routines in Keras that work exceptionally well for regularization purposes work only with pictures;
  • Code reuse;

(Important remark- the final trained production model will be exported to tensorflow as a computational graph and fed spectrograms directly - no worries)

Surprisingly enough this process is also blazingly fast. 70-30 train-validation split is used.

# create valid set for Keras
import scipy.misc
for index, row in df.sample(frac=0.3).iterrows():
        file_name = currentDir+'/sound_features_cut/'+str(row.id)+'_'+str(row.w_id)+'.npy'
        folder_name = str(row.vgenus)
        new_file_name = currentDir+'/sound_features_cut/valid/'+folder_name+'/'+str(row.id)+'_'+str(row.w_id)+'.jpg'
        song = np.load(file_name)
        scipy.misc.imsave(new_file_name, song)

# create train set for Keras
import scipy.misc
for index, row in df.sample(frac=0.7).iterrows():
        file_name = currentDir+'/sound_features_cut/'+str(row.id)+'_'+str(row.w_id)+'.npy'
        folder_name = str(row.vgenus)
        new_file_name = currentDir+'/sound_features_cut/train/'+folder_name+'/'+str(row.id)+'_'+str(row.w_id)+'.jpg'
        song = np.load(file_name)
        scipy.misc.imsave(new_file_name, song)

All of this produces the dataset ready for CNN: