What did the bird say? Part 6 - CNN proof of concept

Or how we got ~70% accuracy with 132 classes on ca. ~7k bird song dataset

Posted by snakers41 on July 29, 2017

Looks familiar, right? Actually this is ca. ~100th epoch. This is the best TLDR of the article.

Article list

What did the bird say? Bird voice recognition. Part 1 - the beginning

What did the bird say? Bird voice recognition. Part 2 - taxonomy

What did the bird say? Bird voice recognition. Part 3 - Listen to the birds

What did the bird say? Bird voice recognition. Part 4 - Dataset choice, data download and pre-processing, visualization and analysis

What did the bird say? Bird voice recognition. Part 5 - Data pre-processing for CNNs

What did the bird say? Bird voice recognition. Part 6 - Neural network MVP

What  did the bird say? Bird voice recognition. Part 7 - full dataset preprocessing (169GB)

What  did the bird say? Bird voice recognition. Part 8 - fast Squeeze-net on 800k images - 500s per epoch (10-20x speed up)


Plain vanilla VGG-16 style convnet with added regularization (batch normalization, image transformation, rolling window image preprocessing, lack of noise removal) got us ~70% on validation dataset. Dataset contained ~7k bird songs split into ~60k files of 132 bird genus.

Further investigation of squeeze-net type architectures / larger dataset sizes will be made with the primary goal of deploying the model into a mobile application capable of recognizing bird genus / species by listening to the bird's song.

1. MVP model

I will not describe the tedious stuff in detail in this article, you will be to  find all of it in the below files with ample comments and structure:

  • ipynb (as usual, use collapsible headers easiest navigation);
  • HTML;

Well, first of all let me remind you how our input data looks like. We have ca. 60k bird song fragments for 132 bird genus that look like this:

When thinking about MVP model architecture I kept in mind the following factors:

  • Simple 1-hidden-layer NN (for testing purposes) produced poor results - max. ca 10% accuracy;
  • This particular task seems much more difficult than CIFAR or MNIST, much less difficult than computer visions tasks, but generally it's easy to see patterns with naked eye. This probably requires some "heavy artillery";
  • Now 60-70% of all the published deep learning papers and blog posts more or less cover the technical subjects (optimization, meta-learning, new architectures, saving time, etc) or subjects related to computer vision, which is considered to be "solved" for small scale applications with 99% accuracy;
  • Computer speech recognition is not considered to be solved with 99% accuracy, and I have seen much less papers and blogs on speech / sound related posts. Also when was the last time you used speech-to-text apps / APIs / interfaces?;
  • For me Stanford MIR (2017) was a great educational source of information on back-of-the-envelope sound processing and inspiration;
  • The final model will be deployed in the mobile application;
  • Input "images" are (64,200) in size, but probably on the mobile device we will need to process them multiple times per second, so investigating squeeze-net architectures for actual models may be a good option (though model weights require only ~5-6MB of space even now - probably this will grow as more classes are added);

Without doing much architecture testing (and keeping in my mind that dataset can be expanded 10-100x) I decided  to train a plain vanilla VGG-16 network from scratch using code snippets from my previous project.

I used this image augmentation procedure in Keras:

genImage = imageGeneratorSugar(
    featurewise_center = False,
    samplewise_center = False,
    featurewise_std_normalization = False,
    samplewise_std_normalization = False,
    rotation_range = 0,
    width_shift_range = 0.2,
    height_shift_range = 0,
    shear_range = 0,
    zoom_range = 0.1,

Basically we  can shift sound horizontally and probably scale it a little bit, but rotations and vertical shifts are off-limits because in real life you cannot shift sound like this. I guess in real life a bird may stop singing or sing taking a bit higher or lower note. Or pause a little or stretch its sounds. Just look at this picture - it's clearly shifted, stretched and contains pauses.

This is plain vanilla modern CNN model architecture:

def getTestModelNormalize(inputShapeTuple, classNumber):
    model = Sequential([
            BatchNormalization(axis=1, input_shape = inputShapeTuple),
            Convolution2D(32, (3,3), activation='relu'),
            Convolution2D(64, (3,3), activation='relu'),
            Dense(200, activation='relu'),
            Dense(classNumber, activation='softmax')
    model.compile(Adam(lr=1e-4), loss='categorical_crossentropy', metrics=['accuracy'])
    return model

Sinсe images are small, I trained model using 512 pictures per batch for ~100 epochs in several takes with early-stopping call-back (it was not really triggered - my dataset choice and preprocessing probably did heavy regularization for me) to achieve ~70% accuracy.

batchSize = 128*4
train_batches = get_batches(train_path, genImage, batch_size=batchSize, imageSizeTuple = (64,200))
valid_batches = get_batches(valid_path, genImage, batch_size=batchSize,  imageSizeTuple = (64,200))
model2 = getTestModelNormalize(classNumber=132,inputShapeTuple=(64,200,3))

This is what data augmentation looks like:

2. Interesting stuff

With 132 classes I basically can evaluate model's performance using the following rules of thumb:

  • Use accuracy score (simple accuracy is ok, I do not want to get too fancy);
  • Look what classes get predicted best / worst;
  • Look which bird genus / families / orders have the most problems;
  • Look at activation visualizations to prove / disprove intuitions;

After running predictions of the cross-validation set and producing accuracy scores for each bird genus, the accuracy distribution looks like this.

It's only natural to assume that this has something to do with class disbalance / sample limitations. Let's plot accuracy vs. total sample size per genus.

Looks interesting.  But it becomes more clear and apparent what happens when we throw bird taxonomy into the mix (naked eye inspection is also cool, but this is much cooler).

The obvious conclusions that are drawn are:

  1. Perching birds are the most numerous and difficult to tell apart. Because they sing! Also they clearly follow a linear pattern - more data => more accuracy;
  2. Overall all the bird orders this rule - more data => more accuracy;
  3. Non-singing birds are much easier to classify;

3. A small treat 

Just look at these cool visualizations produced using keras-viz. It would be cool actually to combine them with listening to real sound and looking at random spectrograms, but I will think about this when the final model is ready.

Class activation visualizations

For a list of random classes. I can clearly see some patterns there. I think if we train this model on 10-100x data size, these distinctions will become much more apparent.

Attention maps

We can clearly see that the network is paying attention mostly to the actual bird call and not the background noise, which is inspiring.

4. What is next?

Try squeeze net architecture, try bigger datasets, prepare for deploy in tensorflow.