What did the bird say? Part 8 - fast Squeeze-net on 800k images - 500s per epoch (instead of 10k - 20k seconds)

What worked, what did not, or how to speed up your model training 20x times

Posted by snakers41 on August 19, 2017


TLDR of the article - training a CNN with ca. 500k  files in the train dataset  (128 batch size) only ca. 500s per epoch


Article list

What did the bird say? Bird voice recognition. Part 1 - the beginning

What did the bird say? Bird voice recognition. Part 2 - taxonomy

What did the bird say? Bird voice recognition. Part 3 - Listen to the birds

What did the bird say? Bird voice recognition. Part 4 - Dataset choice, data download and pre-processing, visualization and analysis

What did the bird say? Bird voice recognition. Part 5 - Data pre-processing for CNNs

What did the bird say? Bird voice recognition. Part 6 - Neural network MVP

What  did the bird say? Bird voice recognition. Part 7 - full dataset preprocessing (169GB)

What  did the bird say? Bird voice recognition. Part 8 - fast Squeeze-net on 800k images - 500s per epoch (10-20x speed up)

0. TLDR - what helps train CNNs 10-20x faster with reasonable accuracy?

So, if you have read my previous articles,  probably you may have noticed that had everything ready to train the model, for bird voice recognition but for the actual pipeline, model architecture and training itself. After researching what I found available on the subject, these approaches helped me to speed up my training time 10-20s:

  • Loading and preprocessing files in multiple threads in Keras - 5-10x speed bump (it saturates when you reach a theoretical IO bandwidth of your drive);


    A noticeable speed bump, right? Trend continues  with more processes if you also factor image preprocessing into equation

  • Using an SSD instead of HDD - 1.5 - 2x speed bump (actually in my case I tried a mdadm RAID-10 array, but for some reason it saturated after processing  ca. 10% of the images);



    First one is mdadm RAID-10 array,  the second one is SSD

  • Using Squeeze net architecture instead of mode traditional (VGG-16) and heavier models (ResNet, AlexNet, Inception Nets) - 4-5x speed bump;
  • Probably loading your files into RAM will give you some additional speed bump, but images are compressed and simple calculation (800k pics *  200x64 * 1 channel * 4 bytes per pixel in float32) yields that the full grayscale dataset would take around 40GBs of RAM (I had 16 when doing the model);
  • Using cyclical learning rates may help you with faster / more stable convergence;
  • Also practice tells me that using batchnorm in the CNN architecture also helps, but I did not test it. In in applying heavier CNN designs (U-NET) I noticed 2-3x speed up in training time;

1. Useful links on the subject

Actually when you have this many files, there are three options on how to organize them:

  • Have a folder for each category (which did not fit me because I wanted to test species vs. genera model) and use flow_from_directory from Keras;
  • Write image augmentation / generators from scratch and / or use some Kaggle kernel boilerplates;
  • Use some kind of solution described here;


I ended up using a mixed approach having written a custom generator on top of Keras generators.

A couple of useful links on Mobile nets and Squeeze net:

  1. A couple of Keras implementations of Squeeze net - 1 2;
  2. Squeeze-net article;
  3. Mobile-net article;


Also this amazing repository and set of tools / notebooks will help you get started with multi-threaded keras image loading.

Also a nice compilation of one-hot encoding techniques.

2. What did not work

Having 500 seconds per epoch model figured out, the biggest disappointment was the fact that I could not train a model (or I am just bad at tuning hyper-parameters) to recognize a bird species by its song. In my final dataset there were 213 bird genera vs. 1623 bird species. Probably closely related bird species are just too similar to distinguish.

As for the bird genus, I am quite sure that palatable accuracy can be achieved (it's work in progress now).

Also I did a lot of fiddling with loading files to memory and sparse arrays, but all of this was not necessary in the end.

3. Snippets, code, useful examples for models in Keras

Here is the list of files (if you want to use my code - please also download and unpack the tools archive to the folder where you will have the notebook);


As usual, I use collapsible headings plugin for navigation, all the end-to-end training scripts are at the end of the notebook.

Squeeze net architecture

Basically all it takes is this:

def fire_module(x, fire_id, squeeze=16, expand=64):
    s_id = 'fire' + str(fire_id) + '/'
    if K.image_data_format() == 'channels_first':
        channel_axis = 1
    else:
        channel_axis = 3
    
    x = Convolution2D(squeeze, (1, 1), padding='valid', name=s_id + sq1x1)(x)
    x = Activation('relu', name=s_id + relu + sq1x1)(x)
    left = Convolution2D(expand, (1, 1), padding='valid', name=s_id + exp1x1)(x)
    left = Activation('relu', name=s_id + relu + exp1x1)(left)
    right = Convolution2D(expand, (3, 3), padding='same', name=s_id + exp3x3)(x)
    right = Activation('relu', name=s_id + relu + exp3x3)(right)
    x = concatenate([left, right], axis=channel_axis, name=s_id + 'concat')
    return x

# Original SqueezeNet from paper.
def SqueezeNet(input_shape=None,classes=1000):
    img_input = Input(shape=input_shape)
    x = Convolution2D(64, (3, 3), strides=(2, 2), padding='valid', name='conv1')(img_input)
    x = Activation('relu', name='relu_conv1')(x)
    x = MaxPooling2D(pool_size=(3, 3), strides=(2, 2), name='pool1')(x)
    x = fire_module(x, fire_id=2, squeeze=16, expand=64)
    x = fire_module(x, fire_id=3, squeeze=16, expand=64)
    x = MaxPooling2D(pool_size=(3, 3), strides=(2, 2), name='pool3')(x)
    x = fire_module(x, fire_id=4, squeeze=32, expand=128)
    x = fire_module(x, fire_id=5, squeeze=32, expand=128)
    x = MaxPooling2D(pool_size=(3, 3), strides=(2, 2), name='pool5')(x)
    x = fire_module(x, fire_id=6, squeeze=48, expand=192)
    x = fire_module(x, fire_id=7, squeeze=48, expand=192)
    x = fire_module(x, fire_id=8, squeeze=64, expand=256)
    x = fire_module(x, fire_id=9, squeeze=64, expand=256)
    x = Dropout(0.5, name='drop9')(x)
    x = Convolution2D(classes, (1, 1), padding='valid', name='conv10')(x)
    x = Activation('relu', name='relu_conv10')(x)
    x = GlobalAveragePooling2D()(x)
    out = Activation('softmax', name='loss')(x)
    inputs = img_input
    model = Model(inputs, out, name='squeezenet')
    return model


I will also experiment with bath norm a bit later.

Custom generator

This is a simple generator that I built on top of Keras generators, that allows to feed any labels to Keras. Note that code_dict is a dictionary that just maps file ids to their respective label encoded classes.

def custom_train_generator():
    global batch_size
    global train_generator
    global code_dict
    global train_image_count
    global sp_classes_len
    
    train_filenames = train_generator.filenames
    train_filenames = list(map(extract_id,train_filenames))
    while True:
        for start in range(0, train_image_count, batch_size):
            imgs =  next(train_generator)
            file_list = train_filenames[start:start+batch_size]
            labels = indices_to_one_hot([code_dict[k] for k in file_list],sp_classes_len)
   
            yield (imgs,labels)


indices_to_one_hot is basically a handy function that transforms an index into a one-hot encoded array (which Keras needs).

def indices_to_one_hot(data, nb_classes):
    """Convert an iterable of indices to one-hot encoded labels."""
    targets = np.array(data).reshape(-1)
    return np.eye(nb_classes)[targets]


Nice callbacks

Do you want just to leave your model to train and forget about it? This example may be useful to get you started on Keras callbacks. It essentially saves the best iterations of your model, reduces learning rate on the plateau and logs the training process.


callbacks = [EarlyStopping(monitor='val_acc',
                           patience=10,
                           verbose=1,
                           min_delta=1e-4,
                           mode='max'),
             ReduceLROnPlateau(monitor='val_acc',
                               factor=0.1,
                               patience=4,
                               verbose=1,
                               epsilon=1e-4,
                               mode='max'),
             ModelCheckpoint(monitor='val_acc',
                             filepath='squeeze_net_200_aug.h5',
                             save_best_only=True,
                             save_weights_only=False,
                             mode='max'),
             CSVLogger('last_training_log.csv', separator=',', append=False)       
            ]
model.fit_generator(
            generator=train_generator,
            validation_data = valid_generator,    
            steps_per_epoch=int(np.floor(train_image_count/batch_size)),
            validation_steps =  int(np.floor(valid_image_count/batch_size)),
            epochs=100,
            callbacks=callbacks
        )


Notes on training process

Basically I know 3 ways how people handle writing their exploratory training / ML scripts:

  • Structure your code into modules / classes / functions and run your script in the console:
    • Advantages:
      • Your can easily upload to Github and share your code;
      • If properly structured (train.py, preprocess.py, model.py) - it can be really easy to take the best bits to your project;
      • Easier to understand for programmers;
    • Disadvantages:
      • May require some prior experience in programming;
      • You cannot store your;
      • You may need to have some experience / skills with IDEs;
      • More difficult to understand for non-programmers;
  • Write, debug and run your script in Jupiter notebook:
    • Advantages:
      • All your experiments are stored;
      • Easier to understand for non-programmers;
      • Easier to debug and write relatively simple scripts;
      • Easier to reuse code and launch many experiments;
      • Unparalleled visualization capabilities;
      • Easy to share the results to broader audience and visually
    • Disadvantages:
      • May be unfamiliar to programmers;
      • Code is more messy, sometimes it's difficult to understand the flow;
      • If your internet connection is unstable longer (10-30 hours) training times are undesirable (if your browser tab dies, the script will not die, but it's impossible to reattach to the session);
      • If your notebook is big, introduces some overhead with handling it (longer save times, minor glitches); 
      • Overall less stable that just running scripts from the console
  • My approach:
    • Do all the visualizations, experiments, degugging and code writing in Jupiter notebook;
    • Actually run scripts in tmux concole;
    • Use csv logs to store longer experiments;


Loading files in parallel


Basically using this  example it becomes as easy as adding a couple of lines to your generator. Just follow the link and you will be happy)