Animal detection in the jungle - 1TB+ of data, 90%+ accuracy and 3rd place in the competition, Part 2

Final solution description

Posted by thinline72 on January 9, 2018


In the continuation of Alexander’s post I’ll describe our final pipeline here as well as things that didn’t help. In case you missed the previous post, I’d highly recommend you to read it. Alex describes the competition, shares lots of insights and even talks about the drama that happened at the very end of competition.

In short: it was multi-label video classification challenge with aggregated log loss metric. Here are competition website with description and github repo with our solution.

Just a few words about competition from my side. It was very exciting and fun to participate. I really enjoyed the process. Moreover, I got a robust new experience, learnt lots of new things and improved my DL skills. I started to participate using my old PC with 780GTX on board, so I was able to use only micro dataset firstly(64x64 video size, 2 FPS, ~3GB). Even with it I achieved pretty decent score and looks like it was enough to use micro dataset to be in Top10. Then I teamed up with Alex and we started using the original dataset (~1TB of video). Alex gave me access to his personal machine with 1070Ti and started extracting features on his friend’s server with 2x1080Ti. Limitation of resources along with pretty big dataset forced us to invest more time in benchmarking different libs for working with videos and numpy arrays as well as in engineering a better pipeline.

I’m also very happy that I teamed up with Alex. He is very smart and practical guy who shares a lot of useful information in his telegram channel and his blog. You should definitely consider following!)

Without further ado, here is a description of our final pipeline.

1. Feature extraction:

As I mentioned above Alex extracted features from many different pre-trained networks like resnet152, inception-resnet, inception4 and nasnet. We found that it’s better to extract features not only from the last layer before FC layers, but also from skip-connections.

We also extracted metadata like width, height and file sizes from both original and micro datasets. Interestingly, their combination worked much better than just original dataset one. Generally, metadata was very useful here especially for blank/not blank predictions as empty videos usually weigh considerably less. Just a simple thresholding allows you to separate almost 25% of blank videos (the biggest class btw): 

2. Stratified train/validation splitting:

Class distribution was very unbalanced. There was a “lion” class that contained just 2 samples from ~200k total videos! Moreover, it was multi-label classification so it requires a bit more specific splitting. Hopefully, I have some code from Planet: Understanding the Amazon from Space competition. With such splitting our val score was always a bit worse than LB score.

def multilabel_stratified_kfold_sampling(Y, n_splits=10, random_state):
    train_folds = [[] for _ in range(n_splits)]
    valid_folds = [[] for _ in range(n_splits)]

n_classes = Y.shape[1]    
    inx = np.arange(Y.shape[0])    
    valid_size = 1.0 / n_splits    
    for cl in range(0, n_classes):        
        sss = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state+cl)                
        for fold, (train_index, test_index) in enumerate(sss.split(inx, Y[:,cl])):           
        b_train_inx, b_valid_inx = inx[train_index], inx[test_index]                        
        # to ensure there is no repetetion within each split and between the splits            
        train_folds[fold] = train_folds[fold] + list(set(list(b_train_inx)) - set(train_folds[fold]) - set(valid_folds[fold]))            
        valid_folds[fold] = valid_folds[fold] + list(set(list(b_valid_inx)) - set(train_folds[fold]) - set(valid_folds[fold]))    
return np.array(train_folds), np.array(valid_folds)

3. Things that we tried:

After extracting features we had matrices for each of the videos with shape like (45, 3000) where 45 is a number of frames and 3000 is a number of features for certain frame.

Things that we tried and they were added to the final solution:

  • We started from different kind of RNNs, but they didn’t really provide better score and they required much more time to train (even Keras fast CuDNN implementations). But we used few trained RNN models in the final stacking though.
  • Usage of RNNs encouraged me to try Attention and it gave us a great improvements. Firstly, we had used this implementation top on 2xCuDNNGRU and plain features. Once I recognised that removing RNNs didn’t really worsen our results, I invested more time in different Keras attention implementation and found this one. Attention worked for us as kind of learnable pooling that allowed to go from video matrices of (45, 3000) size to video vector of size 3000.
  • Max-Min pooling gave us reasonable boost in score and it worked much better than default Max/Avg poolings. The idea behind it was to get better representation how features are changing across timeline.
  • Concatenating features extracted from different models worked very well but required more time and resources to train. 
  • Training separate model for blank/not blank classes. It gave us a slightly better score for this certain class but it was reasonable to use because blank class was the biggest one.
  • Pseudo-labelling always gave a slight boost for our models. Usually, training process started to overfit after ~10-12 epochs. Looks like pseudo-labelling introduced some diversity to the training process and improved our validation score.
  • Stacking trained models together.

Things that we tried but they didn’t work well for this task:

  • Focal Loss: generally, it works pretty well and adds some diversity to the models, but it isn't really helpful with provided metric. But I’d definitely try it for other projects.
  • Handle imbalanced classes: doesn't really helpf with provided metric.
  • 1D/3D Convolutions
  • Optical Flow

4. Final solution description:

We trained 9 models, each on 5 stratified folds, using extracted features:

  • 3 CuDNNGRU models with AttentionWeightedAverage and Max-Min polling layers based on resnet152, inception-resnet and inception4.
  • 4 models with Attention and Max-Min pooling layers based on resnet152, inception-resnet, inception4 and nasnet.
  • 1 concat model with Attention and Max-Min pooling layers (similar to previous one) but with concatenation of resnet152, inception-resnet, inception4 extracted features.
  • 1 concat blank model (similar to previous one) but for blank/not blank predictions only.

We found that 15 training epochs + 5 pseudo-labelling training epochs should be enough to get pretty decent score. Batch size was 64 (44/20) for single-features models and 48 (32/16) for nasnet and concat models. Generally, a bigger batch size was better. The choice was mostly depended on disk I/O and training speed. Predictions from the models were stacked together via 2 FC layers NN meta-model using 10 stratified folds to get the final score.