Animal detection in the jungle - 1TB+ of data, 90%+ accuracy and 3rd place in the competition

Or what we have learned, how to win prizes in such competitions and some useful advice + some trivia

Posted by snakers41 on December 25, 2017


Whole competition in a nutshell -  a random video with leopard. Note that all  the videos are 15 seconds long and there are 400k of them...

Final scores at 3am when the competition ended - I was in the train, but my colleague submitted the winning submission just 10 minutes before  the competition ended

Buy me a coffeeBuy me a coffee

Become a PatronBecome a Patron

Article list

1TB jungle video classification Part 1 - general overview

1TB jungle video classification Part 2 - final pipeline description

0. Competition choice criteria, post structure

If you follow my blog, you probably know how and why I choose competitions.  In a nutshell - in late 2017 the majority of competitions hosted on Kaggle are not so interesting and / or there is just too little money for little to zero educational value and / or 100+ people submitting during the first day because the latest competitions are not really difficult. Just stack 20 models at your pleasure. The most prominent example in the recent history is this and this - while interesting on paper - such competitions are nothing but casinos with GPUs instead of chips.

For these reasons (decent prize, no heavy marketing support resulting in 100+ easy submissions in day one, challenging task, interesting and new domain, worthy cause) - we chose this competition

In a nutshell - you are given ~200k train videos, ~80k test videos (and 120k unlabeled videos, WTF?). Videos are labeled as a whole with 24 animal classes, i.e. video N is marked to have a specific animal (or lack thereof) in it, i.e. video 1 - class 1, video 2 - class  2 - etc.

Also in this competition we participated together with a subscriber of my telegram channel (channel, web translation). So for brevity (and the sheer size of information we had to digest) this post will be structured as follows:

  • TLDR sections for people looking for best working solutions and quick links;
  • All the code and jupyter notebooks I refer to is available here. If the repo is not public - then I have not yet published it (before all the stuff is resolved with the hosters of the competition). Code is encapsulated and notebooks utilize Jupiter extensions (codefolding, table of contents, collapsible headers) for readability - but little attention was paid to turning code into tutorial - so read at your own risk;
  • All the code used by me was written in Pytorch, my colleague was using mostly Keras;

1. TLDR1 - initial naive approach + compilation of useful links on the subject

To jump start yourself on the topic, I compiled a list of highly links in the order you probably should read them, if you want to solve a similar problem. For starters, you should be familiar with computer vision, basic math (linear algebra and calculus), machine learning and basic ML architectures.

Starting links:

  • Post - simple / naive yet very efficient models;
  • A benchmark shared on drivendata;

Intuitions / best articles on LSTMs (turned out LSTM / GRU were not the best solutions overall, but they lead us to attention, which was crucial in our case):

  • Understanding the concept of LSTMs 1 2 3 4;
  • Understanding the concept of attention 1 2 in RNNs;
  • LSTM visuazlizations for text models 1 2;

Then if you are into Pytorch, then a list of best models / examples on LSTMs / attention I was able to find (I did not have time to test attention thoroughly, but my colleague did so in Keras):

  • Basic examples 1 2;
  • Advanced examples + attention 1 2 3 4;
  • An interesting article on attention for Keras and Pytorch;

Academic articles on the topic

Note, that academic papers usually overcomplicate things / are not really reproducible / tackle 10x mode difficult tasks  - so read then with a grain of salt.

Nevertheless, the below papers contain some basic starter architectures (like Bag of Frames, same as driven data started) and they note that some sort of attention / trainable pooling increases the scores:

  • Most relevant paper on the subject - overkill for our purposes, but contains some useful benchmarks and starter architectures;
  • Some naive but simple benchmarks for working with video classification;
  • A paper that hints that attention / learnable pooling may be useful in video classification;
  • Less applicable paper - but some ideas on frame pooling;
  • Interesting idea - use multi-level resnet features to spot fast action;
  • Ideally this paper would enable us to create bboxes from videos and do learnable object detection, but the code they shared (despite being advertised as python only) contained custom C++ drivers for we decided not to go this road =)

2. TLDR2 - best working approaches for this challenge. Our pipelines and other leaders' pipelines

3.1 Best approaches

Ours (3rd place):

Approximately 50% into the competition I teamed up with Savva Kolbachev. Initially, before going in for full-size videos, I tried the following approaches:

  • Motion detection - just for the sake of trying it;
  • Stitching 64x64 video frames into 1 image;
  • Matrix factorization just for the save of trying it;

As far as I remember at that time Savva was using LSTMs + some basic encoders for 64x64 videos.

After trial and error, we ended up with more or less the following pipeline (will be described in more detail in next article):

  • Extract 3 or 4 sets of skip features from a list of the best encoders (we tried various resnets w and wo augmentations, inception4, inception-resnet2, densenet, nasnet + other models) - 45 frames per video;
  • Use metadata in the model (i.e. video file-size and resolution);
  • Use attention layer in the model;
  • Feed all the vectors into final fully-connected layers;

The final solution boasted ~90%+ accuracy and 0.9+ ROC AUC scores for each class.

Some benchmarks (GRU + 256 hidden units) among the best performing encoders  - from top to bottom - fine-tuned inception resnet2, densenet, resnet, inception4, inception-resnet2

Dmytro (1st place) - a really simple pipeline:

  • Finetune a number of networkds (resnet, inception, xception) on random frames from videos;
  • Predict classes for 32 frames per video;
  • Calculate histograms for such predictions to eliminate time from the model
  • Run a simple meta-model on the resulting features;

What we missed / failed to test properly

  • When performing end to end model fine-tuning (pre-trained Imagenet encoder + pre-trained GRU) - it resulted in low (0.03) train loss and high validation loss (0.13), which we considered as a sign of weakness of this approach;
  • It turned out that Dmytro experienced the same behavior, but nevertheless finished the experiment;
  • We abandoned this approach because of time constraints, but when we had to do a proper benchmark of features extracted from the fine-tuned network, it performed more or less the same on my tests, but Savva did not do the proper test on his pipeline. It resulted in us investing the last week into trying new encoders instead of fine-tuning the ones we already tried;
  • Using input data augmentations. Probably the input data is quite noisy (usually animals are present just 1-2s in the video prominently);

Some examples of augmentations we tried - no real boost compared to plain feature extraction

3.2 What we failed / did not have time to try

  • Fine-tuning the encoders we used properly;
  • Trying learnable pooling / attention for skip connection features we extracted from the encoders;

3.3 Key killer features / eureka moments / counter-intuitive moments

  • Counter-intuitive moments:
    • Plain meta-data gives a score of ~0.06 on the LB (60-70% accuracy);
    • I achieved ~0.04 on the LB using only 64x64 videos, whereas a guy from community claimed to have achieved ~0.03 using only 64x64;
    • Plain min-max layer gave ~0.06 - 0.07 on the LB;
    • Poor performance of the end-to-end base encoder does not necessarily imply poor performance of the whole pipeline;

    Key killer features / eureka moments:

    • W/o fine-tuning - extracted features with skip connections (i.e. extract not only the last fc layer, but also some features in the middle of the network) perform much better than plain extracted features. On my GRU-256 benchmark it meant ~ 0.01 score boost w/o any stacking or blending;
    • A significant boost provided by plain features - meta data + plain min-max layer (kind of resonates with Dmytro's solution);
    • Even if some model behaves worse than its counterpart - their ensemble will be better because of the logloss;

    3. Basic dataset EDA and description, metric overview

    3.1 Dataset

    Basic EDA was done in this notebook. As mentioned before, we have ca. 1TB of 200k+ labeled videos, ~80k test videos and ~120k of unlabeled videos. The whole dataset size is ~1TB, and the hosts of the competition were quite late to share the dataset via torrent. Also the direct download was kind of slow (it started with speeds of ~1-2 Mb/s, but then quickly faded to 300-500 Kb/s, probably as number of participants increased). 

    3 versions of the dataset were shared:

    • Full 1TB version;
    • ~3GB 64x64 cropped 2FPS version - surprisingly enough - it was enough to get ~75-80% accuracy and ~0.03 score on the LB. Wow!;
    • 16x16 version - basically useless;

    Good things about the dataset:

    • Its existence in the first place. The size and quality of data is fascinating. Per video annotation is quite scarce itself, but we have ~300k videos annotated (though I believe they just annotated larger chunks and then split them up);
    • Helpful admins on;
    • Data downloads are free (unlike other platforms - come on guys - I am spending a LOT of time solving this challenge, give me FREE data at least);
    • Admins are not stuck in the prehistoric age and know about torrents;
    • This is the competitions with the best train / test split I have ever seen. Our LB scores were always ~5% better than local validation

    Bad things about the dataset:

    • Torrent was shared late;
    • Torrent had only one seeder at first located in the USA region at first;
    • Inside of the torrent was just one tar.gz file - which basically meant that you either could download the whole dataset, or nothing - i.e. no partial downloads;
    • 1TB tar.gz file was not compressed. Though videos themselves are compressed, this may have reduced the overhead by 10-20% still;

    The whole competition took 2 months, but in my case it took 2-3 weeks just to download the data and untar the archive. I was using only Linux console tools - but tar does not really support multi-thread operation unless the archive is compressed. Wow. There is a  hack tough - you can launch multiple untar processes at the same time.

    3.2 Basic EDA

    To be honest I did not really do the proper EDA just because of the sheer size of the data. But below are key insights into the data that were easy enough to discover.

    Datasets in a nutshell

    Labeled data classes - data is very UNBALANCED. On the other hand - train / test split is done properly - so the same distribution of classes holds for the test set

    Same data on a chart - normal scale vs. log scale

    Summary statistics on the video sizes:

    • (404, 720, 3): 170424
    • (480, 720, 3): 21695
    • (540, 720, 3): 770
    • (540, 960, 3): 7372
    • (640, 960, 3): 26
    • (720, 960, 3): 3714

    PC1 and PC2 obtained via PCA - they easily separate day and night, but separating cameras is more tricky

    Same chart - but more beautiful

    Justification for video file size heuristic. Video file size in bytes in log10 scale. Videos w/o animals (blue) vs. videos w animals (orange). Unsurprisingly - due to compression videos w/o animals are smaller in size

    3.3 Metric

    This convoluted formula is just average logloss between all the 24 classes.


    • Popular and simple - exists in all DL packages (though required some minor fiddling in pytorch);
    • Easy to get a decent score on the LB just using meta-data, 64x64 videos and heuristics;
    • Easy to classify 4 most popular classes to rank high (and in practice this would be a nice good-enough solution);
    • More difficult to abuse that accuracy;


    • Not intuitive or illustrative;
    • Very sensitive even to a small amount of overconfident incorrect predictions;
    • Just adding new models usually improves the metric. So it leads to less and less practical solutions;

    4. Alternative approaches

    As far as I know, this could be also done:

    • Some video hashing tricks;
    • Going into object detection:
      • Do object detection on 64x64 videos;
      • Create bboxes and translate into full-res videos;
      • Do 2 or 3 stage pipeline;
    • Try some knowledge distillation techniques to try to build bboxes from fine-tuned networks - difficult;

    I made object detection work, but I did not pursue this approach because I considered it unreliable - I did not want to invest into manual labeling because of sheer size of the data + I did not believe that even 64x64 motion based approaches will be stable. They were highly unstable on full-res videos

    5. Competition pulse and drama

    Just for lulz - some standings on the LB and some drama.

    My first sensible submission

    I jump over Savva, we start thinking about teaming up

    Savva retaliates and we decide to team-up, mainly to beat ZFTurbo =)

    Scores ~ several days remaining - Daniel starts to worry us

    Scores ~ 1 day remaining -  Daniel is a pain in the ass!

    Scores after our final submit 10 minutes before the deadline (!) at 2:40 am (!). Looks like Daniel overfitted his model...

    6. Basic advice for competitions

    1. Try several simple approaches;
    2. Make sure that you really understand what works and why it works;
    3. Read all the reasonable papers on the subject, but do not invest a lot of time into one approach unless:
      1. It is simple and easy to implement;
      2. You are 100% sure that it will work well;
      3. You can more or less combine a model from existing parts - no writing new complicated models or layers from scratch;
    4. Do not be afraid of the sheer complexity of the task. Modern libraries and code give you super powers;
    5. 90% of what is written in papers is rubbish. You just do not have 6 months to test everything to find our that papers usually value form over essence (with rare and beautiful exceptions);
    6. Resort to stacking / blending only as a last resort all-in balls-to-the-wall strategy;
    7. Communicate with people, team-up. It is a sort of stacking and hedge against tunnel-vision. Together you will learn not 2x faster, but 10x faster;
    8. Share your code and approaches;

    7. Basic advice for research and deployable models

    1. In 95% cases - no stacking or blending - as it basically just will give you + 5 - 10%;
    2. Your model and pipeline should be concise, fast and engineer-friendly;
    3. All of the above applies, but discounted by the increased requirements for deployability;

    8. Kudos

    As usual kudos to ZFTurbo and Dmytro for interesting conversations, sharing their experience and support. Also as a tradition ZFTurbo made a  cool video.

    9. Our rig

    We used 3 machines:

    • My weak server with 1070 Ti. Since I had no direct access to the machine, I had to ask friends to install an NVME SSD as we became space-bound;
    • Savva's weak machine with GPU;
    • My friend's server with 2x1080 Ti;

    My friend installing NVME SSD over Skype instructions into my box

    Buy me a coffeeBuy me a coffee

    Become a PatronBecome a Patron