Identify fish challenge - playing with object detection

Or working with large scale video datasets for beginners

Posted by snakers41 on October 29, 2017

0. Some fun


Advertised capabilities of the current sota models - YOLO9000 and SSD



Fishing in the deep sea - expectations



Fishing in the deep sea - reality

1. Why this particular challenge?

Well, this oneBasically because of a list of reasons:

  • Now (in late 2017) Kaggle (after being bought by Google) is dominated by teams of people usually winning by stacking 10-30 models. Otherwise you will not get necessary +0.5% boost;
  • There have been a lot of not very interesting competitions lately with table data (either stacking 30 XGBoost models or industry expertise is required), i.e. this experience is not directly transferable to  other domains and not interesting itself;
  • On recent plain-vanilla CV competition (classification of 5000+ online items with 300x300 pictures) - stacking started 1 month before the end after a guy posted his 0.7 scoring model weights with a note "stack at your pleasure". This more or less turns such competitions into casinos with stacks of GPUs instead of stacks of chips / money;
  • Most competitions' domains are plain boring and / or too "business" related;
  • When I was just listening to the fist fast.ai course, they focused heavily on the last instance of similar competiion, which raised my interest in this field;
  • I am a novice in this field (ca ~1 year of experience, whereas many of current people on Kaggle have 3-4  years worth of track record);


So I really liked the domain, high barriers of entry and steel-nerves required to even make a submit (300+GBs of unpacked data => of 400+ participants only a few dozens managed to make a reasonable pipeline).

In a nutshell in this competition there are ca. 1100 training videos and 667 test videos. We needed to:

  • Count the fish properly = 0.6;
  • Classify fish species =  0.3;
  • Measure fish length - 0.1;


The target metric (randing from 0 to 1), in my view, was unnecessarily complicated. Nobody in the community (of people who took the metric implementation from the forum) could match their local validation with the leaderboard scores. As usual there was an opinion that public portion and the private portion of the LB are heavily unbalanced and that their scoring is bugged (casino?). But what makes participating in such competitions worthwhile - is the interesting domain and the motivation to learn more about the current sota methods.



Heavy class imbalance (SSD vs YOLOv2 prediction sums). Also possible boat / video length imbalance on private and public LB



The metric has 3 parts - 0.6 for counting fish, 0.3 for fish classification, 0.1 for length estimation


To make things worse, a non-expert cannot easily tell one flatfish species from another - they look really similar:


2. Model selection criteria. Key considerations

Well, without prior knowledge about paragraph 8, my decision process was the following:

  • We have ca. 100+GB of videos, which when unpacked into pictures (I did not know then that you can source pictures directly from videos) weigh ca. 300+GB;
  • The basic task - finding and classifying fish is well covered by well-known image classification and object detection architectures;
  • I worked extensively with u-net on this challenge (I scored 66th w/o resorting to stacking and such) - and it has amazing powers, but works very very slowly - ca. 1 frame per second;
  • Since the task is really complicated and success is not guaranteed, ideally I need to use the simplest end-to-end pipeline, consisting of no more than 1 model;
  • Also I decided to learn pytorch as a key part of this competition (if I fail the task, I will learn more about pytorch);




All you need to know about current sota object detection algorithms


Also after reading up on the subject and assessing the available implementations on the target frameworks - keras and pytorch (YOLO, YOLOv2, SSD) - I decided to try Yolov2 in keras due to its simplicity and try SSD in pytorch for a challenge.

For the sake of simplicity, the convolution part of the architecture of these models is really simple (which cannot be said about preprocessing, augmentations and evaluation parts, but they are helpfully provided by the available implementations).



3. Data EDA + how to analyze 300+GB of data efficiently?

Well, the data is videos - so the easiest approach to EDA is watching videos and adding bounding boxes to them. I know a couple of basic ways to do that:

  • Use the provided annotation data to do the basic table data EDA;
  • Convert videos to images, add bounding boxes, convert back to videos;
  • Read directly from videos into memory, add bounding boxes, write to disk as video;


All of this basically requires the usage of the following instruments:

  • List and dictionary comprehensions in python3;
  • Ffmpeg video encoding and decoding;
  • Open-cv for handiling images (and possibly videos, but pip contains open-cv without video bindings);
  • You may also try image.io;
  • Also, if you will read my notebooks, use collapsible headers and table of contents plug-ins from here (otherwise my hierarchical notebooks will seem unreadable);


Without further ado, here is my notebook, and some useful snippets from there;

Downloading the data easily

import collections
file_dict = collections.OrderedDict()
file_dict['submission.csv'] = 'https://s3.amazonaws.com/drivendata/data/48/public/submission_format_zeros.csv'
file_dict['train.csv'] = 'https://s3.amazonaws.com/drivendata/data/48/public/training.csv'
file_dict['train.7z'] = 'https://s3.amazonaws.com/drivendata-public-assets/train_videos_2.7z'
file_dict['test.7z'] = 'https://s3.amazonaws.com/drivendata-public-assets/test_videos_2.7z'

for file,url in file_dict.items():
    url_q = "'" + url + "'"
    ! wget --continue --no-check-certificate --no-proxy -O $file $url_q


Unpacking the files

! mkdir -p data
! mkdir -p data/train
! mkdir -p data/test
!7z x train.7z -aos


Draw a bbox with open-cv

import cv2
def draw_box(image,xmin,ymin,xmax,ymax,label):
     cv2.rectangle(image, (xmin,ymin), (xmax,ymax), (255,0,0), 2)
    cv2.putText(image, label, 
                (xmin, ymin + 16), 
                0, 
                1e-3 * image.shape[0], 
                (255,0,0), 2)
    return image  


Doing any basic pivot tables with table data

df_sub = pd.read_csv('submission.csv')
df_sub.head()
df_sub[df_sub.video_id == '01rFQwp0fqXLHg33']
table = pd.pivot_table(
    df_sub,
    index=["video_id"],
    values=["frame"],
    aggfunc={"frame":pd.Series.nunique
            },
    fill_value=0)
table


Unpacking videos into pictures using ffmpeg

import glob
g = glob.glob('test_videos/*.mp4')
from tqdm import tqdm
folder = 'test_pics'
with tqdm(total=len(g)) as pbar:
    
    for f in g:
        video_id = f.split('/')[1].split('.')[0]
        !mkdir -p $folder/$video_id        
        !ffmpeg -v 16 -i $f -vsync 0 -qscale:v 2 $folder/$video_id/%04d.jpg
            
        pbar.update(1)

Comparing frame number vs. the annotated data

import os
cpt = {r:len(files) for r, d, files in os.walk('test_pics/')}
del cpt['test_pics/']
cpt = {k.split('/')[1]: v for k, v in cpt.items()}
for key,value in cpt.items():
    if (table.loc[key].frame)!=value:
        print('Frame count error with video {}\n'.format(key))
        print('{} vs {}'.format(table.loc[key].frame,value))


Making videos to see the original annotation

from random import shuffle
import glob
import matplotlib.pyplot as plt
%matplotlib inline
import cv2
non_zero_vids = list(cpt_non_zero.keys())
shuffle(non_zero_vids)
sample_vids = non_zero_vids[0:100]
g = glob.glob('train_videos/*.mp4')
species_list = ['species_fourspot', 'species_grey sole','species_other', 'species_plaice', 'species_summer','species_windowpane', 'species_winter']
def draw_box(image,xmin,ymin,xmax,ymax,label):
    cv2.rectangle(image, (xmin,ymin), (xmax,ymax), (255,0,0), 2)
    cv2.putText(image, label, 
                (xmin, ymin + 16), 
                0, 
                1e-3 * image.shape[0], 
                (255,0,0), 2)
        
    return image 
with tqdm(total=len(sample_vids)) as pbar:
    for vid in sample_vids:
        # print(vid)
        !mkdir -p inspect_vids/$vid
        !cp -a ../extra_space/train_pics/$vid/*.jpg inspect_vids/$vid
        imgs = glob.glob('../extra_space/train_pics/{}/*.jpg'.format(vid))
        imgs.sort()
        i = 0
        for img in imgs:
            data = df[(df.video_id == vid) & (df.frame == int(img.split('/')[4].split('.')[0])-1)][['x1','y1','x2','y2']].values 
            if data.shape[0]>0:
                label = df[(df.video_id == vid) & (df.frame == int(img.split('/')[4].split('.')[0])-1)][species_list].values.argmax()
                label = species_list[label]
                data = data[0]
                # print(data)
                try:
                    np_img = cv2.imread(img) 
                    np_img = draw_box(np_img,int(data[2]),int(data[3]),int(data[0]),int(data[1]),label)
                    # print(img.replace('../extra_space/train_pics/','inspect_vids/'))
                    cv2.imwrite(img.replace('../extra_space/train_pics/','inspect_vids/'), np_img)
                except Exception as e: 
                    # print (str(e))
                    pass
            else:
                dst = img.replace('../extra_space/train_pics/','inspect_vids/')
            i += 1
        vid_file = 'inspect_vids/vids/'+vid+'.avi'
        !ffmpeg -loglevel quiet -f image2 -framerate 25 -i inspect_vids/$vid/%04d.jpg $vid_file
        pbar.update(1)

At this stage we can directly and easily see which data was provided to us:



As you can see, the fish was not annotated really properly (so that  it mostly fits into the bounding box) - it was annotated head-to-tail. It may be useful if you want to have a multi stage pipeline, which will first find fish head and tail location, but I wanted to have a more or less end-to-end solution. It's really easy to fix - these fishes are mostly as wide as they are long, so we can just turn bounding boxes into squares that fit all the fish.

Adjusting the bounding boxes

def change_coords(x1,y1,x2,y2,f_len):
    
    max_x = 1280
    max_y = 720
    
    x_av = (x2+x1)/2
    y_av = (y2+y1)/2
    
    if(x_av-f_len/2)>max_x:
        x1_new = max_x
    elif (x_av-f_len/2)<0:
        x1_new = 0
    else:
        x1_new=x_av-f_len/2 
        
    if(x_av+f_len/2)>max_x:
        x2_new = max_x
    else:
        x2_new=x_av+f_len/2 
        
    if(y_av-f_len/2)>max_y:
        y1_new = max_y
    elif (y_av-f_len/2)<0:
        y1_new = 0
    else:
        y1_new=y_av-f_len/2 
    if(y_av+f_len/2)>max_y:
        y2_new = max_y
    else:
        y2_new=y_av+f_len/2
    return x1_new,y1_new,x2_new,y2_new


Pandas is not really well fit for applying functions that take data from 4 columns and paste it to 4 columns, so we have to use a small trick:

df['_'] = df.apply(lambda row: change_coords(row['x1'],row['y1'],row['x2'],row['y2'],row['length']), axis=1)
df[['x1_new','y1_new','x2_new','y2_new']] = df['_'].apply(pd.Series)
del df['_']


The end result looks like this

from random import shuffle
import glob
import matplotlib.pyplot as plt
%matplotlib inline
import cv2
def draw_box(image,xmin,ymin,xmax,ymax,label):
    cv2.rectangle(image, (xmin,ymin), (xmax,ymax), (255,0,0), 2)
    cv2.putText(image, label, 
                (xmin, ymin + 16), 
                0, 
                1e-3 * image.shape[0], 
                (255,0,0), 2)
        
    return image 

img = '../extra_space/train_pics/5Nvu5esMH64QdnoT/0038.jpg'
data = df[(df.video_id == '5Nvu5esMH64QdnoT') & (df.frame == int(img.split('/')[4].split('.')[0])-1)][['x1','y1','x2','y2']].values 
data = data[0]
np_img = cv2.imread(img) 
plt.imshow(np_img)
plt.show()
np_img = draw_box(np_img,int(data[2]),int(data[3]),int(data[0]),int(data[1]),'fish')
plt.imshow(np_img)
plt.show()
data = df[(df.video_id == '5Nvu5esMH64QdnoT') & (df.frame == int(img.split('/')[4].split('.')[0])-1)][['x1_new','y1_new','x2_new','y2_new']].values 
data = data[0]
np_img = cv2.imread(img) 
np_img = draw_box(np_img,int(data[2]),int(data[3]),int(data[0]),int(data[1]),'fish')
plt.imshow(np_img)
plt.show()




And the fish rectangle turns into fish square! Flatfish is square anyway...


Create annotated videos

Now we can easily create annotated videos with square boxes

# create one video with bigger boxes
from random import shuffle
import glob
import matplotlib.pyplot as plt
%matplotlib inline
import cv2
non_zero_vids = list(cpt_non_zero.keys())
shuffle(non_zero_vids)
sample_vids = non_zero_vids[0:5]
g = glob.glob('train_videos/*.mp4')
species_list = ['species_fourspot', 'species_grey sole','species_other', 'species_plaice', 'species_summer','species_windowpane', 'species_winter']
def draw_box(image,xmin,ymin,xmax,ymax,label):
    cv2.rectangle(image, (xmin,ymin), (xmax,ymax), (255,0,0), 2)
    cv2.putText(image, label, 
                (xmin, ymin + 16), 
                0, 
                1e-3 * image.shape[0], 
                (255,0,0), 2)
        
    return image 
with tqdm(total=len(sample_vids)) as pbar:
    
    for vid in sample_vids:
        print(vid)
        !mkdir -p inspect_vids/$vid
        !cp -a ../extra_space/train_pics/$vid/*.jpg inspect_vids/$vid
        imgs = glob.glob('../extra_space/train_pics/{}/*.jpg'.format(vid))
        imgs.sort()
        i = 0
        for img in imgs:
            data = df[(df.video_id == vid) & (df.frame == int(img.split('/')[4].split('.')[0])-1)][['x1_new','y1_new','x2_new','y2_new']].values 
            if data.shape[0]>0:
                label = df[(df.video_id == vid) & (df.frame == int(img.split('/')[4].split('.')[0])-1)][species_list].values.argmax()
                label = species_list[label]
                data = data[0]
                # print(data)
                try:
                    np_img = cv2.imread(img) 
                    np_img = draw_box(np_img,int(data[2]),int(data[3]),int(data[0]),int(data[1]),label)
                    # print(img.replace('../extra_space/train_pics/','inspect_vids/'))
                    cv2.imwrite(img.replace('../extra_space/train_pics/','inspect_vids/'), np_img)
                except Exception as e: 
                    # print (str(e))
                    pass
            else:
                dst = img.replace('../extra_space/train_pics/','inspect_vids/')
            i += 1
        vid_file = 'inspect_vids/vids/'+vid+'.avi'
        !ffmpeg -loglevel quiet -f image2 -framerate 25 -i inspect_vids/$vid/%04d.jpg $vid_file
        pbar.update(1)


Let's take a look at  what we have now:


Key takeaways (after watching 200+ fishy videos):

  • It's really difficult to tell one species from another;
  • Videos run at ca. 10 fps;
  • Most of time fish is belly down, but sometimes it's belly down;
  • There are 4 cameras in the training set;
  • Fish classes are heavily skewed towards only one flatfish species (annotation error also?);
  • All the fish is put on the scale (which I did not use, alas);
  • The annotated frames are the fish frames without any obstructions right after fisherman puts the fish to the scale (which I also did not use and tried to filter data using naive algorithms);
  • Location annotation is done really well;
  • Most of the time fishes are changed really fast, ca. 1-2s per fish;


4. Choosing a base model implementation + necessary class extensions / caveats

After extensive googling, I decided to use these two implementations, due to their relative simplicity, openness and friendliness of the authors:

  • Keras yolov2;
  • When I was first downloading the implementation, author did not provide weights in Keras format and his conversion script failed, so I had to use this one instead. Now he seems to have provided the weights in keras format;
  • SSD in pytorch
    • If you want to learn more about SSD, read the original paper, and read this small series of articles, which I also found handy (Part 1);


Code

I did not clean my code / make it really readable - so brace yourselves if you want to tap into some of the below snippets.


Key take-aways

  • Multi-GPU was not directly and easily supported in Keras when I was doing the code for the challenge;
  • When working with Keras - 90% of the time you will be rewriting generators, pay attention to them being thread-safe;
  • Pytorch supports multi-GPU and easily extendable generator classes from the box;
  • Pytorch SSD has excellent augmentations examples;


Feeding the Keras generator

Basically, most of these implementations base their pre-processing pipelines around Pascal VOC or MS COCO, which at this moment are terribly outdated (you will see obsolete python 2 libraries and awful long xml parsing scripts). So the first part of exercise is bracing yourself and writing your own data generator on the basis of the provide one. 

Feeding the generator (after reverse engineering the xml parsing scripts) turned out to be really easy (a link is to final train script):

generator_config = {
    'IMAGE_H'         : IMAGE_H, 
    'IMAGE_W'         : IMAGE_W,
    'GRID_H'          : GRID_H,  
    'GRID_W'          : GRID_W,
    'BOX'             : BOX,
    'LABELS'          : LABELS,
    'CLASS'           : len(LABELS),
    'ANCHORS'         : ANCHORS,
    'BATCH_SIZE'      : BATCH_SIZE,
    'TRUE_BOX_BUFFER' : 2,
}

def change_coords(x1,y1,x2,y2,f_len):
    
    max_x = 1280
    max_y = 720
    
    x_av = (x2+x1)/2
    y_av = (y2+y1)/2
    
    if(x_av-f_len/2)>max_x:
        x1_new = max_x
    elif (x_av-f_len/2)<0:
        x1_new = 0
    else:
        x1_new=x_av-f_len/2 
        
    if(x_av+f_len/2)>max_x:
        x2_new = max_x
    else:
        x2_new=x_av+f_len/2 
        
    if(y_av-f_len/2)>max_y:
        y1_new = max_y
    elif (y_av-f_len/2)<0:
        y1_new = 0
    else:
        y1_new=y_av-f_len/2 
    if(y_av+f_len/2)>max_y:
        y2_new = max_y
    else:
        y2_new=y_av+f_len/2
    return x1_new,y1_new,x2_new,y2_new
def create_annot_dict(filename,label,xmin,xmax,ymin,ymax):
    try:
        return {
            'filename':filename,
            'object': [{
                'name': label,
                'xmax': int(xmax),
                'xmin': int(xmin),
                'ymax': int(ymax),
                'ymin': int(ymin)
            }]
        }    
    except:
        return {
            'filename':filename,
            'object': []
        }
df = pd.read_csv('train.csv')
df['_'] = df.apply(lambda row: change_coords(row['x1'],row['y1'],row['x2'],row['y2'],row['length']), axis=1)
df[['x1_new','y1_new','x2_new','y2_new']] = df['_'].apply(pd.Series)
del df['_']
cpt = {r:len(files) for r, d, files in os.walk('../extra_space/train_pics/')}
del cpt['../extra_space/train_pics/']
cpt = {k.split('/')[3]: v for k, v in cpt.items()}
cpt_zero = {k: v for k, v in cpt.items() if v == 0}
cpt_non_zero = {k: v for k, v in cpt.items() if v > 0}
species_list = LABELS
df['species'] = df[species_list].idxmax(axis=1)
df.loc[df.fish_number.isnull()==True, 'species'] = np.nan
df['path'] = df.apply(lambda row: '../extra_space/train_pics/{}/{}.jpg'.format(row['video_id'],str(row['frame']+1).zfill(4)), axis=1)
df['dict'] = df.apply(lambda row: create_annot_dict(row['path'],row['species'],row['x1_new'],row['x2_new'],row['y1_new'],row['y2_new']), axis=1)
blank_imgs = []
imgs_fish = set(list(df[df['fish_number'].notnull()].path.values))
# Create a list with all the images wo fishes using sets
# List comprehension is soooo fast compared to iteration
for vid in list(cpt_non_zero.keys()):
    imgs = glob.glob('../extra_space/train_pics/{}/*.jpg'.format(vid))
    imgs.sort()
    imgs = set(imgs)
    blank_imgs += [({'object':[], 'filename':img_path}) for img_path in imgs-imgs_fish]
annotated_imgs = []    
# Add images with annotation
# annotated_imgs += [(row.dict) for index, row in df[df['fish_number'].notnull()].iterrows()]
annotated_imgs += [(row.dict) for index, row in df[ (df['fish_number'].notnull()) & (~df.video_id.isin(list(cpt_zero.keys()))) ].iterrows()]

blank_imgs = random.sample(blank_imgs, int(len(blank_imgs)*(1-REMOVE_NEGATIVE_ITEMS)))
all_imgs = annotated_imgs + blank_imgs
train_imgs, valid_imgs = train_test_split(all_imgs, test_size=VALID_SHARE, random_state=42)
del all_imgs,blank_imgs,annotated_imgs,df
train_batch = BatchGenerator(train_imgs, generator_config, jitter=False, aug_freq = AUG_FREQ)
valid_batch = BatchGenerator(valid_imgs, generator_config, jitter=False, aug_freq = AUG_FREQ)


Multi-GPU in Keras

Also in case of keras YOLO - I just had to infer the format supported by their generator and make sure I utilized 2 GPUs. In keras it is not straight-forward and kind of hacky, and also their generator I believe does not support multiple workers.

This is the implementation of hacky multi-gpu keras  implementation I tested and used.

from keras.layers import merge
from keras.layers.core import Lambda
from keras.models import Model
import tensorflow as tf
def make_parallel(model, gpu_count):
    def get_slice(data, idx, parts):
        shape = tf.shape(data)
        size = tf.concat([ shape[:1] // parts, shape[1:] ],axis=0)
        stride = tf.concat([ shape[:1] // parts, shape[1:]*0 ],axis=0)
        start = stride * idx
        return tf.slice(data, start, size)
    outputs_all = []
    for i in range(len(model.outputs)):
        outputs_all.append([])
    #Place a copy of the model on each GPU, each getting a slice of the batch
    for i in range(gpu_count):
        with tf.device('/gpu:%d' % i):
            with tf.name_scope('tower_%d' % i) as scope:
                inputs = []
                #Slice each input into a piece for processing on this GPU
                for x in model.inputs:
                    input_shape = tuple(x.get_shape().as_list())[1:]
                    slice_n = Lambda(get_slice, output_shape=input_shape, arguments={'idx':i,'parts':gpu_count})(x)
                    inputs.append(slice_n)                
                outputs = model(inputs)
                
                if not isinstance(outputs, list):
                    outputs = [outputs]
                
                #Save all the outputs for merging back together later
                for l in range(len(outputs)):
                    outputs_all[l].append(outputs[l])
    # merge outputs on CPU
    with tf.device('/cpu:0'):
        merged = []
        for outputs in outputs_all:
            merged.append(merge(outputs, mode='concat', concat_axis=0))
            
        return Model(input=model.inputs, output=merged)


First you create a model, then you do the  model = make_parallel(model, gpu_count=2) bit, and then you load the weights. Note that the weights from the original model and the new model are not directly interchangeable - because the above script basically replaces your model with 2 models, which is not ideal for loading back the weights (you have to load weights to the modified model).


Thread-safe generator in Keras

When starting to train the model, I realized that it was not utilizing all the resources of the server fully because it's generator was using only one thread. I had to read up on writing thread-safe generators in Keras. So I changed the provided generator and came up with this one.


Predictions in Keras

It's tempting to use the provided predict or predict_batch in Keras, but in this case due to custom data manipulations I had to improvise a quick and dirty single-threaded queue-based solution.

Pytorch takeaways

  • The pytorch SSD implementation had ample augmentations which also serve as a basis for my current work;
  • Also pytorch provides multi-node and multi-gpu boilerplate out of the box - no need to invent the bicycle;
  • I basically just needed to rewrite the Data generator classes and add basic validation to the train script to ensure that there was no over-fitting;
  • The original implementation had no proper runtime validation in the pipeline + no batch-based inference scripts (also their inference post-processing is not multi-GPU friendly, but I should not really complain =) );
  • A couple of basic visdom charts were a cherry on the top;




Original augmentations in SSD pytorch



My augmentations with 3 crops - notice the higher resolution - SSD with crops converged to a 10x lower cost on train and valid



Human-frienly augmented sample


Quick links


Piece of code I adored

I really enjoyed this piece of code that handled random crops for images + bounding boxes. Just take a look how it handles this anal task with random approach:

class RandomSampleCrop(object):
    """Crop
    Arguments:
        img (Image): the image being input during training
        boxes (Tensor): the original bounding boxes in pt form
        labels (Tensor): the class labels for each bbox
        mode (float tuple): the min and max jaccard overlaps
    Return:
        (img, boxes, classes)
            img (Image): the cropped image
            boxes (Tensor): the adjusted bounding boxes in pt form
            labels (Tensor): the class labels for each bbox
    """
    def __init__(self):
        self.sample_options = (
            # using entire original input image
            None,
            # sample a patch s.t. MIN jaccard w/ obj in .1,.3,.4,.7,.9
            (0.1, None),
            (0.3, None),
            (0.7, None),
            (0.9, None),
            # randomly sample a patch
            (None, None),
        )
    def __call__(self, image, boxes=None, labels=None):
        height, width, _ = image.shape
        while True:
            # randomly choose a mode
            mode = random.choice(self.sample_options)
            if mode is None:
                return image, boxes, labels
            min_iou, max_iou = mode
            if min_iou is None:
                min_iou = float('-inf')
            if max_iou is None:
                max_iou = float('inf')
            # max trails (50)
            for _ in range(50):
                current_image = image
                w = random.uniform(0.3 * width, width)
                h = random.uniform(0.3 * height, height)
                # aspect ratio constraint b/t .5 & 2
                if h / w < 0.5 or h / w > 2:
                    continue
                left = random.uniform(width - w)
                top = random.uniform(height - h)
                # convert to integer rect x1,y1,x2,y2
                rect = np.array([int(left), int(top), int(left+w), int(top+h)])
                # calculate IoU (jaccard overlap) b/t the cropped and gt boxes
                overlap = jaccard_numpy(boxes, rect)
                # is min and max overlap constraint satisfied? if not try again
                if overlap.min() < min_iou and max_iou < overlap.max():
                    continue
                # cut the crop from the image
                current_image = current_image[rect[1]:rect[3], rect[0]:rect[2],
                                              :]
                # keep overlap with gt box IF center in sampled patch
                centers = (boxes[:, :2] + boxes[:, 2:]) / 2.0
                # mask in all gt boxes that above and to the left of centers
                m1 = (rect[0] < centers[:, 0]) * (rect[1] < centers[:, 1])
                # mask in all gt boxes that under and to the right of centers
                m2 = (rect[2] > centers[:, 0]) * (rect[3] > centers[:, 1])
                # mask in that both m1 and m2 are true
                mask = m1 * m2
                # have any valid boxes? try again if not
                if not mask.any():
                    continue
                # take only matching gt boxes
                current_boxes = boxes[mask, :].copy()
                # take only matching gt labels
                current_labels = labels[mask]
                # should we use the box left and top corner or the crop's
                current_boxes[:, :2] = np.maximum(current_boxes[:, :2],
                                                  rect[:2])
                # adjust to crop (by substracting crop's left,top)
                current_boxes[:, :2] -= rect[:2]
                current_boxes[:, 2:] = np.minimum(current_boxes[:, 2:],
                                                  rect[2:])
                # adjust to crop (by substracting crop's left,top)
                current_boxes[:, 2:] -= rect[:2]
                return current_image, current_boxes, current_labels


5. Choosing post-processing heuristics

With the technical part out of the way, now we can handle the post processing heuristics. The thing is - SSD and YOLO can predict bounding boxes and class probabilities, but the cannot really predict fish sequences and count fishes,

Fish length is easy - I tried using simple linear regressions (95% accuracy), regression forests (90% due to overfitting) and CNNs (97-98% on binned data, but too complicated for a simple tasks). Given that fish length is only 0.1 of the score and local validation does not really worked - I decided to focus more on post-processing and filtering fish counts and class probabilities.

In the hindsight - it turned our that dividing (and conquering) the task into a series of simpler tasks (find ruler, extract it, rotate it, classify fish, use RNN to count fish) was the way to go, but in my case there was a steep learning curve in learning both SSD / YOLO / pytorch at the same time (as well as having first 2-3 weeks on a new job) - so I just stuck to naive post  processing techniques and grid searching for the best solution, until ZFTurbo suggested that I could use train dataset to validate my post-processing technique by controlling the number of fishes.




Post processing in a nutshell - we need to decide which probabilities to keep and how to separate fish N from fish N+1


Also it turned out that not only me, but also some of the people who managed to submit a reasonable solution, used more or less this technique:

  • Predict fish location and class probabilities;
  • Apply some kind of threshold - i.e. 0.4,0.5,0.9 to the data (e.g. of the max probability of the row is less than threshold, then there is no fish);
  • Iterate over predictions sorted by video and frame number;
  • If there is a gap in the data (N rows of no fish) increment the fish number;
  • Repeat;


I also used a trick (kudos to  Konstantin Lopuhin) - feeding 3 square crops of the images to SSD300 instead of resizing the whole picture. It gave me a 10x lower cost function on the train, but no real impovement on the LB.

In the hindsight - going for a more end-to-end solution was key to getting 0.7+ score.

6. Experiment log, or am I  just unlucky, or is all of this a hoax?

When I just entered the competition, it looked optimistic.


All started very promising




Then I really struggled with tons of experiments, but I just marginally improved my original score (0.55 => 0.57).





As you can see above I was disappointed that a better model (SSD300 + crops) did not really work despite all the tricks. Maybe it's a glitch somewhere, but it is kind of dis-hearting. 

Probably I should also pay some attention to the trick proposed by ZFTurbo - to validate my post-processing script by running the model on the train dataset and comparing the number of predicted fishes. I used MSE and simple histograms to match my values - but they did not really work the LB.

My initial guess 



Optimized algorithm (lower score) - I got 2 minima for SSD - 2 frames + 0.9 threshold, and 0.4 threshold + 2 frames


Also almost in the end I noticed that SSD with 3 crops gave strange predictions - it usually predicted at least 2 classes with probabilities > 0.9, which I tried to mitigate by manually selecting only the top class. Overall SSD had higher probabilities in the outputs.

As a finishing touch, enjoy a couple of videos:

Yolov2 + 0.5 threshold



SSD300 - NO threshold at all


7. End result, advice, limitations, take-aways

  • On the last day of competition I was 14th on the public LB. I expect a major shake-up, though. Since my best model is simple enough,  I expect that some overfitters (if there were any) woud lose some positions, though my thresholding may be naive;
  • End-to-end solutions rule;
  • Using all the additional information possible explicitly - may give you a proper boost;
  • Steel nerves and balls are necessary till the end;
  • When extending the framework / model classes - invest a lot of time into testing;
  • Try to use as simple tools as possible, if you want to win money. Try do divide and conquer;
  • Do not be afraid to test some naive end-to-end solutions as a part of your pipeline (unless it takes more than 1-2 hours to implement), even if they increase overall complexity;

8. Alternative pipelines

These are other guys' pipelines, that scored in the top results:

  • First approach - inference takes 1-2 days:
    • Unet to locate the fish;
    • Classic classification methods to classify the fish;
    • Naive post processing;
    • Update - also used XGBoost with outputs of previous preprocessing steps to do post-processing;
  • Second approach (similar to mine):
    • SSD300+square frame cropping:
    • Naive post processing;
  • Third approach:
    • Manual annotation of the ruler on which the fish is laid;
    • Manual annotation of the fact whether the fish is seen fully / obstructed;
    • SSD to locate the ruler, Dense-net to classify the fish on the manually rotated predictions;
    • GRU/LSTM to find the optimal fish to split the fish count areas;


9. Kudos

Kudos to ZFTurbo (Kaggle, blog) for kindly sharing useful ideas to fine-tuning post-processing scripts.

Kudos to Konstantin Lopuhin (githib, Kaggle) for these commits for original SSD implementation bugs (1,2) (there was no batch processing for inference!) and kind advice on  data pipeline.

Kudos to Dmitry Poplavsky (Kaggle) for sharing basic details of his pipeline.

Kudos to all of the SSD and YoloV2 repository authors and contributors as well as paper authors.


10. Final treat

As a final treat - enjoy the video from one of the leaders of the competition about his pipeline: