Solving class imbalance on Google open images

Or potentially building a model cascade to alleviate severe class imbalance

Posted by snakers41 on July 23, 2018

Buy me a coffeeBuy me a coffee

Become a PatronBecome a Patron

A half of Google open images dendrogram. Just so that you would understand the challenge


I achieved ~90% accuracy (.88 f1 score) on multi-class classification on open images with 76 classes.

The only thing remaining is to test a coarse semseg model. Code will be released after the competition end.

Number 76 is:

  • ~40% images marked with level 0 classes;
  • ~40% images marked with level 1 classes;
  • ~11% images - children of the above level 1 class images + some manual fiddling;
  • Overall I believe I covered 1.2m images from 1.5m;
  • At first I had 90 classes - but this was due to manual errors and some inconsistencies within the structure json file;

Some of my best runs

Why Google open images

AFAIK in 2018 this is one of the largest openly available image datasets with class labels as well as bounding boxes. I am mostly interested in this dataset not because of the corresponding competition, but mostly because:

  • You can test novel CNN-related techniques on this dataset (e.g. stochastic weight averaging, focal loss, gradually unfreezing the encoder, some forms of cyclical learning rates etc);
  • You can try novel / non-standard approaches with this dataset;
  • You can try building model cascades. This experience may be transferrable to any hierarchical task with CNNs;

Open images 4 in a nutshell:

  • ~1.5 images with bounding boxes;
  • ~15m bounding boxes;
  • There is an explicit hierarchy of classes provided as a json file;
  • ~600 classes with heavy class imbalance;
  • ~40% of the dataset is level0 labeled objects (i.e. they have no parent);
  • ~40% of the dataset is level1 labeled objects (i.e. they have a parent);
  • ~11% are level3 objects;
  • There are 76 classes that cover 91% of the dataset (level0 + level1 + some fiddling with level3 and some manual adjustments);

Some pitfalls of the dataset / competition

They key problem that I have with this dataset - is its raw and non-curated nature. Just take a look at some examples I picked:

Basically, this means that:

  • You have to tackle a complicated challenge - object detection;
  • There are a lot of small objects that appear in flocks with which YOLO-like models appear to be struggling;
  • Objects vary greatly is sizes, forms;
  • Annotation is not uniform and is messy - bounding boxes are messy and classes are not consistent across the dataset;
  • Also 2 months for this competitions seems to be a bit on the small side;

Mastering some techniques / potential avenues of tackling the issue

Basically I see only 2 approaches to solving this problem (and yes, I am not really planning to participate in the actual competition, unless my test runs prove to be very efficient):

  • Use an SSD-like object detector with the following features: (i) encoder skip connections (ii) object anchors (iii) maybe some kind of cascade in the parametrization (a leg is a part of a person);
  • Do it in two stages:
    • Run a multi-class classification model;
    • Have a number of smaller second-level models, that do some form of coarse semantic segmentation (this automatically solves) on some curated subsets of object types;

The problem that I have with object detection is that I do not have a proper working object detection pipeline, and the pipelines I used in past were more or less black boxes. In practice it means that if you want to achieve a high result on such a task - you need to invest 2-3 weeks of work at least to build such a pipeline. On the other hand I have a semseg pipeline and classification pipeline is very easy to build.

As a side perk this also means that you will have a nice dataset and a challenge to test new CNN-related ideas.

Evaluating the 2-stage approach

Basically you need to to 2 things:

  • Build a proper multi-class classification model - at least ~90% accuracy;
  • Build at least one proper coarse semseg model to prove that is properly works on some subclass;
    • Test that this approach scales to other classes then;

I did a benchmark of multi-class classification models and approaches useful in general with multi-tier classificators. The basic idea is - follow the graph structure of class dependencies - train a good multi-class classifier => train coarse semseg models for each big cluster.

What worked

  • Using SOTA classifiers from imagenet (ResNet 152 - papers say it transfers best);
  • Pre-training with frozen encoder (otherwise the model performed worse);
  • Best performing architecture so far - ResNet152 (a couple of others to try as well);
  • Different resolutions => binarise them => divide into 3 major clusters (2:1,1:2,1:1);
  • Train on 0.25 of the original resolution (probably I can just boost my accuracy by boosting this);
  • Using adaptive pooling for different aspect ratio clusters;
  • Fixing the following issues with annotation:
    • Sometimes the same class has 2 places in the hierarchy;
    • Some inconsistencies in the provided json file - there are basically 2 child types - sublabel and part, which I glanced over at first;
    • Being more careful when traversing the class tree, fixing some issues manually, i.e. some classes are not present in the provided class description file;

Fixing issues with annotation boosts  (left chart - green vs. red)  accuracy

A dataset inconsistency

What did not work or did not significantly improve results:

  • Oversampling (I solved it elegantly using build-in PyTorch sampling);
  • Oversampling + minor augs;
  • Using modest or minor augs (10% or 25% of images augmented);

What did not work at

  • Using 1xN + Nx1 convolutions instead of pooling - too heavy;
  • Using some minimal avg. pooling (like 16x16), then using different 1xN + Nx1 convolutions for different clusters - performed mostly worse than just adaptive pooling

Some error analysis

I basically stumbled upon the dataset problems, when I analyzed performance across classes. This led me to understand that there was something wrong with parametrization / classes.

F1 score vs. class size. Note the weird cluster at the bottom

Same behavior - but in histogram form

Weird behaviour of human-centered clusters