A half of Google open images dendrogram. Just so that you would understand the challenge
TLDR
I achieved ~90% accuracy (.88 f1 score) on multi-class classification on open images with 76 classes.
The only thing remaining is to test a coarse semseg model. Code will be released after the competition end.
Number 76 is:
- ~40% images marked with level 0 classes;
- ~40% images marked with level 1 classes;
- ~11% images - children of the above level 1 class images + some manual fiddling;
- Overall I believe I covered 1.2m images from 1.5m;
- At first I had 90 classes - but this was due to manual errors and some inconsistencies within the structure json file;

Some of my best runs
Why Google open images
AFAIK in 2018 this is one of the largest openly available image datasets with class labels as well as bounding boxes. I am mostly interested in this dataset not because of the corresponding competition, but mostly because:
- You can test novel CNN-related techniques on this dataset (e.g. stochastic weight averaging, focal loss, gradually unfreezing the encoder, some forms of cyclical learning rates etc);
- You can try novel / non-standard approaches with this dataset;
- You can try building model cascades. This experience may be transferrable to any hierarchical task with CNNs;
Open images 4 in a nutshell:
- ~1.5 images with bounding boxes;
- ~15m bounding boxes;
- There is an explicit hierarchy of classes provided as a json file;
- ~600 classes with heavy class imbalance;
- ~40% of the dataset is level0 labeled objects (i.e. they have no parent);
- ~40% of the dataset is level1 labeled objects (i.e. they have a parent);
- ~11% are level3 objects;
- There are 76 classes that cover 91% of the dataset (level0 + level1 + some fiddling with level3 and some manual adjustments);
Some pitfalls of the dataset / competition
They key problem that I have with this dataset - is its raw and non-curated nature. Just take a look at some examples I picked:



Basically, this means that:
- You have to tackle a complicated challenge - object detection;
- There are a lot of small objects that appear in flocks with which YOLO-like models appear to be struggling;
- Objects vary greatly is sizes, forms;
- Annotation is not uniform and is messy - bounding boxes are messy and classes are not consistent across the dataset;
- Also 2 months for this competitions seems to be a bit on the small side;
Mastering some techniques / potential avenues of tackling the issue
Basically I see only 2 approaches to solving this problem (and yes, I am not really planning to participate in the actual competition, unless my test runs prove to be very efficient):
- Use an SSD-like object detector with the following features: (i) encoder skip connections (ii) object anchors (iii) maybe some kind of cascade in the parametrization (a leg is a part of a person);
- Do it in two stages:
- Run a multi-class classification model;
- Have a number of smaller second-level models, that do some form of coarse semantic segmentation (this automatically solves) on some curated subsets of object types;
The problem that I have with object detection is that I do not have a proper working object detection pipeline, and the pipelines I used in past were more or less black boxes. In practice it means that if you want to achieve a high result on such a task - you need to invest 2-3 weeks of work at least to build such a pipeline. On the other hand I have a semseg pipeline and classification pipeline is very easy to build.
As a side perk this also means that you will have a nice dataset and a challenge to test new CNN-related ideas.
Evaluating the 2-stage approach
Basically you need to to 2 things:
- Build a proper multi-class classification model - at least ~90% accuracy;
- Build at least one proper coarse semseg model to prove that is properly works on some subclass;
- Test that this approach scales to other classes then;
I did a benchmark of multi-class classification models and approaches useful in general with multi-tier classificators. The basic idea is - follow the graph structure of class dependencies - train a good multi-class classifier => train coarse semseg models for each big cluster.
What worked
- Using SOTA classifiers from imagenet (ResNet 152 - papers say it transfers best);
- Pre-training with frozen encoder (otherwise the model performed worse);
- Best performing architecture so far - ResNet152 (a couple of others to try as well);
- Different resolutions => binarise them => divide into 3 major clusters (2:1,1:2,1:1);
- Train on 0.25 of the original resolution (probably I can just boost my accuracy by boosting this);
- Using adaptive pooling for different aspect ratio clusters;
- Fixing the following issues with annotation:
- Sometimes the same class has 2 places in the hierarchy;
- Some inconsistencies in the provided json file - there are basically 2 child types - sublabel and part, which I glanced over at first;
- Being more careful when traversing the class tree, fixing some issues manually, i.e. some classes are not present in the provided class description file;

Fixing issues with annotation boosts (left chart - green vs. red) accuracy

A dataset inconsistency
What did not work or did not significantly improve results:
- Oversampling (I solved it elegantly using build-in PyTorch sampling);
- Oversampling + minor augs;
- Using modest or minor augs (10% or 25% of images augmented);
What did not work at
- Using 1xN + Nx1 convolutions instead of pooling - too heavy;
- Using some minimal avg. pooling (like 16x16), then using different 1xN + Nx1 convolutions for different clusters - performed mostly worse than just adaptive pooling
Some error analysis
I basically stumbled upon the dataset problems, when I analyzed performance across classes. This led me to understand that there was something wrong with parametrization / classes.

F1 score vs. class size. Note the weird cluster at the bottom

Same behavior - but in histogram form


Weird behaviour of human-centered clusters