Looks really good, huh? This is end-to-end evaluation. Two major failure cases - inconsistent performance with smaller objects near bigger houses + some spurious False Positives
In 2018 ML competitions are plagued by:
- "Let's stack 250 models" type solutions;
- Very nasty / imbalanced / random data-sets, which essentially turn any potentially interesting / decent competition into a casino;
- Lack of inquisitive mentality - just stack everything - people do not really try to understand which technique really works and how much gain it provides;
- Too much BS and marketing;
When looking a CrowdAI mapping challenge, I tried several "new" techniques, which worked really well in Semseg and can be partially applied to other domains:
- Looking at the internal structure of the data / understanding the key failure cases => leveraging it via loss weighting (a bit of self-supervised fashion);
- Polishing your Semseg CNN fine-tuning - namely training last epochs of your CNN with higher DICE loss weight;
I will not be releasing any code at this stage, because the competition is not yet over, but I will gladly share my insights to facilitate discussion.
Why Crowd-ai mapping challenge?
I believe it is a really good test playground for testing new CNN-related ideas for a number of reasons:
- The dataset is really well-balanced;
- The hosts were kind enough to provide a lot of starter-kit boilerplate to do the most tedious tasks;
- There is a stellar open solution to the task;
I believe the following Semseg CNN related techniques to be standard in 2018, but I will list them anyway for completeness:
- Using UNet / LinkNet style CNNs;
- Using pre-trained encoders on Imagenet;
- Trying various combinations of encoder-decoder architectures;
- Trying augmentations of vastly different "heaviness" (you should tune capacity of your model + augmentations for each dataset);
(Funnily enough, they rarely write about this in papers).
From an ML perspective it is great. No, really. Their write-up is amazing.
Weight illustrations in the open solution. Note that they do not show the final weights. If they did - the need to calculate n^2 distances will be less apparent.
These are my corresponding weights
An this is how the CNN sees the weights after parametrization / squeezing - no real difference on how to calculate the distances
But I have a number of serious bones to pick with it and I have voiced my concerns in the gitter chat here, but I will also repeat them here:
- This solution is a blatant marketing of a US$100-per-month-minimum solution for ML, which I guess is supposed to solve scaling / experimentation problems, but I cannot see why Tensorboard + simple logs + bash scripts cannot solve this for free without the code bulk + introducing extra dependencies. Ofc, using this platform is not mandatory, but why would you otherwise ;
- The code bulk in the repo - it contains 3-4 levels of abstractions - decorators on decorators. It's really ok, if you invested in a team of at least several people that compete in interesting competitions as their job, but when you are publishing your code for the public - it looks a bit like those obfuscated repositories with TF code. Their code is really good, but you have to dig through all of there extra layers to get to the "meat";
- The advertised neptune.ml experiment tracking capabilities. Yeah, it tracks all of the hyper-params. But if you follow a link to their open spreadsheet with experiments, you cannot really understand much from there => then what is the tracking advantage? Yeah, running your model on amazon with 2 lines of code is great, but also extremely expensive. Maybe it's better to just assemble a devbox? It looks like they wrote a stellar write-up, but used this tracking only for "illustration", not like a real tool that is crucial;
- The idea about distance weighting is really good - but they implemented line by line as written in UNet paper it by calculating n^2 distances, which in my opinion is over-engineering. Just visually, after "squishing" even one distance transform provides similar mask weights;
- From these points my biggest concern is that new ML practitioners will see this and possibly may draw the following conclusions, which ARE HIGHLY DETRIMENTAL TO THE ML COMMUNITY FROM AN EDUCATIONAL STANDPOINT:
- You need some paid proprietary tools to track experiments. No, you do not. It starts to make sense when you have a team of at least 5 people and you have a LOT of different pipelines;
- You need to invest into highly abstract code. No, you do not. CLI scripts + clean code + bash scripts is enough;
- Implementing UNet paper word-by-word is ok, but probably calculating n^2 distances is not useful;
What worked? Which internal structure did I explore?
Current state / my simple ablation analysis
Currently, I did not implement the second stage of the pipeline, as the second stage is delayed and last time I checked the submits were frozen.
But here is a very brief table with my major tests (I ran ~100 different test, training ca 20-30 models till convergence).
|Architecture||Histogram based F1 score (0.5 IOU) (**)||Hard DICE (0.5) (+)||Proper F1 score (0.5 IOU)||LB / polygon F1 (*)||Comment|
|Best ResNet + LinkNet||.823||~0.9+||NA||NA|
|Best ResNet + UNet||.816||~0.9+||NA||NA||2-3x slower than LinkNet|
|Best InceptionResNet-based model||.786||~0.9+||NA||NA|
|Best DenseNet-based model||.803||~0.9+||NA||NA|
|Bes LinkNext-based model||.820||~0.9+||NA||NA|
|Best ResNet + LinkNet + weighting tricks (***)||.874||~.94||~.9||TBD|
epoch = 0.1 of the dataset (random)
5 epochs with lr 1e-3, unfreeze encoder
10 epochs with lr 1e-4
20 epochs with lr 1e-5
then increase DICE weight x10
Best ResNet + LinkNet + weighting tricks + faster schedule
+ higher size-based weight
No encoder freeze, start with lr 1e-4
10 epochs with lr 1e-4
20 epochs with lr 1e-5then increase DICE weight x10
(*) Challenge hosts provide a tool for local evaluation. It is based on jsons, that have to be filled with polygons + some confidence metric. The authors of the open solution claim that this weighting has a really major impact on the score. I did not test this yet, but I believe it can easily add 2-3 percentage points to the score (i.e. reduce False Positives + make the score .9 => .92-.93 for example);
(**) I just adopted a histogram based F1 score from DS Bowl. I guess it works best when you have many objects. We did a naive implementation of a proper F1 score together with visualization - it is slow, but shows much higher score;
(***) My best runs with (i) size based mask weighting (ii) distance-based mask weighting (iii) polishing with 10x DICE loss component weight;
(+) This is hard DICE score @ 0.5 threshold. Essentially a % of guessed pixels. Somewhere I do not remember the best score exactly;
Some learning curves. (3) - one of the best runs. (2) - effect of untimely LR decay
My best run so far - starting with lr 1e-4 + higher size-based weights - trains faster, achieves a bit better score
And here is the gist of my architecture search.
Best model - the fattest ResNet152
I did some ablation tests and:
-- ResNet101 was a bit worse
-- ResNext was close, but heavier
-- VGG-family models over-fitted heavily
-- Inception family models were good, but worse than ResNet
LinkNet and UNet based models were close with fat ResNet encoders, but UNet is 3-4x times slower
Just for lulz, maybe it's worth leaving a UNet for a couple of days? =) But that borders on remembering the dataset ...
Note, that ideally (0) and (1) should be also tested with the whole pipeline, which I did not do yet (my friend will do the second pipeline part).
Played with different levels of augs, small augs were the best, model does not overfit (therefore it is very interesting to see the second stage data - maybe the data will be different => all the training know-hows will become useless, as it always happens with the competitions on Kaggle ..)
(3) Training regime
Freeze encoder, tune the decoder with lr 1e-3 and adam for 0.1 of the dataset (randomly ofc)
Unfreeze, train with lr 1e-4 and adam for 1.0 of the dataset (randomly ofc)
Train with lr 1e-5 for 1.0 of the dataset
Increase DICE loss 10x and train as long as you want - this possibly may be very fragile (!) if the delayed test dataset is different
(4) Loss weighting
Visually I saw no difference in using only one distance transform vs. calculating distances between each object => they are squished anyway
UNet weighting worked best with --w0 5.0 --sigma 10.0 , which means that the weights are distributed [1;5]
Size weighting worked, and gave +3-5% F1 score
Distance weighting did not improve the result, but together with size weighting there was a slight improvement
Things I decided not to try
Things I decided not to try - morphological operations, erosion, borders etc
- Morphological operations - they are unnecessary here;
- Deep Watershed inspired approaches;
- Proposal based models - they have been shown to be a good start on DS Bowl, but much harder to tune later on;
- Recurrent models - too complicated, too big objects;
Things yet to try
- Build a 2 stage end-to-end pipeline with polygons + meta-data and LightGBM;
- Try the following tricks:
- CLR + SGD / Adam / AdamW;
- Weight-based ensembling;
- Gradual encoder unfreezing (just for the sake of it), batch-norm freeze;
- A couple of ablation tests:
- Maybe lighter UNet;
- Maybe A/B test other weighting approaches;
If you interested in this topic - please feel free to comment - we will release our code with more rigorous tests / references / explanations!