How we participated in SpaceNet three Road Detector challenge

And how we got into top 10 =)

Posted by snakers41 on February 7, 2018



TLDR for this competition. A summary of architectures, hacks, training regimes and competition data EDA


Buy me a coffeeBuy me a coffee

0. TLDR

See image above.  Competition link. We ended up in provisional top-10, meaning that our position may change withing this top-10 after additional testing by the hosts.

Also I invested quite some time in writing idiomatic PyTorch code + data generators. You are welcome to use the code release.

Also we may also write a post about understanding and dissecting sknw all the top competitors ended up using for mask => graph transformation./



Competition target in a nutshell


1. Competition

As you may know we sometimes participate in competitions (1 and 2) and have specific criteria to select the competitions: i) reasonable competitiveness (usually heavily marketed Kaggle competitions attract 1000+ stackers) ii) general interest in the topic iii) challenge.

In case of this competition  we had it all:

  • Tough challenge and high probability that ods.ai top team will be a worthy competitor (they participated in last 3-4 satellite imaging challenges in 2017);
  • The challenge required not only to build a proper semantic segmentation, but also to build a GRAPH out of the predicted masks. This entailed a convoluted and interesting metric (part 2 and their repo). People who tried the repo told me that the python code diverged from the leaderboard heavily, so everyone ended up just using their visualization tool written on Java;
  • Interesting domain - satellite imaging - with ample data (see the above image for details) and ~4000 HD images from 4 cities with different channels (basically we had HD gray scale images, 8 channel MUL images, and pan-sharpened MUL images as well as RGB images. More about pan-sharpening here);
  • Interesting data (graphs, geojson data, satellite images) and new domain to work with (TLDR - use skimage and rasterio - all the other libraries are HORRIBLE). Also it turned out that skimage works well with 16-bit images, tiff images, and other weird formats;
  • I do not know whether it is good or bad, but the TopCoder platform requires a high level of professionalism from competitors (you have to submit a dockerized version of your application, which is really boring for me not because of the Docker part, but because of the painstaking packaging required). But the platform itself is outright horrible, and the code quality provided by the hosts of the competition is also an abomination. So it bothers me a lot, when after you have invested a LOT of time into competing, then they ask you to invest MUCH more time, to package your submission;


My brief rant on current top data science platforms:

  • Kaggle - hoards of stackers and mostly business oriented competitions in 2017 onwards;
  • TopCoder - sometimes very interesting and sophisticated niche stuff, but overall the platform and requirements are horrible;
  • DrivenData - the smallest and the best one so far. Interesting domains, nice hosts who do their job and do not require you to do their job;





We can see that soil and asphalt do not really have peaks in their reflectivity, unlike water of grass - so I invested quite some time in testing various channel combinations


2. Brief EDA, interesting edge cases

You can the input data composition in the top image, but there are a few interesting / noteworthy edge cases, that kind of explain what is going on under the hood.


The original 16-bit image histogram


8-bit image histogram, so some information is lost


Paved roads vs non-paved



An example of paved roads vs. non paved roads vs. intersections


Road lane composition - notice 12 lane roads at the border terminal I guess



Typical bad annotation for Paris 



2. Worthy reading

This is a list of the best materials / resources on the topic I read / compiled when participating in the competition.

  • Tinkoff competiton / basic info about satellite imaging 1;
  • Information about satellite image channels and pan sharpening 1;
  • Relevant domain papers - only this one;
  • Relevant semantic segmentation papers
  • Single best mask to graph repository - sknw;
  • Amazon jungle competition details and experience 1 2 3 4;
  • Feature engineering 1 2;
  • Haze removal: paper, a couple of repositories 1 2 (they work, but they are slow);
  • Repositories /  pipelines 
    • Semantic segmentation 1;
    • Carvana competition materials (best code ever):


To sum all the architectures up:

  • UNET + Imagenet is kind, but LinkNet is better (2x faster and lighter with little loss in accuracy);
  • Transfer learning is king;
  • Sknw is the best mask => graph tool, but it required some fiddling;




Unet + transfer learning



LinkNet


Loss explanation. Some academics still use MSE...


3. Baselines and model failure cases

Well, the majority of the information is above the TLDR section. This section will just elaborate a bit on the details. The leaderboard score was ranging from 0 to roughly 900k for ~900 test images (a score of 0 - 1 was assigned to each image). The final metric was calculated as an average between cities, which was underwhelming, because Paris dragged the overall score down significantly.



Score distribution per tile for some of the best performing models. As you can see Paris has a lot of vegetation  / rural tiles with shitty annotation which affects the score. Also Shanghai and Khartoum have higher divergence between t-p and p-t scores, i.e. it's much easier for the model to make sure all the graph edges are correct, but much more difficult to find ALL graph edges


I tested several dozens of ideas and architectures, and to my dismay, on the most simple and naive one worked (it was kind of uniform among the contestants).

Loss

  1. BCE + DICE / BCE +1 - DICE - behaved kind of the same;
  2. Loss with clipping behaved horribly;
  3. N*BCE + DICE, BCE + N * DICE - did not work in my case. Later competitors shared information, that the metric to be monitored is HARD DICE and the optimal loss was 4 * BCE + DICE;


CNNs

  1.  LinkNet34 (resnet34 + Decoder) - was the best in speed / accuracy
  2. Linknet50, LinkNext101 (ResNeXt + LinkNet), VGG11-Resnet - all behaved the same, but required 2-4x more resources
  3. All the encoders were pre-trained on ImageNet, ofc;
  4. In 8-channel network I just replaced the first convolution, but it behaved more or less the same;


Processing

  1. Mask is binarized
  2. I did ablation analysis of all of the channel combinations
  3. The best were vegetation, rgb, 8 channels, urban - but difference were mininal (3-5%);
  4. Naturally HD images behaved better;
  5. 8 or 16-bit images also behaved more or less the same;
  6. Image normalization - extracting Imagenet mean and std gave +2-3% loss reduction, but no difference on the leaderboard;


Masks

  1. Mask from APLS repository. 10% of them throw an exception because of lack of roads;
  2. Just loading linestrings to skimage and feeding ~10% additional blank images - produces better loss, but lower leaderboard score;
  3. Wide masks, i.e. where road width corresponds to its real width - produce much lower loss, but lower score on the leaderboard;
  4. Paved / non-paved roads - models for non-paved roads work much worse due to lack of data;
  5. Multi channel mask models also did not work;


Meta-models

  1. Siamese networks for refining the predictions of wide + narrow networks did not work;


Pipelines

  1. 3 or 8 channel images behaved more or less the same;
  2. 90 degree rotations, horizontal and vertical flips, minor affine transformations decreased loss by 15-20%;
  3. My best model was on full resolution, but 800x800 crops worked as well;

Ensembling

  1. 3x folds + 4x TTA did not give any boost in my case;


Additional ideas that supposedly boosted the score on the leaderboard:

  1. LinkNet34 (LinkNet + Resnet) => Inception encoder - +20-30k;
  2. Train on full-resolution. On inference: downsize => predict => upsize => ensemble +10-20k;
  3. Post process graphs and connect edges with the frames of the tile - +20-30k;
  4. RGB image (as opposed to MUL) inference +10-15k due to corrupted data;
  5. 4 BCE + 1 DICE, monitor HARD dice instead of the whole loss - +10-30k;


Main failure case. No one solved this as far as I know. Wide masks were supposed to do that


Main reason why wide masks failed - they produce polips and octopuses =)



Same failure case



Paris - bad annotation, forest area - also a failure case



Multi-tier roads - wide model prediction vs narrow



Model hallucinates roads on top of parking lots =)



Sometimes PyTorch glitches out and produces such artifacts


4. Building graphs and main failure cases

The key ingredient was sknw + add graph edges for curvy roads (kudos to Dmytro for his advice!)


This alone pretty much guaranteed that you would be in top10



Without this steps all masks  end up looking like this


Alternative pipelines I tried:

  1. Skeletonize + corner detection from skimage;
  2. Some variants of dilation from here;
  3. They produced decent edges, but sknw was much better;


Additional stuff that sometimes helped other competitors:

  1. Dilation as pre or post processing step;
  2. Connecting graph edges to the frames of the image tiles;


Visualization tool 


5. A visual treat - whole city images

Following a tradition from the last competitions, some guy from the chat made full city images from the tiles using geo-referencing data from tif images.

You can see them in high resolution by following the links:

  1. Vegas;
  2. Paris;
  3. Shanghai;
  4. Khartoum;




Vegas - looks really nice