Applying Deep Watershed Transform to Kaggle Data Science Bowl 2018 (dockerized solution)

And why this competition was a lottery

Posted by snakers41 on April 16, 2018

One of the best performing cases - the network can clearly distinguish the merged nuiclei using learned energy

Overall solution pipeline

Buy me a coffeeBuy me a coffee

Become a PatronBecome a Patron


For dockerized solution go here.

Every year Kaggle hosts a Data Science Bowl competition. Last year's competition was nothing short of extraordinary. You name it:

  • New and interesting domain (3D imaging), worthy cause (lung cancer);
  • Large dataset (50+ GB);
  • Alluring prizes;

Unfortunately, last year when the Bowl was hosted, I was not yet ready to participate in it. This year, after the purchase of Kaggle by Google I have started noticing strange trends (here and here). In a nutshell, machine learning competitions seemed to be beneficial for the host as well as the community in general, but now they more and more look like community annotation efforts (US$100k for 1000+ annotations of 3000 strong stage2 test set images does not really seem to be too pricey).

To be short:

  • Small train stage1 dataset (~600 train images and 65 stage 1 test images) together with large stage 2 dataset seems to be a bit fishy;
  • Data distribution on stage 2 has nothing to do with stage 1;
  • So, in the end, the best model will be the best reasonable model, that saw the most external data / manually annotated data (I believe that if a simple unet w/o any kind of  nuclei separation can achieve 0.5+, then probably it is due to annotation);
  • Also Kaggle is notorious for not preventing cheating - in this particular case model re-training was allowed after second stage data was released;

On the other hand, the task itself - instance segmentation - is very interesting despite the small amount of data. You are expected to produce an accurate binary mask as well as to distinguish the objects of different classes. I understand that such datasets are scarce by default, but people from industry (and the challenge hosts) reported that some companies posses similar datasets with terabytes of data.

Basic computer vision tasks. Also add object classification, i.e. just saying that the image contains a person

In this article I will explain my approach to solving this problem, share my Deep Watershed Transform inspired pipeline, list some alternative approaches and solutions and provide my opinion of how such competitions are to be properly organized.

Data / EDA or why machine learning is not magic

The train dataset contained approximately 600 images and the test dataset contained 65 images. The delayed test dataset from stage 2 contained ~3000 images.

Images from stage 1 had the following resolutions, which itself posed a minor challenge - how do you build a unified pipeline for such images?

256x256      358
256x320      112
520x696       96
360x360       91
512x640       21
1024x1024     16
260x347        9
512x680        8
603x1272       6
524x348        4
519x253        4
520x348        4
519x162        2
519x161        2
1040x1388      1
390x239        1

There were mostly 3 clusters, that were easily detected using K-Means:

  • Images with black background;
  • Images with dye;
  • Images with white background;

This is the main reason why converting RGB images to 3-channel greyscale helped on the public LB.

Dark cluster images

Dyed image mosaic

Nuclei mosaic - nuclei vary in shapes / sizes / colours /

The test dataset contained 3000 images, and on initial review,  ~50%+ of these images had nothing to do with the train dataset, which cased a lot of controversy. So, now you have to participate on Kaggle for free, spend time optimizing your model, and then annotate 3000 images also for free? All of this does not sound like proper organization of such high-profile challenge.

Here are most notable failure cases from the test dataset:

I would suspect that those small things on the background are nuclei

Honestly no idea that this is

Looks like muscles. Once again, are those white things nuclei, or what?

Night sky ... are these nuclei, or just noise?

Deep Watershed transform

If you do not know about watershed, then follow this link. Intuitively, watershed is really simple - it treats your image as a mountain landscape (height = pixel / mask intensity) and floods the basins from the markers that you choose, until the waters meet each other. You can find a lot of tutorials with OpenCV or skimage, but basically all of them revolve around these questions:

  • How do we choose the markers from which we will flood the images?
  • How do we determine the watershed borders?
  • How do we determine the height of the landscape?

Deep Watershed Transform (DWT) helps us solve some of there issues.

Main motivation from the original paper

The idea is to make the CNN learn 2 things - unit vectors pointing to (against) boundary and energy (mountain height) levels

In practice if you just apply it - most likely it will result in over-segmentation. This is the intuition behind DWT - make a CNN learn the "mountain landscape" for you.

The authors of the original paper used 2 separate VGG-like CNNs to learn:

  • The watershed energy itself (energy, elevation - whatever you may call it);
  • The unit vectors points to (away from) the borders - to help CNN learn the boundaries);

In practice, it turns out you do not necessarily need a separate CNN. You can just use your UNet and feed several masks to it:

  • The merged nuclei masks (gt_mask);
  • Several thinned / eroded merged nuclei masks (3 levels of eroson - 1 pixel, 3 pixels, 5 or 7 pixels);
  • Centers of the nuclei (did not help in my case);
  • Unit vectors (helped a little bit, locally);
  • Borders (helped a bit, locally);

Then you just need a bit of magic to combine this and viola, you will have your learned energy. I did not experiment much with architectures, but Dmytro told me that for him there was little difference when he was using and additional CNN or not.

For me the best post-processing (energy_baseline function) was simply:

  • Sum the predicted mask and 3 levels of eroded masks;
  • Apply a threshold of 0.4 to produce watershed seeds;
  • Flood the original thresholded masks from these markers;
  • Use distance transform as "landscape height" measure;

So, to answer the initial questions:

  • How do we choose the markers from which we will flood the images? => thresholded energy, estimated by the CNN
  • How do we determine the watershed borders? => thresholded mask
  • How do we determine the height of the landscape? => distance transform

A step by step post-processing

Learned gradients are not too steep to be used as watershed lines

Sometimes blob detection also helped to estimate the centers of the nuclei, but overall it did not improve the score.

Other approaches

Personally I would divide the possible approaches to this problem into 4 categories:

  1. UNet-like architectures (UNet + pre-trained Resnet34, UNet + pre-trained VGG16, etc) + Deep Watershed Transform inspired post-processing. UNet (and it's cousin, LinkNet) is known to be a universal and easy tool when dealing with semantic segmentation tasks. Add end-to-end energy estimation ;
  2. Recurrent architectures. I found only this more or less relevant paper (even followed by code release);
  3. Proposal based models like Mask-RCNN. Though being more difficult to use (and lacking a proper implementation in PyTorch), it was reported to give a higher result from scratch, but mostly impossible to improve;
  4. Other moonshots described here;

For me favouring DWT + UNet was a no-brainer, because of simplicity of such approach (you can just feed energy layers as additional mask channels) and applicability to other domains. I also really liked the recurrent extensions of the UNet, but I did not have enough time to properly try it.

In case of recurrent UNet, I believe that there are 3 key ingredients added to the ordinary case:

  • ConvLSTM layer;
  • Stopping loss - that is supposed to tell the CNN that it has learned enough nuclei;
  • And usage of Hungarian algorithm to match the predicted nuclei with the ground truths;

All of this seemed a bit overwhelming at first, but I will definitely will try it in future. My intuition though is that this method combines 2 memory intensive architectures - RNN and a encoder-decoder network, which may be impractical in real usage outside of small datasets and resolutions.

A description of the ConvLSTM layer

Recurrent UNet architecture

My pipeline

You can find the details here. In a nutshell:

  • Unet with VGG16 encoder;
  • Deep Watershed-like solution with learned energy;
  • A lot of augmentations, including converting images to grayscale;
  • Train model on 256x256 random crops;
  • Predict on resized images (probably this is the wrong choice);

How to host such competitions properly

In a nutshell, like in this competition, but without a couple of annoying topcoder perks.

  • Larger train dataset and more balanced train/test split;
  • Clear limitations on external data and resources;
  • Dockerization and code freeze;
  • Stage 2 has to be evaluated by the hosts;
  • No model re-training between stage 1 and stage 2;
  • Reproducibility of results;


As usual, kudos to my DS competition role model Dmytro for fruitful discussions and tips.

Links / resources / further reading

Buy me a coffeeBuy me a coffee

Become a PatronBecome a Patron