Exploring the limits of unsupervised Machine Learning in Computer Vision

Or continuation of flat saga (there is no free lunch indeed)

Posted by snakers41 on April 29, 2018


Just look into these beautiful and 100% photo-realistic faces  - I guess this more or less is representative of what GANs can achieve now


Buy me a coffeeBuy me a coffee

Become a PatronBecome a Patron


TLDR

As a continuation of this saga and this review, I decided to apply more heavy unsupervised learning algorithms to try to build a semantic search engine for 1M flat image dataset. I also tried testing the GANs on the LSUN dataset.

Since the time of the review, one more article came out, but alas, no proper reference implementation.

As a reminder, here is the challenge:

  • There is a very large (and in theory virtually unlimited - 10-20M+ images) dataset containing Russian flat images;
  • Add LSUN - and unlabeled data literally is too abundant;
  • The task - is to find the images of the same flat / room - but taken from different angles (at different time, from different positions in the room, etc);
  • There is very little annotation - the flats are sorted, but this sorting is very coarse (a window on size 100-200 may contain 3-5 positively matching sample);


From this exercise and from my other similar experiences here are the brief conclusions:

  • Cluster analysis:
    • Features from pre-trained Imagenet encoder => PCA => Umap => HDBSCAN work really well for image clusterization;
    • The clusters detected, though, are not really granular (i.e. outdoors, toilets, baths, living rooms, houses, etc) ;
  • Siamese networks:
    • Any siamese network / hard negative mining inspired methods just did not work - the annotation data is too coarse;
  • GANs as encoder:
    • At some point I had the following idea:
      • Train a GAN using the latest SOTA methods (i.e. Wasserstein loss, progressive growing, some other tricks from papers);
      • Use a Generator from the GAN to recover embeddings from new images;
      • Use these embeddings for search / distance calculation;
    • Using latest developments in GANs this kind of works, but I did not achieve anything close to photo realistic state on resolutions higher than 128x128 (I believe that 256x256 is minimum);
    • 128x128 showed more or less decent images;
    • Training GANs is really tricky, probably not reproducible, very much DATASET dependent;
    • The resulting reproducible celebrity images are nowhere near the boasted HD images from NVIDIA;

Why GANs, which GANs especially, which results did I?

Why try GANs for this task? Well, because:

  • I know for a fact that similar easier GAN related tasks are solved in practice:
    • Photo style transfer;
    • Super resolution (because perfect annotation is easy to obtain) - and this also can be done with encoder-decoder CNN, like UNet;
    • Some photo reconstruction tasks, like fixing cracks etc;
  • A lot of smart people pointed their attention to this topic (judging by the papers and the news). Usually it means either hype or the fact that some architectures suddenly started working;
  • There is no quality annotation in my case, besides the images themselves;
  • GANs are a cool thing to try;
  • My favourite framework (PyTorch) suits this task really well;
  • In the original progressive growing paper featured this image, which motivated me. Ofc, they did not share 100% reproducible code and settings to reproduce this;


I understand, that 200 people in Nvidia works produce a dozen papers per year, and probably I just do not know which knobs to turn properly, but honestly I do not have 3-6 months just to devote to writing my own GAN loop or fine-tune the knobs.



On the other hand, it should be noted that:

  • GANs are notoriously unstable and difficult to train;
  • GANs are dataset dependent (surprise, anyone?). If it works for Celeba or LSUN - it does not mean that your dataset will work;
  • The simpler the domain (i.e. 32x32 anime images will work fine, 100%) - the higher the change it will work;
  • In my case - I trained them on a subset of flat images that I cleaned from noise and selected the most interesting and prominent cluster;
  • Playing with network capacity is a bit of challenge - you have an encoder - decoder network, which is already very memory intensive;
  • Training time - with progressive growing - 3-10x compared to similar sized UNet;


For this tasks DCGAN + progressive growing seemed to be a natural fit. There is a list of PyTorch (1,2,3, I used the first one, works mostly ok, but you have to rewrite the dataset class and seriously dig into batch-size options, which is poorly documented) implementations (judging by the timelines, writing and testing such an implementation may take 1-3 months itself). All of these are more or less in semi-finished state, and it can be seen from the authors' logs and newer issues, that all of them faced some kind of ceiling in their work. Though 256x256 images on Celeba-HQ dataset look believable.

I did not store a lot if images, but images up to 128x128 looked more or less decent (a neat thing - the GAN hallucinates a cian.ru watermark) for such a diverse dataset:


128x128 images look mostly decent


256x256 images


After playing with the knobs - I decided to call it a day, and abandon this challenge. If my GAN worked perfectly, then I would also have to write a network to distill it's encoding mechanism into another CNN =)

Cleaning and clustering the data

How to cluster such data?

As I mentioned above - GANs work better for simple domains (probably this is the reason that the most featured HD dataset was human faces). Prior to running a GAN, in addition to my previous attempts I cleaned the images using the following algorithm:

  • Use skip connections from inception 4 neural network as encoder (~3072 dimension vectors);
  • Apply PCA to reduce dimensions to 10;
  • Apply UMAP and HDBSCAN to cluster the images;
  • Select only one the most prominent cluster;


Here I should say a couple of words about Imcinnes and his two amazing libraries HDBSCAN and Umap. In a nutshell:

  • Umap is faster T-SNE on steroids (the author claims it to be 10-100x faster), that projects high dimensional data into lower dimensions, preserving local distances;
  • HDBSCAN is a very advanced clustering technique, that scores high on these criteria and supports non-globular clusters:
    • Don’t be wrong (HDBSCAN has a separate "noise" class for );
    • Intuitive parameters  (the single most important parameter is minimum cluster size);
    • Stability over several runs;
    • Performance;
    • (non-globular clusters);


As a mentioned multiple times on my channel these libraries are simply awesome, but let me share a number of additional links:


Clustering results

If you do not yet believe me, just see these results.



Umap manifold with clusters created by HDBSCAN. Red dots - are noise.


Now look at some of the clusters:



Most prominent cluster - living rooms



Toilets



Outdoors



Baths

Siamese networks?

I have tried a number of siamese architectures with different losses (similar to this) and dataset preparation techniques - but nothing really worked.

CNNs can handle noisy data (if you swap 25-30% of labels randomly, accuracy does not fall dramatically), but I guess this is too noisy.

Other methods I did not try

One guy mentioned that such a task can be solved with classical SIFT features, but he mentioned this after I pulled the plug. Alas.

Also I had an idea borrowed from LSUN 2016 challenge - use UNets to predict room geometry (predict the room box by predicting floor, walls and ceiling masks) and the search rooms by geometry.

But can it realistically be done in practice, even with unlimited resources / money / investments?

Well, of course yes. Here is a list of things, that can be done, given unlimited time / resources / investments / effort etc:

  • Train a plethora of weak classifiers and detectors, e.g. whether there is a TV / bed / window / cabinet etc. in the frame. Use this data + some kind of meta-data (address) to train tree-based classifiers;
  • Invest much more time into producing a working encoder using GANs. Implement progressive growing of GANs from scratch;
  • Invest into annotation directly - but I believe that this will be the worst investment - there is no literature applying the siamese networks to REAL datasets in the wild (i.e. something close to Imagenet level);


A brief note on NVIDIA and hype



The hype train. It's even more real than Mickey Mouse


I have noticed many times (especially reading medium or NVIDIA blog posts (which is further reassured by lack of proper supporting code) ) that many ML papers / articles / events pursue the objective of riding the hype train, i.e. the presented results may be curated post hoc and not reproducible per se. It is always a risk when you apply SOTA architectures in practice.

Also I have noticed many times that sometimes some high-profile teams publish dubious articles (like the link above) on some hot topic claiming top results in some competitions, but not bothering to upload their results to the leaderboard.

Buy me a coffeeBuy me a coffee

Become a PatronBecome a Patron