GAN paper list and review

My brief guide / notes I made when reading GAN papers

Posted by snakers41 on January 4, 2018

Generative Adversarial Networks (GANs) in a nutshell in 2016. They are becoming more and more photo realistic day-by-day

Just check out this video to get a retrospective on what GANs can do nowadays in 2017-2018

Buy me a coffeeBuy me a coffee

Become a PatronBecome a Patron

0. What?

Inspired by this repository, for professional reasons I need to read all the most promising / influential / state-of-the-art GAN-related papers and papers related to creating latent space variables for a certain domain. So I decided to make a small curated list of GAN-related papers (and papers related to extracting and training latent space variables in cases when we have no explicit annotation) for myself and for my channel. This serves a couple of purposes:

  • motivates me to be more thorough in paper reviews / note the available code implementations;
  • serves as a notebook and facilitates discussion with my peers / colleagues / friends;
  • is available not only to technical audience, unlike Github and is a good supplement for my telegram channel;
  • serves as some kind of CV, because the underlying work cannot be published under NDA;

Also note - I use all this stuff in applied fashion, I am not a scientist - i.e. my motivation is result, not writing code or writing papers.

Also in a broader sense, if you want to find the best papers in some field:

  • follow influential thinkers via twitter API;
  • ask around;
  • see what is used in competitions;
  • user this website;

1. How is it different from awesome lists?

Well it is not in essence. A couple of minor differences though:

  • Following my article from here and this page - I will add a highly subjective influence / must-read score because all awesome lists become piles of garbage in the end;
  • More cats and lulz;
  • Only high-level gist and very sparse details;
  • Basic navigation - 

2. TLDR;

Well TLDR in the form of sortable dynamic table is located here.

Notice that it should look like this more or less (note the sorting)

3. More detailed descriptions

Some trivia at first:

  • Some examples of what is achievable;
  • Selecting optimizers for image style  transfer;
  • How to train GANs
  • GAN issues:

(1) DeViSE: A Deep Visual-Semantic Embedding Model  ★★★★☆ (8/10)

(2) A Neural Algorithm of Artistic Style ★★★★★ (99/10)

  • key ideas
    • style and content can be split in CNNs
    • higher layers = content, lower layers = style - illustration

    • VGG16 used as an encoder
    • Loss contains 2 weighted components - content loss and style loss - different combinations

    • content loss - MSE of some CNN layer for target image and style-source image
    • style loss - MSE of gram matrices

(3) Generative Adversarial Networks ★★★★★ (99/10)

  • meta-data
    • authors - Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, DavidWarde-Farley, Sherjil Ozair , Aaron Courville, Yoshua Bengio
  • afaik - the original GAN paper
  • similar purpose appoaches (just for list - I see names for first or second time): restricted Boltzmann machines (RBN), Deep belief networks (DBNs), noise-contrastive estimation (NCE), generative stochastic network (GSN)
  • key actors:
    • prior for input noise variables p_z(z)
    • generator G(z,theta_g)
    • discriminator D(x,theta_d)
    • G and D compete - generative model is pitted against an adversary discriminator
    • D is trained to maximize the correct label obtained from G and training samples
    • G is trained to minimize log(1−D(G(z)))
  • training the model
    • iterative approach - k steps of training D and one step for G
    • early in the training because of saturation rather than training G to minimize log(1−D(G(z))) training G to maximize logD(G(z)) can be done
    • exact likelihood is not tractable (!!!)
  • illustrations - generated images


  • ,meta-data
  • key ideas
    • stable GAN training across a range of datasets with a list of hacks
    • features learned by DCGAN are strong for transfer learning with accuracy almost as good as sota models
    • good properties for latent space algebra
  • brief novelties
    • Their D: Convolutional networks. No linear layers. No pooling, instead strided layers. LeakyReLUs.
    • Their G: Starts with 100d noise vector. Generates with linear layers 1024x4x4 values. Then uses fractionally strided convolutions (move by 0.5 per step) to upscale to 512x8x8. This is continued till Cx32x32 or Cx64x64. The last layer is a convolution to 3x32x32/3x64x64 (Tanh activation).
    • The fractionally strided convolutions do basically the same as the progressive upscaling in the laplacian pyramid. So it's basically one laplacian pyramid in a single network and all upscalers are trained jointly leading to higher quality images.
    • They use Adam as their optimizer. To decrease instability issues they decreased the learning rate to 0.0002 (from 0.001) and the momentum/beta1 to 0.5 (from 0.9).
  • Architecture guidelines for stable Deep Convolutional GANs
    • Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator).
    • Use batchnorm in both the generator and the discriminator
    • Remove fully connected hidden layers for deeper architectures
    • Use ReLU activation in generator for all layers except for the output, which uses Tanh.
    • Use LeakyReLU activation in the discriminator for all layers
  • illustrations:
    • generator

    • generated bedrooms

    • bedroom latent space interpolation

    • latent space operations

(5) Perceptual Losses for Real-Time Style Transfer and Super-Resolution ★★★★★ (10/10)

  • (also read their supplementary materials!)
  • key:
    • Same in essence as (1) but 1000x faster and claims to be more robust
    • another definition of perceptual loss
    • used for style-transfer and super-resolution
  • illustrations
    • article in a nutshell

    • calculating Gram matrix efficiently

    • style transfer examples

    • super-resolution examples

    • a hint that resnets may be better

  • they follow DCGAN model guidelines
  • 2 components
    • image transformation network
    • loss network (loss is derived from a network)
  • 2 losses
    • feature reconstruction loss is the (squared, normalized) Euclidean distance between feature representations
    • style reconstruction loss is then the squared Frobenius norm of the difference between the Gram matrices of the output and target images
  • loss network pre-trained on imagenet
  • use residual connections (why not use resnet then?)

(6) Matching Networks for One Shot Learning ★★★☆☆ (7/10)

  • authors Vinyals Oriol, Blundell Charles, Lillicrap Timothy, Kavukcuoglu Koray, Wierstra Daa
  • too verbose mathematically, but too scarce in implications department and formulae heavy for my taste
  • inspired by seq2seq
  • key aim - reduce the amount of data necessary for training
  • 2 novelties
    • modeling level - use LSTM with CNN features as input as attention mechanism
    • training procedure - train 1 label per batch
  • high level description

(7) Image-to-Image Translation with Conditional Adversarial Networks ★★★★★ (99/10)

  • General solution for pix2pix translation using conditional GANs
  • CNNs alone produce blurry images => GANs solves this
  • cGANs use structured loss unlike plain CNNs

  • GANs learn a mapping from random noise vector z to output image y: G : z → y. Conditional GANs learn a mapping from observed image x and random noise vector z, to y: G : {x, z} → y.
  • open question - stochasticity of the output
  • Unet as G and PatchGAN as D

  • tries to classify if each N × N patch in an image is real or fake
  • run this discriminator convolutationally across the image, averaging all responses to provide the ultimate output of D
  • training schedule
    • one gradient descent step on D, then one step on G
    • SGD + Adam
  • smart test time evaluation ideas
    • at test time - dropout is also applied, as well as bnorm (!)
    • use AMT for person based evaluation
    • use FCN-like models to test original and generated images vs labels
  • ablation tests
    • patch size is important, 70x70 is the optimal one
    • cGAN + L1 loss is the best loss option
    • U-net + skips is better than wo skips

    • generator can be applied convolutionally on larger images
  • list of "solved" tasks on 256x256 image size

    • Semantic labels→photo 2975 training images (Cityscapes)
    • Architectural labels→photo 400 training images
    • Maps ↔ aerial photograph 1096 training images 1.2 million training images (Google Maps)
    • BW→color 1.2 million training images (Imagenet)
    • Edges→shoes 50k training images (UT Zappos50K)
  • illustrations

(8) Unsupervised Image-to-Image Translation Networks ★★★★★ (10/10)

  • authors Ming-Yu Liu, Thomas Breuel, Jan Kautz (Nvidia)
  • summary
  • link

  • a paper behind this amazing 2 minute paper episode
  • shared-latent space assumption
  • basically the whole architecture is

  • mostly shared encoder, 2 pairs
  • 2 G+D pairs
  • training
    • ADAM
  • When testing on the (satellite) map dataset:
    • Weight sharing between encoders and between generators improved accuracy
    • The cycle consistency loss improved accuracy
    • Using 4-6 layers (as opposed to just 3) in the discriminator improved accuracy.
    • Translations that added details (e.g. night to day) were harder for the model.
  • After training, the features from each discriminator seem to be quite good for the respective dataset (i.e. unsupervised learned features).
  • Loss
    • Their loss consists of three components:
      • VAE-loss: Reconstruction loss (absolute distance) and KL term on z (to keep it close to the standard normal distribution). Most weight is put on the reconstruction loss
      • GAN-loss: Standard as in other GANs, i.e. cross-entropy.
      • Cycle-Consistency-loss: For an image I_A, it is expected to look the same after switching back and forth between image styles, i.e. I_A = G1(E2( G2(E1(I_A)) )) (switch from style A to B, then from B to A). The cycle consistency loss uses a reconstruction loss and two KL-terms (one for the first E(.) and one for the second).
  • results

(9) Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks - CycleGAN ★★★★★ (10/10)

  • authors Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros, Berkeley AI Research (BAIR) laboratory, UC Berkeley
  • link
  • original release 
  • code explanation 
  • pytorch 
  • a paper behind this episode of 2 minute papers
  • best examples

  • key novelties:
    • mostly based on pix2pix ideas
    • cycle consistency assumption and loss - "if we translate, e.g., a sentence from English to French, and then trans- late it back from French to English, we should arrive back at the original sentence"
  • key moving parts:

    • 2 mapping functions G : X → Y and F : Y → X
    • 2 discriminators Dx and Dy
    • loss : two types of terms: adversarial losses + cycle consistency losses
    • model can be viewed as training two “autoencoders”: learn one autoencoder F ◦ G : X → X jointly with anotherG◦F : Y → Y
  • evaluation
    • Amazon Mechanical Turk perceptual studies
    • Semantic segmentation metrics using FCN
  • caveats
    • though much better that pix2pix and baselines still cannot really fool the AMT evaluation (though being much better)

  • vs baselines

(10) Deep Photo Style Transfer ★★★★☆ (8/10)

  • authors Luan Fujun, Paris Sylvain, Shechtman Eli, Bala Kavita (Adobe)
  • link 
  • an orginal paper behind this 2-minute-papers episode
  • TLDR - looks VERY cool, but most likely NOT PRACTICAL in 2017/2018 as it requires explicit region masks

  • key contributions
    • photorealistic generic style transfer 

  • failure cases

  • suppressing distortions 

    • satisfying photo realistic style transfers in a broad variety of scenarios, including transfer of the time of day, weather, season, and artistic edits
    • changes in output colour would have to preserve the ratios and distance in the input style colours
    • it is achieved by regularization - a "custom fully differentiable energy term inspired by the Matting Laplacian
    • Gram matrix of the neural responses => prevents any region from being ignored
    • though looks very impressive, requires having explicit masks for regions, which complicates the pipeline


  • link 
  • authors Tero Karras, Timo Aila, Samuli Laine, Jaakko Lehtinen (NVIDIA)
  • an original paper behind this video
  • caveat NOT practical in 2017/2018
    • GANs are 2-3x slower to train than ordinary CNNs (also Unet-based architectures are heavy)
    • This approach slows down the training speed by a factor of 3x-6x
    • overal the whole process is very complicated
  • idea in a nutshell

  • CNN architecture

  • key novelties
    • training methodology for GANs where - start with low-resolution images, and then progressively increase the resolution by adding layers to the networks
    • increase variation of the training data using statistic layers (standard deviation)
    • custom normalization schemes (as opposed to bnorm proposed earlier)
    • normalize the feature vector in each pixel to unit length in the generator after each convolutional layer
    • lr equalization

(13) Wasserstein GAN ★★★★★ (99/10)

  • authors Martin Arjovsky1, Soumith Chintala,  Leon Bottou (Courant Institute of Mathematical Sciences, Facebook)
  • link 
  • claims to cure the main training problems of GANs
  • did not beat the math there yet

Buy me a coffeeBuy me a coffee

Become a PatronBecome a Patron