**Normalization techniques other than batch norm:**

**Weight normalization** (used in TCN, paper):

- Decouples length of weight vectors from their direction;
- Does not introduce any dependencies between the examples in a minibatch;
- Can be applied successfully to recurrent models such as LSTMs;
- Tested only on small datasets (CIFAR + VAES + DQN);

**Instance norm** (used in style transfer)

- Proposed for style transfer;
- Essentially is batch-norm for one image;
- The mean and standard-deviation are calculated per-dimension separately for each object in a mini-batch;

**Layer norm** (used in Transformer, paper)

- Designed especially for sequntial networks;
- Computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case;
- The mean and standard-deviation are calculated separately over the last certain number dimensions;
- Unlike Batch Normalization and Instance Normalization, which applies scalar scale and bias for each entire channel/plane with the affine option, Layer Normalization applies
**per-element scale and bias**;

**Articles / posts**:

- Nvidia jetson nano released;
- A bitter lesson - more compute > smarter methods;
- Lighter LSTM? PR in pytorch was [closed]https://arxiv.org/pdf/1903.08023.pdf;
- Semantic information (class) + batch norm?
- Music completion - trained similarly to NLP models;
- Building data science workflows;
- Facebook builds … telegram?
- Chinese ML research will overtake USA soon;
- Tumblr was 30% porn;
- Mobile anti-viruses are mostly fraud;
- Google covers its new GAN paper;
- Learning semantic embeddings using only similarity;
- Google investigates benefits of huge batch sizes;