2020 DS/ML digest 01

2020 DS/ML digest 01

Posted by snakers41 on January 17, 2020


Maybe old news, but AlBERT is same as BERT, but 5x smaller

  • Allocate the model’s capacity more efficiently
  • Hidden layer embeddings are smaller than input embeddings - same as this => 80% reduction in the parameters of the projection block
  • Parameter sharing => a 90% parameter reduction for the attention-feedforward block (a 70% reduction overall)
  • Only 12M parameters, an 89% parameter reduction compared to the BERT-base model with respectable performance
  • Then scale this smaller model

New efficient transformer from Google

  • For LSTM the context window covers from dozens to about a hundred words
  • Context window used by Transformer extends to thousands of words
  • Transformer is prohibitive for
    • Large input lengths (N**2 scaling)
    • Large memory footprint if you store all of the activations
  • Сontext windows of up to 1 million words, all on a single accelerator and using only 16GB of memory
  • Key ideas
    • Locality-sensitive-hashing (LSH)
    • Reversible residual layers

Also I remember that FAIR last year did some (a bit deeper) optimizations of transformer modules as well


English ASR system comparsion
Some real-life specs of commercial ASR systems:

  • Real-time oriented
  • Batch ASR
  • We tested this app - its quality is poor and it is very slow

Mostly PR piece on how voice transfer can be applied for mute people
Surprise-surprise Google uses its on-device STT networks for a transcription app. It works only with Pixel phones
Facebook publishes Wav2Letter inference framework

Python / DS / coding / libraries

Facebook’s open source year in review
Python tips
Some python speed tricks
Some obvious python speed tricks
Software explosion makes people write bad code
Some naive pandas speed tricks
Awesome rant about developers
Pandas 1.0 released, release notes

  • Python 3.6+, Python 2 dropped
  • Major release
  • New documentation website
  • Experimental data types for booleans and strings
  • Experimental NA scalar
  • Using Numba in rolling.apply
  • Сustom windows for rolling operations
    Suprise-surpise - Canada is the best market for developers
    Basic ideas on how to work with larger than RAM data
    Some numba hacks (RU)


Yandex Search has superior facial recognition. Most likely this feature will be nerfed in future
Semi new idea - binarized neural networks
Ad targeting hell (RU)
Knowledge debt
Transfer learning in medical imaging does not work:

  • Shallower models required to achieve good performances
  • Performance on Imagenet is not indicative on performance in medicine
  • Transfer learning => significant speedup in the time taken for the model to converge
  • Feature independent benefit of pretraining — the weight scaling
  • Feature reuse primarily occurs in the lowest layers
  • Hybrid transfer learning - transfer some weights, just init others with random weights scaled properly
    How new portrait mode works in Google pixel - dual pixel + 2 cameras + siamese networks
    Brief review of self-supervised learning ideas in CV by fast.ai

Deploy / devops

Tracing is a very interesting concept for large distributed apps
Deploying GPU apps with batching


A library for model distillation / distillation. Not sure if it is worth trying just for lulz