2020 DS/ML digest 10

2020 DS/ML digest 10

Posted by snakers41 on August 17, 2020

Ben Evans making his email digest a paid feature. RIP.
No more direct links, just emails.
I wonder why, he should not be strapped for cash …

Regulating technology - https://www.ben-evans.com/benedictevans/2020/7/23/regulating-technology
A great review on what Tiktok is and how it works - https://www.eugenewei.com/blog/2020/8/3/tiktok-and-the-sorting-hat
COVID Google trends in UK - https://www.thinkwithgoogle.com/intl/en-gb/marketing-resources/content-marketing/story-uk-lockdown-through-google-trends/

Speech

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations


Fast Transformers with Clustered Attention

  • Paper http://arxiv.org/abs/2007.04825
  • Blog https://clustered-transformers.github.io./blog/index.html
  • Code https://github.com/idiap/fast-transformers / https://fast-transformers.github.io/
  • Clustered attention, which instead of computing the attention for every query, groups queries into clusters and computes attention just for the centroids
  • Tested on WSJ and Switchboard
  • Supposedly speeds up convergence on a give compute budget (!!! a correct metric!!!)
  • Not another LibriSpeech overfitting exercise, looks like they use only 1 GPU
  • Looks like they promise slight WER improvement with 20% faster models and same performances with 40% faster models
  • Their library requires additional C++ libraries
  • Looks like they just use transformer layers + MEL filterbanks, i.e. there is no stride in their model other than STFT stride
  • This kind of devalues their results - you can achieve FASTER speed-ups with simpler means
  • The speed advantage is visible for sequences over 1000 - 2000 items. 512 is the standard length for transformers
  • As an abstract research this is very cool, but for ASR this is very limited
  • General convergence charts are dissappointing (see Appendix), looks like their “compute budget” is just some arbitrary inference time limit, i.e. their WER equivalence figures are cooked

ML

AutoML-Zero: Evolving Code that Learns https://ai.googleblog.com/2020/07/automl-zero-evolving-code-that-learns.html

  • A preliminary work
  • Genetic algorithm tries to write code to solve an ML task
  • Thee code is evolved to produce the final algorithm
  • This final algorithm includes techniques such as:
    • Noise injection as data augmentation
    • Gradient normalization
    • Weight averaging
  • Improvement over the baseline also transfers to datasets that are not used during search
  • When we reduce the amount of data, the noisy ReLU emerges, which helps with regularization

New MLperf is out https://mlperf.org/training-results-0-7/
Shortcuts: How Neural Networks Love to Cheat - https://thegradient.pub/shortcuts-neural-networks-love-to-cheat/
PyTorch 1.6 https://github.com/pytorch/pytorch/releases/tag/v1.6.0
Why You Should Do NLP Beyond English https://ruder.io/nlp-beyond-english/
NLP status tracker https://nlpprogress.com/
Challenges of Comparing Human and Machine Perception https://thegradient.pub/challenges-of-comparing-human-and-machine-perception/
How ML progresses in leaps - NVidia engineers are essentially promiting their new GPUs by pushing mixed precision in PyTorch - https://github.com/pytorch/pytorch/issues/25081
An example using RPC and DDP - https://github.com/pytorch/examples/blob/master/distributed/rpc/ddp_rpc/main.py
Nvidia negotiating to buy ARM, lol
Trouble at Intel - https://digitstodollars.com/2020/07/28/trouble-at-intel/
Live HDR+ and Dual Exposure Controls on Pixel 4 and 4a https://ai.googleblog.com/2020/08/live-hdr-and-dual-exposure-controls-on.html
How GPT-3 works - http://jalammar.github.io/how-gpt3-works-visualizations-animations/
How Facebook looks for security issues in Python - https://engineering.fb.com/security/pysa/
Truth about satellite images - https://medium.com/@joemorrison/the-commercial-satellite-imagery-business-model-is-broken-6f0e437ec29d
Synthetic gradies - https://www.youtube.com/watch?v=1z_Gv98-mkQ&feature=youtu.be
Apple has effectively killed the Identifier for Advertisers (IDFA) by forcing developers to request permission for tracking before accessing it - https://www.revenuecat.com/blog/skadnetwork-and-ios-14-privacy-changes
Neural Radiance Fields for Unconstrained Photo Collections https://nerf-w.github.io/
Trainable autoencoder for clusterization - https://www.dlology.com/blog/how-to-do-unsupervised-clustering-with-keras/

Advanced SQLAlchemy Features You Need To Start Using https://martinheinz.dev/blog/28:

  • Tbh, ORM is overrated and leads to more hassle and more database load than necessary
  • On the other hand, the proper way - writing stored procedures may be too much hassle sometimes


A vector search library from Google

Training GANs - From Theory to Practice

Python

Python logging mixed with tracing - https://eliot.readthedocs.io/en/stable/quickstart.html

Other links

https://github.com/iodide-project/pyodide
https://johanneskopf.de/publications/pixelart/
https://medium.com/swlh/sick-of-javascript-just-use-browser-python-4b9679efe08b