2020 DS/ML digest 07

2020 DS/ML digest 07

Posted by snakers41 on April 30, 2020

Misc

Cliqz has been shut down, but their blog is awesome anyway - https://0x65.dev/
Rabbit MQ protocol description - all you need to know about messaging systems https://www.rabbitmq.com/tutorials/amqp-concepts.html
The VR Winter - https://www.ben-evans.com/benedictevans/2020/5/8/the-vr-winter
This is just epic - https://github.com/donnemartin/system-design-primer
“New” online IDE (VScode) inside Github - https://github.blog/2020-05-06-new-from-satellite-2020-github-codespaces-github-discussions-securing-code-in-private-repositories-and-more/#codespaces
GitHub sponsors out of beta - https://github.blog/2020-05-12-github-sponsors-is-out-of-beta-for-sponsored-organizations/
Ben Evans as cool as ever - https://mailchi.mp/37cd98ba9239/benedicts-newsletter-no-451137?e=b7fff6bc1c

Speech

English ASR comparison part-3 https://blog.timbunce.org/2020/05/17/a-comparison-of-automatic-speech-recognition-asr-systems-part-3/
Google Duplex in Spanish https://venturebeat.com/2020/04/29/google-duplex-now-speaks-spanish-starts-calling-businesses-in-spain-to-update-hours/
New compact LibriSpeech CNN from Google - http://arxiv.org/abs/2005.03191

New TTS from FaceBook - https://ai.facebook.com/blog/a-highly-efficient-real-time-text-to-speech-system-deployed-on-cpus/

  • Cool, but looks like very custom engineering that cannot easily be reproduced

New flow based model from nvidia to replace tacotron


NLP

Biggest typo correction corpus ever https://github.com/mhagiwara/github-typo-corpus - someone stole my idea!
This word does not exist - https://www.thisworddoesnotexist.com/

ML

Numpy on GPU - JAX - https://sjmielke.com/jax-purify.htm
A new go-to attention module for CNNs - http://arxiv.org/abs/1910.03151 ?

A much faster json library - https://github.com/ultrajson/ultrajson
Using Neural Networks to Find Answers in Tables - https://ai.googleblog.com/2020/04/using-neural-networks-to-find-answers.html
Repeating data to train NN faster - https://ai.googleblog.com/2020/05/speeding-up-neural-network-training.html

Open AI jukebox - https://openai.com/blog/jukebox/

  • Open AI continues to pump out hype bait
  • CNN encoder, hierarchical VAE; 3 compression levels 8x, 32x, and 128x, respectively, with a codebook size of 2048 for each level
  • A cascade of Transformers generates codes from top to bottom level, after which the bottom-level decoder can convert them to raw audio
  • Each of these models has 72 layers of factorized self-attention on a context of 8192 codes, which corresponds to approximately 24 seconds, 6 seconds, and 1.5 seconds of raw audio at the top, middle and bottom levels, respectively
  • It takes approximately 9 hours to fully render one minute of audio through their models
  • Ofc OPEN AI does not publish its scraping scripts and aligned datasets …
  • OpenAI’s existence looks like an elaborate hoax to justify astronomic ML expenditure


CRFs in PyTorch - useful for hierarchical tagging tasks heavy class imbalance and obvious class relations


Optimizing Multiple Loss Functions with Loss-Conditional Training


Dask hello world parallel quick-start - https://habr.com/ru/post/499086/
Plain chart templating in Python example - https://habr.com/ru/post/501246/
Now it is possible to learn from graphs wo supervision - https://ai.googleblog.com/2020/05/understanding-shape-of-large-scale-data.html

New Nvidia GPUs - https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/, tldr +25% apples to apples
Applying bayes rule to hyper-param search - https://distill.pub/2020/bayesian-optimization/ (TLDR expertly crafted grid / binary search is not worse)
OpenAI (or ClosedAI?) writes about efficiency … and does nothing, lol https://openai.com/blog/ai-and-efficiency/
Ensembling can be useful in production - https://arxiv.org/pdf/2005.00570.pdf

Open-Sourcing BiT: Exploring Large-Scale Pre-training for Computer Vision - https://ai.googleblog.com/2020/05/open-sourcing-bit-exploring-large-scale.html

  • They release models in PyTorch and JAX =)
  • ILSVRC-2012 (1.28M images with 1000 classes), ImageNet-21k (14M images with ~21k classes) and JFT (300M images with ~18k classes)
  • In order to profit from more data, one also needs to increase model capacity
  • Training duration becomes crucial
  • Replacing batch normalization with group normalization is beneficial


Float formats for DL - https://medium.com/@moocaholic/fp64-fp32-fp16-bfloat16-tf32-and-other-members-of-the-zoo-a1ca7897d407

Turns our PyTorch has some non-core features?


Datasets

CV meta learning dataset - https://ai.googleblog.com/2020/05/announcing-meta-dataset-dataset-of.html

  • Existing approaches have trouble leveraging heterogeneous training data sources
  • Some models are more capable than others of exploiting additional data at test time
  • The adaptation algorithm of a meta-learner is more heavily responsible for its performance than the fact that it is trained end-to-end (i.e. meta-trained)