2020 DS/ML digest 04

2020 DS/ML digest 04

Posted by snakers41 on March 30, 2020

Speech

Towards end-to-end speech - https://drive.google.com/file/d/1Rpob1-C223L9UWTiLJ6_Dy12mTA3YyTn/view
ML Perf has a working group on building 100k hours of speech dataset

NLP

Finally Google focuses on how to REDUCE compute for NLP model pre-training:

  • New pre-training task, called replaced token detection
  • Trains a bidirectional model (like a MLM) while learning from all input positions (like a LM)
  • Instead of corrupting the input by replacing tokens with “[MASK]” as in BERT
  • They corrupt the input by replacing some input tokens with incorrect, but somewhat plausible, fakes
  • This chart is very important

Measuring Compositional Generalization - https://ai.googleblog.com/2020/03/measuring-compositional-generalization.html

NLP transfer learning almanach

ML in General

What each neuron does within a NN - https://distill.pub/2020/circuits/zoom-in/
The state of autonomous vehicles
Peltie devices for cooling - it works but is very inefficient - https://pc-01.tech/peltie/
Linux CLI complexity
Reducing python package footprint
Confidence calibration in ML
CT SCANS DO NOT WORK for detecting covid19
Asyncio vs trio https://habr.com/ru/company/barsgroup/blog/490872/
Very cool story of a failed autonomous truck company https://medium.com/starsky-robotics-blog/the-end-of-starsky-robotics-acb8a6a8a5f5
Loading NumPy arrays from disk: mmap() vs. Zarr/HDF5 https://pythonspeed.com/articles/mmap-vs-zarr-hdf5/
Python CI/CD using Docker, github actions and makefiles?
Mesh TensforFlow for 3D high-res scans https://ai.googleblog.com/2020/02/ultra-high-resolution-image-analysis.html -
Our solution is totally different: we implemented a data communication step called halo exchange. Before every convolution operation, each spatial partition exchanges (receives and sends) margins with its neighbors, effectively expanding the image segment at its margins.
Recent ML Paper review https://habr.com/ru/company/ods/blog/493016/
Google adds radar to their Pixel smartphones https://ai.googleblog.com/2020/03/soli-radar-based-perception-and.html
https://ai.googleblog.com/2020/03/fast-and-easy-infinitely-wide-networks.html
An alternative to embedding visualization - https://distill.pub/2020/grand-tour/
https://blog.cloudflare.com/when-bloom-filters-dont-bloom/

Datasets

Russian news dataset https://github.com/ods-ai-ml4sg/proj_news_viz/releases/tag/data
Smithsonean museum image dataset https://www.si.edu/search?edan_q=cat&edan_fq=metadata_usage%3ACC0 OR media_usage%3ACC0
Google Task Master dialogue dataset https://research.google/tools/datasets/taskmaster-2/- 17,289 dialogs in seven domains: restaurants (3276), food ordering (1050), movies (3047), hotels (2355), flights (2481), music (1602), and sports (3478)