2020 DS/ML digest 11

https://pics.spark-in.me/upload/4cf6e7aa0ff44be3713aaff16ce5cef5.jpg

Posted by snakers41 on September 23, 2020

Tech

Amazon’s profits, AWS and advertising - https://www.ben-evans.com/benedictevans/2020/9/6/amazons-profits
Robot roach - https://backyardbrains.com/products/roboroach
Github container registry https://github.blog/2020-09-01-introducing-github-container-registry/ - expensive AF
TSMC and Graphcore Prepare for AI Acceleration on 3nm
Oculus Quest 2, now $100 cheaper at $300

Speech

Well … our models release - https://github.com/snakers4/silero-models - like share repost!
We are to be included into torch hub models, Soumith Chintala said we made a great contribution - https://github.com/pytorch/hub/pull/153#pullrequestreview-493132338

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

  • Arxiv
  • Masks the speech input in the latent space
  • Solves a contrastive task defined over a quantization of the latent representations which are jointly learned
  • Released under fairseq
  • 53k hours of unlabeled data
  • Ten minutes of labeled data => 5.7/10.1 WER on the noisy/clean test sets of Librispeech
  • All of Librispeech => 1.9/3.5 WER using a simple baseline model architecture
  • Objective:
    • Encodes speech audio via a multi-layer convolutional neural network and then masks spans of the resulting latent speech representations
    • Latent representations fed to a Transformer network to build contextualized representations
    • Model trained via a contrastive task where the true latent is to be distinguished from distractors
  • Two model configurations
    • BASE: 12 transformer blocks, dim 768, ffn 3,072 and 8 attn heads
    • Batches built by cropping 250k audio samples (or 15.6sec) from each example
    • Crops are batched together to not exceed 1.4m samples per GPU (5 audios? lol)
    • 64 V100 GPUs for 1.6 days, total batch size is 1.6h
    • LARGE 24 transformer, dim 1,024, ffn 4,096 and 16 attn heads
    • 320K audio samples (20sec) with a limit of 1.2M samples per GPU
    • 128 V100 GPUs over 2.3 days for Librispeech and 5.2 days for LibriVox
    • Total batch size is 2.7h
    • Dropout 0.1 in the Transformer, at the output of the feature encoder and the input to the quantization module.
    • Layers are dropped at a rate of 0.05 for BASE and 0.2 for LARGE

ML / Blogs / Code

Google Pose Estimation on Mobile - https://ai.googleblog.com/2020/08/on-device-real-time-body-pose-tracking.html
Machine learning for colonoscopy - https://ai.googleblog.com/2020/08/using-machine-learning-to-detect.html
RPI baby-sitter - https://habr.com/ru/company/recognitor/blog/516232/
Nvidia to acquire ARM - https://digitstodollars.com/2020/09/14/nvida-to-acquire-arm-first-impressions/
Training GANs is still a cluster-fuck - https://twitter.com/theshawwn/status/1303431881761800193
The Hessian Penalty A Weak Prior for Unsupervised Disentanglement https://www.wpeebles.com/hessian-penalty
The Technology Behind Google’s Recent Improvements in Flood Forecasting https://ai.googleblog.com/2020/09/the-technology-behind-our-recent.html
The medical AI floodgates open, at a cost of $1000 per patient https://lukeoakdenrayner.wordpress.com/2020/09/06/the-medical-ai-floodgates-open-at-a-cost-of-1000-per-patient/
Writing More Idiomatic and Pythonic Code - https://martinheinz.dev/blog/32
MiniGPT and some hilarious commits by Karpathy - https://github.com/karpathy/minGPT
Google Maps. Google is now feeding satellite data much more directly back into its mapping products. Machine learning, of course. Link
OpenAI Licenses GPT-3 Technology to Microsoft https://openai.com/blog/openai-licenses-gpt-3-technology-to-microsoft/

New from fast.ai

Papers

ODS paper reviews:


Are we done with ImageNet? - https://arxiv.org/abs/2006.07159

  • Significantly more robust procedure for collecting human annotations of the ImageNet validation set
  • With these new labels => reassess the accuracy of recently proposed ImageNet classifiers
  • Their gains to be substantially smaller than those reported on the original labels
  • ImageNet labels to no longer be the best predictors of this independently-collected set


Descending through a Crowded Valley — Benchmarking Deep Learning Optimizers

  • http://arxiv.org/abs/2007.01547
  • Optimizer’s performance highly depends on the test problem
  • Evaluating 14 optimizers on 8 ВД problems using 4 different schedules
  • No a clear winner. Some optimizers are frequently decent, they also generally perform similarly, switching their relative positions in the ranking
  • You can expect to do about equally well by taking almost any method from our benchmark and tuning it or
  • As you would by investing the same computational resources into running a selection of optimizers from our benchmark in their default (“one-shot”) setting and picking the winner
  • Three different tuning budgets and four selected learning rate schedules
  • Non-adaptive optimization methods require more tuning
  • Adaptive methods have well-working default parameters


Understanding Deep Learning on Controlled Noisy Labels

  • https://ai.googleblog.com/2020/08/understanding-deep-learning-on.html
  • https://google.github.io/controlled-noisy-web-labels/index.html
  • studying the impact of the noise level — the percentage of examples with incorrect labels in the dataset — on model performance
  • Web label noise
  • Mini-ImageNet, for coarse-grained image classification, and Stanford Cars, for fine-grained image classification. We gradually replace clean images in these datasets with incorrectly labeled images gathered from the web, following standard methods for the construction of synthetic datasets
  • 213k noisy annotated images
  • create 10 different datasets with progressively higher levels of label noise (from 0% clean data to 80% data with erroneous labels
  • Goal is to assign high weights for correctly labeled examples and zero weights for incorrectly labeled examples
  • Use importance sampling to select another example in the same mini-batch according to the distribution
  • New Findings
    • Deep neural networks generalize much better on web label noise
    • Deep neural networks may NOT learn patterns first when trained on web label noise
    • ImageNet architectures generalize on noisy training labels when the networks are fine-tuned


    REALM: Integrating Retrieval into Language Representation Models

    • https://ai.googleblog.com/2020/08/realm-integrating-retrieval-into.html
    • In typical large LM models world knowledge is captured in an abstract way in the model weights;
    • … allowing REALM models to retrieve textual world knowledge explicitly from raw text documents, instead of memorizing all the knowledge in the model parameters;
    • Fill-in-the-blank task, called masked language modeling, has been widely used for pre-training language representation models;
    • A knowledge retriever that first retrieves another piece of text from an external document collection as the supporting knowledge — in our experiments, we use the Wikipedia text corpus — and then feeds this supporting text as well as the original text into a language representation model;
    • One can use the task of filling words to train the knowledge retriever indirectly, without any human annotations;
    • The selection of the best document is formulated as maximum inner product search
    • In REALM, we use the ScaNN package to conduct MIPS efficiently, which makes finding the maximum inner product value relatively cheap, given that the document vectors are pre-computed
    • Updating document vectors every 500 training steps, instead of every step, is able to achieve good performance and make training tractable
    • We evaluate the effectiveness of REALM by applying it to open-domain question answering (Open-QA)

    Understanding View Selection for Contrastive Learning

    • https://ai.googleblog.com/2020/08/understanding-view-selection-for.html
    • If one doesn’t have annotated labels readily available, how does one select the views to which the representations should be invariant
    • One should reduce the mutual information between views while keeping task-relevant information intact
    • Data augmentation as a way to reduce mutual information, and show that increasing data augmentation indeed leads to decreasing mutual information while improving downstream classification accuracy
    • Views that yield the best results should discard as much information in the input as possible except for the task relevant information (e.g., object labels), which we call the InfoMin principle
    • Sample different patches from the same images, and reduce their mutual information simply by increasing the distance between the patches


    DeepSpeed: Extreme-scale model training for everyone


    Meet the startup that helped Microsoft build the world of Flight Simulator


    On-device Supermarket Product Recognition - https://ai.googleblog.com/2020/07/on-device-supermarket-product.html:

    • A proper and cool pipeline of models

    • Most novel thing to use metric learning to find objects

      The regions of interest (ROIs) from the detector are sent to the embedder model, which then computes a 64-dimension embedding. The embedder model is initially trained from a large classification model (i.e., the teacher model, based on NASNet), which spans tens of thousands of classes. An embedding layer is added in the model to project the input image into an ‘embedding space’, i.e., a vector space where two points being close means that the images they represent are visually similar (e.g., two images show the same product). Analyzing only the embeddings ensures that the model is flexible and does not need to be retrained every time it is to be expanded to new products. However, because the teacher model is too large to be used directly on-device, the embeddings it generates are used to train a smaller, mobile-friendly student model that learns to map the input images to the same points in the embedding space as the teacher network. Finally, we apply principal component analysis (PCA) to reduce the dimensionality of the embedding vectors from 256 to 64, streamlining the embeddings for storing on-device.


    Language-Agnostic BERT Sentence Embedding - https://ai.googleblog.com/2020/08/language-agnostic-bert-sentence.html

    • Language-agnostic cross-lingual sentence embeddings for 109 languages
    • Trained on 17b monolingual sentences and 6b bilingual sentence pairs using MLM and TLM pre-training
    • A BERT-like architecture
    • A 12-layer transformer with a 500k token vocabulary pre-trained using MLM and TLM on 109 languages
    • Evaluated using the Tatoeba corpus. Average Accuracy (%) on Tatoeba Datasets:
      Model14 Langs36 Langs82 LangsAll Langs
      m~USE*93.9
      LASER95.384.475.965.5
      LaBSE95.395.087.383.7


    FAIR also updates their LASER sentence encoder tool