2020 DS/ML digest 02

2020 DS/ML digest 02

Posted by snakers41 on February 4, 2020

Speech

A LibriVox subset w speakers and books, without STT labels - https://ai.facebook.com/tools/libri-light

SCALING UP ONLINE SPEECH RECOGNITION USING CONVNETS

  • link
  • Time-Depth Separable (TDS) convolutions and Connectionist Temporal Classification (CTC)
  • Asymmetric padding for convolutions:
    • Adds more (zero) padding at the start of the input
    • Reduces the dependency of the current output on the future input and hence reduces the latency of the model
  • Arch
    • Two 15-channel
    • Three 19-channel
    • Four 23-channel
    • Five 27-channel TDS blocks
    • Separated by 1-D convolution layers which increase the number of output channels / perform subsampling
  • Extend the beam search graph, we allow for correction of partial transcriptions generated earlier
  • Pruning:
    • only top-K (e.g. K = 50) tokens according to the acoustic model score
    • we propose only the blank symbol if its posterior probability is larger than 0.95
  • Standalone, modular platform to perform the full online inference procedure
    • 16-bit floating point FBGEMM
  • Dataset
    • 1 million utterances (∼13.7K hours) of in-house English videos publicly shared by users
  • Hardware
    • wav2letter++
    • 64 GPUs for each experiment
    • 80-dim log mel-scale filter banks as input features
    • with STFTs computed on 25ms Hamming windows strided by 10ms
    • 5000 sub-word tokens generated from SentencePiece toolkit
  • Some benches (CPUs with 18 physical cores and 64GB of RAM)
  • Limit future context from 5s to 250ms - only ∼4% relative decrease in Where
  • New ideas on how to manage concurrency - chunkify into small pieces?

From Senones to Chenones: Tied Context-Dependent Graphemes for Hybrid Speech Recognition

  • Link
  • “Furthering” this paper - one more method to not rely on phonetic knowledge even in English!
  • Alternative for end-to-end ASR and establishes that hybrid systems can be improved by dropping the reliance on phonetic knowledge
  • So the idea is that using context-dependent graphemes is better than using phonemes
  • Some info on how the datasets are collected
  • This in-house anonymized dataset is collected through crowd- sourcing. It consists of utterances recorded via mobile de- vices where crowd-sourced workers ask a smart assistant to carry out certain actions, such as calling a friend, playing mu- sic, or getting weather information. The training set com- prises 15.7M utterances (12.5K hours) from 20K speakers. The development set for hyper-parameter tuning consists of 48 speakers disjoint from the training set, totaling 34.4K utter- ances (32.6 hours). The test set (ast-test) contains 58.8K utterances (50.4 hours) from 300 speakers that are unseen in both training and development. The average utterance length is 2.9 seconds with a standard deviation of 1.3 seconds.

Python

Do docker pull without docker daemon - cool reverse engineering
PostgreSQL explain tool https://explain.tensor.ru/plan/
Websockets vs gRPC? Or HTTP2 vs HTTP?
Habr IT job salaries in Russia
More Python tricks
Alpine Linux does not support pip wheels

ML

What is Natural gradient descent?
Using GANs to create teeth prostetics
OpenAI now uses PyTorch
A year in ML for Google
Allegedly there is an American find face with 3bn images selling their DB to law enforcement
Apple bought Xnor.ai

Client side media scanning
Google releasing some materials of its fly brain mapping project
Use Julia to write fast ML code?
Decent analytic report from a recent “tabular” ML competition (RU)
AI dungeon https://play.aidungeon.io/
Yeah right, UNET can predict weather

Analysis of GPT-2 success

  • TLDR - such networks do not understand anything and are more of very expensive toys
  • In essence, GPT-2 has been a monumental experiment in Locke’s hypothesis, and so far it has failed. Empiricism has been given every advantage in the world; thus far it hasn’t worked.
  • Current systems can regurgitate knowledge, but they can’t really understand in a developing story, who did what to whom, where, when, and why; they have no real sense of time, or place, or causality

Google’s Meena, a 2.6 billion parameter end-to-end trained neural conversational model

  • single Evolved Transformer encoder block and 13 Evolved Transformer decoder blocks
  • 2.6 billion parameters and is trained on 341 GB of text, filtered from public domain social media conversations
  • Conceptually, perplexity represents the number of choices the model is trying to choose from when producing the next token
  • Model itself not published

I like this tweet

New library and boosting approach for tabular data?

  • Not very popular compared to older libraries (yet)
  • Key claim - they add uncertainty estimation to boosting / tree methods which was a problem before
  • GBMs - in regression tasks, they output only a scalar value

Laser Tagger by Google - key idea - instead of using your full s2s pipeline - you just tag what you want to replace

Looks like that current PyTorch quantization is just built on top of their CPU matrix multiplication library

Data

https://prodi.gy/demo