2019 DS/ML digest 15

2019 DS/ML digest 15

Posted by snakers41 on August 21, 2019

Speech


NLP

  • Finally - FAIR tries to simplify the transformer networks:
    • TLDR - adaptive attention span + merge self-attention and feed-forward layer;
    • Boast 3-4x train / inference speed-up;
    • Probably this will end up as a layer in PyTorch as well;
  • You can replicate GPT-2. But why?
  • ACL 2019 highlights;


Cool stuff

  • New colormap from Google that combines the best from Jet and Inferno colormaps;
  • Reflinks for COW dataset management:
    • Looks so cool … copy-on-write behavior;
    • But “like with hard links, reflinks only work within a given mounted volume”;
    • Also requires exotic FS on your disk;
    • Any write to the cloned file causes new data blocks to be allocated to hold that data. The cloned file appears changed, and the original file is unmodified. The clone is perfectly suitable for the case of duplicating a dataset, allowing modifications to the dataset without polluting the original dataset.

ML


Dealing with noisy labels


Datasets / competitions

  • Waymo open-sources its self-driving dataset. The data itself is behind the registration wall that requires gmail auth;
  • Facebook’s Deep Fake challenge and dataset;
  • Fast MRI challenge;
  • New natural dialogue datasets from Google;
  • Lyft 3D segmentation competition;