2020 DS/ML digest 03

2020 DS/ML digest 03

Posted by snakers41 on March 1, 2020

NLP

FastText minification - https://habr.com/ru/post/489474/:

  • Not very useful in practice, but nice that someone does this


Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer:

  • 11 billion parameters
  • encoder-decoder models generally outperformed “decoder-only” language models
  • fill-in-the-blank-style denoising objectives worked best;
  • the most important factor was the computational cost;
  • training on in-domain data can be beneficial but that pre-training on smaller datasets can lead to detrimental overfitting;
  • multitask learning could be close to competitive with a pre-train-then-fine-tune approach but requires carefully choosing how often the model is trained on each task;

ML

Libraries / repos

Datasets

  • TyDi QA: A Multilingual Question Answering Benchmark dataset
  • English NLP datasets
  • The largest NMT dataset by FAIR:
    • Based on Common Crawl
    • Mined using translation + FAISS
    • 3.5 billions parallel sentences
    • 661 million are aligned with English
    • 17 language pairs have more then 30 million parallel sentences, 82 more then 10 million, and most more than one million
  • Transparent objects dataset
  • Google StreetView + navigation intructions dataset
  • Open Images 6 - with localized narratives