2019 DS/ML digest 02

2019 DS/ML digest 02

Posted by snakers41 on January 31, 2019

NLP - Highlight of the week - LASER

  • Plain PyTorch 1.0 / numpy / FAISS based;
  • Release, library. paper;
  • Looks like an off-shoot of their “unsupervised” NMT project;
LASER’s vector representations of sentences are generic with respect to both theinput language and the NLP task. The tool maps a sentence in any language topoint in a high-dimensional space with the goal that the same statement in anylanguage will end up in the same neighborhood. This representation could be seenas a universal language in a semantic vector space. We have observed that thedistance in that space correlates very well to the semantic closeness of thesentences.
  • Alleged pros:
    It delivers extremely fast performance, processing up to 2,000 sentences per second on GPU.    The sentence encoder is implemented in PyTorch with minimal external dependencies.    Languages with limited resources can benefit from joint training over many languages.    The model supports the use of multiple languages in one sentence.    Performance improves as new languages are added, as the system learns to recognize characteristics of language families.

Articles/ blog posts

    Algorithms are often implemented without ways to address mistakes.
AI makes it easier to not feel responsible.
AI encodes & magnifies bias.
Optimizing metrics above all else leads to negative outcomes.
There is no accountability for big tech companies.

Papers

  • Self-feeding chatbots? No, it is just a clever annotation mining technique. This is a ranking challenge - not a generative one;

Datasets

  • NLP in English - Natural Questions - new 300k sized QA dataset by Google + leaderboard with … Google’s BERT - looks like scale matters;