2018 DS/ML digest 24

2018 DS/ML digest 24

Posted by snakers41 on September 20, 2018

Buy me a coffeeBuy me a coffee

Become a PatronBecome a Patron

This was a week of Google’s posts … they write them faster than I can read, lol.
Google is taking the ML market by zerg rush.


  • Simple and efficient semantic embeddings for rare words, n-grams, and any other language feature:
    • Looks like a simple and cool method to evaluate OOV RARE words;
    • I would suppose that it works only for FULL words on LARGE datasets - otherwise we have n-grams;
    • Word w (with embedding w_v) is surrounded by a sequence c (of words). u_c_w - the average of the word embeddings of words in c;
    • A * u_c_w is a good estimate for v_w. A is matrix estimated per dataset;
    • Also can be used to estimate embeddings for short phrases, word combinations (!!!);
  • A good detailed explanation of triplet loss (can be used for the same purpose as the above link - learn to calculate phrase embeddings);

Market / articles / posts:

  • (RU) Knowledge distillation with UNet;
  • One of the best approaches to solve ML track particle challenge - I myself had a lot of discussions about this (also with proper physicists) and spent quite some time with EDA, but I could not come up with the idea on how to create initial seed proposals to make the task tractable. Looks like it was tractable just to brute force it w/o seeds. Nice. The end LB looks similar to Kalman filter accuracy ~90%;
  • TF 2.0 is coming - looks like they want to take all the best features from keras / PyTorch. Only time will tell. I am sceptic…;
  • New inference cards from Tesla. Hard to comment, this is a very niche thing;
  • Nice illustration about impostor syndrome;
  • (RU) a bit about processes in Linux;
  • Yet another library for image augmentations - very early stage. I kind of do not understand why you need these when you can improvise a plain pipeline using the core classes from this;

News / datasets / challenges from Google

  • New inclusive images challenge. Surprisingly, it looks really decent as a community effort for a change. Key idea - test the models with new data from new sources. The size of the train dataset (3.6 GB) is on the lower side;
  • New Google’s conceptual captions - 3.3m images with captions - looks like they even did some EDA and cleaning on the dataset - this also looks VERY interesting because you can try novel sequence to sequence models (transformer) here. But ofc this is Google - you need to invest into Google cloud and submit all of your code for free. Bummer;
  • Unrestricted adversarial challenge from Google. Amazing community effort. But no carrot for me personally - sounds like a cheap way to source the datasets and solution code for free;
  • Cool idea from Google - for Text2Speech you can share datasets of similar languages via phonetic representation;
  • What if tool in the Tensorboard - looks really cool, but ofc requires TF model and seems a bit too niche;

New architectures from Google

  • CNN Music recognition:
    • Embeddings
      • Semantic fingerprint via low-dimensional embedding of 8-second music clips sliced into 2 second intervals calculated on the fly
      • On device song database for search
    • Matching:
      • Optimized KNN search
      • On a narrowed sample full list of embeddings is downloaded, they are treated as sequences to improve accuracy
    • CNN details
      • 128 emb size
      • Fingerprint calculated each 0.5s