This was a week of Google’s posts … they write them faster than I can read, lol.
Google is taking the ML market by zerg rush.
NLP
- Simple and efficient semantic embeddings for rare words, n-grams, and any other language feature:
- Looks like a simple and cool method to evaluate OOV RARE words;
- I would suppose that it works only for FULL words on LARGE datasets - otherwise we have n-grams;
- Word w (with embedding w_v) is surrounded by a sequence c (of words). u_c_w - the average of the word embeddings of words in c;
- A * u_c_w is a good estimate for v_w. A is matrix estimated per dataset;
- Also can be used to estimate embeddings for short phrases, word combinations (!!!);
- A good detailed explanation of triplet loss (can be used for the same purpose as the above link - learn to calculate phrase embeddings);
Market / articles / posts:
- (RU) Knowledge distillation with UNet;
- One of the best approaches to solve ML track particle challenge - I myself had a lot of discussions about this (also with proper physicists) and spent quite some time with EDA, but I could not come up with the idea on how to create initial seed proposals to make the task tractable. Looks like it was tractable just to brute force it w/o seeds. Nice. The end LB looks similar to Kalman filter accuracy ~90%;
- TF 2.0 is coming - looks like they want to take all the best features from keras / PyTorch. Only time will tell. I am sceptic…;
- New inference cards from Tesla. Hard to comment, this is a very niche thing;
- Nice illustration about impostor syndrome;
- (RU) a bit about processes in Linux;
- Yet another library for image augmentations - very early stage. I kind of do not understand why you need these when you can improvise a plain pipeline using the core classes from this;
News / datasets / challenges from Google
- New inclusive images challenge. Surprisingly, it looks really decent as a community effort for a change. Key idea - test the models with new data from new sources. The size of the train dataset (3.6 GB) is on the lower side;
- New Google’s conceptual captions - 3.3m images with captions - looks like they even did some EDA and cleaning on the dataset - this also looks VERY interesting because you can try novel sequence to sequence models (transformer) here. But ofc this is Google - you need to invest into Google cloud and submit all of your code for free. Bummer;
- Unrestricted adversarial challenge from Google. Amazing community effort. But no carrot for me personally - sounds like a cheap way to source the datasets and solution code for free;
- Cool idea from Google - for Text2Speech you can share datasets of similar languages via phonetic representation;
- What if tool in the Tensorboard - looks really cool, but ofc requires TF model and seems a bit too niche;
New architectures from Google
- CNN Music recognition:
- Embeddings
- Semantic fingerprint via low-dimensional embedding of 8-second music clips sliced into 2 second intervals calculated on the fly
- On device song database for search
- Matching:
- Optimized KNN search
- On a narrowed sample full list of embeddings is downloaded, they are treated as sequences to improve accuracy
- CNN details
- 128 emb size
- Fingerprint calculated each 0.5s
- Embeddings