Balancing accuracy vs. exploration is always tricky
In previous article we wrote about how we managed to solve a 5000 class classification NLP task using a single BiLSTM neural network with an embedding bag layer. This network helped our users to find the services they were looking for and was agnostic to input typos (cases, and other nasty things present in morphologically rich languages). Each time user types something in a search bar we suggest him 5 most relevant (according to the NN) services sorted by confidence and …
Even though top-5 classification accuracy ~95% it is not guaranteed that the rest four services are still appropriate. In some cases we got something like this:
In our dataset we have one ground truth label for each search query, but want our classifier to predict 5 relevant services to balance search and exploration. So we need some way to tell NN that GT label is still one, but there are maybe several “similar” services that are good enough too. As with all classification problems - you can just swap CE loss with BCE loss (Pytorch implementation) weighting the ground truth service with 1 and “similar” services with some constant < 1:
So the main problem is - how can we find similar services for each piece of annotation? It’s quite a difficult task for manual annotation, so we need to find another way.
- First idea - to take heuristic service graph and treat vertices within some range as “similar”. But Heuristics usually are biased (the problem is - you do not know how biased they are and where exactly), it is usually better rely on statistics;
- Here comes the second idea - create N x N co-occurrence (affinity) matrix (where N - total number of services) each cell of which says how many times two services appeared together (in a profile for example), normalize each row (or column) by minmax scaler and treat all services within some treshold as “similar”.
So now we believe the expression “tell me who you go with and I’ll tell you who you are”. It probably won’t ruin your classifier, but will add some sort of generalization (or flexibility), cause it will tend to establish more complex connections between labels itself (thanks to backpropagation).
We trained such network using 0.5 threshold for affinity matrix and weighting these relatives by 0.15 in BCE loss. Result:
It looks better, isn’t it? One more example:
So! If classes are somehow related to each other by nature or by statistics or by common sense and it is possible to create a matrix of relative proximity (not necessarily symmetrical) and use it to help classifier understand inner connections between different classes. It can take some iterations to choose best suitable parameters, but result is good!
Last example on weighting:
Another idea (not tested yet). Just treat each row of an affinity matrix as a label vector and use l2 loss or cosine loss as a prediction metric. We’ll test it and report results.