The categorical cross-entropy loss after softmax activation is the method of choice for classification;
Training a CNN classifier from scratch on small datasets does does not work well;
In contrast to this, we show that the cosine loss function provides significantly better performance than cross-entropy on datasets with only a handful of samples per class. For
The problems is that it looks like they use imagenet for some kind of transfer learning;
Key idea - use information “between” the labels, because some classes clearly correlate;
Must … not … post BERT papers
BioBERT: pre-trained biomedical language representation model for biomedical text mining;