What nobody will tell you about what they do

A couple of real and funny cases from practice

Posted by snakers41 on December 4, 2017

A typical reaction when you have found such errors in your work or in your colleagues' work


So, people usually brag about the high places they achieved in competitions (usually disregarding the ugly impractical solution they used, or 99% borrowed code they do not understand), their possessions, money, trivia, mobile phones, women and other sorts of trivial and stupid shit (vain and vanity sound similar, right?).

But I really rarely see people brag about the process of becoming better, overcoming hardship and the process of making stupid and not so stupid errors. There is a saying that a stupid man never learns, a clever man learns from his mistakes and a wise man learns from errors of the former categories (i.e. he or she in turn learns from fools?). You see, our modern world has a really short attention span, people have always been focusing on the top of the iceberg, which inevitably leads to 1000 "zerg solutions" being better than 1 well-balanced and "who screams louder gets the reward" attitude.

Without further ado, I would like to present a funny list of mistakes similar to this post. They may seem a bit more sophisticated /  less obvious, but when you really understand what caused them - it inevitably makes you either laugh or be amazed at the fragile nature of our existence. So behold)

1. Open-CV reads and writes images in BGR format rather than RGB by default

Well when I found out about this, I realized why so many Kaggle kernels are so verbose in invoking their CV2 methods. This post on stack overflow pretty much sums it all. Combine it with really poor open-CV (I would say - terrible and cringe inducing) documentation for Python - and you will see why many younger data scientists may be unaware of this. 

In a nutshell in late 2017 the majority of libraries and frameworks I have encountered (Pytorch, Tensorflow, Moviepy, PIL, Sci-kit image, Keras to name a few) for most of their methods and pre-trained models and iterators assume RGB image format. But looks like in the early days of computer vision with the prominence of Open-CV the BGR format was dominant. In keras, for example, in their model zoo if you look under the hood, you will see that for some of the older models the channels are indeed swapped. Wow.

Just for the sake of demonstration (yes, I did my own benchmarks, but I am too lazy to dig them out) this image

with this code

import skimage.io
import cv2
img = skimage.io.imread('sample.png')
cv2.imwrite('sample_out_1.png', img)


turns into this:



Now consider that neural networks are famously powerful to correct your mistakes (and you can flip a matrix just by matrix multiplication) and that in many cases people use pre-trained models for RGB - and you have a bomb)

Solutions? I basically see 3 major options:

  1. State the explicit cvtColor in your openCV calls;
  2. Just use Pillow or sci-kit image and forget about all of this;
  3. If you are working with video - on the channel we did a benchmark  (if you do not have Telegram - go here) of video reading libraries (hint - all of them use ffmpeg and are mostly the same);


import skimage.io
import cv2
img = skimage.io.imread('sample.png')
cv2.imwrite('sample_out_2.png', cv2.cvtColor(img, cv2.COLOR_RGB2BGR))  


2. Mind simple operators - they may be very deceptive

So you have written a fairlycomplex and sophisticated data iterator class extension to seep through your video dataset. After a rigorous testing all seems to work (and even all the models seem to be training almost properly). But on the delayed test dataset your results are very poor.

What is the problem? After a couple of hours of additional testing you find the culprit - this little gem.

unique_values = list(set([x for x in self.video_dict.values()]))
self.idx_2_value = {i:value for i,value in enumerate(unique_values) }
self.value_2_idx = {value:i for i,value in enumerate(unique_values) }


The thing is - set() operation in Python is not ordered. There is some kind of random order sometimes. but probably it is based on some kind of hashing. Basically I was training a proper model, but then just forgetting which label is which. This is the correct version. 

unique_values = sorted(list(set([x for x in self.video_dict.values()])))
self.idx_2_value = {i:value for i,value in enumerate(unique_values) }
self.value_2_idx = {value:i for i,value in enumerate(unique_values) }


I also have encountered same problem with dict.keys() in Python3, but it seemed to be producing the same ordering for alphabetical keys. It's really nice when you make a mistake, but then you also make a second one that negates the first one.


3. So you have written a really complex and well-tested class. Pay attention to how it is invoked

Usually you pay more attention to more complex parts of the task at hand. In my case I struggled to debug a sophisticated model, when the model was just fine, I was just invoking it with one parameter set to 2 instead of 15, which roughly decreased the model's capacity several times (RNN). What is more annoying - in this case I used PyTorch and Pytorch accepts dynamic sequence length for GRU and LSTM - so indeed one error was negated by this feature of the framework.


4. Data leakage is a real issue in corporate environment, especially in young companies

I remember reading an article, that more than 50% of Kaggle competitions have leaks and people usually exploit them. Also I heard an opinion that in corporate environment due to people not caring and other reasons up to 70% of ML algorithms work due to leaks. This obviously cannot be backed by any rigorous research, because people are really reluctant to admitting such mistakes (basically if a data scientists working for salary exploits leaks instead of avoiding them - he is incompetent and should be fired). You make take it or you may leave it.

I just would like to present you with the following case. On my new job I was presented with the fact that there is a task where an intern achieved 95-97-98% classification accuracy. Also a list of fun facts:

  • Whole department kind of helped him to write the code - there was no clear code owner;
  • His best model took 14+ hours per epoch to train;
  • It took him 2-3 weeks to push at least some code to the repository;
  • No ablation analysis / real insight was given into which part of the model was a killer feature (a couple of counter-intuitive things boosted my accuracy);


So, challenged with finding the truth, some 6 weeks later we had the following:

  • Having trained more than 200 models for various architectures and ablation analysis purposes - we have found a leak in his code;
  • I had to learn a new framework from scratch in limited time (it was a hiring requirement) - I am glad it was Pytorch, not Tensorflow;
  • I had to admin my own PC, create a set of docker environments and re-install the system from scratch for speed;
  • I had to deal with OPS providing me with outdated Apache Spark interfaces with no python support;
  • My final accuracy was around 87% and ROC AUC score around 0.91 (which is not bad);
  • The management has issues with my work, because "why cannot middle/senior guy get a higher score than intern?". Funny enough, the chief data scientist at least understands that leaks are a bad thing =)

 

The leak was really subtle (or probably egregious). You see, the guy read all the data just sequentially from disk and did not freeze the batch normalization layers of the network during validation phase. So he just learned the data average from one class, then from another class and his model learned to distinguish that. That is basically regressing y on itself with extra steps.



This quote seems  to be a nice fit here


Moral of the story - if you do your job properly - you will always be a bad guy.  Just accept it and be ready to fight back with mediocrity and false advertising and false promises. It is generally hard and takes time.

If you are not a "yes-man" or you cannot manage people's expectations, you will always encounter this. I can manage them, but when people are biased from start (99% accuracy is a goal, given that it took scientists decades to get 80% on some comparable benchmarks) then it's a bit more tricky.

5. A word on data normalization

Well, people from future read your documentation properly and try to understand boilerplate code. Just read this thread - the issue is - some guy prepared a pre-trained inception4 network (by converting TF-slim weights as evidenced by comparing the weights directly), but his normalization schedule differs a bit. Wise people understand that neural networks can work around such issues, but random people reading their docs will not appreciate normalization being different for one model.

Also mind that sci-kit image and open-CV do not normalize images to [0;1], PIL does, while the majority of neural networks accept data in normalized [0;1] range (first convert to 0-1 then subract mean and divide by variance).

Final note - how to combat such trivia and not get discouraged?

Here is the really tricky and philosophical part. I did a post on a similar subject here about how to stack black boxes responsibly.

So, everything is simple and complicated at the same time:

  1. You can compare your performance only to yourself - apples to apples (contrary to what the world thinks - logic like 10 Chinese will do your job holds, but it's kind of dis-hearting);
  2. The world is focused on getting short-term unsustainable gains and bragging about them. This should not involve you;
  3. You should be calm and see all challenges as an opportunity to learn;
  4. Help others when you can. People will not accept real advice - people will tend to push their work to you - but you need to be patient too;
  5. Be as public about your findings as possible, but be cool about it - life is a race, but it is not at the same time;


Pink Floyd's time is as timeless as ever.