UPDATE: I made a mistake, read about it here
I got a test case after a job interview. The company itself looked like a dream at first, I'd say. It's not a tech giant, nor it's an easy-going startup. Still, it's screwed in some other ways (I'll leave my comment in the last part of the article for ones interested). The article is mainly focused on their test case, because I found it to be strikingly beautiful and inspirational. I don't know if I've solved the case correctly yet.
Some of projections
The dataset consists of two parts, train and test, but unlike all the standard cases, it is not supervised learning case. The train dataset isn't labelled. Instead, it is stated, that there are two huge clusters of genetic markers found in genomes of people from two different continents. And two clusters are represented evenly throughout the train dataset. The goal is to find those clusters in the train dataset and then split the test dataset too.
- The dataset is not really huge (250k samples in each of two parts), but it could be a challenge for unsupervised learning algorithms.
- I don't have bioinformatics background, so I am not able to notice some important details in preprocessing steps, I guess. I was in love with genetics as a kid, but that's it (you know, all those simplified things like some phenotypic traits inheritance, genetic diseases and blood types). I have understanding of what the DNA and RNA are and how proteins are encoded and decoded.
BTW, you can extract DNA on your own:
Part 1. Basic EDA and transformations
There are three columns describing genetic markers. After some manipulations I ended up not using them at all. Besides that, there are 100 columns of genetic markers (or their parts, I am not sure) and they look like this (each row is a person):
It is seen that marker parts are repeated, and there are 1196 of them in the whole dataset.
I made an assumption that it's not actually important where the marker is in the row. What important is, how many of those parts are found in a person among those 100 parts. Such an assumption could be terribly wrong, but hey, it ended up looking just great.
For those who are interested in label encoding techniques, check out this article and also read some articles about embeddings.
Now it's time to visualize our dataset. Using PCA for dimensionality reduction helps us to visualize any dataset and possibly spot the structure.
PCA with 10 components adds up explained variance ratio to 99%. 3 first components are responsible for 83% of variance, so we will look on 3D image and 2D projections.
2D projections of the train dataset.
After some time spent on staring at projections and making a model out of bamboo sticks and strings, I realized that the structure is just a tetrahedron with bent edges.
I transformed and visualized the test dataset to make sure the test dataset distribution was similar to the one of the train dataset.
2D projections of the test dataset.
I've tried a bunch of techniques. I thought UMAP would do well, but apparently, the dataset was too large for UMAP to handle.
HDBSCAN worked too well :) I mean, it's clearly seen, that there are 6 clusters (because tetrahedron has 6 edges), and HDBSCAN finds them without hesitation. But I've been told to find 2 clusters, not 6. And when I tune min_cluster_size parameter to look for 100k sample-sized clusters at least, HDBSCAN gives up. So if you have a dataset with complex structure and clusters number is unknown, use HDBSCAN.
HDBSCAN clusters. Red dots are noise.
Since people's genomes are still somewhat isolated even inside the same relatively small areas like countries and regions (and there are many social factors influencing such isolation), it is adequate to assume hierarchical clustering should work well. Well, it did. On a small dataset (~50k).
And then I tried k-means. And it worked surprisingly well. It sliced the tetrahedron as expected, leaving two huge opposite edges in two different clusters.
K-Means clustering result. 3 first components 2D projections.
I felt myself as a kid, playing with other components projections. They all look like 4 vertices attached by 6 flexible edges, and edges are pulled by some forces in different directions. I also tried to recreate what I see in projection with my poor model made out of bamboo sticks and strings twisting it in all possible ways. Yeah, I am still definitely a kid.
Beautiful, isn't it?
Yeah, definitely beautiful
Training a model and prediction
So, after clustering we can use cluster labels for training a model. Since I am lazy I just use gradient boosting classifier from sklearn. It does great job anyway (I could train a neural network using pytorch, but it would be an overkill, right?).
Results are pretty decent (but remember, that I don't have ground truth to compare to, only clustering labels):
Given a trained model, prediction is easy. Nothing to discuss, actually.
A jupyter notebook is on github for ones curious enough.
A comment I promised (click to read)
Well, I am not a kid to be naive about how the real world works. For a short time I worked in a company I didn't like much. On leaving I was given a piece of advice: to keep my opinions to myself, especially when no one asks for my opinion directly.
No one asks me for my opinion now. I still think that I am a human, a social being, not a silent worker meant to be smiling when some HR passes by, I have an opinion and I might find someone who thinks the same way I do. I see something that makes me sad, and I want to speak about it.
So here is the comment about the company that gave me the test case I described above (not the one where I worked):
They offered me low salary (two times lower than I wanted and 30% less than the minimum that I could work for). They explained it by saying that bioinformatics scientist's salary is much lower in Russia when compared to other fields. So I went to Glassdoor to find some information. Well, we are not in USA, but still I don't get it why the gap between a common data scientist and bioinformatics scientist is so huge in Russia. Probably, there are passionate people who just don't understand their value and work for the joy, not for the salary. That is sad, actually.
They also said that it was not easy to get into bioinformatics field from other fields, so company invested a lot of its time and other resources to newcomers, that's why they offered low salary for the start. Well, as far as I know, if you are undervalued from the beginning, it's a road to nowhere. So you end up loosing your money and time.
So, the dream is just another cruel reality. No doubts. So sad.