Why word vectors
A standard Word2Vec illustration from a seminal paper (I will do a paper review on the topic some time in the future)
Using some form of word vectors / embeddings is considered nowadays to be one ofstandard approaches for practical NLP related tasks.
In my opinion, the most useful in practice are these 2 approaches:
- Using fast-text (notice there is a python version now) vectors (expecially for "complicated" languages with cases);
- Using end-to-end CNN-based approaches and maybe using CNN's embeddings to get vectors (e.g. consider this transformer tutorial);
In this article we will just assume that you have your word vectors and you need to use them for some downstream task.
So what is the fuss all about?
Imagine that you have a reasonably large dataset, for example with 1-10m queries / phrases / sentences. Ideally, to provide value for business / applied tasks, you need to:
- Investigate the data structure;
- Build a streamlined annotation process (data exploration => annotation => models => distillation => even more annotation => better models etc);
- Be able to match internal datasets with external ones;
In practice matching usually implies calculating some kind of distance in the most simplistic case, e.g. cosine disntance Well, 1-10m * 1-10m equals a LOT of operations.
In real applied tasks you may have little to no annotation. Hence you will have to be creative and do something of this sort:
- Find some annotation;
- Enrich it with some external sources;
- Manually check / correct it;
An the problem is - calculating 10^6 * 10^6 distances is very slow.
In a nutshell - understand the internal structure of your data and exploit some optimization tricks:
- 4x speed-up - cluster your data and dispose of the rubbish clusters;
- 10x speed-up - use some form of multiprocessing / multiple cores for calculation (or maybe do them in batch form on the GPU);
- 5x speed-up - use numba to compile your distance function into C;
- 10x speed up - if you do even more proper clustering, you can dispose of very distant clusters that are most likely not relevant;
In on applied case I manage to speed up my calculation from 27 days to 4-5 hours.
So, using UMAP + HDBSCAN you can remove ~50% of rubbish data. But you can also cluster your dataset into a reasobable amount of clusters (notice - you do not set number of clusters as a parameter of HDBSCAN, which arguably is its best advantage) - 5k - 10k. Then you can just calculate distances between your target phrases and these clusters and throw away the distant ones.
In plain terms - 10^3 x 10^6 << 10^6 x 10^6. Also cosine distance on such large scale is usually normally distributed, so you can get away with keeping all relations with distance being smaller than mean - standard deviation.
Numba snippet that worked for me
@numba.jit(target='cpu', nopython=True) def fast_cosine(u, v): m = u.shape udotv = 0 u_norm = 0 v_norm = 0 for i in range(m): if (np.isnan(u[i])) or (np.isnan(v[i])): continue udotv += u[i] * v[i] u_norm += u[i] * u[i] v_norm += v[i] * v[i] u_norm = np.sqrt(u_norm) v_norm = np.sqrt(v_norm) if (u_norm == 0) or (v_norm == 0): ratio = 1.0 else: ratio = udotv / (u_norm * v_norm) return 1-ratio
borrowed from here