**Why word vectors**

*A standard Word2Vec illustration from a seminal paper (I will do a paper review on the topic some time in the future)*

Using some form of word vectors / embeddings is considered nowadays to be one of**standard approaches for practical NLP related tasks**.

In my opinion, the most useful in practice are these 2 approaches:

- Using fast-text (notice there is a python version now) vectors (expecially for "complicated" languages with cases);
- Using end-to-end CNN-based approaches and maybe using CNN's embeddings to get vectors (e.g. consider this transformer tutorial);

In this article we will just assume that you have your word vectors and you need to use them for some downstream task.

### So what is the fuss all about?

Imagine that you have a reasonably large dataset, for example with 1-10m queries / phrases / sentences. Ideally, to provide value for business / applied tasks, you need to:

- Investigate the data structure;
- Build a streamlined annotation process (data exploration => annotation => models => distillation => even more annotation => better models etc);
- Be able to match internal datasets with external ones;

In practice matching usually implies calculating some kind of distance in the most simplistic case, e.g. cosine disntance Well, **1-10m * 1-10m** equals a LOT of operations.

In real applied tasks you may have little to no annotation. Hence you will have to be creative and do something of this sort:

- Find some annotation;
- Enrich it with some external sources;
- Manually check / correct it;
- Repeat;

An the problem is - calculating 10^6 * 10^6 distances is very slow.

**The solution**

In a nutshell - understand the internal structure of your data and exploit some optimization tricks:

**4x speed-up**- cluster your data and dispose of the rubbish clusters;**10x speed-up**- use some form of multiprocessing / multiple cores for calculation (or maybe do them in batch form on the GPU);**5x speed-up**- use**numba**to compile your distance function into C;**10x speed up**- if you do even more proper clustering, you can dispose of very distant clusters that are most likely not relevant;

In on applied case I manage to speed up my calculation **from 27 days to 4-5 hours**.

#### Clustering

In my view, UMAP + HDBSCAN work reasobably well for word vectors. UMAP even has a cosine distance option (though everything is not 100% smooth there).

Also HDBSCAN has a great perk - it produces a -1 "rubbish" cluster. In practice you may end up with something like this, which has reasobably good local structure

So, using UMAP + HDBSCAN you can remove ~50% of rubbish data. But you can also cluster your dataset into a reasobable amount of clusters (notice - you do not set number of clusters as a parameter of HDBSCAN, which arguably is its best advantage) - 5k - 10k. Then you can just calculate distances between your target phrases and these clusters and throw away the distant ones.

In plain terms - 10^3 x 10^6 << 10^6 x 10^6. Also cosine distance on such large scale is usually normally distributed, so you can get away with keeping all relations with distance being smaller than mean - standard deviation.

#### Numba snippet that worked for me

```
@numba.jit(target='cpu', nopython=True)
def fast_cosine(u, v):
m = u.shape[0]
udotv = 0
u_norm = 0
v_norm = 0
for i in range(m):
if (np.isnan(u[i])) or (np.isnan(v[i])):
continue
udotv += u[i] * v[i]
u_norm += u[i] * u[i]
v_norm += v[i] * v[i]
u_norm = np.sqrt(u_norm)
v_norm = np.sqrt(v_norm)
if (u_norm == 0) or (v_norm == 0):
ratio = 1.0
else:
ratio = udotv / (u_norm * v_norm)
return 1-ratio
```

borrowed from here