# Speeding up word distance calculation 100x

## My plain approach to faster and more practical cosine distance calculation

Posted by snakers41 on August 12, 2018

### Why word vectors A standard Word2Vec illustration from a seminal paper (I will do a paper review on the topic some time in the future)

Using some form of word vectors / embeddings is considered nowadays to be one ofstandard approaches for practical NLP related tasks.

In my opinion, the most useful in practice are these 2 approaches:

• Using fast-text (notice there is a python version now) vectors (expecially for "complicated" languages with cases);
• Using end-to-end CNN-based approaches and maybe using CNN's embeddings to get vectors (e.g. consider this transformer tutorial);

In this article we will just assume that you have your word vectors and you need to use them for some downstream task.

### So what is the fuss all about?

Imagine that you have a reasonably large dataset, for example with 1-10m queries / phrases / sentences. Ideally, to provide value for business / applied tasks, you need to:

• Investigate the data structure;
• Build a streamlined annotation process (data exploration => annotation => models => distillation => even more annotation => better models etc);
• Be able to match internal datasets with external ones;

In practice matching usually implies calculating some kind of distance in the most simplistic case, e.g. cosine disntance Well, 1-10m * 1-10m equals a LOT of operations.

In real applied tasks you may have little to no annotation. Hence you will have to be creative and do something of this sort:

• Find some annotation;
• Enrich it with some external sources;
• Manually check / correct it;
• Repeat;

An the problem is - calculating 10^6 * 10^6 distances is very slow.

### The solution

In a nutshell - understand the internal structure of your data and exploit some optimization tricks:

1. 4x speed-up - cluster your data and dispose of the rubbish clusters;
2. 10x speed-up - use some form of multiprocessing / multiple cores for calculation (or maybe do them in batch form on the GPU);
3. 5x speed-up - use numba to compile your distance function into C;
4. 10x speed up - if you do even more proper clustering, you can dispose of very distant clusters that are most likely not relevant;

In on applied case I manage to speed up my calculation from 27 days to 4-5 hours.

#### Clustering

In my view, UMAP + HDBSCAN work reasobably well for word vectors. UMAP even has a cosine distance option (though everything is not 100% smooth there).

Also HDBSCAN has a great perk - it produces a -1 "rubbish" cluster. In practice you may end up with something like this, which has reasobably good local structure So, using UMAP + HDBSCAN you can remove ~50% of rubbish data. But you can also cluster your dataset into a reasobable amount of clusters (notice - you do not set number of clusters as a parameter of HDBSCAN, which arguably is its best advantage) - 5k - 10k. Then you can just calculate distances between your target phrases and these clusters and throw away the distant ones.

In plain terms - 10^3 x 10^6 << 10^6 x 10^6. Also cosine distance on such large scale is usually normally distributed, so you can get away with keeping all relations with distance being smaller than mean - standard deviation.

#### Numba snippet that worked for me

``````@numba.jit(target='cpu', nopython=True)
def fast_cosine(u, v):
m = u.shape
udotv = 0
u_norm = 0
v_norm = 0
for i in range(m):
if (np.isnan(u[i])) or (np.isnan(v[i])):
continue

udotv += u[i] * v[i]
u_norm += u[i] * u[i]
v_norm += v[i] * v[i]

u_norm = np.sqrt(u_norm)
v_norm = np.sqrt(v_norm)

if (u_norm == 0) or (v_norm == 0):
ratio = 1.0
else:
ratio = udotv / (u_norm * v_norm)
return 1-ratio``````

borrowed from here