Similarity between multiple vectors having same length

Objective : Compute a similarity between two users on the basis of their skills

Approach : Trained a word2vec model using gensim library on the set of skills obtained from Job Descriptions. Model seems to be working pretty fine when used model.wv.most_similar
e.g.

Problem : Vocabulary of the skills on which model was trained doesn’t match with the skills which I currently have so I went ahead and found a replacement of the current skills from the model’s vocabulary by finding a similarity w.r.t spelling using SequenceMatcher from module difflib. e.g. “PyTorch” was there in my current skills but the model’s vocabulary had “torch” present as a skill. So using SequenceMatcher I found that “torch” has the highest similarity from model’s vocabulary so I replaced “Pytorch” with “torch” and computed the vector representation of the same by passing “torch” into the model, model.wv["torch"]
and stored it in a dictionary so that I won’t have to compute it again and again.

Function to compute the same :

def new_to_old_embedding(skill_embeddings, new_skill, model)
    """ Computing embeddings for new skills from app by mapping new skills with old skills from model's vocabulary
    
    Returns:
        dict: Embeddings of new skills after mapping with old skills
    """
    if new_skill not in old_skills:
        thresh = 0.6
        replaced_skill = ''
        for old_skill in old_skills :
            spell_sim = SequenceMatcher(None, old_skill, new_skill).ratio()
            if spell_sim > thresh :
                thresh = spell_sim
                replaced_skill = old_skill
        skill_embeddings[new_skill] = model.wv[replaced_skill]
    else :
        skill_embeddings[new_skill] = model.wv[new_skill]
    return skill_embeddings

Similarly for all of my current skills, I found a nearest skill w.r.t spelling and computed its vector representation and stored it in a python dictionary.

Now if user1 has skills = [“OpenCV”, “Python”] and user2 has skills = [“Machine Learning”, “Deep Learning”, “Python”] and I already have vector representations of each skill stored in a dictionary then how can I compute the similarity between these two sets of skills ?

In other words, I have to find a similarity between two matrices of dimensions (m, L) and (n, L)
where,

m is number of skills for user1
n is the number of skills for user2
L is the length of the vector representing skill which is fixed (300 in my case)

I did found this question but since my problem is a NLP problem I was not sure whether or not this will work.

Contents hide

Answers:

Method 1

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

One option would be to average the multiple vectors together for each set-of-skills, then compute the cosine-similarity between those average vectors.

The next version of Gensim will have a utility method on KeyedVectors that will let you supply a list of keys (words), and return the average of all those vectors. Until that’s released, you could use its source code as a model for your own calculations:

https://github.com/RaRe-Technologies/gensim/blob/97cef997032c3222645ebdc898c199a7b63e5395/gensim/models/keyedvectors.py#L462

Thee’s also a utility method to calculate the cosine-similarity between one vector and a list of others, KeyedVectors.cosine_similarities(), that you could use on those averages:

docs: https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.cosine_similarities

source: https://github.com/RaRe-Technologies/gensim/blob/97cef997032c3222645ebdc898c199a7b63e5395/gensim/models/keyedvectors.py#L1147

But, this way of comparing sets-of-vectors – by their average – while straightforward & common, is only one of many possible ways.

Another option is something called “Word Mover’s Distance” (WMD), which is more expensive to calculate (especially on larger sets), because it actually uses a search for a minimal set of changes to ‘shift’ the different sets-of-meanings to match. But the resulting distances (smaller for more-similar sets) can sometmes better capture what’s meaningful.

It’s available as a method on KeyedVectors where you supply two lists of keys (word) that should be in the set-of-KeyedVectors, and it returns the calculated distance:

docs: https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.wmdistance

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating