I have the following code, which computes cosine similarity of the descriptions of tv shows and movies.
for i, row in df.iterrows():
doc = nlp(row['description'])
similarities[i] = {}
# print(row['title'])
for j, row2 in df.iterrows():
doc2 = nlp(row2['description'])
#print(f"{row['title']} x {row2['title']}: {doc.similarity(doc2):.10f}")
similarities[i][j] = doc.similarity(doc2)
I’ve also written this function, which takes as arguments two titles and returns their similarity
def lookup(title1, title2):
return similarities[lookup_by_title(title1)][lookup_by_title(title2)]
my issue is that the dataframe I loop through has 4884 rows, so I’m have a list of 23.8 million computations. So I’m wondering what the best way is to run the computations once and save that information somewhere efficiently.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
After you calculate similarities at the first time, you can dump it to a local file, and then in the next times, instead of doing the computations again, just load similarities from the file.
You can use pickle for this, See a nice tutorial here.
I’m copying the samples in case the webpage won’t be available in future. In your case, of course you need to replace config_dictionary with similarities:
Dump:
# Step 1
import pickle
config_dictionary = {'remote_hostname': 'google.com', 'remote_port': 80}
# Step 2
with open('config.dictionary', 'wb') as config_dictionary_file:
# Step 3
pickle.dump(config_dictionary, config_dictionary_file)
Load:
# Step 1
import pickle
# Step 2
with open('config.dictionary', 'rb') as config_dictionary_file:
# Step 3
config_dictionary = pickle.load(config_dictionary_file)
# After config_dictionary is read from file
print(config_dictionary)
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0