What is a good way to save high dimensional data so it doesn’t run every time?

I have the following code, which computes cosine similarity of the descriptions of tv shows and movies.

    for i, row in df.iterrows():
        doc = nlp(row['description'])
        similarities[i] = {}
    #     print(row['title'])
        for j, row2 in df.iterrows():
            doc2 = nlp(row2['description'])
            #print(f"{row['title']} x {row2['title']}: {doc.similarity(doc2):.10f}")
            similarities[i][j] = doc.similarity(doc2)

I’ve also written this function, which takes as arguments two titles and returns their similarity

def lookup(title1, title2):
    return similarities[lookup_by_title(title1)][lookup_by_title(title2)]

my issue is that the dataframe I loop through has 4884 rows, so I’m have a list of 23.8 million computations. So I’m wondering what the best way is to run the computations once and save that information somewhere efficiently.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

After you calculate similarities at the first time, you can dump it to a local file, and then in the next times, instead of doing the computations again, just load similarities from the file.

You can use pickle for this, See a nice tutorial here.

I’m copying the samples in case the webpage won’t be available in future. In your case, of course you need to replace config_dictionary with similarities:

Dump:

# Step 1
import pickle
 
config_dictionary = {'remote_hostname': 'google.com', 'remote_port': 80}
 
# Step 2
with open('config.dictionary', 'wb') as config_dictionary_file:
 
  # Step 3
  pickle.dump(config_dictionary, config_dictionary_file)

Load:

# Step 1
import pickle
 
# Step 2
with open('config.dictionary', 'rb') as config_dictionary_file:
 
    # Step 3
    config_dictionary = pickle.load(config_dictionary_file)
 
    # After config_dictionary is read from file
    print(config_dictionary)


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x