Python NLP Spacy : improve bi-gram extraction from a dataframe, and with named entities?

I am using Python and spaCy as my NLP library, working on a big dataframe that contains feedback about different cars, which looks like this:

‘feedback’ column contains natural language text to be processed,
‘lemmatized’ column contains lemmatized version of the feedback text,
‘entities’ column contains named entities extracted from the feedback text (I’ve trained the pipeline so that it will recognise car models and brands, labelling these as ‘CAR_BRAND’ and ‘CAR_MODEL’)

I then created the following function, which applies the Spacy nlp token to each row of my dataframe and extract any [noun + verb], [verb + noun], [adj + noun], [adj+ proper noun] combinations.

def bi_gram(x):
    doc = nlp_token(x)
    result = []
    text = ''
    for i in range(len(doc)):
        j = i+1
        if j < len(doc):
            if (doc[i].pos_ == "NOUN" and doc[j].pos_ == "VERB") or (doc[i].pos_ == "VERB" and doc[j].pos_ == "NOUN") or (doc[i].pos_ == "ADJ" and doc[j].pos_ == "NOUN") or (doc[i].pos_ == "ADJ" and doc[j].pos_ == "PROPN"):
                text = doc[i].text + " " + doc[j].text
                result.append(text)
        i = i+1
        return result

Then I applied this function to ‘lemmatized’ column:

df['bi_gram'] = df['lemmatized'].apply(bi_gram)

This is where I have a problem…

This is producing only one bigram per row maximum. How can I tweak the code so that more than one bigram can be extracted and put in a column? (Also are there more linguistic combinations I should try?)
Is there a possibility to find out what people are saying about ‘CAR_BRAND’ and ‘CAR_MODEL’ named entities extracted in the ‘entities’ column? For example ‘Cool Porsche’ – Some brands or models are made of more than two words so it’s tricky to tackle.

I am very new to NLP.. If there is a more efficient way to tackle this, any advice will be super helpful!
Many thanks for your help in advance.

Contents hide

Answers:

Method 1

Set up the pattern matcher

Extract matches

Result

Some ideas for improvement

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

spaCy has a built-in pattern matching engine that’s perfect for your application – it’s documented here and in a more extensive usage guide. It allows you to define patterns in a readable and easy-to-maintain way, as lists of dictionaries that define the properties of the tokens to be matched.

Set up the pattern matcher

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm") # or whatever model you choose

matcher = Matcher(nlp.vocab)

# your patterns
patterns = {
    "noun_verb": [{"POS": "NOUN"}, {"POS": "VERB"}],
    "verb_noun": [{"POS": "VERB"}, {"POS": "NOUN"}],
    "adj_noun": [{"POS": "ADJ"}, {"POS": "NOUN"}],
    "adj_propn": [{"POS": "ADJ"}, {"POS": "PROPN"}],
}

# add the patterns to the matcher
for pattern_name, pattern in patterns.items():
    matcher.add(pattern_name, [pattern])

Extract matches

doc = nlp("The dog chased cats. Fast cats usually escape dogs.")
matches = matcher(doc)

matches is a list of tuples containing

a match id,
the start index of the matched bit and
the end index (exclusive).

This is a test output adopted from the spaCy usage guide:

for match_id, start, end in matches:
    
    # Get string representation
    string_id = nlp.vocab.strings[match_id]

    # The matched span
    span = doc[start:end]
    
    print(repr(span.text))
    print(match_id, string_id, start, end)
    print()

Result

'dog chased'
1211260348777212867 noun_verb 1 3

'chased cats'
8748318984383740835 verb_noun 2 4

'Fast cats'
2526562708749592420 adj_noun 5 7

'escape dogs'
8748318984383740835 verb_noun 8 10

Some ideas for improvement

Named entity recognition should be able to detect multi-word expressions, so brand and/or model names that consist of more than one token shouldn’t be an issue if everything is set up correctly
Matching dependency patterns instead of linear patterns might slightly improve your results

That being said, what you’re trying to do – kind of sentiment analysis -is quite a difficult task that’s normally engaged with machine learning approaches and heaps of training data. So don’t expect too much from simple heuristics.

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating