Analyzer ignoring certain word when used in Sklearn Tfidf

Here is my code:

def ngrams(string, n=4):
    string = re.sub(r'[,-./]|sBD',r'', string)
    ngrams = zip(*[string[i:] for i in range(n)])
    R = [''.join(ngram) for ngram in ngrams]
    if len(R) == 0:
        return string
    else:
        return R

L = ['a', 'aa', 'aaa', 'a', 'aa', 'aaa']

vectorizer = TfidfVectorizer(min_df = 0, token_pattern='(?u)\b\w+\b', analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(L)

print(vectorizer.vocabulary_)

The output of vocabulary is {'a': 0}.

I am confused where are "aa" and "aaa" and when you check my ngrams function, I am returning string if it’s length is less then the parameter (which is 4 in above code).

The token regex is also made in a way to accept single character.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

This is a theory.

I believe TfidVectorizer expects the analyzer function to return a sequence. Notice the inputs vs outputs of your ngrams function:

'a'  -> 'a'
'aa' -> 'aa'
'aaa' -> 'aaa'
'aaaa' -> ['aaaa']
'aaaaa' -> ['aaaa','aaaa']

A string is a sequence, so in the first 3 cases, you are returning a sequence that consists of repeats of the single letter 'a'.

If my theory is correct, you need to replace

        return string

with

        return [string]


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x