Here is my code:
def ngrams(string, n=4):
string = re.sub(r'[,-./]|sBD',r'', string)
ngrams = zip(*[string[i:] for i in range(n)])
R = [''.join(ngram) for ngram in ngrams]
if len(R) == 0:
return string
else:
return R
L = ['a', 'aa', 'aaa', 'a', 'aa', 'aaa']
vectorizer = TfidfVectorizer(min_df = 0, token_pattern='(?u)\b\w+\b', analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(L)
print(vectorizer.vocabulary_)
The output of vocabulary is {'a': 0}.
I am confused where are "aa" and "aaa" and when you check my ngrams function, I am returning string if it’s length is less then the parameter (which is 4 in above code).
The token regex is also made in a way to accept single character.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
This is a theory.
I believe TfidVectorizer expects the analyzer function to return a sequence. Notice the inputs vs outputs of your ngrams function:
'a' -> 'a' 'aa' -> 'aa' 'aaa' -> 'aaa' 'aaaa' -> ['aaaa'] 'aaaaa' -> ['aaaa','aaaa']
A string is a sequence, so in the first 3 cases, you are returning a sequence that consists of repeats of the single letter 'a'.
If my theory is correct, you need to replace
return string
with
return [string]
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0