I am trying to learn how to tag spanish words using NLTK.
From the nltk book, It is quite easy to tag english words using their example. Because I am new to nltk and all language processing, I am quite confused on how to proceeed.
I have downloaded the cess_esp corpus. Is there a way to specifiy a corpus in nltk.pos_tag. I looked at the pos_tag documentation and didn’t see anything that suggested I could. I feel like i’m missing some key concepts. Do I have to manually tag the words in my text agains the cess_esp corpus? (by manually I mean tokenize my sentance and run it agains the corpus) Or am I off the mark entirely. Thank you
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
First you need to read the tagged sentence from a corpus. NLTK provides a nice interface to no bother with different formats from the different corpora; you can simply import the corpus use the corpus object functions to access the data. See http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml .
Then you have to choose your choice of tagger and train the tagger. There are more fancy options but you can start with the N-gram taggers.
Then you can use the tagger to tag the sentence you want. Here’s an example code:
from nltk.corpus import cess_esp as cess
from nltk import UnigramTagger as ut
from nltk import BigramTagger as bt
# Read the corpus into a list,
# each entry in the list is one sentence.
cess_sents = cess.tagged_sents()
# Train the unigram tagger
uni_tag = ut(cess_sents)
sentence = "Hola , esta foo bar ."
# Tagger reads a list of tokens.
uni_tag.tag(sentence.split(" "))
# Split corpus into training and testing set.
train = int(len(cess_sents)*90/100) # 90%
# Train a bigram tagger with only training data.
bi_tag = bt(cess_sents[:train])
# Evaluates on testing data remaining 10%
bi_tag.evaluate(cess_sents[train+1:])
# Using the tagger.
bi_tag.tag(sentence.split(" "))
Training a tagger on a large corpus may take a significant time. Instead of training a tagger every time we need one, it is convenient to save a trained tagger in a file for later re-use.
Please look at Storing Taggers section in http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html
Method 2
Given the tutorial in the previous answer, here’s a more object-oriented approach from spaghetti tagger: https://github.com/alvations/spaghetti-tagger
#-*- coding: utf8 -*-
from nltk import UnigramTagger as ut
from nltk import BigramTagger as bt
from cPickle import dump,load
def loadtagger(taggerfilename):
infile = open(taggerfilename,'rb')
tagger = load(infile); infile.close()
return tagger
def traintag(corpusname, corpus):
# Function to save tagger.
def savetagger(tagfilename,tagger):
outfile = open(tagfilename, 'wb')
dump(tagger,outfile,-1); outfile.close()
return
# Training UnigramTagger.
uni_tag = ut(corpus)
savetagger(corpusname+'_unigram.tagger',uni_tag)
# Training BigramTagger.
bi_tag = bt(corpus)
savetagger(corpusname+'_bigram.tagger',bi_tag)
print "Tagger trained with",corpusname,"using" +
"UnigramTagger and BigramTagger."
return
# Function to unchunk corpus.
def unchunk(corpus):
nomwe_corpus = []
for i in corpus:
nomwe = " ".join([j[0].replace("_"," ") for j in i])
nomwe_corpus.append(nomwe.split())
return nomwe_corpus
class cesstag():
def __init__(self,mwe=True):
self.mwe = mwe
# Train tagger if it's used for the first time.
try:
loadtagger('cess_unigram.tagger').tag(['estoy'])
loadtagger('cess_bigram.tagger').tag(['estoy'])
except IOError:
print "*** First-time use of cess tagger ***"
print "Training tagger ..."
from nltk.corpus import cess_esp as cess
cess_sents = cess.tagged_sents()
traintag('cess',cess_sents)
# Trains the tagger with no MWE.
cess_nomwe = unchunk(cess.tagged_sents())
tagged_cess_nomwe = batch_pos_tag(cess_nomwe)
traintag('cess_nomwe',tagged_cess_nomwe)
print
# Load tagger.
if self.mwe == True:
self.uni = loadtagger('cess_unigram.tagger')
self.bi = loadtagger('cess_bigram.tagger')
elif self.mwe == False:
self.uni = loadtagger('cess_nomwe_unigram.tagger')
self.bi = loadtagger('cess_nomwe_bigram.tagger')
def pos_tag(tokens, mmwe=True):
tagger = cesstag(mmwe)
return tagger.uni.tag(tokens)
def batch_pos_tag(sentences, mmwe=True):
tagger = cesstag(mmwe)
return tagger.uni.batch_tag(sentences)
tagger = cesstag()
print tagger.uni.tag('Mi colega me ayuda a programar cosas .'.split())
Method 3
I ended up here searching for POS taggers for other languages then English. Another option for your problem is using the Spacy library. Which offers POS tagging for multiple languages such as Dutch, German, French, Portuguese, Spanish, Norwegian, Italian, Greek and Lithuanian.
From the Spacy Documentation:
import es_core_news_sm
nlp = es_core_news_sm.load()
doc = nlp("El copal se usa principalmente para sahumar en distintas ocasiones como lo son las fiestas religiosas.")
print([(w.text, w.pos_) for w in doc])
leads to:
[(‘El’, ‘DET’), (‘copal’, ‘NOUN’), (‘se’, ‘PRON’), (‘usa’, ‘VERB’),
(‘principalmente’, ‘ADV’), (‘para’, ‘ADP’), (‘sahumar’, ‘VERB’),
(‘en’, ‘ADP’), (‘distintas’, ‘DET’), (‘ocasiones’, ‘NOUN’), (‘como’,
‘SCONJ’), (‘lo’, ‘PRON’), (‘son’, ‘AUX’), (‘las’, ‘DET’), (‘fiestas’,
‘NOUN’), (‘religiosas’, ‘ADJ’), (‘.’, ‘PUNCT’)]
and to visualize in a notebook:
displacy.render(doc, style='dep', jupyter = True, options = {'distance': 120})
Method 4
The following script gives you a quick approach to get a “bag of words” in Spanish sentences. Note that if you want to do it correctly you must tokenize the sentences before tag, so ‘religiosas.’ must be separated in two tokens ‘religiosas’,’.’
#-*- coding: utf8 -*-
# about the tagger: http://nlp.stanford.edu/software/tagger.shtml
# about the tagset: nlp.lsi.upc.edu/freeling/doc/tagsets/tagset-es.html
import nltk
from nltk.tag.stanford import POSTagger
spanish_postagger = POSTagger('models/spanish.tagger', 'stanford-postagger.jar', encoding='utf8')
sentences = ['El copal se usa principalmente para sahumar en distintas ocasiones como lo son las fiestas religiosas.','Las flores, hojas y frutos se usan para aliviar la tos y también se emplea como sedante.']
for sent in sentences:
words = sent.split()
tagged_words = spanish_postagger.tag(words)
nouns = []
for (word, tag) in tagged_words:
print(word+' '+tag).encode('utf8')
if isNoun(tag): nouns.append(word)
print(nouns)
Gives:
El da0000 copal nc0s000 se p0000000 usa vmip000 principalmente rg para sp000 sahumar vmn0000 en sp000 distintas di0000 ocasiones nc0p000 como cs lo pp000000 son vsip000 las da0000 fiestas nc0p000 religiosas. np00000 [u'copal', u'ocasiones', u'fiestas', u'religiosas.'] Las da0000 flores, np00000 hojas nc0p000 y cc frutos nc0p000 se p0000000 usan vmip000 para sp000 aliviar vmn0000 la da0000 tos nc0s000 y cc también rg se p0000000 emplea vmip000 como cs sedante. nc0s000 [u'flores,', u'hojas', u'frutos', u'tos', u'sedante.']
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0
