NLTK Tagging spanish words using a corpus

I am trying to learn how to tag spanish words using NLTK.

From the nltk book, It is quite easy to tag english words using their example. Because I am new to nltk and all language processing, I am quite confused on how to proceeed.

I have downloaded the cess_esp corpus. Is there a way to specifiy a corpus in nltk.pos_tag. I looked at the pos_tag documentation and didn’t see anything that suggested I could. I feel like i’m missing some key concepts. Do I have to manually tag the words in my text agains the cess_esp corpus? (by manually I mean tokenize my sentance and run it agains the corpus) Or am I off the mark entirely. Thank you

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

First you need to read the tagged sentence from a corpus. NLTK provides a nice interface to no bother with different formats from the different corpora; you can simply import the corpus use the corpus object functions to access the data. See http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml .

Then you have to choose your choice of tagger and train the tagger. There are more fancy options but you can start with the N-gram taggers.

Then you can use the tagger to tag the sentence you want. Here’s an example code:

from nltk.corpus import cess_esp as cess
from nltk import UnigramTagger as ut
from nltk import BigramTagger as bt

# Read the corpus into a list, 
# each entry in the list is one sentence.
cess_sents = cess.tagged_sents()

# Train the unigram tagger
uni_tag = ut(cess_sents)

sentence = "Hola , esta foo bar ."

# Tagger reads a list of tokens.
uni_tag.tag(sentence.split(" "))

# Split corpus into training and testing set.
train = int(len(cess_sents)*90/100) # 90%

# Train a bigram tagger with only training data.
bi_tag = bt(cess_sents[:train])

# Evaluates on testing data remaining 10%
bi_tag.evaluate(cess_sents[train+1:])

# Using the tagger.
bi_tag.tag(sentence.split(" "))

Training a tagger on a large corpus may take a significant time. Instead of training a tagger every time we need one, it is convenient to save a trained tagger in a file for later re-use.

Please look at Storing Taggers section in http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html

Method 2

Given the tutorial in the previous answer, here’s a more object-oriented approach from spaghetti tagger: https://github.com/alvations/spaghetti-tagger

#-*- coding: utf8 -*-

from nltk import UnigramTagger as ut
from nltk import BigramTagger as bt
from cPickle import dump,load

def loadtagger(taggerfilename):
    infile = open(taggerfilename,'rb')
    tagger = load(infile); infile.close()
    return tagger

def traintag(corpusname, corpus):
    # Function to save tagger.
    def savetagger(tagfilename,tagger):
        outfile = open(tagfilename, 'wb')
        dump(tagger,outfile,-1); outfile.close()
        return
    # Training UnigramTagger.
    uni_tag = ut(corpus)
    savetagger(corpusname+'_unigram.tagger',uni_tag)
    # Training BigramTagger.
    bi_tag = bt(corpus)
    savetagger(corpusname+'_bigram.tagger',bi_tag)
    print "Tagger trained with",corpusname,"using" +
                "UnigramTagger and BigramTagger."
    return

# Function to unchunk corpus.
def unchunk(corpus):
    nomwe_corpus = []
    for i in corpus:
        nomwe = " ".join([j[0].replace("_"," ") for j in i])
        nomwe_corpus.append(nomwe.split())
    return nomwe_corpus

class cesstag():
    def __init__(self,mwe=True):
        self.mwe = mwe
        # Train tagger if it's used for the first time.
        try:
            loadtagger('cess_unigram.tagger').tag(['estoy'])
            loadtagger('cess_bigram.tagger').tag(['estoy'])
        except IOError:
            print "*** First-time use of cess tagger ***"
            print "Training tagger ..."
            from nltk.corpus import cess_esp as cess
            cess_sents = cess.tagged_sents()
            traintag('cess',cess_sents)
            # Trains the tagger with no MWE.
            cess_nomwe = unchunk(cess.tagged_sents())
            tagged_cess_nomwe = batch_pos_tag(cess_nomwe)
            traintag('cess_nomwe',tagged_cess_nomwe)
            print
        # Load tagger.
        if self.mwe == True:
            self.uni = loadtagger('cess_unigram.tagger')
            self.bi = loadtagger('cess_bigram.tagger')
        elif self.mwe == False:
            self.uni = loadtagger('cess_nomwe_unigram.tagger')
            self.bi = loadtagger('cess_nomwe_bigram.tagger')

def pos_tag(tokens, mmwe=True):
    tagger = cesstag(mmwe)
    return tagger.uni.tag(tokens)

def batch_pos_tag(sentences, mmwe=True):
    tagger = cesstag(mmwe)
    return tagger.uni.batch_tag(sentences)

tagger = cesstag()
print tagger.uni.tag('Mi colega me ayuda a programar cosas .'.split())

Method 3

I ended up here searching for POS taggers for other languages then English. Another option for your problem is using the Spacy library. Which offers POS tagging for multiple languages such as Dutch, German, French, Portuguese, Spanish, Norwegian, Italian, Greek and Lithuanian.

From the Spacy Documentation:

import es_core_news_sm
nlp = es_core_news_sm.load()
doc = nlp("El copal se usa principalmente para sahumar en distintas ocasiones como lo son las fiestas religiosas.")
print([(w.text, w.pos_) for w in doc])

leads to:

[(‘El’, ‘DET’), (‘copal’, ‘NOUN’), (‘se’, ‘PRON’), (‘usa’, ‘VERB’),
(‘principalmente’, ‘ADV’), (‘para’, ‘ADP’), (‘sahumar’, ‘VERB’),
(‘en’, ‘ADP’), (‘distintas’, ‘DET’), (‘ocasiones’, ‘NOUN’), (‘como’,
‘SCONJ’), (‘lo’, ‘PRON’), (‘son’, ‘AUX’), (‘las’, ‘DET’), (‘fiestas’,
‘NOUN’), (‘religiosas’, ‘ADJ’), (‘.’, ‘PUNCT’)]

and to visualize in a notebook:

displacy.render(doc, style='dep', jupyter = True, options = {'distance': 120})

Method 4

The following script gives you a quick approach to get a “bag of words” in Spanish sentences. Note that if you want to do it correctly you must tokenize the sentences before tag, so ‘religiosas.’ must be separated in two tokens ‘religiosas’,’.’

#-*- coding: utf8 -*-

# about the tagger: http://nlp.stanford.edu/software/tagger.shtml 
# about the tagset: nlp.lsi.upc.edu/freeling/doc/tagsets/tagset-es.html

import nltk

from nltk.tag.stanford import POSTagger

spanish_postagger = POSTagger('models/spanish.tagger', 'stanford-postagger.jar', encoding='utf8')

sentences = ['El copal se usa principalmente para sahumar en distintas ocasiones como lo son las fiestas religiosas.','Las flores, hojas y frutos se usan para aliviar la tos y también se emplea como sedante.']

for sent in sentences:

    words = sent.split()
    tagged_words = spanish_postagger.tag(words)

    nouns = []

    for (word, tag) in tagged_words:

        print(word+' '+tag).encode('utf8')
        if isNoun(tag): nouns.append(word)

    print(nouns)

Gives:

El da0000
copal nc0s000
se p0000000
usa vmip000
principalmente rg
para sp000
sahumar vmn0000
en sp000
distintas di0000
ocasiones nc0p000
como cs
lo pp000000
son vsip000
las da0000
fiestas nc0p000
religiosas. np00000
[u'copal', u'ocasiones', u'fiestas', u'religiosas.']
Las da0000
flores, np00000
hojas nc0p000
y cc
frutos nc0p000
se p0000000
usan vmip000
para sp000
aliviar vmn0000
la da0000
tos nc0s000
y cc
también rg
se p0000000
emplea vmip000
como cs
sedante. nc0s000
[u'flores,', u'hojas', u'frutos', u'tos', u'sedante.']

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating