I’m faced with a situation where I’m reading a string of text and I need to detect the language code (en, de, fr, es, etc).
Is there a simple way to do this in python?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
If you need to detect language in response to a user action then you could use google ajax language API:
#!/usr/bin/env python
import json
import urllib, urllib2
def detect_language(text,
userip=None,
referrer="http://stackoverflow.com/q/4545977/4279",
api_key=None):
query = {'q': text.encode('utf-8') if isinstance(text, unicode) else text}
if userip: query.update(userip=userip)
if api_key: query.update(key=api_key)
url = 'https://ajax.googleapis.com/ajax/services/language/detect?v=1.0&%s'%(
urllib.urlencode(query))
request = urllib2.Request(url, None, headers=dict(Referer=referrer))
d = json.load(urllib2.urlopen(request))
if d['responseStatus'] != 200 or u'error' in d['responseData']:
raise IOError(d)
return d['responseData']['language']
print detect_language("Python - can I detect unicode string language code?")
Output
en
Google Translate API v2
Default limit 100000 characters/day (no more than 5000 at a time).
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
import urllib, urllib2
from operator import itemgetter
def detect_language_v2(chunks, api_key):
"""
chunks: either string or sequence of strings
Return list of corresponding language codes
"""
if isinstance(chunks, basestring):
chunks = [chunks]
url = 'https://www.googleapis.com/language/translate/v2'
data = urllib.urlencode(dict(
q=[t.encode('utf-8') if isinstance(t, unicode) else t
for t in chunks],
key=api_key,
target="en"), doseq=1)
# the request length MUST be < 5000
if len(data) > 5000:
raise ValueError("request is too long, see "
"http://code.google.com/apis/language/translate/terms.html")
#NOTE: use POST to allow more than 2K characters
request = urllib2.Request(url, data,
headers={'X-HTTP-Method-Override': 'GET'})
d = json.load(urllib2.urlopen(request))
if u'error' in d:
raise IOError(d)
return map(itemgetter('detectedSourceLanguage'), d['data']['translations'])
Now you could request detecting a language explicitly:
def detect_language_v2(chunks, api_key):
"""
chunks: either string or sequence of strings
Return list of corresponding language codes
"""
if isinstance(chunks, basestring):
chunks = [chunks]
url = 'https://www.googleapis.com/language/translate/v2/detect'
data = urllib.urlencode(dict(
q=[t.encode('utf-8') if isinstance(t, unicode) else t
for t in chunks],
key=api_key), doseq=True)
# the request length MUST be < 5000
if len(data) > 5000:
raise ValueError("request is too long, see "
"http://code.google.com/apis/language/translate/terms.html")
#NOTE: use POST to allow more than 2K characters
request = urllib2.Request(url, data,
headers={'X-HTTP-Method-Override': 'GET'})
d = json.load(urllib2.urlopen(request))
return [sorted(L, key=itemgetter('confidence'))[-1]['language']
for L in d['data']['detections']]
Example:
print detect_language_v2(
["Python - can I detect unicode string language code?",
u"матрёшка",
u"打水"], api_key=open('api_key.txt').read().strip())
Output
[u'en', u'ru', u'zh-CN']
Method 2
In my case I only need to determine two languages so I just check the first character:
import unicodedata
def is_greek(term):
return 'GREEK' in unicodedata.name(term.strip()[0])
def is_hebrew(term):
return 'HEBREW' in unicodedata.name(term.strip()[0])
Method 3
Have a look at guess-language:
Attempts to determine the natural language of a selection of Unicode (utf-8) text.
But as the name says, it guesses the language. You can’t expect 100% correct results.
Edit:
guess-language is unmaintained. But there is a fork (that support python3): guess_language-spirit
Method 4
Look at Natural Language Toolkit and Automatic Language Identification using Python for ideas.
I would like to know if a Bayesian filter can get language right but I can’t write a proof of concept right now.
Method 5
A useful article here suggests that this open source named CLD is the best bet for detecting language in python.
The article shows a comparison of speed and accuracy between 3 solutions :
- language-detection or its python port langdetect
- Tika
- Chromium Language Detection (CLD)
I wasted my time with langdetect now I am switching to CLD which is 16x faster than langdetect and has 98.8% accuracy
Method 6
Try Universal Encoding Detector its a port of chardet module from Firefox to Python.
Method 7
If you only have a limited number of possible languages, you could use a set of dictionaries (possibly only including the most common words) of each language and then check the words in your input against the dictionaries.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0