Convert between Unicode Normalization Forms on the unix command-line

In Unicode, some character combinations have more than one representation.

For example, the character ä can be represented as

  • “ä”, that is the codepoint U+00E4 (two bytes c3 a4 in UTF-8 encoding), or as
  • “ä”, that is the two codepoints U+0061 U+0308 (three bytes 61 cc 88 in UTF-8).

According to the Unicode standard, the two representations are equivalent but in different “normalization forms”, see UAX #15: Unicode Normalization Forms.

The unix toolbox has all kinds of text transformation tools, sed, tr, iconv, Perl come to mind. How can I do quick and easy NF conversion on the command-line?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

You can use the uconv utility from ICU. Normalization is achieved through transliteration (-x).

$ uconv -x any-nfd <<<ä | hd
00000000  61 cc 88 0a                                       |a...|
00000004
$ uconv -x any-nfc <<<ä | hd
00000000  c3 a4 0a                                          |...|
00000003

On Debian, Ubuntu and other derivatives, uconv is in the libicu-dev package. On Fedora, Red Hat and other derivatives, and in BSD ports, it’s in the icu package.

Method 2

Python has unicodedata module in its standard library, which allow to translate Unicode representations through unicodedata.normalize() function:

import unicodedata

s1 = 'Spicy Jalapeu00f1o'
s2 = 'Spicy Jalapenu0303o'

t1 = unicodedata.normalize('NFC', s1)
t2 = unicodedata.normalize('NFC', s2)
print(t1 == t2) 
print(ascii(t1)) 

t3 = unicodedata.normalize('NFD', s1)
t4 = unicodedata.normalize('NFD', s2)
print(t3 == t4)
print(ascii(t3))

Running with Python 3.x:

$ python3 test.py
True
'Spicy Jalapexf1o'
True
'Spicy Jalapenu0303o'

Python isn’t well suited for shell one liners, but it can be done if you don’t want to create external script:

$ python3 -c $'import unicodedatanprint(unicodedata.normalize("NFC", "ääääää"))'
ääääää

For Python 2.x you have to add encoding line (# -*- coding: utf-8 -*-) and mark strings as Unicode with u character:

$ python -c $'# -*- coding: utf-8 -*-nimport unicodedatanprint(unicodedata.normalize("NFC", u"ääääää"))'
ääääää

Method 3

For completeness, with perl:

$ perl -CSA -MUnicode::Normalize=NFD -e 'print NFD($_) for @ARGV' $'ue1' | uconv -x name
N{LATIN SMALL LETTER A}N{COMBINING ACUTE ACCENT}
$ perl -CSA -MUnicode::Normalize=NFC -e 'print NFC($_) for @ARGV' $'au301' | uconv -x name
N{LATIN SMALL LETTER A WITH ACUTE}

Method 4

Check it with the tool hexdump:

echo  -e "äc" |hexdump -C 

00000000  61 cc 88                                          |a..|
00000003

convert with iconv and check again with hexdump:

echo -e "äc" | iconv -f UTF-8-MAC -t UTF-8 |hexdump -C

00000000  c3 a4                                             |..|
00000002

printf 'xc3xa4'
ä

Method 5

coreutils has a patch to get a proper unorm. works fine for me on 4byte wchars. follow http://crashcourse.housegordon.org/coreutils-multibyte-support.html#unorm
The remaining problem there are 2-byte wchar systems (cygwin, windows, plus aix and solaris on 32bit), which need to transform codepoints from upper planes into surrogate pairs and vice versa, and the underlying libunistring/gnulib cannot handle that yet.

perl has the unichars tool, which also does the various normalization forms on the cmdline. http://search.cpan.org/dist/Unicode-Tussle/script/unichars

Method 6

There’s a perl utility called Charlint available from

https://www.w3.org/International/charlint/

which does what you want. You’ll also have to download a file from

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt

After the first run you’ll see Charlint complaining about incompatible entries in that file so you’ll have to delete those lines from UnicodeData.txt.

Method 7

Since uconv doesn’t seem to be well documented, and the python solution posted here isn’t actually a one-liner, here’s a one-liner using ruby:

ruby -e '$stdin.each_line {|line| puts line.unicode_normalize(:nfd)}' <infile >outfile

Documentation: https://apidock.com/ruby/v2_5_5/String/unicode_normalize


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x