In a huge UTF-8 text file, I want to show all lines that contains Japanese kanjis.
What grep (or other) expression does this?
If I am not mistaken, kanjis are the characters between u4e00 and u4dbf.
I don’t need to show kanas, but showing them too would not be a big problem.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
It is impossible (without using a huge table) to tell apart a japanese kanji from a Han ideograph not used in Japanese (eg, a chinese or korean variant).
If you just want to detect any Han ideograph in the basic range (u4e00 to u9fff)
then they are encoded in 3 bytes, the first byte is always between 0xe4 and 0xe9, the second and third bytes between 0x80 and 0xbf.
There are two difficulties here, first you have to tell grep you want to look after bytes and not characters; then you have to type the 0xe4, 0xe9, 0x80 and 0xbf bytes to put them in the regexp expression.
I discovered the -P switch does both; and the line you want is:
grep -P "[xe4-xe9][x80-xbf][x80-xbf]"
and if you want kana too:
grep -P "[xe4-xe9][x80-xbf][x80-xbf]|xe3[x81-x83][x80-xbf]"
Method 2
According to the fifth table here, kanji are the caracters between u4e00 and u9fff
My implementation of grep doesn’t seem to be able to handle unicode characters (that’s GNU grep 2.14 on Archlinux), but we can still use x. You can find the respective codes here or use a tool like hexedit to get them.
For anything in our range of interest above e9 be a5 returned “Invalid collation character” so this is what I’ve come up with:
grep "["$'xe4xb8x80'"-"$'xe9xbexa5'"]" file.txt
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0