How to use grep/ack with files in arbitrary encoding?

On my Linux desktop I have a UTF-8 locale. When I try to search some KOI8-R encoded files with grep (ack), it fails. If I manually encode the pattern into KOI8-R and pass that as an argument, it works.

Is it possible to tell grep what encoding to use for the pattern? Or any other tool?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

If all the files you’re searching in have the same encoding:

LC_CTYPE=ru_RU.KOI8-R luit ack-grep "$(echo 'привет' | iconv -t KOI8-R)" *.txt

or in bash or zsh

LC_CTYPE=ru_RU.KOI8-R luit ack-grep "$(iconv -t KOI8-R <<<'привет')" *.txt

Or start a child shell in the desired encoding:

$ LC_CTYPE=ru_RU.KOI8-R luit
$ ack-grep 'привет' *.txt
$ exit

Luit (shipped with XFree86 and X.org) runs the program specified on its command line in the locale specified by the LC_CTYPE setting, assuming an UTF-8 terminal. So the command runs in the desired locale, and Luit translates its terminal output to UTF-8.

Another approach, if you have a directory tree with a lot of files in a different encoding, is to mount a view of that directory tree under a your prefered encoding. I think the fuseflt filesystem can do this (untested).

mkdir /utf8-view
fuseflt iconv-koi8r-utf8.conf /some/dir /utf8-view
ack-grep 'привет' /utf8-view/*.txt.utf8
fusermount -u /utf8-view

where the configuration file iconv-koi8r-utf8.conf contains

ext_in =
ext_out = *.utf8
flt_in =
flt_out = .utf8
flt_cmd = iconv -f KOI8-R -t UTF-8


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x