Several questions about file-system character encoding on linux

Due to a lot of file exchange works between Windows (GBK encoding) and Linux (UTF-8 encoding), it will encounter character encoding issues easily, such as:

zip/tar files whose name contains chinese characters on Windows system, unzip/untar it in Linux system.
run migrated legacy java web application (designed on Windows system, using GBK encoding in JSP) which write GBK-encoding-named files to disk.
ftp get/put GBK-encoding-named files between Windows FTP server and Linux client.
switch LANG environment in Linux.

The common issue of the previous mentioned are file locating/naming. After googled, I got an article Using Unicode in Linux
https://www.linux.com/news/using-unicode-linux/, it said:

the operating system and many utilities do not realize what characters the bytes in file names represent.

So, it’s possible to have two files with same name (SAME when their names are decoded by correct character set, but DIFFERENT in bytes), such as 中文.txt, but in different encoding:

[<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="21534e4e55614744454e5340">[email protected]</a> test]# ls
????  中文
[<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="c9bba6a6bd89afacada6bba8">[email protected]</a> test]# ls | iconv -f GBK
中文
涓iconv: illegal input sequence at position 7
[<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="592b36362d193f3c3d362b38">[email protected]</a> test]# ls 中文 && ls $'xd6xd0xcexc4' | iconv -f gbk
中文
中文

Questions:

Is it possible to config linux filesystem use fixed character encoding (like NTFS use UTF-16 internally) to store file names regardless of LANG/LC_ALL environment?
Or, what I actually want ask is: Is it possible to let file name 中文.txt ($'xe4xb8xadxe6x96x87.txt') in zh_CN.UTF-8 environment and file name 中文.txt ($'xd6xd0xcexc4.txt') in zh_CN.GBK environment refer to same file?
If it’s not configurable, then is it possible to patch kernel to translate character encoding between file-system and current environment (just a question, not request implementation)? and how much performance con effect if it’s possible?

Contents hide

Answers:

Method 1

1. Is it possible to config linux filesystem use fixed character encoding to store file names regardless of LANG/LC_ALL environment?

2. Is it possible to let different file names refer to same file?

3. Is it possible to patch the kernel to translate character encoding between file-system and current environment?

Method 2

Method 3

Method 4

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

I have reformulated your questions a bit, for reasons that should
appear evident when you read them in sequence.

1. Is it possible to config linux filesystem use fixed character encoding to store file names regardless of LANG/LC_ALL environment?

No, this is not possible: as you mention in your question, a UNIX file
name is just a sequence of bytes; the kernel knows nothing about
the encoding, which entirely a user-space (i.e., application-level)
concept.

In other words, the kernel knows nothing about LANG/LC_*, so it cannot
translate.

2. Is it possible to let different file names refer to same file?

You can have multiple directory entries referring to the same file;
you can make that through hard links or symbolic links.

Be aware, however, that the file names that are not valid in the
current encoding (e.g., your GBK character string when you’re working
in a UTF-8 locale) will display badly, if at all.

3. Is it possible to patch the kernel to translate character encoding between file-system and current environment?

You cannot patch the kernel to do this (see 1.), but you could -in
theory- patch the C library (e.g., glibc) to perform this translation,
and always convert file names to UTF-8 when it calls the kernel, and
convert them back to the current encoding when it reads a file name
from the kernel.

A simpler approach could be to write an overlay filesystem with FUSE,
that just redirects any filesystem request to another location after
converting the file name to/from UTF-8. Ideally you could mount this
filesystem in ~/trans, and when an access is made to
~/trans/a/GBK/encoded/path then the FUSE filesystem really accesses
/a/UTF-8/encoded/path.

However, the problem with these approaches is: what do you do with
files that already exist on your filesystem and are not UTF-8 encoded?
You cannot just simply pass them untranslated, because then you don’t
know how to convert them; you cannot mangle them by translating
invalid character sequences to ? because that could create
conflicts…

Method 2

What you can do is limit the amount of supported locales to only UTF-8 locales.

http://www.fifi.org/cgi-bin/man2html/usr/share/man/man5/locale.gen.5

Method 3

OEM code page selection is broken in both vanilla unzip and vanilla p7zip.
I made patches fixing this issue and there is ppa for Ubuntu with p7zip with this patch applied.

Method 4

This issue with zips has been fixed in the most recent far2l file and archive manager. For zip legacy charset detection by far2l to work properly, your system language setting should match the one set on the system where the archive was created (Windows’ internal “zip folders” tool uses just the same logic). Also you can do

LANG=zh_CN.UTF-8 far2l

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating