Someone sent me a ZIP file containing files with Hebrew names (and created on Windows, not sure with which tool). I use LXDE on Debian Stretch. The Gnome archive manager manages to unzip the file, but the Hebrew characters are garbled. I think I’m getting UTF-8 octets extended into Unicode characters, e.g. I have a file whose name has four characters and a .doc suffic, and the characters are: 0x008E 0x0087 0x008E 0x0085 . Using the command-line unzip utility is even worse – it refuses to decompress altogether, complaining about an “Invalid or incomplete multibyte or wide character”.
So, my questions are:
- Is there another decompression utility that will decompress my files with the correct names?
- Is there something wrong with the way the file was compressed, or is it just an incompatibility of ZIP implementations? Or even misfeature/bug of the Linux ZIP utilities?
- What can I do to get the correct filenames after having decompressed using the garbled ones?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
It sounds like the filenames are encoded in one of Windows’ proprietary codepages (CP862, 1255, etc).
-
Is there another decompression utility that will decompress my files with the correct names? I’m not aware of a zip utility that supports these code pages natively. 7z has some understanding of encodings, but I believe it has to be an encoding your system knows about more generally (you pick it by setting the
LANGenvironment variable) and Windows codepages likely aren’t among those.unzip -UUshould work from the command line to create files with the correct bytes in their names (by disabling all Unicode support). That is probably the effect you got from GNOME’s tool already. The encoding won’t be right either way, but we can fix that below. - Is there something wrong with the way the file was compressed, or is it just an incompatibility of ZIP implementations? Or even misfeature/bug of the Linux ZIP utilities? The file you’ve been given was not created portably. That’s not necessarily wrong for an internal use where the encoding is fixed and known in advance, although the format specification says that names are supposed to be either UTF-8 or cp437 and yours are neither. Even between Windows machines, using different codepages doesn’t work out well, but non-Windows machines have no concept of those code pages to begin with. Most tools UTF-8 encode their filenames (which still isn’t always enough to avoid problems).
-
What can I do to get the correct filenames after having decompressed using the garbled ones? If you can identify the encoding of the filenames, you can convert the bytes in the existing names into UTF-8 and move the existing files to the right name. The
convmvtool essentially wraps up that process into a single command:convmv -f cp862 -t utf8 -r .will try to convert everything inside.from cp862 to UTF-8.Alternatively, you can use
iconvandfindto move everything to their correct names. Something like:find -mindepth 1 -exec sh -c 'mv "$1" "$(echo "$1" | iconv -f cp862 -t utf8)"' sh {} ;will find all the files underneath the current directory and try to convert the names into UTF-8.
In either case, you can experiment with different encodings and try to find one that makes sense.
After you’ve fixed the encoding for you, if you want to send these files back in the other direction it’s possible you’ll have the same problem on the other end. In that case, you can reverse the process before zipping the files up with -UU, since it’s likely to be very hard to fix on the Windows end.
Method 2
I had success with the command 7z x <source.zip>.
Version:
p7zip Version 16.02 (locale=utf8,Utf16=on,HugeFiles=on,64 bits,[...])
Potentially relevant environment:
LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_CTYPE=UTF-8
It was able to decompress all files with 8-bit characters in their filenames, with some of these characters skipped, some garbled.
Method 3
I have just had the same problem, and it turns out that my version of unzip that is available from Ubuntu repositories (UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP.) can handle automatic decoding of filenames if you specify the -a switch.
unzip -a stupid.zip
Method 4
I had a similar problem with decoding a zip archive with cyrillic characters. A one line python script did the job properly:
#!/usr/bin/python import zipfile import sys zipfile.ZipFile(sys.argv[1], 'r').extractall(sys.argv[2] if len(sys.argv) > 2 else '.')
Then just call it unzip_enc and call it unzip_enc ZIP_FILE [TARGET_DIR]
For me neither the unzip -UU, unzip -a nor LANG* environment variables did any good.
Method 5
I had luck with this combination:
export LANG=es_MX 7z x file.zip convmv -f cp437 -t utf8 -r .
add –notest to convmv for actual rename. Later I found even better version:
LANG=es_MX.cp437 unzip -UU file.zip convmv -f cp437 -t utf8 -r . --notest
Method 6
This issue with zips has been fixed in the most recent far2l file and archive manager. For zip legacy charset detection by far2l to work properly, your system language setting should match the one set on the system where the archive was created (Windows’ internal “zip folders” tool uses just the same logic). Also you can do
LANG=he_IL.UTF-8 far2l
Method 7
I have zip archive compressed in Linux (from command line) and filenames with diacritics characters are not correctly decompressed on Windows, but I succesfully unpacked it with Bandizip software which can set charset on toolbar.
Method 8
I had a problem when extracting a file with Cyrillic symbols in the name using ‘unzip’.
The problem was solved by using ‘ark’:
ark -ab archive.zip
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0