bulk rename (or correctly display) files with special characters

I have a bunch of directories and subdirectories that contain files with special characters, like this file:

<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="4e3c212c2c272b0e3e262722">[email protected]</a>:~$ ls test�sktest.txt 
test?sktest.txt

Find reveals an escape sequence:

<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="94e6fbf6f6fdf1d4e4fcfdf8">[email protected]</a>:~$ find test�sktest.txt -ls 
424512 4000 -rwxr--r-x   1 robbie   robbie    4091743 Jan 26 00:34 test323sktest.txt

The only reason I can even type their names on the console is because of tab completion. This also means I can rename them manually (and strip the special character).

I’ve set LC_ALL to UTF-8, which does not seem to help (also not on a new shell):

<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="f7859895959e92b7879f9e9b">[email protected]</a>:~$ echo $LC_ALL
en_US.UTF-8

I’m connecting to the machine using ssh from my mac. It’s an Ubuntu install:

<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="b6c4d9d4d4dfd3f6c6dedfda">[email protected]</a>:~$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=7.10
DISTRIB_CODENAME=gutsy
DISTRIB_DESCRIPTION="Ubuntu 7.10"

Shell is Bash, TERM is set to xterm-color.

These files have been there for quite a while, and they have not been created using that install of Ubuntu. So I don’t know what the system encoding settings used to be.

I’ve tried things along the lines of:

find . -type f -ls | sed 's/[^a-zA-Z0-9]//g'

But I can’t find a solution that does everything I want:

  1. Identify all files that have undisplayable characters (the above ignores way too much)
  2. For all those files in a directory tree (recursively), execute mv oldname newname
  3. Optionally, the ability to transliterate special characters such as ä to a (not required, but would be awesome)

OR

  1. Correctly display all these files (and no errors in applications when trying to open them)

I have bits and pieces, like iterating over all files and moving them, but identifying the files and formatting them correctly for the mv command seems to be the hard part.

Any extra information as to why they do not display correctly, or how to “guess” the correct encoding are also welcome. (I’ve tried convmv but it doesn’t seem to do exactly what I want: http://j3e.de/linux/convmv/)

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

I guess you see this invalid character because the name contains a byte sequence that isn’t valid UTF-8. File names on typical unix filesystems (including yours) are byte strings, and it’s up to applications to decide on what encoding to use. Nowadays, there is a trend to use UTF-8, but it’s not universal, especially in locales that could never live with plain ASCII and have been using other encodings since before UTF-8 even existed.

Try LC_CTYPE=en_US.iso88591 ls to see if the file name makes sense in ISO-8859-1 (latin-1). If it doesn’t, try other locales. Note that only the LC_CTYPE locale setting matters here.

In a UTF-8 locale, the following command will show you all files whose name is not valid UTF-8:

grep-invalid-utf8 () {
  perl -l -ne '/^([00-177]|[300-337][200-277]|[340-357][200-277]{2}|[360-367][200-277]{3}|[370-373][200-277]{4}|[374-375][200-277]{5})*$/ or print'
}
find | grep-invalid-utf8

You can check if they make more sense in another locale with recode or iconv:

find | grep-invalid-utf8 | recode latin1..utf8
find | grep-invalid-utf8 | iconv -f latin1 -t utf8

Once you’ve determined that a bunch of file names are in a certain encoding (e.g. latin1), one way to rename them is

find | grep-invalid-utf8 |
rename 'BEGIN {binmode STDIN, ":encoding(latin1)"; use Encode;}
        $_=encode("utf8", $_)'

This uses the perl rename command available on Debian and Ubuntu. You can pass it -n to show what it would be doing without actually renaming the files.

Method 2

I know this is an old question but i have been searching all night for a similar solution. I found a few helpful tips but they did not do exactly what i needed, so I had to mix and match a few to get the correct outcome I was looking for

to simply remove special characters and replace them with a (.) dot

for f in *.txt; do mv "$f" `echo $f | sed "s/[^a-zA-Z0-9.]/./g"`; done

to use in a cronjob I did the following to run every minute

*/1 * * * * cd /path/to/files/ && for f in *.txt; do mv "$f" `echo $f | sed "s/[^a-zA-Z0-9.]/./g"`; done >/dev/null 2>&1

I hope someone finds this helpful as it has made my day 🙂

Method 3

Now, when you know which encoding is used for the filenames on the remote end (“latin1” — according to the comments to the first answer), you could also follow the second way — run a local termninal and ssh in such a way that the remote filenames are displayed correctly (rather than the first way: rename them).

Like me, you could start a terminal locally that would work in that special encoding, perhaps, like this:

LC_ALL=en_US.latin1 xvt &

xvt stands for your terminal program.

Perhaps, the existing locale is called en_US.iso88591, and not en_US.latin1, as I assumed.

Method 4

This doesn’t meet the bulk requirements, but I have just had a similar problem where I had multiple versions of a file with similar names which differed only by a single weird character. Unfortunately this meant that I could not rename the offenders using the wildcard trick I usually use.

In the end I used Filezilla to connect as an SFTP client, browsed to the files and renamed them using the GUI. Filezilla handled the dodgy chars quite well.


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x