I have a file in UTF-8 encoding with BOM and want to remove the BOM. Are there any linux command-line tools to remove the BOM from the file?
$ file test.xml test.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
If you’re not sure if the file contains a UTF-8 BOM, then this (assuming the GNU implementation of sed) will remove the BOM if it exists, or make no changes if it doesn’t.
sed '1s/^xEFxBBxBF//' < orig.txt > new.txt
You can also overwrite the existing file with the -i option:
sed -i '1s/^xEFxBBxBF//' orig.txt
If you are using the BSD version of sed (eg macOS) then you need to have bash do the escaping:
sed $'1s/xefxbbxbf//' < orig.txt > new.txt
Method 2
A BOM doesn’t make sense in UTF-8. Those are generally added by mistake by bogus software on Microsoft OSes.
dos2unix will remove it and also take care of other idiosyncrasies of Windows text files.
dos2unix test.xml
Method 3
Using VIM
-
Open file in VIM:
vi text.xml
-
Remove BOM encoding:
:set nobomb
-
Save and quit:
:wq
For a non-interactive solution, try the following command line:
vi -c ":set nobomb" -c ":wq" text.xml
That should remove the BOM, save the file and quit, all from the command line.
Method 4
It is possible to remove the BOM from a file with the tail command:
tail -c +4 withBOM.txt > withoutBOM.txt
Be aware that this chops the first 4 bytes from the file, so be sure that the file really contains the BOM before running tail.
Method 5
You can use
LANG=C LC_ALL=C sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- filename
to remove the byte order mark from the beginning of the file, if it has any, as well as convert any CR LF newlines to LF only. The LANG=C LC_ALL=C tells the shell you want the command to run in the default C locale (also known as the default POSIX locale), where the three bytes forming the Byte Order Mark are treated as bytes. The -i option to sed means in-place. If you use -i.old, then sed saves the original file as filename.old, and the new file (with the modifications, if any) as filename.
I personally like to have this as ~/bin/fix-ms; for example, as
#!/bin/dash
export LANG=C LC_ALL=C
if [ $# -gt 0 ]; then
for FILE in "<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="5b7f1b">[email protected]</a>" ; do
sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$FILE" || exit 1
done
else
exec sed -e 's/r$// ; 1 s/^xefxbbxbf//'
fi
so that if I need to apply this to say all C source files and headers (my old code from the MS-DOS era, for example!), I just run
find . -name '*.[CHch]' -print0 | xargs -r0 ~/bin/ms-fix
or, if I just want to look at such a file, without modifying it, I can run
~/bin/ms-fix < filename | less
and not see the ugly <U+FEFF> in my UTF-8 terminal.
Method 6
I use a vim one-liner on the regular for this:
vim --clean -c 'se nobomb|wq' filename vim --clean -c 'bufdo se nobomb|wqa' filename1 filename2 ...
Method 7
I have a slightly different problem, and am putting this here for someone who, like me, ends up here with data full of ZERO WIDTH NO-BREAK SPACE characters (which are known as Byte Order Mark when they are the first character of the file).
I got this data by copying out of grafana query metrics field, and it had multiple (17) xefxbbxbf sequences (which show up in vim as rate<feff>(<feff>node<feff>{<feff>job<feff>) in a single line with only 81 actual characters.
I modified Nominal Animal’s code just slightly:
LANG=C LC_ALL=C sed -e 's/xefxbbxbf//g'
And the :set nobomb thing in vim only removes the very first one in the file.
tried this:
LANG=C vim b
Then vim doesn’t show them, but they are still there (even after a write…)
Method 8
I had the same question and ended up writing a dedicated utility bom(1) for this. It’s available here.
Here’s the man page:
NAME
bom -- Decode Unicode byte order mark
SYNOPSIS
bom --strip [--expect types] [--lenient] [--prefer32] [--utf8] [file]
bom --detect [--expect types] [--prefer32] [file]
bom --print type
bom --list
bom --help
bom --version
DESCRIPTION
bom decodes, verifies, reports, and/or strips the byte order mark (BOM) at the
start of the specified file, if any.
When no file is specified, or when file is -, read standard input.
OPTIONS
-d, --detect
Report the detected BOM type to standard output and then exit.
See SUPPORTED BOM TYPES for possible values.
-e, --expect types
Expect to find one of the specified BOM types, otherwise exit with an
error.
Multiple types may be specified, separated by commas.
Specifying NONE is acceptable and matches when the file has no (sup-
ported) BOM.
-h, --help
Output command line usage help.
-l, --lenient
Silently ignore any illegal byte sequences encountered when converting
the remainder of the file to UTF-8.
Without this flag, bom will exit immediately with an error if an ille-
gal byte sequence is encountered.
This flag has no effect unless the --utf8 flag is given.
--list List the supported BOM types and exit.
-p, --print type
Output the byte sequence corresponding to the type byte order mark.
--prefer32
Used to disambiguate the byte sequence FF FE 00 00, which can be
either a UTF-32LE BOM or a UTF-16LE BOM followed by a NUL character.
Without this flag, UTF-16LE is assumed; with this flag, UTF-32LE is
assumed.
-s, --strip
Strip the BOM, if any, from the beginning of the file and output the
remainder of the file.
-u, --utf8
Convert the remainder of the file to UTF-8, assuming the character
encoding implied by the detected BOM.
For files with no (supported) BOM, this flag has no effect and the
remainder of the file is copied unmodified.
For files with a UTF-8 BOM, the identity transformation is still
applied, so (for example) illegal byte sequences will be detected.
-v, --version
Output program version and exit.
SUPPORTED BOM TYPES
The supported BOM types are:
NONE No supported BOM was detected.
UTF-7 A UTF-7 BOM was detected.
UTF-8 A UTF-8 BOM was detected.
UTF-16BE
A UTF-16 (Big Endian) BOM was detected.
UTF-16LE
A UTF-16 (Little Endian) BOM was detected.
UTF-32BE
A UTF-32 (Big Endian) BOM was detected.
UTF-32LE
A UTF-32 (Little Endian) BOM was detected.
GB18030
A GB18030 (Chinese National Standard) BOM was detected.
EXAMPLES
To tell what kind of byte order mark a file has:
$ bom --detect
To normalize files with byte order marks into UTF-8, and pass other files
through unchanged:
$ bom --strip --utf8
Same as previous example, but discard illegal byte sequences instead of gener-
ating an error:
$ bom --strip --utf8 --lenient
To verify a properly encoded UTF-8 or UTF-16 file with a byte-order-mark and
output it as UTF-8:
$ bom --strip --utf8 --expect UTF-8,UTF-16LE,UTF-16BE
To just remove any byte order mark and get on with your life:
$ bom --strip file
RETURN VALUES
bom exits with one of the following values:
0 Success.
1 A general error occurred.
2 The --expect flag was given but the detected BOM did not match.
3 An illegal byte sequence was detected (and --lenient was not speci-
fied).
SEE ALSO
iconv(1)
bom: Decode Unicode byte order mark, https://github.com/archiecobbs/bom.
Method 9
Recently I found this tiny command-line tool which adds or removes the BOM on arbitary UTF-8 encoded files: UTF BOM Utils (new link at github)
Little drawback, you can download only the plain C++ source code. You have to create the makefile (with CMake, for example) and compile it by yourself, binaries are not provided on this page.
Method 10
I know it’s been a while, but since I had a slightly different issue, I’m posting so others may benefit.
My text file was randomly haunted by characters feff, luckily for me they appeared at start of the lines and the set of allowed characters is limited to alphanumeric.
The below command in vim cuts first non-alphanumeric character, but use it with caution as your set of allowed characters might vary.
:%s/^[^a-zA-Z0-9]//g
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0