How can I remove the BOM from a UTF-8 file?

I have a file in UTF-8 encoding with BOM and want to remove the BOM. Are there any linux command-line tools to remove the BOM from the file?

$ file test.xml
test.xml:  XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

If you’re not sure if the file contains a UTF-8 BOM, then this (assuming the GNU implementation of sed) will remove the BOM if it exists, or make no changes if it doesn’t.

sed '1s/^xEFxBBxBF//' < orig.txt > new.txt

You can also overwrite the existing file with the -i option:

sed -i '1s/^xEFxBBxBF//' orig.txt

If you are using the BSD version of sed (eg macOS) then you need to have bash do the escaping:

 sed $'1s/xefxbbxbf//' < orig.txt > new.txt

Method 2

A BOM doesn’t make sense in UTF-8. Those are generally added by mistake by bogus software on Microsoft OSes.

dos2unix will remove it and also take care of other idiosyncrasies of Windows text files.

dos2unix test.xml

Method 3

Using VIM

  1. Open file in VIM:
     vi text.xml
  2. Remove BOM encoding:
     :set nobomb
  3. Save and quit:
     :wq

For a non-interactive solution, try the following command line:

vi -c ":set nobomb" -c ":wq" text.xml

That should remove the BOM, save the file and quit, all from the command line.

Method 4

It is possible to remove the BOM from a file with the tail command:

tail -c +4 withBOM.txt > withoutBOM.txt

Be aware that this chops the first 4 bytes from the file, so be sure that the file really contains the BOM before running tail.

Method 5

You can use

LANG=C LC_ALL=C sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- filename

to remove the byte order mark from the beginning of the file, if it has any, as well as convert any CR LF newlines to LF only. The LANG=C LC_ALL=C tells the shell you want the command to run in the default C locale (also known as the default POSIX locale), where the three bytes forming the Byte Order Mark are treated as bytes. The -i option to sed means in-place. If you use -i.old, then sed saves the original file as filename.old, and the new file (with the modifications, if any) as filename.


I personally like to have this as ~/bin/fix-ms; for example, as

#!/bin/dash
export LANG=C LC_ALL=C
if [ $# -gt 0 ]; then
    for FILE in "<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="5b7f1b">[email protected]</a>" ; do
        sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$FILE" || exit 1
    done
else
    exec sed -e 's/r$// ; 1 s/^xefxbbxbf//'
fi

so that if I need to apply this to say all C source files and headers (my old code from the MS-DOS era, for example!), I just run

find . -name '*.[CHch]' -print0 | xargs -r0 ~/bin/ms-fix

or, if I just want to look at such a file, without modifying it, I can run

~/bin/ms-fix < filename | less

and not see the ugly <U+FEFF> in my UTF-8 terminal.

Method 6

I use a vim one-liner on the regular for this:

vim --clean -c 'se nobomb|wq' filename

vim --clean -c 'bufdo se nobomb|wqa' filename1 filename2 ...

Method 7

I have a slightly different problem, and am putting this here for someone who, like me, ends up here with data full of ZERO WIDTH NO-BREAK SPACE characters (which are known as Byte Order Mark when they are the first character of the file).

I got this data by copying out of grafana query metrics field, and it had multiple (17) xefxbbxbf sequences (which show up in vim as rate<feff>(<feff>node<feff>{<feff>job<feff>) in a single line with only 81 actual characters.

I modified Nominal Animal’s code just slightly:

LANG=C LC_ALL=C sed -e 's/xefxbbxbf//g'

And the :set nobomb thing in vim only removes the very first one in the file.

tried this:

LANG=C vim b

Then vim doesn’t show them, but they are still there (even after a write…)

Method 8

I had the same question and ended up writing a dedicated utility bom(1) for this. It’s available here.

Here’s the man page:

NAME
     bom -- Decode Unicode byte order mark

SYNOPSIS
     bom --strip [--expect types] [--lenient] [--prefer32] [--utf8] [file]
     bom --detect [--expect types] [--prefer32] [file]
     bom --print type
     bom --list
     bom --help
     bom --version

DESCRIPTION
     bom decodes, verifies, reports, and/or strips the byte order mark (BOM) at the
     start of the specified file, if any.

     When no file is specified, or when file is -, read standard input.

OPTIONS
     -d, --detect
             Report the detected BOM type to standard output and then exit.

             See SUPPORTED BOM TYPES for possible values.

     -e, --expect types
             Expect to find one of the specified BOM types, otherwise exit with an
             error.

             Multiple types may be specified, separated by commas.

             Specifying NONE is acceptable and matches when the file has no (sup-
             ported) BOM.

     -h, --help
             Output command line usage help.

     -l, --lenient
             Silently ignore any illegal byte sequences encountered when converting
             the remainder of the file to UTF-8.

             Without this flag, bom will exit immediately with an error if an ille-
             gal byte sequence is encountered.

             This flag has no effect unless the --utf8 flag is given.

     --list  List the supported BOM types and exit.

     -p, --print type
             Output the byte sequence corresponding to the type byte order mark.

     --prefer32
             Used to disambiguate the byte sequence FF FE 00 00, which can be
             either a UTF-32LE BOM or a UTF-16LE BOM followed by a NUL character.

             Without this flag, UTF-16LE is assumed; with this flag, UTF-32LE is
             assumed.

     -s, --strip
             Strip the BOM, if any, from the beginning of the file and output the
             remainder of the file.

     -u, --utf8
             Convert the remainder of the file to UTF-8, assuming the character
             encoding implied by the detected BOM.

             For files with no (supported) BOM, this flag has no effect and the
             remainder of the file is copied unmodified.

             For files with a UTF-8 BOM, the identity transformation is still
             applied, so (for example) illegal byte sequences will be detected.

     -v, --version
             Output program version and exit.

SUPPORTED BOM TYPES
     The supported BOM types are:

     NONE    No supported BOM was detected.

     UTF-7   A UTF-7 BOM was detected.

     UTF-8   A UTF-8 BOM was detected.

     UTF-16BE
             A UTF-16 (Big Endian) BOM was detected.

     UTF-16LE
             A UTF-16 (Little Endian) BOM was detected.

     UTF-32BE
             A UTF-32 (Big Endian) BOM was detected.

     UTF-32LE
             A UTF-32 (Little Endian) BOM was detected.

     GB18030
             A GB18030 (Chinese National Standard) BOM was detected.

EXAMPLES
     To tell what kind of byte order mark a file has:

           $ bom --detect

     To normalize files with byte order marks into UTF-8, and pass other files
     through unchanged:

           $ bom --strip --utf8

     Same as previous example, but discard illegal byte sequences instead of gener-
     ating an error:

           $ bom --strip --utf8 --lenient

     To verify a properly encoded UTF-8 or UTF-16 file with a byte-order-mark and
     output it as UTF-8:

           $ bom --strip --utf8 --expect UTF-8,UTF-16LE,UTF-16BE

     To just remove any byte order mark and get on with your life:

           $ bom --strip file

RETURN VALUES
     bom exits with one of the following values:

     0       Success.

     1       A general error occurred.

     2       The --expect flag was given but the detected BOM did not match.

     3       An illegal byte sequence was detected (and --lenient was not speci-
             fied).

SEE ALSO
     iconv(1)

     bom: Decode Unicode byte order mark, https://github.com/archiecobbs/bom.

Method 9

Recently I found this tiny command-line tool which adds or removes the BOM on arbitary UTF-8 encoded files: UTF BOM Utils (new link at github)

Little drawback, you can download only the plain C++ source code. You have to create the makefile (with CMake, for example) and compile it by yourself, binaries are not provided on this page.

Method 10

I know it’s been a while, but since I had a slightly different issue, I’m posting so others may benefit.

My text file was randomly haunted by characters feff, luckily for me they appeared at start of the lines and the set of allowed characters is limited to alphanumeric.

The below command in vim cuts first non-alphanumeric character, but use it with caution as your set of allowed characters might vary.

:%s/^[^a-zA-Z0-9]//g


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x