Convert a string’s character encoding from windows-1252 to utf-8

I had converted a Word Document(docx) to html, the converted html has windows-1252 as its character encoding. In .Net for this 1252 character encoding all the special characters are being displayed as ‘�’. This html is being displayed in a Rad Editor which displays correctly if the html is in Utf-8 format.

I had tried the following code but no vein

Encoding wind1252 = Encoding.GetEncoding(1252);  
Encoding utf8 = Encoding.UTF8;  
byte[] wind1252Bytes = wind1252.GetBytes(strHtml);  
byte[] utf8Bytes = Encoding.Convert(wind1252, utf8, wind1252Bytes);  
char[] utf8Chars = new char[utf8.GetCharCount(utf8Bytes, 0, utf8Bytes.Length)];   
utf8.GetChars(utf8Bytes, 0, utf8Bytes.Length, utf8Chars, 0);  
string utf8String = new string(utf8Chars);

Any suggestions on how to convert the html into UTF-8?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

This should do it:

Encoding wind1252 = Encoding.GetEncoding(1252);
Encoding utf8 = Encoding.UTF8;  
byte[] wind1252Bytes = wind1252.GetBytes(strHtml);
byte[] utf8Bytes = Encoding.Convert(wind1252, utf8, wind1252Bytes);
string utf8String = Encoding.UTF8.GetString(utf8Bytes);

Method 2

Actually the problem lies here

byte[] wind1252Bytes = wind1252.GetBytes(strHtml);

We should not get the bytes from the html String. I tried the below code and it worked.
Encoding wind1252 = Encoding.GetEncoding(1252);
Encoding utf8 = Encoding.UTF8;
byte[] wind1252Bytes = ReadFile(Server.MapPath(HtmlFile));
byte[] utf8Bytes = Encoding.Convert(wind1252, utf8, wind1252Bytes);
string utf8String = Encoding.UTF8.GetString(utf8Bytes);


public static byte[] ReadFile(string filePath)      
    {      
        byte[] buffer;   
        FileStream fileStream = new FileStream(filePath, FileMode.Open, FileAccess.Read);  
        try
        {
            int length = (int)fileStream.Length;  // get file length    
            buffer = new byte[length];            // create buffer     
            int count;                            // actual number of bytes read     
            int sum = 0;                          // total number of bytes read    

            // read until Read method returns 0 (end of the stream has been reached)    
            while ((count = fileStream.Read(buffer, sum, length - sum)) > 0)
                sum += count;  // sum is a buffer offset for next reading
        }
        finally
        {
            fileStream.Close();
        }
        return buffer;
    }

Method 3

How you are planning to use resulting html? The most appropriate way in my opinion to solve your problem would be add meta with encoding specification. Something like:

<meta http-equiv="content-type" content="text/html;charset=UTF-8" />

Method 4

Use Encoding.Convert method. Details are in the Encoding.Convert method MSDN article.


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x