WebClient.DownloadString() returns string with peculiar characters

I have an issue with some content that we are downloading from the web for a screen scraping tool that I am building.

in the code below, the string returned from the web client download string method returns some odd characters for the source download for a few (not all) web sites.

I have recently added http headers as below. Previously the same code was called without the headers to the same effect. I have not tried variations on the ‘Accept-Charset’ header, I don’t know much about text encoding other than the basics.

The characters, or character sequences that I refer to are:



and

Â

These characters are not seen when you use “view source” in a web browser. What could be causing this and how can I rectify the problem?

string urlData = String.Empty;
WebClient wc = new WebClient();

// Add headers to impersonate a web browser. Some web sites 
// will not respond correctly without these headers
wc.Headers.Add("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12");
wc.Headers.Add("Accept", "*/*");
wc.Headers.Add("Accept-Language", "en-gb,en;q=0.5");
wc.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");

urlData = wc.DownloadString(uri);

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

 is the windows-1252 representation of the octets EF BB BF. That’s the UTF-8 byte-order marker, which implies that your remote web page is encoded in UTF-8 but you’re reading it as if it were windows-1252. According to the docs, WebClient.DownloadString uses Webclient.Encoding as its encoding when it converts the remote resource into a string. Set it to System.Text.Encoding.UTF8 and things should theoretically work.

Method 2

The way WebClient.DownloadString is implemented is very dumb. It should get the character encoding from the Content-Type header in the response, but instead it expects the developer to tell the expected encoding beforehand. I don’t know what the developers of this class were thinking.

I have created an auxiliary class that retrieves the encoding name from the Content-Type header of the response:

public static class WebUtils
{
    public static Encoding GetEncodingFrom(
        NameValueCollection responseHeaders,
        Encoding defaultEncoding = null)
    {
        if(responseHeaders == null)
            throw new ArgumentNullException("responseHeaders");

        //Note that key lookup is case-insensitive
        var contentType = responseHeaders["Content-Type"];
        if(contentType == null)
            return defaultEncoding;

        var contentTypeParts = contentType.Split(';');
        if(contentTypeParts.Length <= 1)
            return defaultEncoding;

        var charsetPart =
            contentTypeParts.Skip(1).FirstOrDefault(
                p => p.TrimStart().StartsWith("charset", StringComparison.InvariantCultureIgnoreCase));
        if(charsetPart == null)
            return defaultEncoding;

        var charsetPartParts = charsetPart.Split('=');
        if(charsetPartParts.Length != 2)
            return defaultEncoding;

        var charsetName = charsetPartParts[1].Trim();
        if(charsetName == "")
            return defaultEncoding;

        try
        {
            return Encoding.GetEncoding(charsetName);
        }
        catch(ArgumentException ex) 
        {
            throw new UnknownEncodingException(
                charsetName,   
                "The server returned data in an unknown encoding: " + charsetName, 
                ex);
        }
    }
}

(UnknownEncodingException is a custom exception class, feel free to replace for InvalidOperationException or whatever else if you want)

Then the following extension method for the WebClient class will do the trick:

public static class WebClientExtensions
{
    public static string DownloadStringAwareOfEncoding(this WebClient webClient, Uri uri)
    {
        var rawData = webClient.DownloadData(uri);
        var encoding = WebUtils.GetEncodingFrom(webClient.ResponseHeaders, Encoding.UTF8);
        return encoding.GetString(rawData);
    }
}

So in your example you would do:

urlData = wc.DownloadStringAwareOfEncoding(uri);

…and that’s it.

Method 3

var client = new WebClient { Encoding = System.Text.Encoding.UTF8 };

var json = client.DownloadString(url);

Method 4

None of them didn’t work for me for some special websites such as “www.yahoo.com”. The only way which I resolve my problem was changing DownloadString to OpenRead and using UserAgent header like sample code. However, a few sites like “www.varzesh3.com” didn’t work with any of methods!

WebClient client = new WebClient()    
client.Headers.Add(HttpRequestHeader.UserAgent, "");
var stream = client.OpenRead("http://www.yahoo.com");
StreamReader sr = new StreamReader(stream);
s = sr.ReadToEnd();

Method 5

in my case , i deleted ever header related to language ,charset etc
EXcept user agent and cookie . it worked..

 // try commenting
 //wc.Headers.Add("Accept-Language", "en-gb,en;q=0.5");
 //wc.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x