Method to strip HTML tags not in a safe list

Is there a method that strips all HTML tags that are not on a safe tags list? If there isn’t, what would be ~~a regex~~ the method to achieve it?

~~I’m looking for something just like PHP’s strip_tags function.~~

Contents hide

Answers:

Method 1

Method 2

Method 3

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Do. Not. Use. Regex. To. Parse. HTML.

Use an XML parser:
MSDN Reference
Simple Tutorial
HTMLAgilityPack

Method 2

NullUserException answer is perfect, I made a little extension method to do it and I’m posting here if anyone else needs.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.IO;

namespace Extenders
{
    public static class StringExtender
    {
        internal static void ParseHtmlDocument(XmlDocument doc, XmlNode root, string[] allowedTags, string[] allowedAttributes, string[] allowedStyleKeys)
        {
            XmlNodeList nodes;

            if (root == null) root = doc.ChildNodes[0];
            nodes = root.ChildNodes;

            foreach (XmlNode node in nodes)
            {
                if (!(allowedTags.Any(x => x.ToLower() == node.Name.ToLower())))
                {
                    var safeNode = doc.CreateTextNode(node.InnerText);
                    root.ReplaceChild(safeNode, node);
                }
                else
                {
                    if (node.Attributes != null)
                    {
                        var attrList = node.Attributes.OfType<XmlAttribute>().ToList();
                        foreach (XmlAttribute attr in attrList)
                        {
                            if (!(allowedAttributes.Any(x => x.ToLower() == attr.Name)))
                            {
                                node.Attributes.Remove(attr);
                            }
                            // TODO: if style is allowed, check the allowed keys: values
                        }
                    }
                }

                if (node.ChildNodes.Count > 0)
                    ParseHtmlDocument(doc, node, allowedTags, allowedAttributes, allowedStyleKeys);
            }
        }

        public static string ParseSafeHtml(this string input, string[] allowedTags, string[] allowedAttributes, string[] allowedStyleKeys)
        {
            var xmlDoc = new XmlDocument();
            xmlDoc.LoadXml("<span>" + input + "</span>");

            ParseHtmlDocument(xmlDoc, null, allowedTags, allowedAttributes, allowedStyleKeys);

            string result;

            using (var sw = new StringWriter())
            {
                using (var xw = new XmlTextWriter(sw))
                    xmlDoc.WriteTo(xw);

                result = sw.ToString();
            }

            return result.Substring(6, result.Length - 7);
        }
    }
}

To use:

var x = "<b>allowed</b><b class='text'>allowed attr</b><b id='5'>not allowed attr</b><i>not all<b>o</b>wed tag</i>".ParseSafeHtml((new string[] { "b", "#text" }), (new string[] { "class" }), (new string[] { }));

Which outputs:

<b>allowed</b><b class='text'>allowed attr</b><b>not allowed attr</b>not allowed tag

If the element is not allowed it will get the innerText and pull out the tag, removing all inner tags.

Method 3

You can use the MS AntiXSS library to sanitize potentially-executable HTML. Take a look at it here:

http://msdn.microsoft.com/en-us/security/aa973814.aspx

http://wpl.codeplex.com/

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating