How to get number of words on a web page?

I need to get total number of WORDS on a web page. I know about the System.Net.WebClient class. But it’s DownloadString() method return the whole HTML markup where as what I need is only the TEXT so that I can figure out the number of words.

Any ideas/suggestions welcome.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Use the HTML Agility Pack to download and parse the HTML document.

You can then query the document object and extract the inner text of all nodes.

Method 2

Take a look at HTML Agility Pack. It allows you to apply XPath expressions to an HTML document.

You want to find all text nodes and then count the words. //text() is the XPath to get all text nodes.

Method 3

I see two choices:

  1. Use a html library to parse the string into a dom like tree.
  2. Use some simple text based hacks

Option 1 is obviously cleaner, but introduces additional dependencies on third party libraries.

There are several steps:

  1. Remove tags(with content) whose content you don’t like. For example scripts and stylesheets.
  2. Remove all other tags keeping their content/extract the text from the remaining tags
  3. Split the remainder using the string.Split function with all whitespaces as split chars, and the option to ignore empty result strings enabled
  4. Count the number of entries Split returned.

Obviously this doesn’t work well for all languages. For example Japanese/Chinese don’t have spaces between words.

Method 4

http://www.wordcounttool.com/ … this is the most basic way i know

Method 5

If you need to count only those words actually visible to the user (i.e. ignoring content hidden by CSS and including content created dynamically by JavaScript) then you will probably need to automate a browser or browser control.

It may be possible to do this entirely with client-side JavaScript:

  1. Load the first web page into an iframe.
  2. After everything is fully loaded, interrogate the runtime DOM to extract only the content that is visible to the user.
  3. Write the results into the outer pages content area.
  4. Repeat for the next web page.

Method 6

okay so since the text hack in not here I’ll try it quick and dirty js way hope this helps someone.

let wordCount = document.body.innerText.split(/s/).length

//other stuff to copy pasta :)
let words = document.body.innerText.split(/s/)

    /**
     * Makes the elements unique by converting to a set and then back to an array.
     * @param array
     * @return {any[]}
     */
    function uniqueArray(array) {
        return Array.from(new Set(array));
    }

let uniqueWords = uniqueArray(words);
// todo count frequency adding Term frequency analysis 
// and your ready to do page indexing by word tags


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x