I am removing stop words from a text, roughly using this
code
I have the following
$ cat file file types extensions $ cat stopwords i file types
grep -vwFf stopwords file
I am expecting the result:
extensions
but I get the ( I think incorrect)
file extensions
It is as if the word file has been skipped in the stopwords file.
Now here’s the cool bit: if I modify the stopwords file, by changing the single word/letter i on the first line, to any other ascii letter apart from f, i, l, e, then the same grep command gives me a different and correct result of extensions.
What is going on here and how do I fix it?
I’m using grep (BSD grep) 2.5.1-FreeBSD on a Mac OSX GNU bash, version 4.4.12(1)
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
This was a bug in bsdgrep, relating to a variable that tracks the part of the current line still to scan that is overwritten with successive calls to the regular expression matching engine when multiple patterns are involved.
local fix
You can work around this to an extent by not using the -w option, which relies upon this variable for correct operation and thus is failing, but instead using the regular expression extensions that match the beginning and endings of words, making your stopwords file look like:
<i> <file> <types>
This workaround will also require that you do not use the -F option.
Note that the documented regular expression components [[:<:]] and [[:>:]] that the re_format manual tells you about will not work here. This is because the regular expression library that is compiled into bsdgrep has GNU regular expression compatibility support turned on. This is another bug, which is reportedly fixed.
service fix
This bug was fixed earlier this year. The fix has not yet made it into the STABLE or RELEASE flavours of FreeBSD, but is reportedly in CURRENT.
For getting this into the MacOS version of grep, that is derived from FreeBSD’s bsdgrep, please consult Apple. ☺
Further reading
- Jonathan de Boyne Pollard (2017-10-15). bsdgrep behaves incorrectly when given multiple patterns. Bug #223031. FreeBSD Bugzilla.
- Kyle Evans (2017-04-03). bsdgrep: fix matching behaviour. Revision 316477. FreeBSD source.
- Kyle Evans (2017-05-02). bsdgrep: fix -w -v matching improperly with certain patterns
. Revision 317665. FreeBSD source. - Nathan Weeks (2014-06-16). grep(1) and bsdgrep(1) do not recognize [[:<:]] and [[:>:]]. Bug #191086. FreeBSD Bugzilla.
Method 2
This code:
pl " Input data file data1 and stopwords file data2:" head data1 data2 pl " Expected output:" cat $E pl " Results, grep:" # grep -vwFf stopwords file grep -vwFf data2 data1 pl " Results, cgrep:" cgrep -x1 -vFf data2 data1
produces:
----- Input data file data1 and stopwords file data2: ==> data1 <== file types extensions ==> data2 <== i file types ----- Expected output: extensions ----- Results, grep: file extensions ----- Results, cgrep: extensions
On a system like:
OS, ker|rel, machine: Apple/BSD, Darwin 16.7.0, x86_64 Distribution : macOS 10.12.6 (16G29), Sierra bash GNU bash 3.2.57
More details on cgrep, available via brew, and from sourceforge:
cgrep shows context of matching patterns found in files (man) Path : ~/executable/cgrep Version : 8.15 Type : Mach-O64-bitexecutablex86_64 ...) Home : http://sourceforge.net/projects/cgrep/ (doc)
cheers, drl
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0