grep strange behaviour with single letter words

I am removing stop words from a text, roughly using this
code

I have the following

$ cat file
file
types
extensions

$ cat stopwords
i
file
types

grep -vwFf stopwords file

I am expecting the result:
extensions

but I get the ( I think incorrect)

file
extensions

It is as if the word file has been skipped in the stopwords file.
Now here’s the cool bit: if I modify the stopwords file, by changing the single word/letter i on the first line, to any other ascii letter apart from f, i, l, e, then the same grep command gives me a different and correct result of extensions.

What is going on here and how do I fix it?

I’m using grep (BSD grep) 2.5.1-FreeBSD on a Mac OSX GNU bash, version 4.4.12(1)

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

This was a bug in bsdgrep, relating to a variable that tracks the part of the current line still to scan that is overwritten with successive calls to the regular expression matching engine when multiple patterns are involved.

local fix

You can work around this to an extent by not using the -w option, which relies upon this variable for correct operation and thus is failing, but instead using the regular expression extensions that match the beginning and endings of words, making your stopwords file look like:

<i>
<file>
<types>

This workaround will also require that you do not use the -F option.

Note that the documented regular expression components [[:<:]] and [[:>:]] that the re_format manual tells you about will not work here. This is because the regular expression library that is compiled into bsdgrep has GNU regular expression compatibility support turned on. This is another bug, which is reportedly fixed.

service fix

This bug was fixed earlier this year. The fix has not yet made it into the STABLE or RELEASE flavours of FreeBSD, but is reportedly in CURRENT.

For getting this into the MacOS version of grep, that is derived from FreeBSD’s bsdgrep, please consult Apple. ☺

Further reading

Method 2

This code:

pl " Input data file data1 and stopwords file data2:"
head data1 data2

pl " Expected output:"
cat $E

pl " Results, grep:"
# grep -vwFf stopwords file
grep -vwFf data2 data1

pl " Results, cgrep:"
cgrep -x1 -vFf data2 data1

produces:

-----
 Input data file data1 and stopwords file data2:
==> data1 <==
file
types
extensions

==> data2 <==
i
file
types

-----
 Expected output:
extensions

-----
 Results, grep:
file
extensions

-----
 Results, cgrep:
extensions

On a system like:

OS, ker|rel, machine: Apple/BSD, Darwin 16.7.0, x86_64
Distribution        : macOS 10.12.6 (16G29), Sierra
bash GNU bash 3.2.57

More details on cgrep, available via brew, and from sourceforge:

cgrep   shows context of matching patterns found in files (man)
Path    : ~/executable/cgrep
Version : 8.15
Type    : Mach-O64-bitexecutablex86_64 ...)
Home    : http://sourceforge.net/projects/cgrep/ (doc)

cheers, drl


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x