I want to find patterns that are listed in one file and find them in other file. The second file has those patterns separated by commas.
for e.g. first file F1 has genes
ENSG00000187546 ENSG00000113492 ENSG00000166971
and second file F2 has those genes along with some more columns(five columns) which I need
region gene chromosome start end intronic ENSG00000135870 1 173921301 173921301 intergenic ENSG00000166971(dist=56181),ENSG00000103494(dist=37091) 16 53594504 53594504 ncRNA_intronic ENSG00000215231 5 5039185 5039185 intronic ENSG00000157890 15 66353740 66353740
So the gene ENSG00000166971 which is present in the second file does not show up in grep because it has another gene with it,separated by comma.
My code is:
grep -f "F1.txt" "F2.txt" >output.txt
I want those values even if one of them is present,and the associated data with it.Is there any way to do this?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
What version of grep are you using? I tried your code and got the following results:
$ grep -f file1 file2 ENSG00000187546 ENSG00000113492 ENSG00000166971,ENSG00000186106
If you just want the results that match you can use grep‘s -o switch to report only the things that match:
$ grep -o -f file1 file2 ENSG00000187546 ENSG00000113492 ENSG00000166971
grep version
$ grep --version grep (GNU grep) 2.14 Copyright (C) 2012 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Written by Mike Haertel and others, see <http://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.
Stray characters in F1.txt?
While debugging this further I noticed several stray spaces at the end of the 2nd line in the file F1.txt. You can see them using hexdump.
$ hexdump -C ff1 00000000 45 4e 53 47 30 30 30 30 30 31 38 37 35 34 36 0a |ENSG00000187546.| 00000010 45 4e 53 47 30 30 30 30 30 31 31 33 34 39 32 20 |ENSG00000113492 | 00000020 20 0a 45 4e 53 47 30 30 30 30 30 31 36 36 39 37 | .ENSG0000016697| 00000030 31 0a |1.| 00000032
They show up with as ASCII codes 20. You can see them in them here: 32 20 20 0a.
Method 2
It’s very easy if you use ripgrep. Here’s how you would search log.txt for each separate line in ids.txt:
rg -f ids.txt log.txt
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0