Comparing two files using Unix and Awk

I need to compare two files, File1 and File2 (Separated by space) using 4 fields (Field 1, 2, 4 and 5 of File 1 with field1, 2, 4 and 5 of File2).

Logic:
If column 1, 2 and 4 of File 1 matches with column 1, 2 and 4 of File 2 and there is a mismatch at column 5 then both the lines from File 1 and File 2 are concatenated redirected as output.
Therefore, the output file only contains those lines where Column 1, 2 and 4 of File1 and File2 matches and Column 5 does not match.

File1:

sc2/80         20      .        A       T         86       PASS     N=2     F=5;U=4
sc2/60         55      .        G       T         76       PASS     N=2     F=5;U=4 
sc2/68         20      .        T       C         71       PASS     N=2     F=5;U=4
sc2/24         24      .        T       G         31       PASS     N=2     F=5;U=4

File2:

sc2/80         20      .        A        C        80      PASS    N=2       F=5;U=4
sc2/60         55      .        G        C        72      PASS    N=2       F=5;U=4 
sc2/68         20      .        T        C        71      PASS    N=2       F=5;U=4 
sc2/10         24      .        T        G        31      PASS    N=2       F=5;U=4
sc2/40         59      .        T        G        31      PASS    N=2       F=5;U=4
sc2/24         24      .        A        G        38      PASS    N=2       F=5;U=4

Output:

sc2/80         20      .        A       T        86      PASS     N=2      F=5;U=4
sc2/80         20      .        A       C        80      PASS     N=2      F=5;U=4

sc2/60         55      .        G       T        76      PASS     N=2      F=5;U=4 
sc2/60         55      .        G       C        72      PASS     N=2      F=5;U=4

I am new in the field and I appreciate your help.

Contents hide

Answers:

Method 1

Example

How it works?

Method 2

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

You can use awk. Put the following in a script, script.awk:

FNR == NR {
  f1[$1,$2,$4] = $0
  f1_c14[$1,$2,$4] = 1
  f1_c5[$1,$2,$4] = $5
  next
}  

f1_c14[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print f1[$1,$2,$4];
}

f1[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print $0;
}

Now run it like this:

$ awk -f script.awk file1  file2
sc2/80         20      .        A       T         86       PASS     N=2     F=5;U=4
sc2/80         20      .        A        C        80      PASS    N=2       F=5;U=4
sc2/60         55      .        G       T         76       PASS     N=2     F=5;U=4 
sc2/60         55      .        G        C        72      PASS    N=2       F=5;U=4

The script works as follows. This block creates 3 arrays, f1, f1_c14, and f1_c5. f1 contains all the lines of file1 in an array, indexed using the contents of the columns 1, 2, & 4 from file1. f1_c14 is another array with the same index (1, 2, & 4’s contents) and a value of 1. The 3rd array uses the same index as the 1st 2, with the value of the 5th column from file1.

FNR == NR {
  f1[$1,$2,$4] = $0
  f1_c14[$1,$2,$4] = 1
  f1_c5[$1,$2,$4] = $5
  next
}

The next block is responsible for printing lines from the 1st file, file1 under the conditions that the columns 1, 2, & 4 match the columns from file2, AND it will onlu print the line from file1 if the 5th columns of file1 and file2 do not match.

f1_c14[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print f1[$1,$2,$4];
}

The 3rd block is responsible for printing the associated line from file2 there’s a corresponding line in the array f1 for file2‘s columns 1, 2, & 4. Again it only prints if the 5th columns do not match.

f1[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print $0;
}

Example

Running the above script like so:

$ awk -f script.awk file1  file2
sc2/80         20      .        A       T         86       PASS     N=2     F=5;U=4
sc2/80         20      .        A        C        80      PASS    N=2       F=5;U=4
sc2/60         55      .        G       T         76       PASS     N=2     F=5;U=4 
sc2/60         55      .        G        C        72      PASS    N=2       F=5;U=4

You can use the column command to clean up the output slightly:

$ awk -f script.awk file1  file2 | column -t
sc2/80  20  .  A  T  86  PASS  N=2  F=5;U=4
sc2/80  20  .  A  C  80  PASS  N=2  F=5;U=4
sc2/60  55  .  G  T  76  PASS  N=2  F=5;U=4
sc2/60  55  .  G  C  72  PASS  N=2  F=5;U=4

How it works?

FNR == NR

This makes use of awk‘s ability to loop through files in a particular way. Here’s we’re looping through the files and when we’re on a line that’s from the first file, file, we want to run a particular block of code on this line from file1.

This example shows what FNR == NR is doing when we give it 2 simulated files. One has 4 lines in it while the other has 5 lines:

$ awk 'BEGIN {print "NRtFNRtline"} {print NR"t"FNR"t"$0}' 
     <(seq 1 4) <(seq 1 5)
NR  FNR line
1   1   1
2   2   2
3   3   3
4   4   4
5   1   1
6   2   2
7   3   3
8   4   4
9   5   5

other blocks

The other blocks, f1_c14[$1,$2,$4] AND f1[$1,$2,$4] only run when the values from those array elements has a value.

Method 2

Here’s a solution in Perl. You should save the following code in a file and run it as a script (see below):

#!/usr/bin/perl
$file1 = '/path/to/file1';
$file2 = '/path/to/file2';
open $f1,'<',$file1;
open $f2,'<',$file2;
while(<$f1>){
    ($c1,$c2,$c4,$c5) = (split / /)[0,1,3,4]; #get relevant columns in file 1
    $lines_dictionary{"$c1 $c2 $c4"}="$c5---$_"; #create a hash entry keyed by the relevant columns
}
while(<$f2>){
    ($c1,$c2,$c4,$c5) = (split / /)[0,1,3,4]; #get relevant columns in file 2
    if(exists $lines_dictionary{"$c1 $c2 $c4"}){ #if a line with similar columns was seen in file 1
        ($file1_c5,$file1_line) = split /---/,$lines_dictionary{"$c1 $c2 $c4"}; #parse the hash entry this line in file 1
        if($file1_c5 -ne $c5){ #if column 5 of file 2 doesn't match column 5 of file 1
            print "${file1_line}$_n"; #we only need one extra newline as the lines read from the files have trailing ones.
        }
    }
}
close $f1;
close $f2;

Use any text editor to paste this script into a file, modify the $file1 and $file2 variables to reflect the true locations of your files, then make the script executable by doing:

$ chmod +x /path/to/script

Finally, call the script:

$ /path/to/script

Disclaimer

This code is untested
This code assumes the pattern ‘—‘ is unlikely to occur in the 5th column.
This code assumes the lines in file 1 are unique (i.e. that each line has a different combination of “column1 column2 column4”). If there are multiple lines (not necessarily consecutive) containing the same data in the relevant columns, the script will use the last one (bottom-most in the file) of these lines.

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating